


Quick Summary
Zylon and Onyx both enable organizations to deploy AI over private company knowledge, with comparable capabilities across chat, connectors, and retrieval. The real difference appears when teams move from an initial pilot to production: Onyx provides the application layer, while Zylon delivers a fuller private AI stack that also includes the inference layer. For enterprises that need AI to remain fully inside their environment, scale reliably under real usage, and avoid taking on infrastructure complexity themselves, that distinction becomes highly relevant.

What Is Zylon
Zylon is an enterprise AI platform built to run inside the customer’s infrastructure—on‑prem, in a cloud VPC, or fully air‑gapped—so organizations can deploy generative AI with full data control, governance, and compliance.
Zylon’s platform description breaks the product into a full stack that is deployed and operated as one system:
AI Core: described as self-contained AI infrastructure including local LLMs, vector databases, and GPU orchestration, deployable across private cloud, on‑prem, and air‑gapped environments.
API Gateway: OpenAI-compatible endpoints with built-in authentication, logging, rate limiting, and observability for integrating Zylon into existing tools and workflows.
Workspace: a product surface for teams to use AI over internal data with no external dependencies (when deployed privately).
A core theme across Zylon’s documentation is complete on-prem operation including model execution, paired with a fixed-cost / unlimited-usage model (no token restrictions at the platform level).
Operationally, Zylon is designed to be installed and upgraded through a CLI-driven process. The online installation guide describes that the system automatically downloads required dependencies, container images, and AI models; and it itemizes what the installation process handles (including Kubernetes/k0s, NVIDIA drivers + CUDA where applicable, tools like kubectl/helm, and Zylon’s container images).
What Is Onyx
Onyx positions itself as open-source enterprise search + an AI assistant: “the open-source AI chat connected to your docs, apps, and people,” with deep research and advanced chat features.
In its architecture documentation, Onyx describes the application as a set of Docker containers. It documents a core stack that includes:
Application layer: Next.js web frontend, FastAPI API server, and background workers.
Data layer: Postgres, a retrieval layer described as Vespa keyword search engine + vector store, Redis cache, and MinIO object storage.
Infrastructure layer: Nginx as a request router.
For deployment, Onyx offers multiple modes and packaging options:
Onyx Lite vs Onyx Standard: Onyx’s deploy overview states that Lite is a lightweight Chat UI requiring minimal resources and explicitly “does not include the vector database, background workers, or model inference servers.” Standard includes those pieces plus “AI model inference servers for deep learning models used during indexing and inference,” along with Redis/MinIO performance optimizations.
Docker Compose deployment is documented for local/self-hosted use.
Kubernetes via Helm: Onyx’s Kubernetes guide states that “the Onyx Helm chart packages all the required services (API, web, PostgreSQL, Vespa, etc.) into a single deployment.”
Security and governance controls are also documented: for example, Onyx documents SSO support (OAuth/OIDC/SAML) and notes that RBAC is available in the Enterprise Edition. Onyx also states (in its architecture FAQ) that documents and queries are sent to third-party LLMs, but that deployments can be configured to use only chosen providers or connect to a self-hosted LLM.
Architecture and Deployment Comparison
This section focuses on the practical question regulated buyers ask first: What exactly must we run and operate to keep sensitive data fully private in production? The answer depends on where your inference boundary sits.
Side-by-side comparison
Dimension | Zylon | Onyx |
|---|---|---|
Core positioning | Private on‑prem AI platform for regulated industries, designed to run inside your infrastructure (including air‑gapped). | Open-source AI chat connected to docs/apps/people, deployable on your infra; includes deep research and RAG/search features. |
Packaging / install model | CLI-driven installation that automatically downloads dependencies, container images, and AI models; installation process explicitly handles Kubernetes (k0s), NVIDIA drivers/CUDA (if applicable), container tools (kubectl/helm), and Zylon images. | Deployable via Docker Compose or Helm on Kubernetes; Helm chart packages required services into a single deployment. |
Default stack components (documented) | Full platform: AI Core + API Gateway + Workspace. AI inference is operated as part of the platform: Triton Inference Server is referenced in AI preset configuration and troubleshooting; vLLM is referenced as the inference backend. | Core stack: web frontend, FastAPI API server, background workers; Postgres + Vespa (keyword + vector) + Redis + MinIO; Nginx router. |
On-prem LLM inference responsibility | Zylon is designed to run “entirely on‑premise, including AI models,” and describes AI Core as including an inference server. Zylon docs reference Triton + vLLM as the deployed inference layer. | Onyx is configured to use an admin‑configured LLM. Its FAQ states documents/queries are sent to third‑party LLMs, unless you connect Onyx to a self‑hosted LLM. Onyx’s Ollama guide instructs you to set up Ollama and deploy your models, then point Onyx at it. |
What this means operationally | The “LLM serving layer” is treated as part of the platform lifecycle (install, configure presets, benchmark, upgrade). Zylon docs explicitly discuss concurrency, GPU memory tuning, shared memory for Triton, and model/version compatibility. | Onyx’s platform stack can be deployed quickly, but your privacy and performance posture depends on the configured LLM provider. If using a self-hosted LLM (e.g., Ollama), you own that server’s production operations. |
The key on‑prem difference that matters in production
The critical distinction is who absorbs the complexity of running inference reliably under enterprise load.
Onyx documents that you can connect to a self-hosted LLM, and its Ollama guide explicitly frames the model server as something you deploy and run separately (“Setup Ollama and Deploy your Models,” with the self-hosted default port noted, and then Onyx configured to use that provider). Architecturally, Onyx also frames the system boundary around an “admin configured LLM” when describing query flow and external communications.
That design is reasonable for teams who already run a model serving layer or who are comfortable outsourcing inference to cloud providers. But for regulated enterprises insisting that no data leaves their infrastructure, the model-serving layer becomes an unavoidable responsibility—covering GPUs, NVIDIA drivers/CUDA compatibility, concurrency/queuing, upgrades to support new model versions, and ongoing reliability.
Zylon’s approach is to ship the platform with the inference layer treated as a first-class part of the deployment and operations lifecycle. Zylon’s docs don’t just mention “a model endpoint”—they document the inference engine as Triton (with shared memory for throughput and latency) and reference vLLM as the inference backend, plus operational guidance for out-of-memory failures, model-version support issues, and concurrency benchmarking. The Zylon installer also explicitly includes the “messy middle” pieces (Kubernetes distribution, NVIDIA drivers/CUDA where needed) as part of what the install process handles.
A useful way to summarize it for buyers: Functionally, both products can look similar in the UI (chat, connectors, RAG). The difference is operational: who owns the hard part of making private AI work in production—especially the inference layer.
Security, Governance, and Compliance Considerations
Security posture in enterprise AI assistants is typically determined by two layers: (1) platform controls (identity, audit logs, encryption, access governance), and (2) the trust boundary for model inference (where prompts/context are processed).
Onyx’s security architecture FAQ states that documents and queries are sent to third‑party LLMs, while also noting you can restrict providers or connect to a self-hosted LLM. This has a direct compliance implication: if the configured LLM is external, the organization must treat that vendor as part of the data processing chain; if the LLM is self-hosted, the organization must ensure the self-hosted inference service is secured and governed like any other production system.
Zylon’s documentation repeatedly emphasizes that it runs on-prem “including AI models,” aligning the trust boundary with infrastructure the enterprise already governs. For regulated buyers, Zylon also documents governance tooling such as audit logging (including an admin audit log described as containing “every single thing that happens in the platform,” with export options via API).
On the identity/access side, Onyx documents SSO support (OAuth/OIDC/SAML) and clarifies that RBAC controls are available in its Enterprise Edition. Zylon’s operator documentation includes configuration guides for enterprise setup and hardening (for example, security-oriented guides like disk encryption and airgap hardening are explicitly part of the operator manual structure).
The practical takeaway: both platforms can participate in an enterprise security program, but Zylon’s primary security “move” is isolation (everything runs inside your environment), while Onyx’s security posture depends materially on the LLM provider configuration and whether inference is self-hosted or external.
Cost, Operations, and Scaling in the Real World
For many teams, the deciding factor isn’t a feature checklist—it’s the operational cost of keeping the system stable at scale.
Zylon explicitly markets and documents an unlimited-usage model: its API documentation describes that there are “no restrictions on tokens or inference executions,” enabling scaling without additional per-token costs. From an operator perspective, the same documentation set shows that Zylon anticipates production scaling concerns—multi-GPU configuration guidance, shared-memory tuning for Triton to improve throughput/latency, and performance benchmarking under concurrency.
Onyx’s resourcing documentation shows that Onyx Standard includes multiple containers with explicitly stated CPU/memory sizing (including indexing_model_server and inference_model_server), and it notes that using cloud-based embedding models reduces the memory needs for those model-server containers. This reinforces an important operational reality: Onyx can either run parts of the ML workload locally or offload some ML services to external providers—again making architectural choices (and compliance boundaries) part of day-to-day operations.
The “pilot-to-production” inflection point typically happens at concurrency and reliability. Zylon’s inference performance troubleshooting guide explicitly states that the platform allocates compute resources to maintain consistent response times under concurrent load (8–10 simultaneous users) and provides a benchmarking script for TTFT, throughput, and latency under concurrent requests. Those are exactly the operational concerns that become painful when the LLM serving layer is treated as “someone else’s problem” until usage spikes.
A neutral way to describe the difference:
If you view “private AI” as an application + connectors + RAG, Onyx is a strong open-source option and is easy to package and deploy.
If you view “private AI” as an application + connectors + RAG + production-grade inference operations, Zylon is explicitly designed to ship the inference layer as part of the platform lifecycle (install, tune, benchmark, upgrade) rather than outsourcing it to each customer.
Decision Guide and FAQ
When Zylon is the better fit
Zylon is typically the best choice when:
Inference must be fully private by default (including the models), because policy prohibits sending prompts/context to external LLM providers. Zylon’s product and docs emphasize running entirely on-prem, including AI models.
You want one on‑prem platform lifecycle to own (install + upgrades + GPU stack + inference tuning), rather than assembling and operating multiple systems (chat/RAG + separate LLM server + separate upgrade cycles). Zylon’s installation guide explicitly includes Kubernetes + NVIDIA drivers/CUDA where needed as part of the install process, and the inference layer is documented as Triton + vLLM backend.
You care about predictable operational behavior under concurrency, and want vendor documentation that treats concurrency and inference stability as core platform requirements.
When Onyx can make sense
Onyx can be a good choice when:
You want an open-source AI chat + search platform and are comfortable adopting and operating it as a containerized OSS stack (Docker Compose or Helm).
You already have a preferred model-serving strategy (cloud LLM, internal inference team, or a self-hosted LLM like Ollama), and you want the assistant to be model-provider flexible. Onyx explicitly supports configuring LLM providers, including self-hosted LLMs.
Your primary need is fast experimentation with chat/agents and connectors, and your compliance boundary permits third‑party LLM usage (or you plan to self-host separately).
FAQ
Does Onyx include an inference server or not?
Onyx’s docs describe “model inference servers” as part of Onyx Standard (and not part of Onyx Lite), and its resourcing guide lists indexing_model_server and inference_model_server containers. However, Onyx also documents that it uses an “admin configured LLM” for query flow, and it explicitly supports connecting to a self-hosted LLM such as Ollama (which you set up and run separately).
So the practical distinction for regulated on‑prem deployments is: you still own the LLM inference boundary unless you choose an external LLM provider.
How does Zylon operationalize the inference layer differently?
Zylon’s docs treat inference as a first-class part of platform operations: Triton is referenced as the inference server with shared-memory tuning, vLLM is referenced as the inference backend, and the docs explicitly discuss concurrency behavior and troubleshooting for inference failures. Zylon’s installer also explicitly includes the GPU stack pieces (NVIDIA drivers/CUDA where applicable) and Kubernetes distribution as part of a standard installation flow.
If both UIs look similar (chat + RAG), why does inference ownership matter so much?
Because LLM serving is its own production subsystem. Industry references describe vLLM and Ollama as LLM serving frameworks—i.e., an “inference server component” within a larger architecture. In regulated environments, this subsystem is often where the hardest operational problems appear (GPU memory tuning, concurrency, upgrade cadence, and reliability). Zylon’s product approach is to ship that subsystem as part of the platform and document it accordingly; Onyx’s approach is to let you choose/configure your LLM provider (including a self-hosted LLM), which shifts more operational responsibility to the customer.
Bottom line:
If your organization needs a truly private, production-ready on‑prem AI platform where the inference layer is included and operated as part of one supported system, Zylon is typically the safer long-term choice.
If you want an open-source assistant layer and are comfortable owning the broader architecture—especially the LLM serving boundary—Onyx can be a strong OSS option.
Author: Cristina Traba Deza, Product Designer at Zylon
Published: April 2026
Last updated: April 2026
Cristina designs secure, on-premise AI platforms for regulated industries, specializing in enterprise AI deployments for financial services, healthcare, and public sector organizations requiring full data control, governance, and compliance.
THE ZYLON DIFFERENCE
Considering Other Enterprise AI Options?
Explore detailed comparisons between Zylon’s private, on-prem enterprise AI platform and leading cloud AI assistants, with emphasis on governance, security posture, and infrastructure control.

Zylon vs Onyx
On-Premise AI Platform Comparison for Regulated Enterprises

Zylon vs Abacus
On-premise AI platform comparison

Zylon vs Aimable
On-Premise AI Platform Comparison for Regulated Enterprises

Zylon vs Langdock
On-Premise AI Platform Comparison for Regulated Industries

Zylon vs Gemini
On-premise AI vs cloud AI for the enterprise

Zylon vs ChatGPT Enterprise
The definitive comparison for regulated industries

Zylon vs Building an AI Platform In-House
AI Platform alternatives for Regulated Industries

Zylon vs Microsoft Copilot
An On-Prem Private AI Platform Comparison for Regulated Industries

Zylon vs Claude
Private On-Prem Enterprise AI vs Cloud AI Assistant for Regulated Industries