Enterprise-Grade Self-Managed AI
In this article, we share how Saphes IT Systems has approached enterprise-grade, self-managed AI from day one. We started down this path for two reasons: we needed to meet strict compliance requirements, and we wanted to keep full control instead of depending on a single provider.
Our work is closely tied to intellectual property and sensitive client data. Sending proprietary knowledge and assets to an external AI provider without clear guarantees and control raises a simple question: what if we end up giving away our IP for nothing?
After more than two years of hands-on experimentation in production conditions, we've built deep expertise in designing and operating a competitive self-managed AI stack. Today, that experience translates into a platform that is not only compliant and controllable, but also faster and significantly more cost-effective than many alternatives.
In the next sections, we review the key building blocks of an enterprise self-managed AI setup:
- Hardware & sizing (capacity/SLO-driven planning)
- Model strategy (what to run for which tasks and how to manage lifecycle)
- Software stack (inference, orchestration, RAG, observability)
Hardware & sizing
Sizing starts with a few practical targets: expected user count, peak concurrency, acceptable latency (time‑to‑first‑token and tokens/sec), and workload mix (chat, RAG, batch jobs, embeddings). From there, you can choose:
- Compute: CPU vs GPU, GPU class (consumer vs datacenter), and VRAM needs driven by model size + context length + batch size. See PassMark GPU benchmarks for consumer-grade and Pro comparisons.
- Memory & storage: enough RAM for supporting services plus fast local storage for model weights and caches; plan for growth as models and context windows increase.
- Network: low‑latency networking between inference and supporting services (vector DB, storage, observability), and enough egress capacity if multiple apps call the model.
For teams already managing servers, this looks like capacity planning for any other internal platform: define SLOs, measure, then scale.
Observations
These observations may not match your exact requirements. We strongly recommend analysing your future usage with an expert instead of guessing. The examples below are real-world observations, but they are not precise enough to choose the right setup on their own.
| Company size | Usage types | Hardware options | Initial cost |
|---|---|---|---|
| Small team (2-20 users), low usage | Internal chat, ad-hoc Q&A, low concurrency | CPU-first (Apple M3 Ultra) or 1× Pro GPU 48GB VRAM (e.g. RTX PRO 5000) with 64GB RAM and fast NVMe | ~2k-5k€ |
| Small engineers team (2-10 users), high usage | Internal chat, ad-hoc Q&A, high concurrency, RAG, code generation, code completion | 2× Pro GPU 96GB VRAM (RTX PRO 5000) or 4× premium consumer-grade and cost-effective GPU (RTX 3090) with 128GB RAM and fast NVMe | ~5k-15k€ |
| Medium company (50-200 users), mixed usage | Internal assistant + department RAG (HR, Sales, Support), moderate concurrency, some batch jobs | 2× datacenter GPU 80GB (A100 / H100) or 2× 96GB Pro RTX 6000 GPUs, 256-512GB RAM, 2-4× NVMe (RAID); consider a separate node for vector DB/ETL | ~20k€-50k€ |
Frequently asked questions
How much will electricity cost to run the system?
This is negligible compared to the initial cost or to AI providers' costs per token. Let's say you use 20% of the time 100% of your capacity with 4 consumer-grade GPUs (which is an intensive usage). You will have a bill around 100€/month. With the same intensive usage, on AI providers you would be around 3k-6k€ a month.
How to choose the right motherboard and what about PCIe?
That depends on your needs. Some professional motherboards such as MZ32-AR0 are recommended, especially if you want to use multiple models and load/unload them quickly (PCIe version and lanes per slot, here PCIe 4 16x). If you rely on only one or two models, you can keep them loaded and use a consumer-grade motherboard such as a B550 Eagle, which is very cost-effective. In fact, the lane speed, even at PCIe 3 x1, is enough and there is no difference in inference speed, only model loading becomes much longer.
How to power 4 GPUs? Can I use multiple PSU?
Yes. We recommend using multiple PSUs if you don't find one matching your requirements. A PSU can be activated without the requirement to connect it to the workstation.
What CPU should I pair with high-end GPUs?
Prioritize PCIe lanes and memory bandwidth: workstation/server CPUs (Threadripper/EPYC/Xeon are often a better fit than consumer CPUs when you run 2-4+ GPUs. Avoid CPU bottlenecks for token streaming and RAG pipelines.
What storage setup works best for model weights and RAG?
Use fast NVMe for model weights and hot caches. Separate volumes (or separate nodes) for the vector DB and logs. RAID1/10 helps availability; also plan backups for indexes and documents.
Is consumer GPU hardware reliable enough for production?
In most cases, yes. However, if you serve clients and do not accept higher operational risk (no ECC VRAM, less predictable thermals), prefer datacenter GPUs, validated chassis, and enterprise support.
How do we handle cooling, noise, and power limits?
Plan airflow first (front-to-back, high static-pressure fans), size PSUs with margin, and set GPU power caps to stabilise performance/watts. In offices, consider remote rack placement or quieter workstation builds (avoid blower systems).
Do we need high-speed networking (10/25GbE)?
Not really. AI traffic is mostly text, and text is light, so this is rarely the bottleneck.
Can I use AMD or Intel GPUs?
The short answer is yes,but practically, it's often not worth it. The issue is ecosystem support (ROCm and Intel oneAPI) compared to CUDA. Even if you save a bit on initial cost, you'll likely struggle more, and sometimes you simply won't be able to use a given model or feature.
Model selection
Enterprise setups benefit from a deliberate model policy rather than “one model for everything”. In practice, we recommend defining a small “model portfolio”:
- Primary assistant model (best quality you can run within your latency/SLO constraints)
- Cost-efficient model for high-volume tasks (classification, extraction, routing, drafts)
- Embeddings model for RAG (and optionally a reranker if retrieval quality matters)
Selection criteria we use:
- Fit to tasks: chat vs translations vs structured extraction vs coding; don't overspend on a large model for simple classification.
- Serving efficiency: KV cache size grows with the context window; context is often the real VRAM bottleneck.
- Licensing & governance: validate licence compatibility (internal use, customer-facing use, redistribution) and document approved models. Check Hugging Face's model licenses before deployment.
Our baseline recommendations
The table below is meant as a starting point (exact VRAM depends on the backend, batch size, context, and whether you keep multiple models loaded).
| Model | Quantization | Context window | Size VRAM | Usage |
|---|---|---|---|---|
| Qwen3.6 35b (a3b) | Q5_K_M | 256k | 50GB | Coding, agentic tasks, general tasks |
| Qwen3.6 27b | Q5_K_M | 256k | 50GB | Same as the 35b, more accurate in benchmarks, but slower |
| GLM 4.7 flash | Q5_K_M | 198k | 50GB | Coding, agentic tasks, general tasks |
| TranslateGemma | Q5_K_M | 128k | 35GB | Translations |
| Qwen2.5 Coder 14b (base) | Q5_K_M | 8k (up to 32k) | 25GB | Code completion |
| Nomic Embed 0.3b | FP16 | 2k | 1GB | Light, reliable and portable embedding model |
| Qwen3 Embedding 4b | Q8_0 | 40k | 20GB | Very accurate embedding model, but heavy, useful if you need a big context window |
| Flux 2 Klein 9b | FP8 | / | 10GB | Good for photo realistic results |
| GLM Image | FP8/FP16 | / | 20GB | Good for common image including text |
Frequently asked questions
What throughput could you expect?
Every one of our examples is highly practical. For example, a 4× RTX 3090 running Qwen 3.6 35B with a 256k context window on llama.cpp delivers 120 tokens/sec for a single user and 70 tokens/sec for 3 concurrent requests, which is better than the Claude 4.6 API for example, without any usage limitations.
How do we choose the right context window?
Start from the product need, not the model spec. For most enterprise assistants, 8k-16k is enough if your RAG pipeline is good (retrieve less, retrieve better). Increase context only when you have real long-document workflows (contracts, large technical specs, multi-file code reviews). Remember: longer context increases VRAM usage (KV cache) and often reduces throughput.
How do we choose the right quantization (FP16 vs 8-bit vs 4-bit)?
Never use FP16 except for very small models or image generation; it's not worth it. 8-bit is very good for accuracy-sensitive tasks with better memory savings. 4-bit is usually the best cost/performance for chat and RAG in production, but validate on your own eval set (some tasks like strict extraction or multilingual nuance can degrade). At Saphes we recommend Q5 quantisation; the balance is perfect for our usage, and we've never identified a degraded output at this quantisation.
What impacts VRAM the most?
Three main drivers: (1) model weights, (2) context window (KV cache), and (3) batching/concurrency. Many teams underestimate the KV cache: doubling context can significantly increase VRAM needs, especially at higher concurrency.
Should we run one big model or multiple smaller ones?
Prefer multiple models when you have clear task separation: a strong assistant for “hard” requests, and a cheaper model for routing, summarization, extraction, or drafts. This reduces cost and improves latency under load.
Do we need a reranker for RAG?
If your documents are noisy/long and retrieval quality matters (support knowledge bases, legal/technical content), a reranker often improves accuracy more than increasing context size. If your corpus is small and clean, you may skip it.
Why not a big model like Kimi k2.5?
Once you reach a certain scale, adding parameters has less and less impact on quality. A model like Qwen 3.6 is more than enough for complex agentic tasks today; you don't have to rely on “frontier” open source models like Deepseek V4. The performance gap isn't worth it: a smaller, well-configured model is more than enough, and much faster.
Should I use multiple models or one big flagship model?
We recommend having a flagship model, always loaded on your GPUs, used for most tasks. Keep some VRAM available for useful smaller models such as a translation specialist and code completion, these don't take much VRAM and are very fast at what they do. This also helps offload concurrent requests from your flagship model.
Software stack / Tooling
A typical self‑managed stack is modular and can start small:
Inference
For inference, there are currently three real challengers. Each exposes an OpenAI-compatible API, and each has strengths and trade-offs:
| Solution | Pros | Cons |
|---|---|---|
| Ollama | Fast to set up, user friendly: a few commands and your model is running. | Slower than the other solutions; frequently loads/unloads models if a request doesn't match what's already loaded. |
| Llama server (llama.cpp) | Very fast and solid for concurrency; high level of customization; broad model choice and format support. | Loads one model at a time; longer to set up than Ollama. |
| vLLM | Excellent for concurrency; the right choice for a flagship model at scale; allows a high level of customization. | Slower than llama.cpp for single requests; longer to set up than Ollama; fewer supported formats than llama.cpp. |
UI / Gateway
There is one standout option: Open WebUI. Open WebUI acts as a gateway and provides granular access control. It also comes with many features, including proxying Ollama (if you use it) as well as other connected OpenAI-compatible APIs.
Open WebUI can be coupled with ComfyUI for image generation, they both interact well, creating a GPT like image generation experience.
RAG
A RAG (Retrieval-Augmented Generation) system is crucial for many use cases, and the more documents your company has, the more important it becomes. RAG lets you store chunks of your documents and, instead of injecting 300 pages into the context window for every question, retrieve only the relevant parts along with metadata (document name, page reference, etc.). Learn more about RAG architecture and best practices.
For development purposes, RAG is often less useful (for example, re,embedding a whole codebase for each branch is not realistic). In contrast, embedding structured documentation for a framework or language can be very helpful.
Before setting up a RAG system (which is straightforward to get started with), the most important part is defining what should be in your database and which metadata you need. Architecture matters: it determines whether your RAG system will be usable in practice.
You can run ChromaDB or Qdrant on your server. Both are good; we recommend Qdrant because it's fast and well designed (this recommendation is opinionated).
Code editors
For local-first, AI-assisted development, we recommend the following setup:
Zed: our default recommendation. It integrates AI natively, supports ACP and other modern tooling, and is extremely fast and responsive. See Zed's AI docs for setup details.
VS Code or JetBrains:
- Continue for code completion and quick chat-style answers.
- Cline for more agentic workflows (multi-step tasks, repo-wide changes, tooling orchestration).
MCPs
There are plenty of MCPs on the market. The right ones are simply the ones you need. What matters most is deployment: MCPs can run in multiple ways and support different connection types. We recommend streamable HTTP connections for MCPs that must connect to your server, and running them directly on the server.
A solid set of MCPs, or building the ones you need, can be a game changer. MCPs are mostly limited by imagination: you can develop an MCP to talk to your e-commerce platform, CMS, create and fetch content, create products, and much more. Learn more about the Model Context Protocol specification and explore the MCP Registry for community-built integrations.
Final thoughts
Today, integrating self-managed AI into a company is no longer a moonshot: it’s concrete and achievable. Models are now smart and fast enough to handle complex tasks, a well-configured RAG system can outperform most automated RAG solutions on the market, UI tools have become user-friendly, and ROI is often reached quickly, typically within 6 to 24 months depending on the complexity of your systems.
That said, there are important considerations. The initial setup can be complicated: driver issues, connecting tools, configuring the RAG system, and selecting the right models. For us, the biggest cost is not the hardware, but the time-consuming setup, while ongoing maintenance is usually much simpler.