How do we choose the right context window for enterprise AI?

Start from the product need, not the model spec. For most enterprise assistants, 8k-16k is sufficient if your RAG pipeline is effective. Increase context only for long-document workflows like contracts or large technical specs. Note that longer contexts increase VRAM usage (KV cache) and reduce throughput.

What impacts VRAM the most in LLM inference?

The three main drivers are model weights, context window (KV cache), and batching/concurrency. Teams often underestimate the KV cache impact: doubling the context size can significantly increase VRAM needs, especially at higher concurrency levels.

Why not use a very large model like Kimi k2.5?

Beyond a certain scale, adding parameters yields diminishing returns on quality. Models like Qwen 3.6 are sufficient for complex agentic tasks without relying on frontier open-source models like Deepseek V4. The performance gap is rarely worth the cost; a smaller, well-configured model is faster and often sufficient.

How much will electricity cost to run a self-managed AI system?

Electricity costs are negligible compared to initial hardware costs or API provider costs. For intensive usage (e.g., 20% of time at 100% capacity with 4 consumer-grade GPUs), the monthly bill is around 100€. The same usage on AI providers would cost between 3,000€ and 6,000€ per month.

How to choose the right motherboard and what about PCIe for AI hardware?

Choose based on your needs. Professional motherboards like the Gigabyte MZ32-AR0 are recommended for quick model loading/unloading via PCIe 4 16x lanes. For single or dual model setups, consumer boards like the B550 Eagle are cost-effective. PCIe lane speed (even PCIe 3 x1) does not affect inference speed, only model loading time.

How to power 4 GPUs? Can I use multiple PSUs?

Yes, you can use multiple PSUs if you cannot find a single unit meeting your requirements. A PSU can be activated without being connected to the workstation.

What CPU should I pair with high-end GPUs for AI workloads?

Prioritize PCIe lanes and memory bandwidth. Workstation or server CPUs like Threadripper, EPYC, or Xeon are often better than consumer CPUs when running 2-4+ GPUs. This helps avoid CPU bottlenecks for token streaming and RAG pipelines.

How do we handle cooling, noise, and power limits in a self-hosted AI setup?

Plan airflow first (front-to-back with high static-pressure fans), size PSUs with margin, and set GPU power caps to stabilize performance per watt. In office environments, consider remote rack placement or quieter workstation builds, avoiding blower systems.

Do we need high-speed networking (10/25GbE) for AI inference?

Not really. AI traffic is mostly text, which is lightweight, so network speed is rarely the bottleneck for inference.

Can I use AMD or Intel GPUs for self-managed AI?

Technically yes, but it is often not worth it due to ecosystem support gaps in ROCm and Intel oneAPI compared to CUDA. You may struggle with compatibility, and some models or features may not work, negating initial cost savings.

Enterprise-Grade Self-Managed AI

Q: What throughput could you expect from a self-hosted LLM setup?

For example, a 4x RTX 3090 running Qwen 3.6 35b with a 256k context window on llama.cpp delivers 120 tokens/sec for a single user and 70 tokens/sec for 3 concurrent requests. This performance exceeds many API-based alternatives like Claude 4.6 without usage limitations.

Q: How do we choose the right quantization (FP16 vs 8-bit vs 4-bit)?

Avoid FP16 except for small models or image generation. 8-bit is ideal for accuracy-sensitive tasks with good memory savings. 4-bit is usually the best cost/performance for chat and RAG, though validation on your specific eval set is recommended. At Saphes, Q5 quantization offers the perfect balance with no observed degradation in output quality.

Q: Should we run one big model or multiple smaller ones?

Prefer multiple models when you have clear task separation. Use a strong assistant for complex requests and cheaper models for routing, summarization, extraction, or drafts. This approach reduces cost and improves latency under load.

Q: Do we need a reranker for RAG?

If your documents are noisy or long, or if retrieval quality is critical (e.g., legal/technical content), a reranker often improves accuracy more than increasing context size. For small, clean corpora, you may skip it.

Q: Should I use multiple models or one big flagship model?

We recommend a flagship model always loaded on your GPUs for most tasks. Keep some VRAM available for specialized smaller models like translation or code completion specialists. These take minimal VRAM, are very fast, and help offload concurrent requests from the flagship model.

In this article, we share how Saphes IT Systems has approached enterprise-grade, self-managed AI from day one. We started down this path for two reasons: we needed to meet strict compliance requirements, and we wanted to keep full control instead of depending on a single provider.

Our work is closely tied to intellectual property and sensitive client data. Sending proprietary knowledge and assets to an external AI provider without clear guarantees and control raises a simple question: what if we end up giving away our IP for nothing?

After more than two years of hands-on experimentation in production conditions, we've built deep expertise in designing and operating a competitive self-managed AI stack. Today, that experience translates into a platform that is not only compliant and controllable, but also faster and significantly more cost-effective than many alternatives.

In the next sections, we review the key building blocks of an enterprise self-managed AI setup:

Hardware & sizing (capacity/SLO-driven planning)
Model strategy (what to run for which tasks and how to manage lifecycle)
Software stack (inference, orchestration, RAG, observability)

Hardware & sizing

Sizing starts with a few practical targets: expected user count, peak concurrency, acceptable latency (time‑to‑first‑token and tokens/sec), and workload mix (chat, RAG, batch jobs, embeddings). From there, you can choose:

Compute: CPU vs GPU, GPU class (consumer vs datacenter), and VRAM needs driven by model size + context length + batch size. See PassMark GPU benchmarks for consumer-grade and Pro comparisons.
Memory & storage: enough RAM for supporting services plus fast local storage for model weights and caches; plan for growth as models and context windows increase.
Network: low‑latency networking between inference and supporting services (vector DB, storage, observability), and enough egress capacity if multiple apps call the model.

For teams already managing servers, this looks like capacity planning for any other internal platform: define SLOs, measure, then scale.

Observations

These observations may not match your exact requirements. We strongly recommend analysing your future usage with an expert instead of guessing. The examples below are real-world observations, but they are not precise enough to choose the right setup on their own.

Company size	Usage types	Hardware options	Initial cost
Small team (2-20 users), low usage	Internal chat, ad-hoc Q&A, low concurrency	CPU-first (Apple M3 Ultra) or 1× Pro GPU 48GB VRAM (e.g. RTX PRO 5000) with 64GB RAM and fast NVMe	~2k-5k€
Small engineers team (2-10 users), high usage	Internal chat, ad-hoc Q&A, high concurrency, RAG, code generation, code completion	2× Pro GPU 96GB VRAM (RTX PRO 5000) or 4× premium consumer-grade and cost-effective GPU (RTX 3090) with 128GB RAM and fast NVMe	~5k-15k€
Medium company (50-200 users), mixed usage	Internal assistant + department RAG (HR, Sales, Support), moderate concurrency, some batch jobs	2× datacenter GPU 80GB (A100 / H100) or 2× 96GB Pro RTX 6000 GPUs, 256-512GB RAM, 2-4× NVMe (RAID); consider a separate node for vector DB/ETL	~20k€-50k€

Frequently asked questions

How much will electricity cost to run the system?

This is negligible compared to the initial cost or to AI providers' costs per token. Let's say you use 20% of the time 100% of your capacity with 4 consumer-grade GPUs (which is an intensive usage). You will have a bill around 100€/month. With the same intensive usage, on AI providers you would be around 3k-6k€ a month.

How to choose the right motherboard and what about PCIe?

That depends on your needs. Some professional motherboards such as MZ32-AR0 are recommended, especially if you want to use multiple models and load/unload them quickly (PCIe version and lanes per slot, here PCIe 4 16x). If you rely on only one or two models, you can keep them loaded and use a consumer-grade motherboard such as a B550 Eagle, which is very cost-effective. In fact, the lane speed, even at PCIe 3 x1, is enough and there is no difference in inference speed, only model loading becomes much longer.

How to power 4 GPUs? Can I use multiple PSU?

Yes. We recommend using multiple PSUs if you don't find one matching your requirements. A PSU can be activated without the requirement to connect it to the workstation.

What CPU should I pair with high-end GPUs?

Prioritize PCIe lanes and memory bandwidth: workstation/server CPUs (Threadripper/EPYC/Xeon are often a better fit than consumer CPUs when you run 2-4+ GPUs. Avoid CPU bottlenecks for token streaming and RAG pipelines.

What storage setup works best for model weights and RAG?

Use fast NVMe for model weights and hot caches. Separate volumes (or separate nodes) for the vector DB and logs. RAID1/10 helps availability; also plan backups for indexes and documents.

Is consumer GPU hardware reliable enough for production?

In most cases, yes. However, if you serve clients and do not accept higher operational risk (no ECC VRAM, less predictable thermals), prefer datacenter GPUs, validated chassis, and enterprise support.

How do we handle cooling, noise, and power limits?

Plan airflow first (front-to-back, high static-pressure fans), size PSUs with margin, and set GPU power caps to stabilise performance/watts. In offices, consider remote rack placement or quieter workstation builds (avoid blower systems).

Do we need high-speed networking (10/25GbE)?

Not really. AI traffic is mostly text, and text is light, so this is rarely the bottleneck.

Can I use AMD or Intel GPUs?

The short answer is yes,but practically, it's often not worth it. The issue is ecosystem support (ROCm and Intel oneAPI) compared to CUDA. Even if you save a bit on initial cost, you'll likely struggle more, and sometimes you simply won't be able to use a given model or feature.

Model selection

Enterprise setups benefit from a deliberate model policy rather than “one model for everything”. In practice, we recommend defining a small “model portfolio”:

Primary assistant model (best quality you can run within your latency/SLO constraints)
Cost-efficient model for high-volume tasks (classification, extraction, routing, drafts)
Embeddings model for RAG (and optionally a reranker if retrieval quality matters)

Selection criteria we use:

Fit to tasks: chat vs translations vs structured extraction vs coding; don't overspend on a large model for simple classification.
Serving efficiency: KV cache size grows with the context window; context is often the real VRAM bottleneck.
Licensing & governance: validate licence compatibility (internal use, customer-facing use, redistribution) and document approved models. Check Hugging Face's model licenses before deployment.

Our baseline recommendations

The table below is meant as a starting point (exact VRAM depends on the backend, batch size, context, and whether you keep multiple models loaded).

Model	Quantization	Context window	Size VRAM	Usage
Qwen3.6 35b (a3b)	Q5_K_M	256k	50GB	Coding, agentic tasks, general tasks
Qwen3.6 27b	Q5_K_M	256k	50GB	Same as the 35b, more accurate in benchmarks, but slower
GLM 4.7 flash	Q5_K_M	198k	50GB	Coding, agentic tasks, general tasks
TranslateGemma	Q5_K_M	128k	35GB	Translations
Qwen2.5 Coder 14b (base)	Q5_K_M	8k (up to 32k)	25GB	Code completion
Nomic Embed 0.3b	FP16	2k	1GB	Light, reliable and portable embedding model
Qwen3 Embedding 4b	Q8_0	40k	20GB	Very accurate embedding model, but heavy, useful if you need a big context window
Flux 2 Klein 9b	FP8	/	10GB	Good for photo realistic results
GLM Image	FP8/FP16	/	20GB	Good for common image including text

Frequently asked questions

What throughput could you expect?

Every one of our examples is highly practical. For example, a 4× RTX 3090 running Qwen 3.6 35B with a 256k context window on llama.cpp delivers 120 tokens/sec for a single user and 70 tokens/sec for 3 concurrent requests, which is better than the Claude 4.6 API for example, without any usage limitations.

How do we choose the right context window?

Start from the product need, not the model spec. For most enterprise assistants, 8k-16k is enough if your RAG pipeline is good (retrieve less, retrieve better). Increase context only when you have real long-document workflows (contracts, large technical specs, multi-file code reviews). Remember: longer context increases VRAM usage (KV cache) and often reduces throughput.

How do we choose the right quantization (FP16 vs 8-bit vs 4-bit)?

Never use FP16 except for very small models or image generation; it's not worth it. 8-bit is very good for accuracy-sensitive tasks with better memory savings. 4-bit is usually the best cost/performance for chat and RAG in production, but validate on your own eval set (some tasks like strict extraction or multilingual nuance can degrade). At Saphes we recommend Q5 quantisation; the balance is perfect for our usage, and we've never identified a degraded output at this quantisation.

What impacts VRAM the most?

Three main drivers: (1) model weights, (2) context window (KV cache), and (3) batching/concurrency. Many teams underestimate the KV cache: doubling context can significantly increase VRAM needs, especially at higher concurrency.

Should we run one big model or multiple smaller ones?

Prefer multiple models when you have clear task separation: a strong assistant for “hard” requests, and a cheaper model for routing, summarization, extraction, or drafts. This reduces cost and improves latency under load.

Do we need a reranker for RAG?

If your documents are noisy/long and retrieval quality matters (support knowledge bases, legal/technical content), a reranker often improves accuracy more than increasing context size. If your corpus is small and clean, you may skip it.

Why not a big model like Kimi k2.5?

Once you reach a certain scale, adding parameters has less and less impact on quality. A model like Qwen 3.6 is more than enough for complex agentic tasks today; you don't have to rely on “frontier” open source models like Deepseek V4. The performance gap isn't worth it: a smaller, well-configured model is more than enough, and much faster.

Should I use multiple models or one big flagship model?

We recommend having a flagship model, always loaded on your GPUs, used for most tasks. Keep some VRAM available for useful smaller models such as a translation specialist and code completion, these don't take much VRAM and are very fast at what they do. This also helps offload concurrent requests from your flagship model.

Software stack / Tooling

A typical self‑managed stack is modular and can start small:

Inference

For inference, there are currently three real challengers. Each exposes an OpenAI-compatible API, and each has strengths and trade-offs:

Solution	Pros	Cons
Ollama	Fast to set up, user friendly: a few commands and your model is running.	Slower than the other solutions; frequently loads/unloads models if a request doesn't match what's already loaded.
Llama server (llama.cpp)	Very fast and solid for concurrency; high level of customization; broad model choice and format support.	Loads one model at a time; longer to set up than Ollama.
vLLM	Excellent for concurrency; the right choice for a flagship model at scale; allows a high level of customization.	Slower than llama.cpp for single requests; longer to set up than Ollama; fewer supported formats than llama.cpp.

UI / Gateway

There is one standout option: Open WebUI. Open WebUI acts as a gateway and provides granular access control. It also comes with many features, including proxying Ollama (if you use it) as well as other connected OpenAI-compatible APIs.

Open WebUI can be coupled with ComfyUI for image generation, they both interact well, creating a GPT like image generation experience.

RAG

A RAG (Retrieval-Augmented Generation) system is crucial for many use cases, and the more documents your company has, the more important it becomes. RAG lets you store chunks of your documents and, instead of injecting 300 pages into the context window for every question, retrieve only the relevant parts along with metadata (document name, page reference, etc.). Learn more about RAG architecture and best practices.

For development purposes, RAG is often less useful (for example, re,embedding a whole codebase for each branch is not realistic). In contrast, embedding structured documentation for a framework or language can be very helpful.

Before setting up a RAG system (which is straightforward to get started with), the most important part is defining what should be in your database and which metadata you need. Architecture matters: it determines whether your RAG system will be usable in practice.

You can run ChromaDB or Qdrant on your server. Both are good; we recommend Qdrant because it's fast and well designed (this recommendation is opinionated).

Code editors

For local-first, AI-assisted development, we recommend the following setup:

Zed: our default recommendation. It integrates AI natively, supports ACP and other modern tooling, and is extremely fast and responsive. See Zed's AI docs for setup details.

VS Code or JetBrains:

Continue for code completion and quick chat-style answers.
Cline for more agentic workflows (multi-step tasks, repo-wide changes, tooling orchestration).

MCPs

There are plenty of MCPs on the market. The right ones are simply the ones you need. What matters most is deployment: MCPs can run in multiple ways and support different connection types. We recommend streamable HTTP connections for MCPs that must connect to your server, and running them directly on the server.

A solid set of MCPs, or building the ones you need, can be a game changer. MCPs are mostly limited by imagination: you can develop an MCP to talk to your e-commerce platform, CMS, create and fetch content, create products, and much more. Learn more about the Model Context Protocol specification and explore the MCP Registry for community-built integrations.

Final thoughts

Today, integrating self-managed AI into a company is no longer a moonshot: it’s concrete and achievable. Models are now smart and fast enough to handle complex tasks, a well-configured RAG system can outperform most automated RAG solutions on the market, UI tools have become user-friendly, and ROI is often reached quickly, typically within 6 to 24 months depending on the complexity of your systems.

That said, there are important considerations. The initial setup can be complicated: driver issues, connecting tools, configuring the RAG system, and selecting the right models. For us, the biggest cost is not the hardware, but the time-consuming setup, while ongoing maintenance is usually much simpler.