Skip to main content
EngineeringArtificial Intelligence
Paul Weinsberg

Enterprise-Grade Self-Managed AI

In this article, we share how Saphes IT Systems has approached enterprise-grade, self-managed AI from day one. We started down this path for two reasons: we needed to meet strict compliance requirements, and we wanted to keep full control instead of depending on a single provider.

Our work is closely tied to intellectual property and sensitive client data. Sending proprietary knowledge and assets to an external AI provider without clear guarantees and control raises a simple question: what if we end up giving away our IP for nothing?

After more than two years of hands-on experimentation in production conditions, we've built deep expertise in designing and operating a competitive self-managed AI stack. Today, that experience translates into a platform that is not only compliant and controllable, but also faster and significantly more cost-effective than many alternatives.

In the next sections, we review the key building blocks of an enterprise self-managed AI setup:

  • Hardware & sizing (capacity/SLO-driven planning)
  • Model strategy (what to run for which tasks and how to manage lifecycle)
  • Software stack (inference, orchestration, RAG, observability)

Hardware & sizing

Sizing starts with a few practical targets: expected user count, peak concurrency, acceptable latency (time‑to‑first‑token and tokens/sec), and workload mix (chat, RAG, batch jobs, embeddings). From there, you can choose:

  • Compute: CPU vs GPU, GPU class (consumer vs datacenter), and VRAM needs driven by model size + context length + batch size. See PassMark GPU benchmarks for consumer-grade and Pro comparisons.
  • Memory & storage: enough RAM for supporting services plus fast local storage for model weights and caches; plan for growth as models and context windows increase.
  • Network: low‑latency networking between inference and supporting services (vector DB, storage, observability), and enough egress capacity if multiple apps call the model.

For teams already managing servers, this looks like capacity planning for any other internal platform: define SLOs, measure, then scale.

Observations

These observations may not match your exact requirements. We strongly recommend analysing your future usage with an expert instead of guessing. The examples below are real-world observations, but they are not precise enough to choose the right setup on their own.

Company sizeUsage typesHardware optionsInitial cost
Small team (2-20 users), low usageInternal chat, ad-hoc Q&A, low concurrencyCPU-first (Apple M3 Ultra) or 1× Pro GPU 48GB VRAM (e.g. RTX PRO 5000) with 64GB RAM and fast NVMe~2k-5k€
Small engineers team (2-10 users), high usageInternal chat, ad-hoc Q&A, high concurrency, RAG, code generation, code completion2× Pro GPU 96GB VRAM (RTX PRO 5000) or 4× premium consumer-grade and cost-effective GPU (RTX 3090) with 128GB RAM and fast NVMe~5k-15k€
Medium company (50-200 users), mixed usageInternal assistant + department RAG (HR, Sales, Support), moderate concurrency, some batch jobs2× datacenter GPU 80GB (A100 / H100) or 2× 96GB Pro RTX 6000 GPUs, 256-512GB RAM, 2-4× NVMe (RAID); consider a separate node for vector DB/ETL~20k€-50k€

Frequently asked questions

How much will electricity cost to run the system?

This is negligible compared to the initial cost or to AI providers' costs per token. Let's say you use 20% of the time 100% of your capacity with 4 consumer-grade GPUs (which is an intensive usage). You will have a bill around 100€/month. With the same intensive usage, on AI providers you would be around 3k-6k€ a month.

How to choose the right motherboard and what about PCIe?

That depends on your needs. Some professional motherboards such as MZ32-AR0 are recommended, especially if you want to use multiple models and load/unload them quickly (PCIe version and lanes per slot, here PCIe 4 16x). If you rely on only one or two models, you can keep them loaded and use a consumer-grade motherboard such as a B550 Eagle, which is very cost-effective. In fact, the lane speed, even at PCIe 3 x1, is enough and there is no difference in inference speed, only model loading becomes much longer.

How to power 4 GPUs? Can I use multiple PSU?

Yes. We recommend using multiple PSUs if you don't find one matching your requirements. A PSU can be activated without the requirement to connect it to the workstation.

What CPU should I pair with high-end GPUs?

Prioritize PCIe lanes and memory bandwidth: workstation/server CPUs (Threadripper/EPYC/Xeon are often a better fit than consumer CPUs when you run 2-4+ GPUs. Avoid CPU bottlenecks for token streaming and RAG pipelines.

What storage setup works best for model weights and RAG?

Use fast NVMe for model weights and hot caches. Separate volumes (or separate nodes) for the vector DB and logs. RAID1/10 helps availability; also plan backups for indexes and documents.

Is consumer GPU hardware reliable enough for production?

In most cases, yes. However, if you serve clients and do not accept higher operational risk (no ECC VRAM, less predictable thermals), prefer datacenter GPUs, validated chassis, and enterprise support.

How do we handle cooling, noise, and power limits?

Plan airflow first (front-to-back, high static-pressure fans), size PSUs with margin, and set GPU power caps to stabilise performance/watts. In offices, consider remote rack placement or quieter workstation builds (avoid blower systems).

Do we need high-speed networking (10/25GbE)?

Not really. AI traffic is mostly text, and text is light, so this is rarely the bottleneck.

Can I use AMD or Intel GPUs?

The short answer is yes,but practically, it's often not worth it. The issue is ecosystem support (ROCm and Intel oneAPI) compared to CUDA. Even if you save a bit on initial cost, you'll likely struggle more, and sometimes you simply won't be able to use a given model or feature.

Model selection

Enterprise setups benefit from a deliberate model policy rather than “one model for everything”. In practice, we recommend defining a small “model portfolio”:

  • Primary assistant model (best quality you can run within your latency/SLO constraints)
  • Cost-efficient model for high-volume tasks (classification, extraction, routing, drafts)
  • Embeddings model for RAG (and optionally a reranker if retrieval quality matters)

Selection criteria we use:

  • Fit to tasks: chat vs translations vs structured extraction vs coding; don't overspend on a large model for simple classification.
  • Serving efficiency: KV cache size grows with the context window; context is often the real VRAM bottleneck.
  • Licensing & governance: validate licence compatibility (internal use, customer-facing use, redistribution) and document approved models. Check Hugging Face's model licenses before deployment.

Our baseline recommendations

The table below is meant as a starting point (exact VRAM depends on the backend, batch size, context, and whether you keep multiple models loaded).

ModelQuantizationContext windowSize VRAMUsage
Qwen3.6 35b (a3b)Q5_K_M256k50GBCoding, agentic tasks, general tasks
Qwen3.6 27bQ5_K_M256k50GBSame as the 35b, more accurate in benchmarks, but slower
GLM 4.7 flashQ5_K_M198k50GBCoding, agentic tasks, general tasks
TranslateGemmaQ5_K_M128k35GBTranslations
Qwen2.5 Coder 14b (base)Q5_K_M8k (up to 32k)25GBCode completion
Nomic Embed 0.3bFP162k1GBLight, reliable and portable embedding model
Qwen3 Embedding 4bQ8_040k20GBVery accurate embedding model, but heavy, useful if you need a big context window
Flux 2 Klein 9bFP8/10GBGood for photo realistic results
GLM ImageFP8/FP16/20GBGood for common image including text

Frequently asked questions

What throughput could you expect?

Every one of our examples is highly practical. For example, a 4× RTX 3090 running Qwen 3.6 35B with a 256k context window on llama.cpp delivers 120 tokens/sec for a single user and 70 tokens/sec for 3 concurrent requests, which is better than the Claude 4.6 API for example, without any usage limitations.

How do we choose the right context window?

Start from the product need, not the model spec. For most enterprise assistants, 8k-16k is enough if your RAG pipeline is good (retrieve less, retrieve better). Increase context only when you have real long-document workflows (contracts, large technical specs, multi-file code reviews). Remember: longer context increases VRAM usage (KV cache) and often reduces throughput.

How do we choose the right quantization (FP16 vs 8-bit vs 4-bit)?

Never use FP16 except for very small models or image generation; it's not worth it. 8-bit is very good for accuracy-sensitive tasks with better memory savings. 4-bit is usually the best cost/performance for chat and RAG in production, but validate on your own eval set (some tasks like strict extraction or multilingual nuance can degrade). At Saphes we recommend Q5 quantisation; the balance is perfect for our usage, and we've never identified a degraded output at this quantisation.

What impacts VRAM the most?

Three main drivers: (1) model weights, (2) context window (KV cache), and (3) batching/concurrency. Many teams underestimate the KV cache: doubling context can significantly increase VRAM needs, especially at higher concurrency.

Should we run one big model or multiple smaller ones?

Prefer multiple models when you have clear task separation: a strong assistant for “hard” requests, and a cheaper model for routing, summarization, extraction, or drafts. This reduces cost and improves latency under load.

Do we need a reranker for RAG?

If your documents are noisy/long and retrieval quality matters (support knowledge bases, legal/technical content), a reranker often improves accuracy more than increasing context size. If your corpus is small and clean, you may skip it.

Why not a big model like Kimi k2.5?

Once you reach a certain scale, adding parameters has less and less impact on quality. A model like Qwen 3.6 is more than enough for complex agentic tasks today; you don't have to rely on “frontier” open source models like Deepseek V4. The performance gap isn't worth it: a smaller, well-configured model is more than enough, and much faster.

Should I use multiple models or one big flagship model?

We recommend having a flagship model, always loaded on your GPUs, used for most tasks. Keep some VRAM available for useful smaller models such as a translation specialist and code completion, these don't take much VRAM and are very fast at what they do. This also helps offload concurrent requests from your flagship model.

Software stack / Tooling

A typical self‑managed stack is modular and can start small:

Inference

For inference, there are currently three real challengers. Each exposes an OpenAI-compatible API, and each has strengths and trade-offs:

SolutionProsCons
OllamaFast to set up, user friendly: a few commands and your model is running.Slower than the other solutions; frequently loads/unloads models if a request doesn't match what's already loaded.
Llama server (llama.cpp)Very fast and solid for concurrency; high level of customization; broad model choice and format support.Loads one model at a time; longer to set up than Ollama.
vLLMExcellent for concurrency; the right choice for a flagship model at scale; allows a high level of customization.Slower than llama.cpp for single requests; longer to set up than Ollama; fewer supported formats than llama.cpp.

UI / Gateway

There is one standout option: Open WebUI. Open WebUI acts as a gateway and provides granular access control. It also comes with many features, including proxying Ollama (if you use it) as well as other connected OpenAI-compatible APIs.

Open WebUI can be coupled with ComfyUI for image generation, they both interact well, creating a GPT like image generation experience.

RAG

A RAG (Retrieval-Augmented Generation) system is crucial for many use cases, and the more documents your company has, the more important it becomes. RAG lets you store chunks of your documents and, instead of injecting 300 pages into the context window for every question, retrieve only the relevant parts along with metadata (document name, page reference, etc.). Learn more about RAG architecture and best practices.

For development purposes, RAG is often less useful (for example, re,embedding a whole codebase for each branch is not realistic). In contrast, embedding structured documentation for a framework or language can be very helpful.

Before setting up a RAG system (which is straightforward to get started with), the most important part is defining what should be in your database and which metadata you need. Architecture matters: it determines whether your RAG system will be usable in practice.

You can run ChromaDB or Qdrant on your server. Both are good; we recommend Qdrant because it's fast and well designed (this recommendation is opinionated).

Code editors

For local-first, AI-assisted development, we recommend the following setup:

Zed: our default recommendation. It integrates AI natively, supports ACP and other modern tooling, and is extremely fast and responsive. See Zed's AI docs for setup details.

VS Code or JetBrains:

  • Continue for code completion and quick chat-style answers.
  • Cline for more agentic workflows (multi-step tasks, repo-wide changes, tooling orchestration).

MCPs

There are plenty of MCPs on the market. The right ones are simply the ones you need. What matters most is deployment: MCPs can run in multiple ways and support different connection types. We recommend streamable HTTP connections for MCPs that must connect to your server, and running them directly on the server.

A solid set of MCPs, or building the ones you need, can be a game changer. MCPs are mostly limited by imagination: you can develop an MCP to talk to your e-commerce platform, CMS, create and fetch content, create products, and much more. Learn more about the Model Context Protocol specification and explore the MCP Registry for community-built integrations.

Final thoughts

Today, integrating self-managed AI into a company is no longer a moonshot: it’s concrete and achievable. Models are now smart and fast enough to handle complex tasks, a well-configured RAG system can outperform most automated RAG solutions on the market, UI tools have become user-friendly, and ROI is often reached quickly, typically within 6 to 24 months depending on the complexity of your systems.

That said, there are important considerations. The initial setup can be complicated: driver issues, connecting tools, configuring the RAG system, and selecting the right models. For us, the biggest cost is not the hardware, but the time-consuming setup, while ongoing maintenance is usually much simpler.