HealthcareOn-prem#LLM/RAG#On-prem/Edge#MLOps

A clinical-document assistant that never let patient data leave the building

A European healthcare provider

[A European healthcare provider] · LLM/RAG · On-prem/Edge · MLOps · On-prem

A clinical-document assistant that never let patient data leave the building

Context

A European healthcare provider with a large internal corpus of clinical documents — patient records, care guidelines, internal protocols — wanted clinicians to be able to ask questions in plain language and get grounded, cited answers instead of hunting through document systems by hand. The catch was non-negotiable from day one: under GDPR and local health-data regulations, this data could not leave their network. Not to a cloud API, not for "processing," not at all.

Challenge

Most off-the-shelf LLM assistants assume a hosted API. For this client that assumption was a hard stop. Sending patient information to an external endpoint was legally impossible and reputationally unthinkable. They needed the capability of a modern language model with a deployment posture that kept every byte of clinical data inside their own infrastructure — and the system still had to be accurate enough that a clinician would trust a cited answer over their own manual search.

Approach

We scoped the project around the constraint rather than treating it as a footnote. If nothing could leave the network, the architecture had to be built on open-weights models running on the client's own hardware from the start — retrofitting privacy onto a cloud design was never on the table.

The first decisions were about trust, not technology. A wrong answer in a clinical setting is worse than no answer, so we built the system to ground every response in retrieved source documents and to cite them, and to refuse when retrieval confidence was low rather than guess. We worked with the client's clinical staff to assemble an evaluation set of real questions tied to the documents that answered them, so we could measure retrieval quality objectively instead of by impression.

For the model, we selected an open-weights model in the Llama class and fine-tuned it on a de-identified corpus to adapt it to clinical language and the house style of their documents. De-identification for the fine-tuning data was itself part of the engagement — training material was stripped of identifiers before it was ever used.

Architecture

Everything ran on the client's own GPU servers, inside their network, with no external dependency at inference time.

Inference: an open-weights model (Llama-class), fine-tuned on the de-identified corpus, served from the client's on-prem GPU servers behind their firewall.
Retrieval: documents were chunked along their natural structure (sections, protocol steps, guideline clauses) rather than fixed character windows, embedded, and indexed in a vector store that also lived entirely on-prem. Metadata — source, section, effective date — was preserved so answers could cite and filter to current material.
Grounding and refusal: responses were generated only from retrieved context, with inline citations, and a retrieval-score threshold that triggered an explicit "I don't have a document covering this" rather than a fabricated answer.
MLOps, locally: because there was no managed cloud platform to lean on, we set up the retraining, re-indexing, and monitoring pipeline to run on the client's own infrastructure, so they could refresh the index and re-evaluate the model as their document corpus evolved.

The defining property of the architecture is what's absent: there is no path by which clinical data reaches an external network. That was the whole point, and it's verifiable rather than promised.

Results

100% on-prem inference — no patient data left the client's network at any point.
Clinician search time for answers in the document corpus dropped from minutes to seconds.
GDPR and local health-data requirements met by construction, because the data never moved.
Grounded, cited answers with an explicit refusal path, so clinicians could verify every response against its source.
A retraining and re-indexing pipeline the client's own team can run to keep the assistant current.

Stack

Open-weights LLM (Llama-class), fine-tuned · vLLM-style on-prem inference serving · on-prem vector store · structure-aware RAG pipeline · on-prem MLOps (retraining, re-indexing, monitoring) · client-owned GPU servers.

This is the work we lead with: a real regulatory constraint turned into an architecture, not an excuse. If your data can't leave the building, see how we think about on-prem AI →, or read the hybrid cost-rescue case for the other side of the deploy-anywhere story.

Have a similar problem?

Talk to us