Latest revision as of 18:43, 26 February 2026

AI Attachments and RAG: End-to-End Guide

This document explains how the BioInsights AI uses attached documents (personality files, progress note templates, and patient documents) when generating progress notes. It is intended for teams (product, engineering, support) and clients who need a clear picture of how the system works, what the “Full content” vs “RAG” toggle means, and how retrieval is performed.

1. Purpose and scope

When generating an AI progress note, the system can use:

Personality documents – Guidelines, tone, and instructions attached to an AI personality.
Template documents – Instructions and structure attached to a progress note template.
Patient documents – Files attached to the current encounter (e.g. lab results, referrals).

The AI does not receive raw file binaries. Instead, it receives text that comes from those documents in one of two ways:

Full content – The entire document text is fetched and placed in the AI’s context.
RAG (Retrieval-Augmented Generation) – Only the most relevant parts of the document (chunks) are retrieved using semantic search and then added to the context.

This guide describes both modes, how documents are prepared (indexing, chunking), how retrieval works (including multi-query and the configurable chunk limit), and the end-to-end flow from setup to AI response.

2. High-level overview

┌─────────────────────────────────────────────────────────────────────────────┐
│  Admin configures personality / template with attached files                 │
│  → Each file can be "Full content" or "RAG only"                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  Documents are stored and indexed                                            │
│  → Full content: read at request time via Document Extractor                   │
│  → RAG: split into chunks, embedded, stored in Solr vector store              │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  User runs AI progress note (with conversation + optional patient files)     │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  System builds AI context:                                                   │
│  • Full-content files → full text injected into system prompt               │
│  • RAG files → semantic search over conversation messages → top chunks      │
│  • Patient docs → same RAG retrieval (vector search by patient + file IDs)  │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  OpenAI API is called with augmented prompt (no file_search tool)            │
│  → AI generates the note using only the provided context                    │
└─────────────────────────────────────────────────────────────────────────────┘

3. Attachment modes: “Full content” vs “RAG”

When you attach a file to a personality or a progress note template, you can choose how that file is used:

Mode	What it means	When to use it
Full content	The entire document text is loaded and added to the AI’s system prompt.	Short, critical docs (e.g. short guidelines, required structure) where nothing should be missed.
RAG only	The document is not sent in full. Only relevant chunks are retrieved using the current conversation and injected as context.	Longer docs (e.g. long manuals, large templates) where you want the AI to focus on the parts that match the conversation.

Patient documents (files attached to the encounter) are always retrieved via RAG (vector search); there is no “full content” option for them.
If you do not set the toggle (e.g. older templates/personalities with only a list of file IDs), the system treats all non-patient attachments as full content for backward compatibility.

4. How documents are prepared for the AI

4.1 Storing and indexing

Storage: Files are stored in the application’s file storage (e.g. S3 or local drive) and linked to the personality or template (or to the patient/encounter for patient documents).
Vector store (for RAG):
- Documents that can be used for RAG are chunked (split into overlapping segments of roughly 1,500 characters).
- Each chunk is converted into a vector (embedding) and stored in Apache Solr (vector core).
- When the user runs the AI, the system runs semantic search over these chunks using the conversation as the query (see below).

4.2 Full-content path

For files marked “Full content”, the system does not use the vector store at request time.
It uses the Document Extractor to read the file (e.g. PDF, DOCX) and get the full text.
That full text is then injected into the system prompt so the model sees the whole document.

4.3 RAG path

For files marked “RAG only” (and for patient documents), the system uses only the vector store.
It does not send the full document. It runs a multi-query retrieval (see next section), then injects only the retrieved chunks into the prompt, up to a character budget.

5. How RAG retrieval works

Previously, retrieval used only the last user/developer/assistant message and a fixed number of chunks (e.g. 20). The current behavior is:

5.1 Multi-query retrieval

Every user, developer, and assistant message in the conversation that has non-empty content is used as a separate query.
For each such message:
- The system generates an embedding for that message.
- It runs vector search in Solr for:
  - Non-patient files (personality/template files that are RAG-only): search by file IDs.
  - Patient files (if present): search by patient ID and attached file IDs.
- Results are collected and then merged.

5.2 Deduplication and limit

Chunks are identified by a key (e.g. fileId:chunkIndex). If the same chunk appears in results for multiple messages, it is deduplicated (one entry per chunk, keeping the best score).
After merging and sorting by score, the system keeps at most N chunks for non-patient docs and N for patient docs, where N is the configurable RAG chunk limit (see Configuration below).
So: “each message” improves recall (earlier context can pull in relevant chunks); the limit and deduplication keep context size and cost under control.

5.3 Where the chunk limit came from

The previous hardcoded value (e.g. 20) was an arbitrary default, not derived from a formal requirement.
The design intention was always to make this configurable via environment (see SOLR_RAG_CHUNK_LIMIT in env.ts). The code now uses that setting everywhere instead of a fixed 20.
Default in config is 50; the application caps it between 1 and 500 for safety.

5.4 Context budget

Even if many chunks are retrieved, the total character count of the RAG context sent to the model is capped (e.g. RAG_CONTEXT_MAX_CHARS or a model-specific override). Chunks are added in score order until the budget is reached; the rest are dropped and a warning is logged.

6. End-to-end flow (step by step)

Setup (personality / template)
- Admin attaches files and, for each file, chooses Full content or RAG only.
- Data is saved (e.g. file_ids and file_ids_full_content).
- Files are stored; RAG-only (and patient) documents are chunked and indexed in Solr when the indexing pipeline runs.
User starts progress note
- User may attach patient documents to the encounter.
- Those are also chunked and indexed in Solr (by patient and file).
User runs AI
- The frontend sends the conversation (messages) and options (e.g. template ID, personality ID, patient ID, attached file IDs).
- Backend resolves which files are personality/template and which are patient, and which of the former are full-content vs RAG-only.
Building context
- Full-content (non-patient) files: Document Extractor fetches full text; that text is added to the system prompt.
- RAG-only (non-patient) files and patient files:
  - If there is already full-document context from step above, the system runs multi-query RAG for RAG-only non-patient files and appends those chunks to the system message.
  - If there is no full-document context (e.g. all attachments are RAG-only or only patient docs), the system uses the fallback path: multi-query RAG for both non-patient and patient files, then builds a single RAG context (chunks + file list + instructions) and injects it into the system prompt.
API call
- The AI provider (e.g. OpenAI) is called with the augmented system prompt and the conversation.
- No “file_search” or similar tool is used; all document content is in the prompt.
Response
- The model generates the note using only the provided context. The UI shows the note and, where applicable, “Documents referenced” so the user knows which files were used.

7. Configuration (for operations / team)

Relevant environment variables:

Variable	Purpose	Default / notes
`SOLR_RAG_CHUNK_LIMIT`	Max number of chunks to retrieve per source (non-patient and patient) in RAG.	Optional; default in code is 50. Capped between 1 and 500.
`RAG_CONTEXT_MAX_CHARS`	Max total characters of RAG context injected into the prompt.	Optional; can be overridden per model with `RAG_CONTEXT_MAX_CHARS_<MODEL>`.

Other Solr/vector and embedding settings (e.g. core name, dimensions) are documented in env.ts and in the Solr/vector store docs referenced below.

8. Summary for clients

Two ways to use a file: “Full content” (entire document in the prompt) or “RAG only” (only the most relevant parts, based on the conversation).
RAG uses the whole conversation (every user/assistant message) to find relevant sections, not just the last message.
The number of chunks used is configurable (SOLR_RAG_CHUNK_LIMIT), so you can tune how much document material the AI sees.
Patient documents are always used via RAG (semantic search); personality and template documents can be either full content or RAG.

Last updated: February 2025

Documentation/Progress Notes: Difference between revisions