Revision as of 18:38, 26 February 2026

AI Attachments and RAG: End-to-End Guide

1. Purpose and scope

2. High-level overview

3. Attachment modes: “Full content” vs “RAG”

4. How documents are prepared for the AI

4.1 Storing and indexing

4.2 Full-content path

4.3 RAG path

5. How RAG retrieval works

5.1 Multi-query retrieval

5.2 Deduplication and limit

5.3 Where the chunk limit came from

5.4 Context budget

6. End-to-end flow (step by step)

7. Configuration (for operations / team)

8. Summary for clients

9. Related documentation

@@ Line 1: / Line 1: @@
 = AI Attachments and RAG: End-to-End Guide =
-This document explains how the BioInsights AI uses attached documents (personality files, progress note templates, and patient documents) when generating progress notes. It is intended for '''teams''' (product, engineering, support) and '''clients''' who need a clear picture of how the system works, what the "Full content" vs "RAG" toggle means, and how retrieval is performed.
-----
 == 1. Purpose and scope ==
-When generating an AI progress note, the system can use:
-* '''Personality documents''' – Guidelines, tone, and instructions attached to an AI personality.
-* '''Template documents''' – Instructions and structure attached to a progress note template.
-* '''Patient documents''' – Files attached to the current encounter (e.g. lab results, referrals).
-The AI does '''not''' receive raw file binaries. Instead, it receives '''text''' that comes from those documents in one of two ways:
-# '''Full content''' – The entire document text is fetched and placed in the AI's context.
-# '''RAG (Retrieval-Augmented Generation)''' – Only the most relevant parts of the document (chunks) are retrieved using semantic search and then added to the context.
-This guide describes both modes, how documents are prepared (indexing, chunking), how retrieval works (including multi-query and the configurable chunk limit), and the end-to-end flow from setup to AI response.
-----
 == 2. High-level overview ==
-<pre>
+== 3. Attachment modes: “Full content” vs “RAG” ==
-+-----------------------------------------------------------------------------+
-|  Admin configures personality / template with attached files                 |
-|  -> Each file can be "Full content" or "RAG only"                            |
-+-----------------------------------------------------------------------------+
-                                        |
-                                        v
-+-----------------------------------------------------------------------------+
-|  Documents are stored and indexed                                            |
-|  -> Full content: read at request time via Document Extractor                |
-|  -> RAG: split into chunks, embedded, stored in Solr vector store            |
-+-----------------------------------------------------------------------------+
-                                        |
-                                        v
-+-----------------------------------------------------------------------------+
-|  User runs AI progress note (with conversation + optional patient files)     |
-+-----------------------------------------------------------------------------+
-                                        |
-                                        v
-+-----------------------------------------------------------------------------+
-|  System builds AI context:                                                   |
-|  * Full-content files -> full text injected into system prompt               |
-|  * RAG files -> semantic search over conversation messages -> top chunks     |
-|  * Patient docs -> same RAG retrieval (vector search by patient + file IDs)   |
-+-----------------------------------------------------------------------------+
-                                        |
-                                        v
-+-----------------------------------------------------------------------------+
-|  OpenAI API is called with augmented prompt (no file_search tool)              |
-|  -> AI generates the note using only the provided context                    |
-+-----------------------------------------------------------------------------+
-</pre>
-----
-== 3. Attachment modes: "Full content" vs "RAG" ==
-When you attach a file to a '''personality''' or a '''progress note template''', you can choose how that file is used:
-{| border="1" cellpadding="5" cellspacing="0"
-|-
-| '''Mode'''
-| '''What it means'''
-| '''When to use it'''
-|-
-| '''Full content'''
-| The '''entire''' document text is loaded and added to the AI's system prompt.
-| Short, critical docs (e.g. short guidelines, required structure) where nothing should be missed.
-|-
-| '''RAG only'''
-| The document is '''not''' sent in full. Only '''relevant chunks''' are retrieved using the current conversation and injected as context.
-| Longer docs (e.g. long manuals, large templates) where you want the AI to focus on the parts that match the conversation.
-|}
-* '''Patient documents''' (files attached to the encounter) are always retrieved via '''RAG''' (vector search); there is no "full content" option for them.
-* If you do '''not''' set the toggle (e.g. older templates/personalities with only a list of file IDs), the system treats all non-patient attachments as '''full content''' for backward compatibility.
-----
 == 4. How documents are prepared for the AI ==
 === 4.1 Storing and indexing ===
-* '''Storage''': Files are stored in the application's file storage (e.g. S3 or local drive) and linked to the personality or template (or to the patient/encounter for patient documents).
-* '''Vector store (for RAG)''':
-** Documents that can be used for RAG are '''chunked''' (split into overlapping segments of roughly 1,500 characters).
-** Each chunk is converted into a '''vector (embedding)''' and stored in '''Apache Solr''' (vector core).
-** When the user runs the AI, the system runs '''semantic search''' over these chunks using the conversation as the query (see below).
 === 4.2 Full-content path ===
-* For files marked '''"Full content"''', the system does '''not''' use the vector store at request time.
-* It uses the '''Document Extractor''' to read the file (e.g. PDF, DOCX) and get the full text.
-* That full text is then injected into the system prompt so the model sees the whole document.
 === 4.3 RAG path ===
-* For files marked '''"RAG only"''' (and for patient documents), the system uses '''only''' the vector store.
-* It does '''not''' send the full document. It runs a '''multi-query retrieval''' (see next section), then injects only the retrieved chunks into the prompt, up to a character budget.
-----
 == 5. How RAG retrieval works ==
-Previously, retrieval used '''only the last''' user/developer/assistant message and a '''fixed''' number of chunks (e.g. 20). The current behavior is:
 === 5.1 Multi-query retrieval ===
-* '''Every''' user, developer, and assistant message in the conversation that has non-empty content is used as a separate query.
-* For each such message:
-** The system generates an '''embedding''' for that message.
-** It runs '''vector search''' in Solr for:
-*** '''Non-patient files''' (personality/template files that are RAG-only): search by file IDs.
-*** '''Patient files''' (if present): search by patient ID and attached file IDs.
-** Results are collected and then '''merged'''.
 === 5.2 Deduplication and limit ===
-* Chunks are identified by a key (e.g. <code>fileId:chunkIndex</code>). If the same chunk appears in results for multiple messages, it is '''deduplicated''' (one entry per chunk, keeping the best score).
-* After merging and sorting by score, the system keeps at most '''N''' chunks for non-patient docs and '''N''' for patient docs, where '''N''' is the '''configurable RAG chunk limit''' (see Configuration below).
-* So: "each message" improves recall (earlier context can pull in relevant chunks); the limit and deduplication keep context size and cost under control.
 === 5.3 Where the chunk limit came from ===
-* The previous hardcoded value (e.g. 20) was an arbitrary default, not derived from a formal requirement.
-* The design intention was always to make this '''configurable''' via environment (see <code>SOLR_RAG_CHUNK_LIMIT</code> in <code>env.ts</code>). The code now uses that setting everywhere instead of a fixed 20.
-* Default in config is '''50'''; the application caps it between 1 and 500 for safety.
 === 5.4 Context budget ===
-* Even if many chunks are retrieved, the total '''character count''' of the RAG context sent to the model is capped (e.g. <code>RAG_CONTEXT_MAX_CHARS</code> or a model-specific override). Chunks are added in score order until the budget is reached; the rest are dropped and a warning is logged.
-----
 == 6. End-to-end flow (step by step) ==
-# '''Setup (personality / template)'''
-#* Admin attaches files and, for each file, chooses '''Full content''' or '''RAG only'''.
-#* Data is saved (e.g. <code>file_ids</code> and <code>file_ids_full_content</code>).
-#* Files are stored; RAG-only (and patient) documents are chunked and indexed in Solr when the indexing pipeline runs.
-# '''User starts progress note'''
-#* User may attach '''patient documents''' to the encounter.
-#* Those are also chunked and indexed in Solr (by patient and file).
-# '''User runs AI'''
-#* The frontend sends the conversation (messages) and options (e.g. template ID, personality ID, patient ID, attached file IDs).
-#* Backend resolves which files are personality/template and which are patient, and which of the former are full-content vs RAG-only.
-# '''Building context'''
-#* '''Full-content (non-patient) files''': Document Extractor fetches full text; that text is added to the system prompt.
-#* '''RAG-only (non-patient) files''' and '''patient files''':
-#** If there is already full-document context from step above, the system runs multi-query RAG for '''RAG-only non-patient''' files and appends those chunks to the system message.
-#** If there is '''no''' full-document context (e.g. all attachments are RAG-only or only patient docs), the system uses the '''fallback''' path: multi-query RAG for both non-patient and patient files, then builds a single RAG context (chunks + file list + instructions) and injects it into the system prompt.
-# '''API call'''
-#* The AI provider (e.g. OpenAI) is called with the '''augmented''' system prompt and the conversation.
-#* No "file_search" or similar tool is used; all document content is in the prompt.
-# '''Response'''
-#* The model generates the note using only the provided context. The UI shows the note and, where applicable, "Documents referenced" so the user knows which files were used.
-----
 == 7. Configuration (for operations / team) ==
-Relevant environment variables:
-{| border="1" cellpadding="5" cellspacing="0"
-|-
-| '''Variable'''
-| '''Purpose'''
-| '''Default / notes'''
-|-
-| <code>SOLR_RAG_CHUNK_LIMIT</code>
-| Max number of chunks to retrieve per source (non-patient and patient) in RAG.
-| Optional; default in code is 50. Capped between 1 and 500.
-|-
-| <code>RAG_CONTEXT_MAX_CHARS</code>
-| Max total characters of RAG context injected into the prompt.
-| Optional; can be overridden per model with <code>RAG_CONTEXT_MAX_CHARS_&lt;MODEL&gt;</code>.
-|}
-Other Solr/vector and embedding settings (e.g. core name, dimensions) are documented in <code>env.ts</code> and in the Solr/vector store docs referenced below.
-----
 == 8. Summary for clients ==
-* '''Two ways to use a file''': "Full content" (entire document in the prompt) or "RAG only" (only the most relevant parts, based on the conversation).
+== 9. Related documentation ==
-* '''RAG''' uses the '''whole conversation''' (every user/assistant message) to find relevant sections, not just the last message.
-* The '''number of chunks''' used is '''configurable''' (<code>SOLR_RAG_CHUNK_LIMIT</code>), so you can tune how much document material the AI sees.
-* '''Patient documents''' are always used via RAG (semantic search); personality and template documents can be either full content or RAG.
-----

Documentation/Progress Notes: Difference between revisions

Revision as of 18:38, 26 February 2026

Contents

AI Attachments and RAG: End-to-End Guide

1. Purpose and scope

2. High-level overview

3. Attachment modes: “Full content” vs “RAG”

4. How documents are prepared for the AI

4.1 Storing and indexing

4.2 Full-content path

4.3 RAG path

5. How RAG retrieval works

5.1 Multi-query retrieval

5.2 Deduplication and limit

5.3 Where the chunk limit came from

5.4 Context budget

6. End-to-end flow (step by step)

7. Configuration (for operations / team)

8. Summary for clients

9. Related documentation