No edit summary
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
<span id="ai-attachments-and-rag-end-to-end-guide"></span>
= AI Attachments and RAG: End-to-End Guide =
= AI Attachments and RAG: End-to-End Guide =


This document explains how the BioInsights AI uses attached documents (personality files, progress note templates, and patient documents) when generating progress notes. It is intended for '''teams''' (product, engineering, support) and '''clients''' who need a clear picture of how the system works, what the "Full content" vs "RAG" toggle means, and how retrieval is performed.
This document explains how the BioInsights AI uses attached documents (personality files, progress note templates, and patient documents) when generating progress notes. It is intended for '''teams''' (product, engineering, support) and '''clients''' who need a clear picture of how the system works, what the “Full content” vs “RAG” toggle means, and how retrieval is performed.


----


-----
<span id="purpose-and-scope"></span>
== 1. Purpose and scope ==
== 1. Purpose and scope ==


Line 11: Line 14:
* '''Personality documents''' – Guidelines, tone, and instructions attached to an AI personality.
* '''Personality documents''' – Guidelines, tone, and instructions attached to an AI personality.
* '''Template documents''' – Instructions and structure attached to a progress note template.
* '''Template documents''' – Instructions and structure attached to a progress note template.
* '''Patient documents''' – Files attached to the current encounter (e.g. lab results, referrals).
* '''Patient documents''' – Files attached to the current encounter (e.g. lab results, referrals).


The AI does '''not''' receive raw file binaries. Instead, it receives '''text''' that comes from those documents in one of two ways:
The AI does '''not''' receive raw file binaries. Instead, it receives '''text''' that comes from those documents in one of two ways:


# '''Full content''' – The entire document text is fetched and placed in the AI's context.
# '''Full content''' – The entire document text is fetched and placed in the AI’s context.
# '''RAG (Retrieval-Augmented Generation)''' – Only the most relevant parts of the document (chunks) are retrieved using semantic search and then added to the context.
# '''RAG (Retrieval-Augmented Generation)''' – Only the most relevant parts of the document (chunks) are retrieved using semantic search and then added to the context.


This guide describes both modes, how documents are prepared (indexing, chunking), how retrieval works (including multi-query and the configurable chunk limit), and the end-to-end flow from setup to AI response.
This guide describes both modes, how documents are prepared (indexing, chunking), how retrieval works (including multi-query and the configurable chunk limit), and the end-to-end flow from setup to AI response.


----


-----
<span id="high-level-overview"></span>
== 2. High-level overview ==
== 2. High-level overview ==


<pre>
<pre>┌─────────────────────────────────────────────────────────────────────────────┐
+-----------------------------------------------------------------------------+
Admin configures personality / template with attached files               
| Admin configures personality / template with attached files                |
Each file can be &quot;Full content&quot; or &quot;RAG only&quot;                            │
| -> Each file can be "Full content" or "RAG only"                            |
└─────────────────────────────────────────────────────────────────────────────┘
+-----------------------------------------------------------------------------+
                                        
                                         |
                                        
                                         v
┌─────────────────────────────────────────────────────────────────────────────┐
+-----------------------------------------------------------------------------+
Documents are stored and indexed                                           
| Documents are stored and indexed                                            |
Full content: read at request time via Document Extractor                   │
| -> Full content: read at request time via Document Extractor               |
RAG: split into chunks, embedded, stored in Solr vector store             │
| -> RAG: split into chunks, embedded, stored in Solr vector store           |
└─────────────────────────────────────────────────────────────────────────────┘
+-----------------------------------------------------------------------------+
                                        
                                         |
                                        
                                         v
┌─────────────────────────────────────────────────────────────────────────────┐
+-----------------------------------------------------------------------------+
User runs AI progress note (with conversation + optional patient files)   
| User runs AI progress note (with conversation + optional patient files)    |
└─────────────────────────────────────────────────────────────────────────────┘
+-----------------------------------------------------------------------------+
                                        
                                         |
                                        
                                         v
┌─────────────────────────────────────────────────────────────────────────────┐
+-----------------------------------------------------------------------------+
System builds AI context:                                                 
| System builds AI context:                                                  |
Full-content files full text injected into system prompt             
| * Full-content files -> full text injected into system prompt              |
RAG files semantic search over conversation messages top chunks     │
| * RAG files -> semantic search over conversation messages -> top chunks     |
Patient docs same RAG retrieval (vector search by patient + file IDs)
| * Patient docs -> same RAG retrieval (vector search by patient + file IDs)   |
└─────────────────────────────────────────────────────────────────────────────┘
+-----------------------------------------------------------------------------+
                                        
                                         |
                                        
                                         v
┌─────────────────────────────────────────────────────────────────────────────┐
+-----------------------------------------------------------------------------+
OpenAI API is called with augmented prompt (no file_search tool)           │
| OpenAI API is called with augmented prompt (no file_search tool)             |
AI generates the note using only the provided context                   
| -> AI generates the note using only the provided context                    |
└─────────────────────────────────────────────────────────────────────────────┘</pre>
+-----------------------------------------------------------------------------+
</pre>


----
-----


== 3. Attachment modes: "Full content" vs "RAG" ==
<span id="attachment-modes-full-content-vs-rag"></span>
== 3. Attachment modes: “Full content” vs “RAG” ==


When you attach a file to a '''personality''' or a '''progress note template''', you can choose how that file is used:
When you attach a file to a '''personality''' or a '''progress note template''', you can choose how that file is used:


{| border="1" cellpadding="5" cellspacing="0"
{| class="wikitable"
|-
!width="6%"| Mode
| '''Mode'''
!width="48%"| What it means
| '''What it means'''
!width="44%"| When to use it
| '''When to use it'''
|-
|-
| '''Full content'''
| '''Full content'''
| The '''entire''' document text is loaded and added to the AI's system prompt.
| The '''entire''' document text is loaded and added to the AI’s system prompt.
| Short, critical docs (e.g. short guidelines, required structure) where nothing should be missed.
| Short, critical docs (e.g. short guidelines, required structure) where nothing should be missed.
|-
|-
| '''RAG only'''
| '''RAG only'''
| The document is '''not''' sent in full. Only '''relevant chunks''' are retrieved using the current conversation and injected as context.
| The document is '''not''' sent in full. Only '''relevant chunks''' are retrieved using the current conversation and injected as context.
| Longer docs (e.g. long manuals, large templates) where you want the AI to focus on the parts that match the conversation.
| Longer docs (e.g. long manuals, large templates) where you want the AI to focus on the parts that match the conversation.
|}
|}


* '''Patient documents''' (files attached to the encounter) are always retrieved via '''RAG''' (vector search); there is no "full content" option for them.
* '''Patient documents''' (files attached to the encounter) are always retrieved via '''RAG''' (vector search); there is no “full content” option for them.
* If you do '''not''' set the toggle (e.g. older templates/personalities with only a list of file IDs), the system treats all non-patient attachments as '''full content''' for backward compatibility.
* If you do '''not''' set the toggle (e.g. older templates/personalities with only a list of file IDs), the system treats all non-patient attachments as '''full content''' for backward compatibility.
 


----
-----


<span id="how-documents-are-prepared-for-the-ai"></span>
== 4. How documents are prepared for the AI ==
== 4. How documents are prepared for the AI ==


<span id="storing-and-indexing"></span>
=== 4.1 Storing and indexing ===
=== 4.1 Storing and indexing ===


* '''Storage''': Files are stored in the application's file storage (e.g. S3 or local drive) and linked to the personality or template (or to the patient/encounter for patient documents).
* '''Storage''': Files are stored in the application’s file storage (e.g. S3 or local drive) and linked to the personality or template (or to the patient/encounter for patient documents).
* '''Vector store (for RAG)''':
* '''Vector store (for RAG)''':
** Documents that can be used for RAG are '''chunked''' (split into overlapping segments of roughly 1,500 characters).
** Documents that can be used for RAG are '''chunked''' (split into overlapping segments of roughly 1,500 characters).
Line 93: Line 99:
** When the user runs the AI, the system runs '''semantic search''' over these chunks using the conversation as the query (see below).
** When the user runs the AI, the system runs '''semantic search''' over these chunks using the conversation as the query (see below).


<span id="full-content-path"></span>
=== 4.2 Full-content path ===
=== 4.2 Full-content path ===


* For files marked '''"Full content"''', the system does '''not''' use the vector store at request time.
* For files marked '''“Full content”''', the system does '''not''' use the vector store at request time.
* It uses the '''Document Extractor''' to read the file (e.g. PDF, DOCX) and get the full text.
* It uses the '''Document Extractor''' to read the file (e.g. PDF, DOCX) and get the full text.
* That full text is then injected into the system prompt so the model sees the whole document.
* That full text is then injected into the system prompt so the model sees the whole document.


<span id="rag-path"></span>
=== 4.3 RAG path ===
=== 4.3 RAG path ===


* For files marked '''"RAG only"''' (and for patient documents), the system uses '''only''' the vector store.
* For files marked '''“RAG only”''' (and for patient documents), the system uses '''only''' the vector store.
* It does '''not''' send the full document. It runs a '''multi-query retrieval''' (see next section), then injects only the retrieved chunks into the prompt, up to a character budget.
* It does '''not''' send the full document. It runs a '''multi-query retrieval''' (see next section), then injects only the retrieved chunks into the prompt, up to a character budget.


----


-----
<span id="how-rag-retrieval-works"></span>
== 5. How RAG retrieval works ==
== 5. How RAG retrieval works ==


Previously, retrieval used '''only the last''' user/developer/assistant message and a '''fixed''' number of chunks (e.g. 20). The current behavior is:
Previously, retrieval used '''only the last''' user/developer/assistant message and a '''fixed''' number of chunks (e.g. 20). The current behavior is:


<span id="multi-query-retrieval"></span>
=== 5.1 Multi-query retrieval ===
=== 5.1 Multi-query retrieval ===


Line 120: Line 131:
** Results are collected and then '''merged'''.
** Results are collected and then '''merged'''.


<span id="deduplication-and-limit"></span>
=== 5.2 Deduplication and limit ===
=== 5.2 Deduplication and limit ===


* Chunks are identified by a key (e.g. <code>fileId:chunkIndex</code>). If the same chunk appears in results for multiple messages, it is '''deduplicated''' (one entry per chunk, keeping the best score).
* Chunks are identified by a key (e.g. <code>fileId:chunkIndex</code>). If the same chunk appears in results for multiple messages, it is '''deduplicated''' (one entry per chunk, keeping the best score).
* After merging and sorting by score, the system keeps at most '''N''' chunks for non-patient docs and '''N''' for patient docs, where '''N''' is the '''configurable RAG chunk limit''' (see Configuration below).
* After merging and sorting by score, the system keeps at most '''N''' chunks for non-patient docs and '''N''' for patient docs, where '''N''' is the '''configurable RAG chunk limit''' (see Configuration below).
* So: "each message" improves recall (earlier context can pull in relevant chunks); the limit and deduplication keep context size and cost under control.
* So: “each message” improves recall (earlier context can pull in relevant chunks); the limit and deduplication keep context size and cost under control.


<span id="where-the-chunk-limit-came-from"></span>
=== 5.3 Where the chunk limit came from ===
=== 5.3 Where the chunk limit came from ===


* The previous hardcoded value (e.g. 20) was an arbitrary default, not derived from a formal requirement.
* The previous hardcoded value (e.g. 20) was an arbitrary default, not derived from a formal requirement.
* The design intention was always to make this '''configurable''' via environment (see <code>SOLR_RAG_CHUNK_LIMIT</code> in <code>env.ts</code>). The code now uses that setting everywhere instead of a fixed 20.
* The design intention was always to make this '''configurable''' via environment (see <code>SOLR_RAG_CHUNK_LIMIT</code> in <code>env.ts</code>). The code now uses that setting everywhere instead of a fixed 20.
* Default in config is '''50'''; the application caps it between 1 and 500 for safety.
* Default in config is '''50'''; the application caps it between 1 and 500 for safety.


<span id="context-budget"></span>
=== 5.4 Context budget ===
=== 5.4 Context budget ===


* Even if many chunks are retrieved, the total '''character count''' of the RAG context sent to the model is capped (e.g. <code>RAG_CONTEXT_MAX_CHARS</code> or a model-specific override). Chunks are added in score order until the budget is reached; the rest are dropped and a warning is logged.
* Even if many chunks are retrieved, the total '''character count''' of the RAG context sent to the model is capped (e.g. <code>RAG_CONTEXT_MAX_CHARS</code> or a model-specific override). Chunks are added in score order until the budget is reached; the rest are dropped and a warning is logged.


----


-----
<span id="end-to-end-flow-step-by-step"></span>
== 6. End-to-end flow (step by step) ==
== 6. End-to-end flow (step by step) ==


# '''Setup (personality / template)'''
# '''Setup (personality / template)'''
#* Admin attaches files and, for each file, chooses '''Full content''' or '''RAG only'''.
#* Admin attaches files and, for each file, chooses '''Full content''' or '''RAG only'''.
#* Data is saved (e.g. <code>file_ids</code> and <code>file_ids_full_content</code>).
#* Data is saved (e.g. <code>file_ids</code> and <code>file_ids_full_content</code>).
#* Files are stored; RAG-only (and patient) documents are chunked and indexed in Solr when the indexing pipeline runs.
#* Files are stored; RAG-only (and patient) documents are chunked and indexed in Solr when the indexing pipeline runs.
# '''User starts progress note'''
# '''User starts progress note'''
Line 148: Line 164:
#* Those are also chunked and indexed in Solr (by patient and file).
#* Those are also chunked and indexed in Solr (by patient and file).
# '''User runs AI'''
# '''User runs AI'''
#* The frontend sends the conversation (messages) and options (e.g. template ID, personality ID, patient ID, attached file IDs).
#* The frontend sends the conversation (messages) and options (e.g. template ID, personality ID, patient ID, attached file IDs).
#* Backend resolves which files are personality/template and which are patient, and which of the former are full-content vs RAG-only.
#* Backend resolves which files are personality/template and which are patient, and which of the former are full-content vs RAG-only.
# '''Building context'''
# '''Building context'''
Line 154: Line 170:
#* '''RAG-only (non-patient) files''' and '''patient files''':
#* '''RAG-only (non-patient) files''' and '''patient files''':
#** If there is already full-document context from step above, the system runs multi-query RAG for '''RAG-only non-patient''' files and appends those chunks to the system message.
#** If there is already full-document context from step above, the system runs multi-query RAG for '''RAG-only non-patient''' files and appends those chunks to the system message.
#** If there is '''no''' full-document context (e.g. all attachments are RAG-only or only patient docs), the system uses the '''fallback''' path: multi-query RAG for both non-patient and patient files, then builds a single RAG context (chunks + file list + instructions) and injects it into the system prompt.
#** If there is '''no''' full-document context (e.g. all attachments are RAG-only or only patient docs), the system uses the '''fallback''' path: multi-query RAG for both non-patient and patient files, then builds a single RAG context (chunks + file list + instructions) and injects it into the system prompt.
# '''API call'''
# '''API call'''
#* The AI provider (e.g. OpenAI) is called with the '''augmented''' system prompt and the conversation.
#* The AI provider (e.g. OpenAI) is called with the '''augmented''' system prompt and the conversation.
#* No "file_search" or similar tool is used; all document content is in the prompt.
#* No “file_search” or similar tool is used; all document content is in the prompt.
# '''Response'''
# '''Response'''
#* The model generates the note using only the provided context. The UI shows the note and, where applicable, "Documents referenced" so the user knows which files were used.
#* The model generates the note using only the provided context. The UI shows the note and, where applicable, “Documents referenced” so the user knows which files were used.


----


-----
<span id="configuration-for-operations-team"></span>
== 7. Configuration (for operations / team) ==
== 7. Configuration (for operations / team) ==


Relevant environment variables:
Relevant environment variables:


{| border="1" cellpadding="5" cellspacing="0"
{| class="wikitable"
|-
!width="13%"| Variable
| '''Variable'''
!width="43%"| Purpose
| '''Purpose'''
!width="42%"| Default / notes
| '''Default / notes'''
|-
|-
| <code>SOLR_RAG_CHUNK_LIMIT</code>
| <code>SOLR_RAG_CHUNK_LIMIT</code>
Line 182: Line 199:
|}
|}


Other Solr/vector and embedding settings (e.g. core name, dimensions) are documented in <code>env.ts</code> and in the Solr/vector store docs referenced below.
Other Solr/vector and embedding settings (e.g. core name, dimensions) are documented in <code>env.ts</code> and in the Solr/vector store docs referenced below.


----


-----
<span id="summary-for-clients"></span>
== 8. Summary for clients ==
== 8. Summary for clients ==


* '''Two ways to use a file''': "Full content" (entire document in the prompt) or "RAG only" (only the most relevant parts, based on the conversation).
* '''Two ways to use a file''': “Full content” (entire document in the prompt) or “RAG only” (only the most relevant parts, based on the conversation).
* '''RAG''' uses the '''whole conversation''' (every user/assistant message) to find relevant sections, not just the last message.
* '''RAG''' uses the '''whole conversation''' (every user/assistant message) to find relevant sections, not just the last message.
* The '''number of chunks''' used is '''configurable''' (<code>SOLR_RAG_CHUNK_LIMIT</code>), so you can tune how much document material the AI sees.
* The '''number of chunks''' used is '''configurable''' (<code>SOLR_RAG_CHUNK_LIMIT</code>), so you can tune how much document material the AI sees.
* '''Patient documents''' are always used via RAG (semantic search); personality and template documents can be either full content or RAG.
* '''Patient documents''' are always used via RAG (semantic search); personality and template documents can be either full content or RAG.


----
 
-----
 
''Last updated: February 2025''

Latest revision as of 18:43, 26 February 2026

AI Attachments and RAG: End-to-End Guide

This document explains how the BioInsights AI uses attached documents (personality files, progress note templates, and patient documents) when generating progress notes. It is intended for teams (product, engineering, support) and clients who need a clear picture of how the system works, what the “Full content” vs “RAG” toggle means, and how retrieval is performed.



1. Purpose and scope

When generating an AI progress note, the system can use:

  • Personality documents – Guidelines, tone, and instructions attached to an AI personality.
  • Template documents – Instructions and structure attached to a progress note template.
  • Patient documents – Files attached to the current encounter (e.g. lab results, referrals).

The AI does not receive raw file binaries. Instead, it receives text that comes from those documents in one of two ways:

  1. Full content – The entire document text is fetched and placed in the AI’s context.
  2. RAG (Retrieval-Augmented Generation) – Only the most relevant parts of the document (chunks) are retrieved using semantic search and then added to the context.

This guide describes both modes, how documents are prepared (indexing, chunking), how retrieval works (including multi-query and the configurable chunk limit), and the end-to-end flow from setup to AI response.



2. High-level overview

┌─────────────────────────────────────────────────────────────────────────────┐
│  Admin configures personality / template with attached files                 │
│  → Each file can be "Full content" or "RAG only"                             │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  Documents are stored and indexed                                            │
│  → Full content: read at request time via Document Extractor                   │
│  → RAG: split into chunks, embedded, stored in Solr vector store              │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  User runs AI progress note (with conversation + optional patient files)     │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  System builds AI context:                                                   │
│  • Full-content files → full text injected into system prompt               │
│  • RAG files → semantic search over conversation messages → top chunks      │
│  • Patient docs → same RAG retrieval (vector search by patient + file IDs)  │
└─────────────────────────────────────────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│  OpenAI API is called with augmented prompt (no file_search tool)            │
│  → AI generates the note using only the provided context                    │
└─────────────────────────────────────────────────────────────────────────────┘

3. Attachment modes: “Full content” vs “RAG”

When you attach a file to a personality or a progress note template, you can choose how that file is used:

Mode What it means When to use it
Full content The entire document text is loaded and added to the AI’s system prompt. Short, critical docs (e.g. short guidelines, required structure) where nothing should be missed.
RAG only The document is not sent in full. Only relevant chunks are retrieved using the current conversation and injected as context. Longer docs (e.g. long manuals, large templates) where you want the AI to focus on the parts that match the conversation.
  • Patient documents (files attached to the encounter) are always retrieved via RAG (vector search); there is no “full content” option for them.
  • If you do not set the toggle (e.g. older templates/personalities with only a list of file IDs), the system treats all non-patient attachments as full content for backward compatibility.



4. How documents are prepared for the AI

4.1 Storing and indexing

  • Storage: Files are stored in the application’s file storage (e.g. S3 or local drive) and linked to the personality or template (or to the patient/encounter for patient documents).
  • Vector store (for RAG):
    • Documents that can be used for RAG are chunked (split into overlapping segments of roughly 1,500 characters).
    • Each chunk is converted into a vector (embedding) and stored in Apache Solr (vector core).
    • When the user runs the AI, the system runs semantic search over these chunks using the conversation as the query (see below).

4.2 Full-content path

  • For files marked “Full content”, the system does not use the vector store at request time.
  • It uses the Document Extractor to read the file (e.g. PDF, DOCX) and get the full text.
  • That full text is then injected into the system prompt so the model sees the whole document.

4.3 RAG path

  • For files marked “RAG only” (and for patient documents), the system uses only the vector store.
  • It does not send the full document. It runs a multi-query retrieval (see next section), then injects only the retrieved chunks into the prompt, up to a character budget.



5. How RAG retrieval works

Previously, retrieval used only the last user/developer/assistant message and a fixed number of chunks (e.g. 20). The current behavior is:

5.1 Multi-query retrieval

  • Every user, developer, and assistant message in the conversation that has non-empty content is used as a separate query.
  • For each such message:
    • The system generates an embedding for that message.
    • It runs vector search in Solr for:
      • Non-patient files (personality/template files that are RAG-only): search by file IDs.
      • Patient files (if present): search by patient ID and attached file IDs.
    • Results are collected and then merged.

5.2 Deduplication and limit

  • Chunks are identified by a key (e.g. fileId:chunkIndex). If the same chunk appears in results for multiple messages, it is deduplicated (one entry per chunk, keeping the best score).
  • After merging and sorting by score, the system keeps at most N chunks for non-patient docs and N for patient docs, where N is the configurable RAG chunk limit (see Configuration below).
  • So: “each message” improves recall (earlier context can pull in relevant chunks); the limit and deduplication keep context size and cost under control.

5.3 Where the chunk limit came from

  • The previous hardcoded value (e.g. 20) was an arbitrary default, not derived from a formal requirement.
  • The design intention was always to make this configurable via environment (see SOLR_RAG_CHUNK_LIMIT in env.ts). The code now uses that setting everywhere instead of a fixed 20.
  • Default in config is 50; the application caps it between 1 and 500 for safety.

5.4 Context budget

  • Even if many chunks are retrieved, the total character count of the RAG context sent to the model is capped (e.g. RAG_CONTEXT_MAX_CHARS or a model-specific override). Chunks are added in score order until the budget is reached; the rest are dropped and a warning is logged.



6. End-to-end flow (step by step)

  1. Setup (personality / template)
    • Admin attaches files and, for each file, chooses Full content or RAG only.
    • Data is saved (e.g. file_ids and file_ids_full_content).
    • Files are stored; RAG-only (and patient) documents are chunked and indexed in Solr when the indexing pipeline runs.
  2. User starts progress note
    • User may attach patient documents to the encounter.
    • Those are also chunked and indexed in Solr (by patient and file).
  3. User runs AI
    • The frontend sends the conversation (messages) and options (e.g. template ID, personality ID, patient ID, attached file IDs).
    • Backend resolves which files are personality/template and which are patient, and which of the former are full-content vs RAG-only.
  4. Building context
    • Full-content (non-patient) files: Document Extractor fetches full text; that text is added to the system prompt.
    • RAG-only (non-patient) files and patient files:
      • If there is already full-document context from step above, the system runs multi-query RAG for RAG-only non-patient files and appends those chunks to the system message.
      • If there is no full-document context (e.g. all attachments are RAG-only or only patient docs), the system uses the fallback path: multi-query RAG for both non-patient and patient files, then builds a single RAG context (chunks + file list + instructions) and injects it into the system prompt.
  5. API call
    • The AI provider (e.g. OpenAI) is called with the augmented system prompt and the conversation.
    • No “file_search” or similar tool is used; all document content is in the prompt.
  6. Response
    • The model generates the note using only the provided context. The UI shows the note and, where applicable, “Documents referenced” so the user knows which files were used.



7. Configuration (for operations / team)

Relevant environment variables:

Variable Purpose Default / notes
SOLR_RAG_CHUNK_LIMIT Max number of chunks to retrieve per source (non-patient and patient) in RAG. Optional; default in code is 50. Capped between 1 and 500.
RAG_CONTEXT_MAX_CHARS Max total characters of RAG context injected into the prompt. Optional; can be overridden per model with RAG_CONTEXT_MAX_CHARS_<MODEL>.

Other Solr/vector and embedding settings (e.g. core name, dimensions) are documented in env.ts and in the Solr/vector store docs referenced below.



8. Summary for clients

  • Two ways to use a file: “Full content” (entire document in the prompt) or “RAG only” (only the most relevant parts, based on the conversation).
  • RAG uses the whole conversation (every user/assistant message) to find relevant sections, not just the last message.
  • The number of chunks used is configurable (SOLR_RAG_CHUNK_LIMIT), so you can tune how much document material the AI sees.
  • Patient documents are always used via RAG (semantic search); personality and template documents can be either full content or RAG.



Last updated: February 2025