RAG Security, Classification, and Knowledge Graphs
The RAG foundations and AI features described in Parts 1 and 2 are only trustworthy with proper security, classification, and provenance. A vector search that returns sensitive salary data to every user, or an agent that summarises confidential deal terms into a public-facing Pulse, undermines the entire system. This document covers the three pillars that make RAG production-ready: data classification (per-chunk, not per-file), citations (verifiable provenance), and soft references (an emergent knowledge graph that grows as agents work).
Data Classification
The Problem
The platform has property-level data classification (classification = PII, Financial, Secret, etc.) that controls what users and roles can see via ClassificationFilter. But when files are chunked for RAG ingestion, the chunks lose this context. A salary spreadsheet uploaded to an entity gets chunked into osy.FileChunk rows and embedded into osy.EntityContext — and those chunks contain PII/Financial data that the vector search will happily return to any user who can see the parent entity.
Structured property classification is enforced. Unstructured file content classification is not. That’s the gap.
Design
Both osy.EntityContext and osy.FileChunk carry a DataClassification field — the same integer scale (0-6) used by PropertyMetadata.DataClassification:
ensure entity osy.EntityContext:
...existing properties...
DataClassification: Int32 default = 0
ClassificationSource: String(50) # "auto", "inherited", "manual"
ensure entity osy.FileChunk:
...existing properties...
DataClassification: Int32 default = 0
ClassificationSource: String(50)
The classification on osy.FileChunk is the source of truth. The osy.EntityContext row copies it at embedding time. This means re-chunking (new strategy) inherits classification without re-classifying, and re-embedding (new model) carries the classification forward.
Classification Sources
Classification can come from three sources, in priority order:
| Source | When | Example |
|---|---|---|
inherited | The parent entity’s collection property has a classification | A HrDocuments collection classified as PII — all chunks from files in this collection inherit PII |
auto | LLM-based classification during chunking | The chunk contains salary figures or SSNs — the classifier detects this |
manual | User or admin explicitly classifies a file | Override for edge cases |
Inherited classification is the cheapest and most reliable. If the app builder declared:
entity Employee:
Name: String required
PublicDocuments: collection(osy.FileAsset, Owner)
HrDocuments: collection(osy.FileAsset, Owner) classification = PII
Then all chunks from files uploaded to HrDocuments inherit classification = PII. No LLM needed. The classification follows the collection property’s declaration.
Auto-classification handles the case where the collection itself isn’t classified but the content is sensitive. The distillation LLM classifies each chunk during processing.
Per-Chunk Classification
A single file often contains mixed sensitivity — a report with a public executive summary, an internal project timeline, and a confidential salary appendix. Rather than classifying the entire file at the highest level (which makes the public summary invisible to users without confidential clearance), the system classifies at the chunk level.
This means:
- A user with
Internalclearance retrieves the executive summary and timeline chunks but not the salary appendix - The agent assembling context for a
PulseServiceRolesees only what that role’s classification level permits, per chunk - Retrieval is maximally useful at every clearance level instead of all-or-nothing per file
The Paragraph-Level Classification Pipeline
The correct granularity for classification is the paragraph, not the section. A section about a project might contain one paragraph mentioning a specific salary (PII) surrounded by public project updates. Only the salary paragraph should be classified as PII.
The pipeline has four passes:
Pass 1: Paragraph-level classification (LLM)
|
Pass 2: Group into classification segments
|
Pass 3: Chunk within each segment
|
Pass 4: Merge undersized adjacent chunks (same classification only)
Pass 1 — Paragraph-level classification. The document is pre-split into paragraphs (by blank lines / markdown blocks) and sent with index markers. The LLM returns a classification per paragraph:
{
"paragraphs": [
{ "index": 0, "classification": 0 },
{ "index": 1, "classification": 0 },
{ "index": 2, "classification": 3 },
{ "index": 3, "classification": 3 },
{ "index": 4, "classification": 0 },
{ "index": 5, "classification": 0 }
]
}
The classification prompt is auto-generated from the app’s seeded osy.DataClassification choice items, or uses a custom classificationPrompt from the classifications: block (see below).
Pass 2 — Group into classification segments. Adjacent paragraphs with the same classification level form a segment:
Segment 1: paragraphs 0-1, classification = Public (0)
Segment 2: paragraphs 2-3, classification = PII (3)
Segment 3: paragraphs 4-5, classification = Public (0)
A “Public / Internal (2 sentences) / Public again” document produces 3 segments, which become at least 3 chunks. The 2 Internal sentences become a small chunk. A small chunk is better than leaking classified content into a Public chunk.
Pass 3 — Chunk within each segment. The recursive chunker runs independently within each segment using the configured chunkSize and chunkOverlap. A large Public segment might produce 5 chunks. A tiny Internal segment (2 sentences) produces 1 small chunk. The chunker does not see across segment boundaries.
Pass 4 — Merge undersized adjacent chunks (same classification only). After chunking, adjacent chunks with the same classification that are both under half the target chunk size are merged. This handles the case where Pass 2 created many tiny same-classification segments. Never merge across classification boundaries. Two adjacent chunks with different classifications stay separate regardless of size.
The classification split priority is: (1) classification boundary, (2) header split, (3) paragraph split, (4) sentence split. Classification boundaries take precedence because mixing classification levels within a chunk would force the entire chunk to the highest level, defeating the purpose.
The invariant: every chunk has exactly one classification level.
Windowed Processing for Large Documents
The initial implementation truncated documents at 50K characters before sending them to the LLM for classification. A 200-page PDF might be 400K+ characters — so the LLM only classified the first ~25 pages. Everything after got DataClassification = 0. That’s a real gap.
The fix is windowed classification:
200-page PDF
|
Text extraction (per-page with PageInfo boundaries)
|
Split into classification windows (~15-20 pages each)
Windows overlap by 2 pages for boundary continuity
|
Each window -> LLM paragraph-level classification + window summary
|
Stitch results across windows (higher classification wins on overlap)
|
Four-pass chunking pipeline
|
Final LLM call: synthesize all window summaries into one file-level summary
Window sizing: Each window targets ~40K characters (well within LLM context limits). Windows split at page boundaries, not mid-paragraph. Each window overlaps with the next by 2 pages.
Stitching overlap pages: Both window N and window N+1 classify the overlap pages. If they agree, use the shared classification. If they disagree, take the higher classification (conservative). The overlap pages’ paragraphs from window N+1 win for final classification — it has the forward context that window N lacked.
Window-level distillation piggybacks on classification. Since each window already goes to the LLM for paragraph-level classification, the system asks for a window summary in the same call — zero additional LLM calls:
{
"windowSummary": "Pages 15-30 cover the financial projections and risk factors...",
"paragraphs": [
{ "index": 0, "classification": 0 },
{ "index": 1, "classification": 5 }
]
}
The window summaries are in-memory only. After all windows are processed, one final LLM call synthesizes them into a single file-level summary. This solves the truncation problem for distillation — the entire document gets summarised, not just the first ~25 pages.
Cost: A 200-page document at 15 pages per window = ~14 LLM calls for classification + distillation (combined), plus 1 synthesis call. At ~$0.01-0.02 per call (Haiku/Flash), that’s ~$0.15-0.30 per document. For bulk ingestion of non-sensitive documents, the app developer turns off classification — distillation still runs but as a single call (current behaviour, truncated at 50K).
The Classification Hierarchy Bug
The initial implementation assumed data classification is a strict hierarchy. ClassificationFilter.GetMaxClassificationLevel(roles) returned the highest level the user can access, and the vector search filtered with:
ec."DataClassification" <= @p_maxclass
This means: if you can see Financial (5), you can see PII (3). That’s wrong. An HR manager might see PII but not Financial. A CFO might see Financial but not PII-Sensitive. Classification levels are independent access grants, not a hierarchy.
The fix replaced GetMaxClassificationLevel with GetAllowedClassificationLevels, and the SQL changed from:
ec."DataClassification" <= @p_maxclass
to:
ec."DataClassification" = ANY(@p_allowed_levels)
Where @p_allowed_levels is an int[] parameter. Public (0) is always included. Each role grants access to specific levels independently:
- CFO role:
[0, 5]— Public + Financial, but not PII - HR Manager role:
[0, 3]— Public + PII, but not Financial - Admin role:
[0, 1, 2, 3, 4, 5, 6]— everything - Viewer role:
[0]— Public only
This change affected all retrieval paths — both vector and keyword branches, Pulse context assembly, and agent context assembly.
Custom Classification Levels
The default classification levels (Public, Internal, Confidential, PII, PII-Sensitive, Financial, Secret) are seeded as osy.DataClassification choice items. App developers can add domain-specific levels:
seed data osy.DataClassification (Label, Value, Icon, Color):
"Export Controlled", 7, "Globe", "red"
"Attorney-Client", 8, "Scale", "purple"
And map roles to them in the application: block:
application:
classifications:
classification:
value = @osy.DataClassification[Label = "Export Controlled"]
roles = "Admin, ComplianceOfficer"
classification:
value = @osy.DataClassification[Label = "Attorney-Client"]
roles = "Admin, LegalCounsel"
Three layers — choice definition (platform), choice items (seeded defaults, app-extendable), role mappings (application block) — keep the system both extensible and type-safe. The compiler validates @osy.DataClassification[Label = "..."] references exist.
Classification Prompt
The LLM classification prompt is auto-generated from the seeded choice items. Custom levels automatically appear in the classification instructions:
Classify each major section's sensitivity using these levels:
- Public (0): No sensitive information
- Internal (1): Internal business information
- Confidential (2): Business-confidential material
- PII (3): Personally identifiable information
- Financial (5): Financial data -- salaries, accounts, revenue
- Export Controlled (7): Subject to export control regulations
For apps that need domain-specific classification guidance beyond what the choice item names convey, classificationPrompt overrides the auto-generated prompt:
application:
classifications:
classificationPrompt = """
Classify each section using the company's data handling policy:
- Public: press releases, marketing materials, published research
- Internal: project plans, meeting notes, internal presentations
- Confidential: M&A targets, board materials, unreleased financials
- Export Controlled: anything referencing ITAR/EAR controlled technology
- Attorney-Client: legal memos, litigation strategy, privilege logs
"""
File Access Model
If chunks are classified and filtered from retrieval, but the original file is freely downloadable, classification is theatre. The file access model enforces classification at every layer.
Three Access Layers
Layer 1 — Chunk-based retrieval (always available, per-chunk filtering). The RAG retrieval surface. Chunks filtered by ClassificationFilter based on the user’s role. A user without Financial clearance never sees salary chunks in search results, Pulse synthesis, or agent context — but they still get the public/internal chunks from the same file.
Getting 60% of a document’s content (the parts you’re cleared for) is far better than getting nothing. The chunks alone tell you what’s in the document at your clearance level.
Layer 2 — Document reader (extracted content, per-chunk filtering). The primary reading experience when someone clicks a citation or opens a document. Assembled from extracted chunks, filtered by classification. Sections the user isn’t cleared for show a placeholder: “Content requires Financial clearance.”
This view is an extraction, not the original. It must be clearly marked as such — “Platform extraction — open original for authoritative version.” The extraction quality depends on the strategy used (see below).
Layer 3 — Original file viewer / download (page-level or file-level gating). For PDFs: rendered in-browser via pdf.js with page-level classification gating. Pages are classified based on the chunks they contain (via PageNumber on osy.FileChunk):
- Pages where all chunks are at or below the user’s clearance: rendered normally
- Pages where any chunk exceeds the user’s clearance: redacted or hidden entirely
- Citation references scroll to the page if the user has clearance
For other formats (DOCX, XLSX): download-only, gated at the file’s highest classification level. Full file download (any format) requires clearance for the file’s highest chunk classification.
Access Matrix
| User clearance vs. file content | Chunks (retrieval) | Document reader | PDF viewer | Download |
|---|---|---|---|---|
| Cleared for all content | All chunks | Full document | All pages | Yes |
| Cleared for some content | Cleared chunks only | Partial with redaction placeholders | Cleared pages, others redacted | No |
| Cleared for no content | No chunks | Not accessible | Not accessible | No |
Extraction Strategies
| Strategy | Quality | Cost | Best for |
|---|---|---|---|
text-extract | Low — no layout | Free | Simple text-heavy digital PDFs |
layout-parse | Medium — gets tables/headers mostly right | Free | Well-structured digital PDFs |
vision-llm | High — understands complex layouts, scans, tables | ~$0.01-0.05/page | Scanned docs, complex layouts, financial tables |
Critical honesty: The extracted view is an interpretation, not a lossless conversion. The vision LLM might restructure, omit, or subtly alter meaning. For anything where exact wording matters (legal clauses, signed contracts, regulatory filings), users must access the original via Layer 3.
The extraction strategy is configurable per-app default, per-file override, or via automatic upgrade (if text-extract produces low-confidence output, escalate to vision-llm).
Why pdf.js for Rendering, Not Extraction
pdf.js is a browser-based PDF renderer (used by Firefox). It’s excellent for rendering — faithful display of the original document. It’s poor for extraction — gives raw text without semantic structure (no markdown headers, table detection fails, no layout understanding).
The platform uses pdf.js for Layer 3 (original file viewing with page-level redaction), not for the ingestion pipeline. Extraction for chunking always happens server-side where LLM-based processing, classification, and embedding can be applied.
Pulse Security
The Pulse is a cached string on the entity — visible to anyone who can read the entity. But the inputs to the Pulse (classified properties, restricted documents, child entities with row-level security) may include data that not all readers should see.
The Pulse agent runs as a defined role, not as a privileged user. The existing SecurityManager filters what the agent can see during context assembly:
agent PulseAgent using CheapLlm:
runAs = agent
role = PulseServiceRole
name = "Pulse Generator"
purpose = "Generate entity Pulse summaries"
application MyApp:
pulseAgent = PulseAgent
When the Pulse agent queries the entity’s context, SecurityManager filters based on PulseServiceRole. If that role can’t read sensitive risk log entries, they never enter the prompt, and therefore never appear in PulseContent.
The developer controls what the Pulse sees through the same security policies as everything else:
security RiskLogPolicy on ProjectRiskLog:
defaults: deny all
rule TeamCanReadNonSensitive:
allow: Read
when = "@CurrentUser.Roles CONTAINS 'ProjectMember'"
rowFilter = "Category != @RiskCategory[Label = 'Sensitive']"
rule PulseCanReadNonSensitive:
allow: Read
when = "@CurrentUser.Roles CONTAINS 'PulseServiceRole'"
rowFilter = "Category != @RiskCategory[Label = 'Sensitive']"
rule LegalCanReadAll:
allow: Read
when = "@CurrentUser.Roles CONTAINS 'Legal'"
The Pulse shows what PulseServiceRole can see — a safe, general-audience summary. Users with higher clearance open the chat (which runs as runAs = user) and ask about the sensitive details — the chat sees everything their role permits.
Audit trail shows: “Pulse generated by Agent PulseAgent running as PulseServiceRole.” Clear provenance.
Graph-Aware RAG
Entities don’t exist in isolation. A Project has Tasks, Risks, and Team Members. A Customer has Orders, Tickets, and Contracts. The RAG layer should understand these relationships without the agent having to manually traverse them.
The Relationship Context Type
When an entity’s relationships change (child created, deleted, or modified), the platform generates a Relationship context chunk that captures the entity’s neighbourhood:
[Project: Apollo] [Relationships]
>> "Project Apollo has 24 Tasks (18 completed, 3 in progress, 3 blocked).
3 open Risks: foundation crack ($45K), permit delay (3 weeks),
subcontractor availability.
5 Team Members: Sarah Chen (Owner), Mike Torres (Architect), ...
2 attached Files: Site_Report.pdf (42 pages), Budget_v3.xlsx.
Parent Organization: Acme Construction."
This chunk is embedded like any other context row. When the agent searches for “what’s blocking Apollo,” the relationship context surfaces alongside file chunks and property snapshots — giving the agent awareness of the entity’s full graph without recursive queries.
Generation Strategy
The relationship context is generated from the entity’s collection properties — the metadata model already knows all parent-child relationships. For each collection:
- Count the children
- Group by relevant status/category properties (if they exist)
- Summarise the top items by recency or priority
- Include parent references
This is a metadata-driven operation — the platform walks the entity’s property metadata, finds Collection-type properties, queries the counts and summaries, and assembles the text. No app-builder configuration needed beyond defining the relationships.
Staleness Propagation and the Thundering-Herd Problem
When a child entity changes, the parent’s relationship context becomes stale:
Task completed -> parent Project's relationship context stale
-> parent Project's Pulse stale
This is controlled by a propagation depth (default 1 — only direct parent). Deep propagation (grandparent, etc.) is opt-in to avoid cascading updates.
The thundering-herd problem. Consider: a CSV import creates 500 Task children under a Project. Each Task creation fires a property-change event, which marks the parent Project’s relationship context as stale. Without coalescing, the platform would:
- Regenerate the Project’s
Relationshipcontext chunk 500 times - Re-embed that chunk 500 times (500 embedding API calls)
- Mark the Project’s Pulse stale 500 times and potentially regenerate it 500 times (500 LLM calls)
At scale this turns a $0.02 import into a $5+ LLM bill and a saturated ingestion queue.
Coalescing as a First-Class Primitive
Coalescing is lifted from a Pulse-specific optimisation to a platform-level primitive that governs all staleness propagation:
-
Staleness is a flag, not a trigger. When a child changes, the platform sets
PulseStaleSince = now()on the parent (if not already set). It does NOT immediately enqueue regeneration. The flag is idempotent — 500 child changes set the same flag once. -
Coalescing window. A background job periodically scans for stale entities. It collects all entities where
PulseStaleSince IS NOT NULL AND PulseStaleSince < now() - coalesce_window. A burst of 500 child changes within the window produces exactly one regeneration. -
Batch-aware ingestion. Bulk operations (CSV import, API batch create, seed data) are identifiable as a batch. During a batch, staleness propagation is deferred entirely — the parent is marked stale once when the batch commits, not per-row.
-
Rate limiting per entity. Even outside batches, an entity’s context is not regenerated more than once per coalescing window. If the window is 30 seconds, a Project with continuous child updates gets at most 2 regenerations per minute.
The coalescing window is configurable per entity type via the rag: block:
entity Project:
rag:
context = auto
coalesce = 30s # default
propagation = parent # depth 1, direct parent only
entity ActivityLog:
rag:
context = auto
coalesce = 120s # high-churn entity, longer window
Multi-Hop Retrieval
For complex queries (“which projects in the Construction division are over budget?”), the agent can combine:
- Cross-entity search on entity_type = ‘Project’ for “over budget”
- Relationship context on each result to see the division (parent Organization)
- Property data to verify the budget numbers
The graph-aware context makes step 2 unnecessary in most cases — the relationship chunk already contains “Parent Organization: Acme Construction,” so the vector search naturally ranks Construction-division projects higher for that query.
Soft References — Agent-Driven Knowledge Graphs
Traditional systems rely on hard relations (EntityRef/Collection) defined at schema time by the app builder. But some of the most valuable connections between entities emerge at runtime — during conversations, document reviews, or agent analysis. A user reviewing Contract A asks the agent to look up a similar contract and says “add a reference to that.” No schema relationship exists between them, but the connection is meaningful, discoverable, and should be navigable.
The key insight: agents eliminate the friction. Manually creating cross-references is high-effort, low-immediate-reward work — nobody does it consistently. But in a conversation with an agent, the references are a natural byproduct. The agent already found the target, already knows why it’s relevant, already has the context. Creating the reference is one sentence from the user, and the agent writes a better reason than most humans would bother to.
This compounds over time. Six months in, an entity has a web of references that nobody planned but everyone benefits from. A new team member opens a contract and immediately sees “this was based on Template X, the indemnity clause was modelled on Contract Y, three risks were extracted from it.” That’s institutional knowledge that normally lives in someone’s head.
The Reference Entity
choice ReferenceKind:
Citation # Agent RAG output referenced a source
CrossReference # Related entities linked by user or agent
DerivedFrom # Entity was created/extracted from the target
Bookmark # User saved a specific location for later
Supersedes # This version replaces the target
BuildsOn # This entity extends or builds upon the target
Contradicts # This entity conflicts with or disputes the target
UpdateOf # Newer version of the same logical content
SupportedBy # Evidence or justification from the target
choice ReferenceLocationType:
Page # PDF/document page number
Section # Named section/heading within a document
Message # Specific chat message
Property # Specific property on an entity
Chunk # File chunk (for precise RAG citations)
ensure entity osy.Reference:
description = "Soft link between any two entity instances"
SourceType: osy.EntityMetadata required
Source: dynamic reference(SourceType) required
TargetType: osy.EntityMetadata required
Target: dynamic reference(TargetType) required
Label: String(500)
Reason: String
Kind: @ReferenceKind required
LocationType: @ReferenceLocationType
LocationDetail: Json
DataClassification: Int32 default = 0
Both sides use dynamic references — any entity type can be a source or target without predefined FKs. The platform’s existing system properties (CreatedAt, CreatedBy) provide full provenance.
Platform Integration
Because osy.Reference is a platform entity, several capabilities work automatically without app builder configuration:
Agent tools. The platform provides AddReference and SearchReferences as built-in function tools available to any agent. The agent populates SourceType, TargetType, Kind, Reason from conversation context. The app builder doesn’t wire these up — they exist because osy.Reference exists.
RAG citations. When Pulse or chat produces citations (see below), the runtime writes Citation-kind references automatically. When a Pulse regenerates, old Citation references for that source are replaced.
Reference panel. The client renders a References section on any entity that has references (as source or target). The app builder can place it explicitly in a view layout, or the platform shows it in a default location.
Bidirectional discovery. “Show me everything that references this template contract” is a query on Target = templateId. “What was this contract based on?” is a query on Source = contractId, Kind = DerivedFrom. Both directions are first-class.
Graph enrichment for RAG. When generating Relationship context, the platform includes soft references alongside hard relations: “Referenced by 3 other contracts, derived from Template X.” Future agent searches benefit from connections created in past conversations.
Navigation. A single reference component handles all target types using TargetType + Target + LocationType + LocationDetail:
| LocationType | Navigation action |
|---|---|
Page | Open file viewer, scroll to page |
Section | Open file viewer, scroll to section heading |
Message | Open chat session, scroll to message |
Property | Open entity default view, highlight property |
Chunk | Open file viewer at chunk’s page/section |
| (null) | Open entity default view |
Examples
Agent-created during chat:
User: "Look up contracts with similar indemnity language"
Agent: [searches, finds Contract B]
Agent: "Contract B (Acme MSA 2024) has a similar mutual indemnity
structure in section 8..."
User: "Add a reference to that"
Agent: [calls AddReference]
-> Source: Contract A, Target: Contract B
-> Kind: CrossReference
-> Label: "Similar indemnity clause structure"
-> Reason: "Mutual indemnity in section 8 of Acme MSA 2024 uses
the same cap-at-fees model. Identified during legal
review of liability terms."
Automatic from RAG citation:
Pulse generates: "Foundation crack detected in sector 7 [src:a1b2c3d4]"
Runtime creates:
-> Source: Project Apollo (PulseContent), Target: osy.FileAsset (Site_Report.pdf)
-> Kind: Citation
-> Label: "Site_Report.pdf, p.12 section 3.2 Risks"
-> LocationType: Page, LocationDetail: { page: 12, section: "3.2 Risks" }
Insight extraction chain:
User: "Add those risks to the risk log with the references"
Agent creates Risk entity + Reference:
-> Source: Risk "Liability cap below standard", Target: Contract A
-> Kind: DerivedFrom
-> Label: "Extracted from MasterAgreement.pdf section 8.1"
-> Reason: "Identified during contract review -- liability cap of
$50K is below industry standard of $250K for this
contract value."
-> LocationType: Page, LocationDetail: { page: 8, section: "8.1 Liability" }
Extended Reference Kinds and Salience
The extended ReferenceKind values carry different implications for relevance. When a reference is created, the Kind gives downstream systems a machine-readable signal about the nature of the relationship:
| Kind | Effect on target salience | Effect on source salience |
|---|---|---|
Citation | Neutral (referenced, but by an output, not by a decision) | N/A |
CrossReference | Slight positive (something is related) | Slight positive |
DerivedFrom | Positive (foundational, things are built from this) | Neutral |
Supersedes | Negative (this has been replaced) | Positive |
BuildsOn | Positive (still being built upon) | Neutral |
Contradicts | Positive (live tension, needs resolution) | Positive |
UpdateOf | Negative (older version) | Positive |
SupportedBy | Positive (used as evidence) | Neutral |
Bookmark | Neutral (user intent, not graph structure) | N/A |
The Kind is the structured, machine-readable signal. The Reason field is where the agent captures nuance that the kind alone can’t express:
Kind: BuildsOn
Reason: "The indemnity clause in section 8.2 uses the same mutual-cap-at-fees
structure established in the Acme MSA 2024, with the liability
threshold increased from $50K to $100K to reflect the larger
contract value."
Citation System
Stable Chunk Identifiers
Every chunk assembled into LLM context gets a short stable ID derived from the osy.EntityContext row’s primary key (UUID), truncated to 8 hex characters for prompt compactness:
[src:a1b2c3d4] Financial performance summary from Q4 2025 report...
[src:e5f6g7h8] Customer satisfaction scores from annual survey...
[src:i9j0k1l2] Board resolution from March 2026 meeting...
The src: prefix distinguishes source citations from other bracket syntax. The 8-char hex is enough to be unique within a single prompt. The full UUID is resolved server-side from the truncated ID.
Prompt Injection
When assembling context for chat or Pulse, each chunk is prefixed with its [src:ID] tag. Citation instructions travel with the context, not in the system prompt header — keeping them close to the chunks in the LLM’s attention:
When you reference information from the provided context, cite the source
using its [src:ID] tag. Place citations inline at the end of the sentence
or paragraph that uses the information. Example: "Revenue grew 15% in Q4
[src:a1b2c3d4]."
Only cite sources that were provided in the context. Do not fabricate
source IDs.
This is injected by the platform — the app developer doesn’t write it.
Post-Processing
After the LLM responds, the platform parses [src:XXXXXXXX] markers from the response text:
- Extract all
[src:ID]occurrences - Resolve each 8-char ID to the full
osy.EntityContextUUID (lookup from the context assembly manifest) - Validate — does this ID match a chunk that was actually in the prompt? If not, flag as hallucinated and strip
- Create an
osy.Referencerow for each valid citation:SourceType/Source= the entity the chat/Pulse is aboutTargetType/Target= the entity the cited chunk belongs toKind= CitationLocationType= ChunkLocationDetail={ "entityContextId": "full-uuid", "sourceFileId": "..." }DataClassification= the cited chunk’s classification valueLabel= first 100 chars of cited chunk contentReason= the sentence that cited it
Opt-In via DSL Directives
Citations are opt-in per feature:
entity Order:
rag:
context = auto
pulse = auto
pulseReferences = true # Pulse cites its sources
chatReferences = true # Entity chat cites its sources
When pulseReferences = true, the Pulse context assembly adds [src:ID] prefixes, the Pulse prompt includes citation instructions, and after generation, citations are extracted and stored as osy.Reference rows. The PulseContent retains the markers — the UI renders them as clickable links.
When chatReferences = true, the same mechanism applies to entity chat responses.
Default: both false. Citations cost more prompt tokens and more post-processing — they should be explicit.
Classification-Aware Citations
osy.Reference carries a DataClassification property (Int32, default 0) — set to the target chunk’s classification at creation time. This means:
- References are filtered the same way chunks are:
DataClassification IN (...)against the user’s allowed levels - The label and reason text (which may contain sensitive content from the target) are only visible to users with clearance
- When loading references for Pulse context assembly or chat, the classification filter applies — the LLM never sees references to content the acting role can’t access
Without this, a reference with Label = "Revenue grew 15% in Q4" pointing to a Financial-classified chunk would leak the classified content to any user who can see the source entity.
For the UI: if a user doesn’t have clearance for a reference’s classification level, it renders as “[source redacted — requires {level} clearance].”
UI Rendering
The [src:XXXXXXXX] markers in stored text (PulseContent, ChatMessage content) are rendered by the client as interactive citation elements:
- Inline badge: small superscript number or icon next to the cited text
- Hover: shows the first 200 chars of the cited chunk + source metadata (file name, page number, entity name)
- Click: navigates to the source — opens the document reader at the relevant page/section, or scrolls to the entity
The client resolves [src:ID] to osy.Reference to LocationDetail to navigation target. All data is already in the database; the client just follows the references.
Reference Graph Traversal
The retrieval paths described in Part 2 search within the current entity’s scope. But the most valuable insights often come from connections across entities — a liability clause pattern flagged in a previous deal, a risk finding that applies to a similar situation.
The osy.Reference entity is the cross-entity memory. When someone reviews the Acme deal, files a Finding about a missing data breach liability carve-out, and creates a Reference linking that Finding to the specific contract clause — that knowledge should surface when the agent encounters a similar clause in the Apollo deal.
How Reference-Aware Retrieval Works
When search_knowledge receives a query with reference_depth > 0, the flow expands the search scope via the reference graph before the vector search:
1. Embed query ("liability clause") -> queryVector
2. Collect entity IDs from the current entity's scope
(its collections, related entities)
3. Traverse references from those entities (BFS, bidirectional):
depth 1: direct References (Finding #47 -> this clause)
depth 2: Finding #47's own References (-> Acme contract section 12)
Cycle-safe: visited set prevents re-traversal
Capped: 50 entities per hop, 200 references per query
4. Vector search across the expanded scope using the SAME queryVector:
osy.EntityContext WHERE EntityId IN (expanded set)
AND Embedding <=> queryVector < threshold
AND DataClassification IN (allowed levels)
5. Results tagged with viaReference: true/false
so the LLM knows which came from graph traversal
The vector search on the expanded set is what makes this precise — traversal at depth 2 on a busy reference graph might reach hundreds of entities, but the vector search filters to only those semantically relevant to “liability clause.” A Finding about pricing on the same deal gets traversed but scores low and is excluded.
The Cross-Deal Example
- User asks “what about the liability clause?” on the Apollo deal
search_knowledgereturns chunks from Apollo’s Master Agreement section 8.1- Reference expansion finds a
CrossReferencefrom Acme Finding #47 to the Apollo Master Agreement, with label “Similar liability cap structure” - This Reference was created by the LegalReviewer agent during the Acme review — it noticed the same clause pattern and linked it
- The system loads Acme Finding #47’s content: “Missing data breach liability carve-out in section 12. Severity: Medium. Resolution: Added explicit carve-out during negotiation.”
- The agent now says: “The liability cap in section 8.1 is $4M [src:a1b2c3d4]. Note: this same clause structure was flagged in the Acme deal — the data breach carve-out was missing there too and had to be added during negotiation.”
The agent doesn’t have to know about the Acme deal. The reference graph surfaces it automatically. The previous reviewer’s work — filing the Finding, creating the Reference — is the mechanism.
Security in Traversal
TraverseReferencesAsync respects RBAC — entities the user can’t access are excluded from the traversal. The vector search on the traversed set also applies classification filtering (DataClassification IN (allowed_levels)). The structural permission boundary holds end-to-end: a user who can’t see Deal A’s confidential findings never has those findings surface during their review of Deal B, even if a reference connects them.
Reference Creation as Learning
Every time an agent or user creates a Reference, the system gets smarter. The reference graph grows over time:
- Agent reviews a contract, files a Finding, references the clause — future agents find it
- User marks two findings as related — cross-link — future searches surface the connection
- Agent in deal B finds something similar to deal A, creates a CrossReference — deal C benefits
This is why chatReferences = true and pulseReferences = true matter — they make the agents produce references automatically, which feeds the graph that makes future retrieval better.
Application Patterns
The entity-anchored RAG design supports every major pattern in the market today. The differentiator: app builders don’t configure RAG — they define entities and relationships, and the platform provides the semantic layer automatically.
| Pattern | Market examples | Platform mapping | What the app builder does |
|---|---|---|---|
| Knowledge base | Notion AI, Confluence AI, Slite | Entity = Document/Page, files = attachments | Define Document entity, upload files |
| Customer support | Intercom Fin, Zendesk AI, Ada | Ticket + Message children, KB articles as separate entities | Define Ticket + Message entities, write agent instructions |
| Long-running chat | ChatGPT memory, Claude projects | Conversation + Message children, rolling summaries | Define Conversation entity, rag: block with children = auto |
| CRM intelligence | Salesforce Einstein, HubSpot AI | Deal/Contact/Company entities with file attachments | Define CRM entities, Pulse = “deal health” |
| Project management | Monday AI, Asana AI, Notion projects | Project -> Tasks/Risks/Milestones, file attachments | Define project entities, attach documents, agent reasons across everything |
| Legal research | Harvey, CaseText, Luminance | Case/Contract entities with deep document chunking | Define Case entity, upload legal documents |
| Enterprise search | Glean, Hebbia, Coveo | Cross-entity workspace search | Deploy app, everything is searchable automatically |
| Document Q&A | ChatPDF, DocuSign IAM | Single entity with file attachment | Upload file to entity, ask questions |
| Multi-agent workflows | CrewAI, AutoGen, LangGraph | Agents share context via entity_context | Define agents with different roles, share entity scope |
| Personal memory | Mem, Rewind, Limitless | User-scoped entities with auto-capture | Define Note/Memory entity, rag: block with children = auto |
What Makes This Different from Standalone RAG Tools
Standalone vector databases (Pinecone, Weaviate, Qdrant, Chroma) provide the storage layer but leave the application developer to build:
- Entity-to-chunk mapping
- Ingestion pipelines
- Staleness management
- Relationship awareness
- Structured + unstructured unification
- Citation tracking back to source
- Access control (which user can search which entities)
The platform provides all of this as infrastructure. The vector store is an implementation detail — the app builder works at the entity level, and the semantic shadow follows automatically.
Access Control Integration
Because osy.EntityContext rows carry Entity (dynamic reference) and EntityType (EntityRef to EntityMetadata), the existing security policy system applies directly. The platform first evaluates which entities the current user can see (per security policy), then performs the vector similarity search within that scoped set. A support agent searching “billing issues” only sees tickets from their assigned accounts, even though the vector index contains all tickets.
Evolution
The Classification Hierarchy Bug
The initial implementation assumed classification levels form a strict hierarchy: if you can see level 5, you can see levels 0-4. The SQL used <= @p_maxclass. This is wrong — classification levels are independent grants. An HR manager should see PII (3) but not Financial (5). A CFO should see Financial but not PII-Sensitive. The fix was conceptually simple (= ANY(@p_allowed_levels)) but required changes across every retrieval path — vector search, keyword search, Pulse context assembly, and agent context assembly.
Per-Chunk vs. Per-File Classification Granularity
The first design classified at the file level. A salary report with a public executive summary and confidential appendix would be classified entirely as Confidential — making the executive summary invisible to users without Confidential clearance. Per-chunk classification solved this by treating each chunk independently. The cost is a more complex ingestion pipeline; the benefit is maximally useful retrieval at every clearance level.
Section-Level vs. Paragraph-Level Classification
Even after moving to per-chunk classification, the initial implementation classified at the section header level — all chunks under a section header got the section’s classification. A section with one PII sentence surrounded by public content over-classified the entire section. The fix was paragraph-level classification: the LLM evaluates each paragraph independently, adjacent same-classification paragraphs form segments, and chunks are created within segments. This added a four-pass pipeline but eliminated over-classification.
The Bootstrap Refactor
Moving the osy.DataClassification choice from an internal module to the baseline application required a unified multi-file compilation model. Previously, each .osy file compiled independently. With classification choice items seeded in one file and referenced via @osy.DataClassification[Label = "..."] in another, cross-file references had to resolve in a single merged AST. This led to a three-pass seed execution model:
- Deferred seeds (choice items from inline blocks)
- Data seeds (can reference choice items from pass 1)
- Post-seed actions (classification configs with choice refs, can look up resolved choice data)
Reference Traversal Simplification
The original design for reference-aware retrieval proposed a two-pass approach: search the current entity scope, then traverse references, then search again on the traversed set, then merge results. The implementation simplified this to expanding the scopeEntityIds set before the search — the existing search post-filter naturally includes referenced entities. One search, not two. Simpler, and the vector similarity scoring naturally handles relevance across the expanded set.