← Back to design archive

RAG Security, Classification, and Knowledge Graphs

architecture draft Updated

The RAG foundations and AI features described in Parts 1 and 2 are only trustworthy with proper security, classification, and provenance. A vector search that returns sensitive salary data to every user, or an agent that summarises confidential deal terms into a public-facing Pulse, undermines the entire system. This document covers the three pillars that make RAG production-ready: data classification (per-chunk, not per-file), citations (verifiable provenance), and soft references (an emergent knowledge graph that grows as agents work).


Data Classification

The Problem

The platform has property-level data classification (classification = PII, Financial, Secret, etc.) that controls what users and roles can see via ClassificationFilter. But when files are chunked for RAG ingestion, the chunks lose this context. A salary spreadsheet uploaded to an entity gets chunked into osy.FileChunk rows and embedded into osy.EntityContext — and those chunks contain PII/Financial data that the vector search will happily return to any user who can see the parent entity.

Structured property classification is enforced. Unstructured file content classification is not. That’s the gap.

Design

Both osy.EntityContext and osy.FileChunk carry a DataClassification field — the same integer scale (0-6) used by PropertyMetadata.DataClassification:

ensure entity osy.EntityContext:
  ...existing properties...
  DataClassification: Int32 default = 0
  ClassificationSource: String(50)       # "auto", "inherited", "manual"

ensure entity osy.FileChunk:
  ...existing properties...
  DataClassification: Int32 default = 0
  ClassificationSource: String(50)

The classification on osy.FileChunk is the source of truth. The osy.EntityContext row copies it at embedding time. This means re-chunking (new strategy) inherits classification without re-classifying, and re-embedding (new model) carries the classification forward.

Classification Sources

Classification can come from three sources, in priority order:

SourceWhenExample
inheritedThe parent entity’s collection property has a classificationA HrDocuments collection classified as PII — all chunks from files in this collection inherit PII
autoLLM-based classification during chunkingThe chunk contains salary figures or SSNs — the classifier detects this
manualUser or admin explicitly classifies a fileOverride for edge cases

Inherited classification is the cheapest and most reliable. If the app builder declared:

entity Employee:
  Name: String required
  PublicDocuments: collection(osy.FileAsset, Owner)
  HrDocuments: collection(osy.FileAsset, Owner) classification = PII

Then all chunks from files uploaded to HrDocuments inherit classification = PII. No LLM needed. The classification follows the collection property’s declaration.

Auto-classification handles the case where the collection itself isn’t classified but the content is sensitive. The distillation LLM classifies each chunk during processing.

Per-Chunk Classification

A single file often contains mixed sensitivity — a report with a public executive summary, an internal project timeline, and a confidential salary appendix. Rather than classifying the entire file at the highest level (which makes the public summary invisible to users without confidential clearance), the system classifies at the chunk level.

This means:

  • A user with Internal clearance retrieves the executive summary and timeline chunks but not the salary appendix
  • The agent assembling context for a PulseServiceRole sees only what that role’s classification level permits, per chunk
  • Retrieval is maximally useful at every clearance level instead of all-or-nothing per file

The Paragraph-Level Classification Pipeline

The correct granularity for classification is the paragraph, not the section. A section about a project might contain one paragraph mentioning a specific salary (PII) surrounded by public project updates. Only the salary paragraph should be classified as PII.

The pipeline has four passes:

Pass 1: Paragraph-level classification (LLM)
  |
Pass 2: Group into classification segments
  |
Pass 3: Chunk within each segment
  |
Pass 4: Merge undersized adjacent chunks (same classification only)

Pass 1 — Paragraph-level classification. The document is pre-split into paragraphs (by blank lines / markdown blocks) and sent with index markers. The LLM returns a classification per paragraph:

{
  "paragraphs": [
    { "index": 0, "classification": 0 },
    { "index": 1, "classification": 0 },
    { "index": 2, "classification": 3 },
    { "index": 3, "classification": 3 },
    { "index": 4, "classification": 0 },
    { "index": 5, "classification": 0 }
  ]
}

The classification prompt is auto-generated from the app’s seeded osy.DataClassification choice items, or uses a custom classificationPrompt from the classifications: block (see below).

Pass 2 — Group into classification segments. Adjacent paragraphs with the same classification level form a segment:

Segment 1: paragraphs 0-1, classification = Public (0)
Segment 2: paragraphs 2-3, classification = PII (3)
Segment 3: paragraphs 4-5, classification = Public (0)

A “Public / Internal (2 sentences) / Public again” document produces 3 segments, which become at least 3 chunks. The 2 Internal sentences become a small chunk. A small chunk is better than leaking classified content into a Public chunk.

Pass 3 — Chunk within each segment. The recursive chunker runs independently within each segment using the configured chunkSize and chunkOverlap. A large Public segment might produce 5 chunks. A tiny Internal segment (2 sentences) produces 1 small chunk. The chunker does not see across segment boundaries.

Pass 4 — Merge undersized adjacent chunks (same classification only). After chunking, adjacent chunks with the same classification that are both under half the target chunk size are merged. This handles the case where Pass 2 created many tiny same-classification segments. Never merge across classification boundaries. Two adjacent chunks with different classifications stay separate regardless of size.

The classification split priority is: (1) classification boundary, (2) header split, (3) paragraph split, (4) sentence split. Classification boundaries take precedence because mixing classification levels within a chunk would force the entire chunk to the highest level, defeating the purpose.

The invariant: every chunk has exactly one classification level.

Windowed Processing for Large Documents

The initial implementation truncated documents at 50K characters before sending them to the LLM for classification. A 200-page PDF might be 400K+ characters — so the LLM only classified the first ~25 pages. Everything after got DataClassification = 0. That’s a real gap.

The fix is windowed classification:

200-page PDF
    |
Text extraction (per-page with PageInfo boundaries)
    |
Split into classification windows (~15-20 pages each)
    Windows overlap by 2 pages for boundary continuity
    |
Each window -> LLM paragraph-level classification + window summary
    |
Stitch results across windows (higher classification wins on overlap)
    |
Four-pass chunking pipeline
    |
Final LLM call: synthesize all window summaries into one file-level summary

Window sizing: Each window targets ~40K characters (well within LLM context limits). Windows split at page boundaries, not mid-paragraph. Each window overlaps with the next by 2 pages.

Stitching overlap pages: Both window N and window N+1 classify the overlap pages. If they agree, use the shared classification. If they disagree, take the higher classification (conservative). The overlap pages’ paragraphs from window N+1 win for final classification — it has the forward context that window N lacked.

Window-level distillation piggybacks on classification. Since each window already goes to the LLM for paragraph-level classification, the system asks for a window summary in the same call — zero additional LLM calls:

{
  "windowSummary": "Pages 15-30 cover the financial projections and risk factors...",
  "paragraphs": [
    { "index": 0, "classification": 0 },
    { "index": 1, "classification": 5 }
  ]
}

The window summaries are in-memory only. After all windows are processed, one final LLM call synthesizes them into a single file-level summary. This solves the truncation problem for distillation — the entire document gets summarised, not just the first ~25 pages.

Cost: A 200-page document at 15 pages per window = ~14 LLM calls for classification + distillation (combined), plus 1 synthesis call. At ~$0.01-0.02 per call (Haiku/Flash), that’s ~$0.15-0.30 per document. For bulk ingestion of non-sensitive documents, the app developer turns off classification — distillation still runs but as a single call (current behaviour, truncated at 50K).

The Classification Hierarchy Bug

The initial implementation assumed data classification is a strict hierarchy. ClassificationFilter.GetMaxClassificationLevel(roles) returned the highest level the user can access, and the vector search filtered with:

ec."DataClassification" <= @p_maxclass

This means: if you can see Financial (5), you can see PII (3). That’s wrong. An HR manager might see PII but not Financial. A CFO might see Financial but not PII-Sensitive. Classification levels are independent access grants, not a hierarchy.

The fix replaced GetMaxClassificationLevel with GetAllowedClassificationLevels, and the SQL changed from:

ec."DataClassification" <= @p_maxclass

to:

ec."DataClassification" = ANY(@p_allowed_levels)

Where @p_allowed_levels is an int[] parameter. Public (0) is always included. Each role grants access to specific levels independently:

  • CFO role: [0, 5] — Public + Financial, but not PII
  • HR Manager role: [0, 3] — Public + PII, but not Financial
  • Admin role: [0, 1, 2, 3, 4, 5, 6] — everything
  • Viewer role: [0] — Public only

This change affected all retrieval paths — both vector and keyword branches, Pulse context assembly, and agent context assembly.

Custom Classification Levels

The default classification levels (Public, Internal, Confidential, PII, PII-Sensitive, Financial, Secret) are seeded as osy.DataClassification choice items. App developers can add domain-specific levels:

seed data osy.DataClassification (Label, Value, Icon, Color):
  "Export Controlled", 7, "Globe", "red"
  "Attorney-Client", 8, "Scale", "purple"

And map roles to them in the application: block:

application:
  classifications:
    classification:
      value = @osy.DataClassification[Label = "Export Controlled"]
      roles = "Admin, ComplianceOfficer"
    classification:
      value = @osy.DataClassification[Label = "Attorney-Client"]
      roles = "Admin, LegalCounsel"

Three layers — choice definition (platform), choice items (seeded defaults, app-extendable), role mappings (application block) — keep the system both extensible and type-safe. The compiler validates @osy.DataClassification[Label = "..."] references exist.

Classification Prompt

The LLM classification prompt is auto-generated from the seeded choice items. Custom levels automatically appear in the classification instructions:

Classify each major section's sensitivity using these levels:
- Public (0): No sensitive information
- Internal (1): Internal business information
- Confidential (2): Business-confidential material
- PII (3): Personally identifiable information
- Financial (5): Financial data -- salaries, accounts, revenue
- Export Controlled (7): Subject to export control regulations

For apps that need domain-specific classification guidance beyond what the choice item names convey, classificationPrompt overrides the auto-generated prompt:

application:
  classifications:
    classificationPrompt = """
      Classify each section using the company's data handling policy:
      - Public: press releases, marketing materials, published research
      - Internal: project plans, meeting notes, internal presentations
      - Confidential: M&A targets, board materials, unreleased financials
      - Export Controlled: anything referencing ITAR/EAR controlled technology
      - Attorney-Client: legal memos, litigation strategy, privilege logs
    """

File Access Model

If chunks are classified and filtered from retrieval, but the original file is freely downloadable, classification is theatre. The file access model enforces classification at every layer.

Three Access Layers

Layer 1 — Chunk-based retrieval (always available, per-chunk filtering). The RAG retrieval surface. Chunks filtered by ClassificationFilter based on the user’s role. A user without Financial clearance never sees salary chunks in search results, Pulse synthesis, or agent context — but they still get the public/internal chunks from the same file.

Getting 60% of a document’s content (the parts you’re cleared for) is far better than getting nothing. The chunks alone tell you what’s in the document at your clearance level.

Layer 2 — Document reader (extracted content, per-chunk filtering). The primary reading experience when someone clicks a citation or opens a document. Assembled from extracted chunks, filtered by classification. Sections the user isn’t cleared for show a placeholder: “Content requires Financial clearance.”

This view is an extraction, not the original. It must be clearly marked as such — “Platform extraction — open original for authoritative version.” The extraction quality depends on the strategy used (see below).

Layer 3 — Original file viewer / download (page-level or file-level gating). For PDFs: rendered in-browser via pdf.js with page-level classification gating. Pages are classified based on the chunks they contain (via PageNumber on osy.FileChunk):

  • Pages where all chunks are at or below the user’s clearance: rendered normally
  • Pages where any chunk exceeds the user’s clearance: redacted or hidden entirely
  • Citation references scroll to the page if the user has clearance

For other formats (DOCX, XLSX): download-only, gated at the file’s highest classification level. Full file download (any format) requires clearance for the file’s highest chunk classification.

Access Matrix

User clearance vs. file contentChunks (retrieval)Document readerPDF viewerDownload
Cleared for all contentAll chunksFull documentAll pagesYes
Cleared for some contentCleared chunks onlyPartial with redaction placeholdersCleared pages, others redactedNo
Cleared for no contentNo chunksNot accessibleNot accessibleNo

Extraction Strategies

StrategyQualityCostBest for
text-extractLow — no layoutFreeSimple text-heavy digital PDFs
layout-parseMedium — gets tables/headers mostly rightFreeWell-structured digital PDFs
vision-llmHigh — understands complex layouts, scans, tables~$0.01-0.05/pageScanned docs, complex layouts, financial tables

Critical honesty: The extracted view is an interpretation, not a lossless conversion. The vision LLM might restructure, omit, or subtly alter meaning. For anything where exact wording matters (legal clauses, signed contracts, regulatory filings), users must access the original via Layer 3.

The extraction strategy is configurable per-app default, per-file override, or via automatic upgrade (if text-extract produces low-confidence output, escalate to vision-llm).

Why pdf.js for Rendering, Not Extraction

pdf.js is a browser-based PDF renderer (used by Firefox). It’s excellent for rendering — faithful display of the original document. It’s poor for extraction — gives raw text without semantic structure (no markdown headers, table detection fails, no layout understanding).

The platform uses pdf.js for Layer 3 (original file viewing with page-level redaction), not for the ingestion pipeline. Extraction for chunking always happens server-side where LLM-based processing, classification, and embedding can be applied.


Pulse Security

The Pulse is a cached string on the entity — visible to anyone who can read the entity. But the inputs to the Pulse (classified properties, restricted documents, child entities with row-level security) may include data that not all readers should see.

The Pulse agent runs as a defined role, not as a privileged user. The existing SecurityManager filters what the agent can see during context assembly:

agent PulseAgent using CheapLlm:
  runAs = agent
  role = PulseServiceRole
  name = "Pulse Generator"
  purpose = "Generate entity Pulse summaries"

application MyApp:
  pulseAgent = PulseAgent

When the Pulse agent queries the entity’s context, SecurityManager filters based on PulseServiceRole. If that role can’t read sensitive risk log entries, they never enter the prompt, and therefore never appear in PulseContent.

The developer controls what the Pulse sees through the same security policies as everything else:

security RiskLogPolicy on ProjectRiskLog:
  defaults: deny all

  rule TeamCanReadNonSensitive:
    allow: Read
    when = "@CurrentUser.Roles CONTAINS 'ProjectMember'"
    rowFilter = "Category != @RiskCategory[Label = 'Sensitive']"

  rule PulseCanReadNonSensitive:
    allow: Read
    when = "@CurrentUser.Roles CONTAINS 'PulseServiceRole'"
    rowFilter = "Category != @RiskCategory[Label = 'Sensitive']"

  rule LegalCanReadAll:
    allow: Read
    when = "@CurrentUser.Roles CONTAINS 'Legal'"

The Pulse shows what PulseServiceRole can see — a safe, general-audience summary. Users with higher clearance open the chat (which runs as runAs = user) and ask about the sensitive details — the chat sees everything their role permits.

Audit trail shows: “Pulse generated by Agent PulseAgent running as PulseServiceRole.” Clear provenance.


Graph-Aware RAG

Entities don’t exist in isolation. A Project has Tasks, Risks, and Team Members. A Customer has Orders, Tickets, and Contracts. The RAG layer should understand these relationships without the agent having to manually traverse them.

The Relationship Context Type

When an entity’s relationships change (child created, deleted, or modified), the platform generates a Relationship context chunk that captures the entity’s neighbourhood:

[Project: Apollo] [Relationships]
>> "Project Apollo has 24 Tasks (18 completed, 3 in progress, 3 blocked).
    3 open Risks: foundation crack ($45K), permit delay (3 weeks),
    subcontractor availability.
    5 Team Members: Sarah Chen (Owner), Mike Torres (Architect), ...
    2 attached Files: Site_Report.pdf (42 pages), Budget_v3.xlsx.
    Parent Organization: Acme Construction."

This chunk is embedded like any other context row. When the agent searches for “what’s blocking Apollo,” the relationship context surfaces alongside file chunks and property snapshots — giving the agent awareness of the entity’s full graph without recursive queries.

Generation Strategy

The relationship context is generated from the entity’s collection properties — the metadata model already knows all parent-child relationships. For each collection:

  1. Count the children
  2. Group by relevant status/category properties (if they exist)
  3. Summarise the top items by recency or priority
  4. Include parent references

This is a metadata-driven operation — the platform walks the entity’s property metadata, finds Collection-type properties, queries the counts and summaries, and assembles the text. No app-builder configuration needed beyond defining the relationships.

Staleness Propagation and the Thundering-Herd Problem

When a child entity changes, the parent’s relationship context becomes stale:

Task completed -> parent Project's relationship context stale
                -> parent Project's Pulse stale

This is controlled by a propagation depth (default 1 — only direct parent). Deep propagation (grandparent, etc.) is opt-in to avoid cascading updates.

The thundering-herd problem. Consider: a CSV import creates 500 Task children under a Project. Each Task creation fires a property-change event, which marks the parent Project’s relationship context as stale. Without coalescing, the platform would:

  1. Regenerate the Project’s Relationship context chunk 500 times
  2. Re-embed that chunk 500 times (500 embedding API calls)
  3. Mark the Project’s Pulse stale 500 times and potentially regenerate it 500 times (500 LLM calls)

At scale this turns a $0.02 import into a $5+ LLM bill and a saturated ingestion queue.

Coalescing as a First-Class Primitive

Coalescing is lifted from a Pulse-specific optimisation to a platform-level primitive that governs all staleness propagation:

  • Staleness is a flag, not a trigger. When a child changes, the platform sets PulseStaleSince = now() on the parent (if not already set). It does NOT immediately enqueue regeneration. The flag is idempotent — 500 child changes set the same flag once.

  • Coalescing window. A background job periodically scans for stale entities. It collects all entities where PulseStaleSince IS NOT NULL AND PulseStaleSince < now() - coalesce_window. A burst of 500 child changes within the window produces exactly one regeneration.

  • Batch-aware ingestion. Bulk operations (CSV import, API batch create, seed data) are identifiable as a batch. During a batch, staleness propagation is deferred entirely — the parent is marked stale once when the batch commits, not per-row.

  • Rate limiting per entity. Even outside batches, an entity’s context is not regenerated more than once per coalescing window. If the window is 30 seconds, a Project with continuous child updates gets at most 2 regenerations per minute.

The coalescing window is configurable per entity type via the rag: block:

entity Project:
  rag:
    context = auto
    coalesce = 30s         # default
    propagation = parent   # depth 1, direct parent only

entity ActivityLog:
  rag:
    context = auto
    coalesce = 120s        # high-churn entity, longer window

Multi-Hop Retrieval

For complex queries (“which projects in the Construction division are over budget?”), the agent can combine:

  1. Cross-entity search on entity_type = ‘Project’ for “over budget”
  2. Relationship context on each result to see the division (parent Organization)
  3. Property data to verify the budget numbers

The graph-aware context makes step 2 unnecessary in most cases — the relationship chunk already contains “Parent Organization: Acme Construction,” so the vector search naturally ranks Construction-division projects higher for that query.


Soft References — Agent-Driven Knowledge Graphs

Traditional systems rely on hard relations (EntityRef/Collection) defined at schema time by the app builder. But some of the most valuable connections between entities emerge at runtime — during conversations, document reviews, or agent analysis. A user reviewing Contract A asks the agent to look up a similar contract and says “add a reference to that.” No schema relationship exists between them, but the connection is meaningful, discoverable, and should be navigable.

The key insight: agents eliminate the friction. Manually creating cross-references is high-effort, low-immediate-reward work — nobody does it consistently. But in a conversation with an agent, the references are a natural byproduct. The agent already found the target, already knows why it’s relevant, already has the context. Creating the reference is one sentence from the user, and the agent writes a better reason than most humans would bother to.

This compounds over time. Six months in, an entity has a web of references that nobody planned but everyone benefits from. A new team member opens a contract and immediately sees “this was based on Template X, the indemnity clause was modelled on Contract Y, three risks were extracted from it.” That’s institutional knowledge that normally lives in someone’s head.

The Reference Entity

choice ReferenceKind:
  Citation              # Agent RAG output referenced a source
  CrossReference        # Related entities linked by user or agent
  DerivedFrom           # Entity was created/extracted from the target
  Bookmark              # User saved a specific location for later
  Supersedes            # This version replaces the target
  BuildsOn              # This entity extends or builds upon the target
  Contradicts           # This entity conflicts with or disputes the target
  UpdateOf              # Newer version of the same logical content
  SupportedBy           # Evidence or justification from the target

choice ReferenceLocationType:
  Page                  # PDF/document page number
  Section               # Named section/heading within a document
  Message               # Specific chat message
  Property              # Specific property on an entity
  Chunk                 # File chunk (for precise RAG citations)

ensure entity osy.Reference:
  description = "Soft link between any two entity instances"
  SourceType: osy.EntityMetadata required
  Source: dynamic reference(SourceType) required
  TargetType: osy.EntityMetadata required
  Target: dynamic reference(TargetType) required
  Label: String(500)
  Reason: String
  Kind: @ReferenceKind required
  LocationType: @ReferenceLocationType
  LocationDetail: Json
  DataClassification: Int32 default = 0

Both sides use dynamic references — any entity type can be a source or target without predefined FKs. The platform’s existing system properties (CreatedAt, CreatedBy) provide full provenance.

Platform Integration

Because osy.Reference is a platform entity, several capabilities work automatically without app builder configuration:

Agent tools. The platform provides AddReference and SearchReferences as built-in function tools available to any agent. The agent populates SourceType, TargetType, Kind, Reason from conversation context. The app builder doesn’t wire these up — they exist because osy.Reference exists.

RAG citations. When Pulse or chat produces citations (see below), the runtime writes Citation-kind references automatically. When a Pulse regenerates, old Citation references for that source are replaced.

Reference panel. The client renders a References section on any entity that has references (as source or target). The app builder can place it explicitly in a view layout, or the platform shows it in a default location.

Bidirectional discovery. “Show me everything that references this template contract” is a query on Target = templateId. “What was this contract based on?” is a query on Source = contractId, Kind = DerivedFrom. Both directions are first-class.

Graph enrichment for RAG. When generating Relationship context, the platform includes soft references alongside hard relations: “Referenced by 3 other contracts, derived from Template X.” Future agent searches benefit from connections created in past conversations.

Navigation. A single reference component handles all target types using TargetType + Target + LocationType + LocationDetail:

LocationTypeNavigation action
PageOpen file viewer, scroll to page
SectionOpen file viewer, scroll to section heading
MessageOpen chat session, scroll to message
PropertyOpen entity default view, highlight property
ChunkOpen file viewer at chunk’s page/section
(null)Open entity default view

Examples

Agent-created during chat:

User: "Look up contracts with similar indemnity language"
Agent: [searches, finds Contract B]
Agent: "Contract B (Acme MSA 2024) has a similar mutual indemnity
        structure in section 8..."
User: "Add a reference to that"
Agent: [calls AddReference]
  -> Source: Contract A, Target: Contract B
  -> Kind: CrossReference
  -> Label: "Similar indemnity clause structure"
  -> Reason: "Mutual indemnity in section 8 of Acme MSA 2024 uses
              the same cap-at-fees model. Identified during legal
              review of liability terms."

Automatic from RAG citation:

Pulse generates: "Foundation crack detected in sector 7 [src:a1b2c3d4]"
Runtime creates:
  -> Source: Project Apollo (PulseContent), Target: osy.FileAsset (Site_Report.pdf)
  -> Kind: Citation
  -> Label: "Site_Report.pdf, p.12 section 3.2 Risks"
  -> LocationType: Page, LocationDetail: { page: 12, section: "3.2 Risks" }

Insight extraction chain:

User: "Add those risks to the risk log with the references"
Agent creates Risk entity + Reference:
  -> Source: Risk "Liability cap below standard", Target: Contract A
  -> Kind: DerivedFrom
  -> Label: "Extracted from MasterAgreement.pdf section 8.1"
  -> Reason: "Identified during contract review -- liability cap of
              $50K is below industry standard of $250K for this
              contract value."
  -> LocationType: Page, LocationDetail: { page: 8, section: "8.1 Liability" }

Extended Reference Kinds and Salience

The extended ReferenceKind values carry different implications for relevance. When a reference is created, the Kind gives downstream systems a machine-readable signal about the nature of the relationship:

KindEffect on target salienceEffect on source salience
CitationNeutral (referenced, but by an output, not by a decision)N/A
CrossReferenceSlight positive (something is related)Slight positive
DerivedFromPositive (foundational, things are built from this)Neutral
SupersedesNegative (this has been replaced)Positive
BuildsOnPositive (still being built upon)Neutral
ContradictsPositive (live tension, needs resolution)Positive
UpdateOfNegative (older version)Positive
SupportedByPositive (used as evidence)Neutral
BookmarkNeutral (user intent, not graph structure)N/A

The Kind is the structured, machine-readable signal. The Reason field is where the agent captures nuance that the kind alone can’t express:

Kind: BuildsOn
Reason: "The indemnity clause in section 8.2 uses the same mutual-cap-at-fees
         structure established in the Acme MSA 2024, with the liability
         threshold increased from $50K to $100K to reflect the larger
         contract value."

Citation System

Stable Chunk Identifiers

Every chunk assembled into LLM context gets a short stable ID derived from the osy.EntityContext row’s primary key (UUID), truncated to 8 hex characters for prompt compactness:

[src:a1b2c3d4] Financial performance summary from Q4 2025 report...
[src:e5f6g7h8] Customer satisfaction scores from annual survey...
[src:i9j0k1l2] Board resolution from March 2026 meeting...

The src: prefix distinguishes source citations from other bracket syntax. The 8-char hex is enough to be unique within a single prompt. The full UUID is resolved server-side from the truncated ID.

Prompt Injection

When assembling context for chat or Pulse, each chunk is prefixed with its [src:ID] tag. Citation instructions travel with the context, not in the system prompt header — keeping them close to the chunks in the LLM’s attention:

When you reference information from the provided context, cite the source
using its [src:ID] tag. Place citations inline at the end of the sentence
or paragraph that uses the information. Example: "Revenue grew 15% in Q4
[src:a1b2c3d4]."

Only cite sources that were provided in the context. Do not fabricate
source IDs.

This is injected by the platform — the app developer doesn’t write it.

Post-Processing

After the LLM responds, the platform parses [src:XXXXXXXX] markers from the response text:

  1. Extract all [src:ID] occurrences
  2. Resolve each 8-char ID to the full osy.EntityContext UUID (lookup from the context assembly manifest)
  3. Validate — does this ID match a chunk that was actually in the prompt? If not, flag as hallucinated and strip
  4. Create an osy.Reference row for each valid citation:
    • SourceType/Source = the entity the chat/Pulse is about
    • TargetType/Target = the entity the cited chunk belongs to
    • Kind = Citation
    • LocationType = Chunk
    • LocationDetail = { "entityContextId": "full-uuid", "sourceFileId": "..." }
    • DataClassification = the cited chunk’s classification value
    • Label = first 100 chars of cited chunk content
    • Reason = the sentence that cited it

Opt-In via DSL Directives

Citations are opt-in per feature:

entity Order:
  rag:
    context = auto
    pulse = auto
    pulseReferences = true       # Pulse cites its sources
    chatReferences = true        # Entity chat cites its sources

When pulseReferences = true, the Pulse context assembly adds [src:ID] prefixes, the Pulse prompt includes citation instructions, and after generation, citations are extracted and stored as osy.Reference rows. The PulseContent retains the markers — the UI renders them as clickable links.

When chatReferences = true, the same mechanism applies to entity chat responses.

Default: both false. Citations cost more prompt tokens and more post-processing — they should be explicit.

Classification-Aware Citations

osy.Reference carries a DataClassification property (Int32, default 0) — set to the target chunk’s classification at creation time. This means:

  • References are filtered the same way chunks are: DataClassification IN (...) against the user’s allowed levels
  • The label and reason text (which may contain sensitive content from the target) are only visible to users with clearance
  • When loading references for Pulse context assembly or chat, the classification filter applies — the LLM never sees references to content the acting role can’t access

Without this, a reference with Label = "Revenue grew 15% in Q4" pointing to a Financial-classified chunk would leak the classified content to any user who can see the source entity.

For the UI: if a user doesn’t have clearance for a reference’s classification level, it renders as “[source redacted — requires {level} clearance].”

UI Rendering

The [src:XXXXXXXX] markers in stored text (PulseContent, ChatMessage content) are rendered by the client as interactive citation elements:

  • Inline badge: small superscript number or icon next to the cited text
  • Hover: shows the first 200 chars of the cited chunk + source metadata (file name, page number, entity name)
  • Click: navigates to the source — opens the document reader at the relevant page/section, or scrolls to the entity

The client resolves [src:ID] to osy.Reference to LocationDetail to navigation target. All data is already in the database; the client just follows the references.


Reference Graph Traversal

The retrieval paths described in Part 2 search within the current entity’s scope. But the most valuable insights often come from connections across entities — a liability clause pattern flagged in a previous deal, a risk finding that applies to a similar situation.

The osy.Reference entity is the cross-entity memory. When someone reviews the Acme deal, files a Finding about a missing data breach liability carve-out, and creates a Reference linking that Finding to the specific contract clause — that knowledge should surface when the agent encounters a similar clause in the Apollo deal.

How Reference-Aware Retrieval Works

When search_knowledge receives a query with reference_depth > 0, the flow expands the search scope via the reference graph before the vector search:

1. Embed query ("liability clause") -> queryVector

2. Collect entity IDs from the current entity's scope
   (its collections, related entities)

3. Traverse references from those entities (BFS, bidirectional):
   depth 1: direct References (Finding #47 -> this clause)
   depth 2: Finding #47's own References (-> Acme contract section 12)
   Cycle-safe: visited set prevents re-traversal
   Capped: 50 entities per hop, 200 references per query

4. Vector search across the expanded scope using the SAME queryVector:
   osy.EntityContext WHERE EntityId IN (expanded set)
     AND Embedding <=> queryVector < threshold
     AND DataClassification IN (allowed levels)

5. Results tagged with viaReference: true/false
   so the LLM knows which came from graph traversal

The vector search on the expanded set is what makes this precise — traversal at depth 2 on a busy reference graph might reach hundreds of entities, but the vector search filters to only those semantically relevant to “liability clause.” A Finding about pricing on the same deal gets traversed but scores low and is excluded.

The Cross-Deal Example

  1. User asks “what about the liability clause?” on the Apollo deal
  2. search_knowledge returns chunks from Apollo’s Master Agreement section 8.1
  3. Reference expansion finds a CrossReference from Acme Finding #47 to the Apollo Master Agreement, with label “Similar liability cap structure”
  4. This Reference was created by the LegalReviewer agent during the Acme review — it noticed the same clause pattern and linked it
  5. The system loads Acme Finding #47’s content: “Missing data breach liability carve-out in section 12. Severity: Medium. Resolution: Added explicit carve-out during negotiation.”
  6. The agent now says: “The liability cap in section 8.1 is $4M [src:a1b2c3d4]. Note: this same clause structure was flagged in the Acme deal — the data breach carve-out was missing there too and had to be added during negotiation.”

The agent doesn’t have to know about the Acme deal. The reference graph surfaces it automatically. The previous reviewer’s work — filing the Finding, creating the Reference — is the mechanism.

Security in Traversal

TraverseReferencesAsync respects RBAC — entities the user can’t access are excluded from the traversal. The vector search on the traversed set also applies classification filtering (DataClassification IN (allowed_levels)). The structural permission boundary holds end-to-end: a user who can’t see Deal A’s confidential findings never has those findings surface during their review of Deal B, even if a reference connects them.

Reference Creation as Learning

Every time an agent or user creates a Reference, the system gets smarter. The reference graph grows over time:

  • Agent reviews a contract, files a Finding, references the clause — future agents find it
  • User marks two findings as related — cross-link — future searches surface the connection
  • Agent in deal B finds something similar to deal A, creates a CrossReference — deal C benefits

This is why chatReferences = true and pulseReferences = true matter — they make the agents produce references automatically, which feeds the graph that makes future retrieval better.


Application Patterns

The entity-anchored RAG design supports every major pattern in the market today. The differentiator: app builders don’t configure RAG — they define entities and relationships, and the platform provides the semantic layer automatically.

PatternMarket examplesPlatform mappingWhat the app builder does
Knowledge baseNotion AI, Confluence AI, SliteEntity = Document/Page, files = attachmentsDefine Document entity, upload files
Customer supportIntercom Fin, Zendesk AI, AdaTicket + Message children, KB articles as separate entitiesDefine Ticket + Message entities, write agent instructions
Long-running chatChatGPT memory, Claude projectsConversation + Message children, rolling summariesDefine Conversation entity, rag: block with children = auto
CRM intelligenceSalesforce Einstein, HubSpot AIDeal/Contact/Company entities with file attachmentsDefine CRM entities, Pulse = “deal health”
Project managementMonday AI, Asana AI, Notion projectsProject -> Tasks/Risks/Milestones, file attachmentsDefine project entities, attach documents, agent reasons across everything
Legal researchHarvey, CaseText, LuminanceCase/Contract entities with deep document chunkingDefine Case entity, upload legal documents
Enterprise searchGlean, Hebbia, CoveoCross-entity workspace searchDeploy app, everything is searchable automatically
Document Q&AChatPDF, DocuSign IAMSingle entity with file attachmentUpload file to entity, ask questions
Multi-agent workflowsCrewAI, AutoGen, LangGraphAgents share context via entity_contextDefine agents with different roles, share entity scope
Personal memoryMem, Rewind, LimitlessUser-scoped entities with auto-captureDefine Note/Memory entity, rag: block with children = auto

What Makes This Different from Standalone RAG Tools

Standalone vector databases (Pinecone, Weaviate, Qdrant, Chroma) provide the storage layer but leave the application developer to build:

  • Entity-to-chunk mapping
  • Ingestion pipelines
  • Staleness management
  • Relationship awareness
  • Structured + unstructured unification
  • Citation tracking back to source
  • Access control (which user can search which entities)

The platform provides all of this as infrastructure. The vector store is an implementation detail — the app builder works at the entity level, and the semantic shadow follows automatically.

Access Control Integration

Because osy.EntityContext rows carry Entity (dynamic reference) and EntityType (EntityRef to EntityMetadata), the existing security policy system applies directly. The platform first evaluates which entities the current user can see (per security policy), then performs the vector similarity search within that scoped set. A support agent searching “billing issues” only sees tickets from their assigned accounts, even though the vector index contains all tickets.


Evolution

The Classification Hierarchy Bug

The initial implementation assumed classification levels form a strict hierarchy: if you can see level 5, you can see levels 0-4. The SQL used <= @p_maxclass. This is wrong — classification levels are independent grants. An HR manager should see PII (3) but not Financial (5). A CFO should see Financial but not PII-Sensitive. The fix was conceptually simple (= ANY(@p_allowed_levels)) but required changes across every retrieval path — vector search, keyword search, Pulse context assembly, and agent context assembly.

Per-Chunk vs. Per-File Classification Granularity

The first design classified at the file level. A salary report with a public executive summary and confidential appendix would be classified entirely as Confidential — making the executive summary invisible to users without Confidential clearance. Per-chunk classification solved this by treating each chunk independently. The cost is a more complex ingestion pipeline; the benefit is maximally useful retrieval at every clearance level.

Section-Level vs. Paragraph-Level Classification

Even after moving to per-chunk classification, the initial implementation classified at the section header level — all chunks under a section header got the section’s classification. A section with one PII sentence surrounded by public content over-classified the entire section. The fix was paragraph-level classification: the LLM evaluates each paragraph independently, adjacent same-classification paragraphs form segments, and chunks are created within segments. This added a four-pass pipeline but eliminated over-classification.

The Bootstrap Refactor

Moving the osy.DataClassification choice from an internal module to the baseline application required a unified multi-file compilation model. Previously, each .osy file compiled independently. With classification choice items seeded in one file and referenced via @osy.DataClassification[Label = "..."] in another, cross-file references had to resolve in a single merged AST. This led to a three-pass seed execution model:

  1. Deferred seeds (choice items from inline blocks)
  2. Data seeds (can reference choice items from pass 1)
  3. Post-seed actions (classification configs with choice refs, can look up resolved choice data)

Reference Traversal Simplification

The original design for reference-aware retrieval proposed a two-pass approach: search the current entity scope, then traverse references, then search again on the traversed set, then merge results. The implementation simplified this to expanding the scopeEntityIds set before the search — the existing search post-filter naturally includes referenced entities. One search, not two. Simpler, and the vector similarity scoring naturally handles relevance across the expanded set.