Hands-On: Build a Knowledge Model in 30 Minutes

This tutorial walks through the same patterns exercised by the medical benchmark: defining relation semantics, indexing nodes/edges/chunks, and retrieving multi-hop context. Everything below is grounded in the beta-0.5 code and tests (no placeholders).

1) Add the Framework

<dependency>
    <groupId>org.ddse.ml</groupId>
    <artifactId>cef-framework</artifactId>
    <version>beta-0.5</version>
</dependency>

2) Configure a Tested Stack (DuckDB + Ollama)

src/main/resources/application.yml

cef:
  graph:
    store: jgrapht           # Tested to ~100K nodes in-memory
  vector:
    store: duckdb            # Embedded, no external DB
  llm:
    default-provider: ollama
    ollama:
      base-url: http://localhost:11434
      model: nomic-embed-text

spring:
  main:
    web-application-type: reactive

Tested combo matches the benchmark harness: DuckDB + JGraphT + Ollama embeddings. vLLM (Qwen3-Coder-30B) was used for generation; you can plug it in later without changing code.

3) Declare Relation Semantics (Like JPA Mappings)

@Configuration
public class KnowledgeModelConfig {

    private final KnowledgeIndexer indexer;

    public KnowledgeModelConfig(KnowledgeIndexer indexer) {
        this.indexer = indexer;
    }

    @PostConstruct
    public void initializeRelations() {
        var relationTypes = List.of(
            new RelationType("TREATS", RelationSemantics.CAUSAL, true,
                "Doctor treats patient"),
            new RelationType("HAS_CONDITION", RelationSemantics.ASSOCIATIVE, false,
                "Patient has medical condition"),
            new RelationType("PRESCRIBED_MEDICATION", RelationSemantics.CAUSAL, false,
                "Patient prescribed medication")
        );

        indexer.initialize(relationTypes).block();
    }
}

These semantics mirror the benchmark scenarios (contraindications, comorbidities, shared doctors).

4) Index Nodes, Edges, and Chunks (Dual Persistence)

// Entity nodes (graph + optional vectorizable content)
Node patient = new Node(
    null, "Patient",
    Map.of("name", "John Doe", "age", 45, "gender", "M"),
    "45-year-old male with type 2 diabetes and hypertension."
);
Node condition = new Node(
    null, "Condition",
    Map.of("name", "Type 2 Diabetes", "icd10", "E11.9"),
    "**CONDITION PROFILE** Name: Type 2 Diabetes Mellitus..."
);
UUID patientId = indexer.indexNode(patient).block().getId();
UUID conditionId = indexer.indexNode(condition).block().getId();

// Typed relationship
Edge hasCondition = new Edge(
    null, "HAS_CONDITION", patientId, conditionId,
    Map.of("diagnosedOn", "2025-01-10"), 1.0
);
indexer.indexEdge(hasCondition).block();

// Additional chunks tied to the patient (semantic side)
Chunk encounter = new Chunk();
encounter.setContent("**CLINICAL ENCOUNTER NOTE** Patient presents with chest pain...");
encounter.setLinkedNodeId(patientId);
encounter.setMetadata(Map.of("source", "ehr", "encounterId", "ENC-1001"));
indexer.indexChunk(encounter).block();

All writes update both the graph store and vector store automatically (dual persistence).

5) Retrieve Multi-Hop Context (Same Flow as Benchmarks)

public Mono<RetrievalResult> findPatientContext(String patientName) {
    var graphQuery = new GraphQuery(
        List.of(new ResolutionTarget(
            patientName,          // semantic text used for entry-point resolution
            "Patient",            // type hint
            null                  // property filter
        )),
        new TraversalHint(
            3,                    // max depth
            List.of("HAS_CONDITION", "PRESCRIBED_MEDICATION"),
            null                  // both directions
        )
    );

    var request = RetrievalRequest.builder()
        .query("Find context for " + patientName)
        .graphQuery(graphQuery)
        .topK(10)
        .maxGraphNodes(50)
        .maxTokenBudget(4000)
        .build();

    return retriever.retrieve(request);
}

retriever is the org.ddse.ml.cef.api.KnowledgeRetriever bean provided by Spring.

Retrieval order (matches the benchmark harness):

Resolve candidate Patient nodes via semantic search (with the type hint).
Traverse HAS_CONDITION and PRESCRIBED_MEDICATION up to depth 3.
Run vector search constrained to the traversed subgraph.
Fallback to vector-only if fewer than 3 results remain.

This is the same flow that delivered 12 vs 5 chunks for contraindication discovery.

6) Validate with the Built-In Benchmarks

Run the same suite that produced the published numbers:

cd cef-framework
mvn -Dtest=MedicalBenchmarkTest test
# Reports: cef-framework/BENCHMARK_REPORT.md, BENCHMARK_REPORT_2.md

Key expected outputs:

Scenario 1 (contraindications): 12 vs 5 chunks (+140%).
Scenario 4 (shared doctors): 16 vs 5 chunks (+220%).
Advanced separation/aggregation (Benchmark 2): +6 to +9 chunks over vector-only.

Where to Go Next

Swap storage backends (PostgreSQL/pgvector, Neo4j) in application.yml once you need larger graphs.
Expose the MCP tool to your LLM so it receives the schema and required fields automatically.
Add more relation types (TEMPORAL, HIERARCHY) to reflect your domain and guide traversal.

1) Add the Framework​

2) Configure a Tested Stack (DuckDB + Ollama)​

3) Declare Relation Semantics (Like JPA Mappings)​

4) Index Nodes, Edges, and Chunks (Dual Persistence)​

5) Retrieve Multi-Hop Context (Same Flow as Benchmarks)​

6) Validate with the Built-In Benchmarks​

Where to Go Next​