🧪 Testing the Explainable AI (XAI) Layer

📋 Overview

The Explainable AI (XAI) layer is evaluated using four distinct clinical scenarios generated by the Neo4j inference engine. These tests are designed to challenge the LLM’s ability to handle logical negations (blocking symptoms), differential weighting, and confidence assessment.

📐 Test Cases

TC1 — Negation Handling

Focus: Can the model correctly identify why a high-scoring disease was excluded?

Diseases Graph Visualization

Field	Value
Most Likely	nonparalytic poliomyelitis (Passed Filter: True)
Differentials	Ebola virus disease and poliomyelitis (Passed Filter: True)
Excluded conditions	Marburg hemorrhagic fever and West Nile fever (Passed Filter: False)
Blocking symptoms	maculopapular rash and/or chills (patient denied both)

Success criteria: The reasoning must explicitly mention the absence of the rash as the reason for excluding West Nile fever and both rash and chills for Marburg hemorrhagic fever.

TC2 — Broad Differential Diagnostic

Focus: Distinguishing between diseases with very similar symptom profiles.

Diseases Graph Visualization

Field	Value
Most Likely	hepatitis E (Passed Filter: True)
Differentials	hepatitis B, hepatitis C and hepatitis A (Passed Filter: True)
Excluded conditions	hepatitis D (Passed Filter: False)
Blocking symptoms	drowsiness and confusion (patient denied both)

Success criteria: Model assigns High Confidence to Hepatitis E, explains that A, B, and C are less likely due to lower symptomatic alignment and mentions the absence of drowsiness and confusion as the reason for excluding hepatitis D.

TC3 — Low Disease Coverage

Focus: Managing high-severity cases with low disease coverage.

Diseases Graph Visualization

Field	Value
Most Likely	West Nile encephalitis (Passed Filter: True)
Differentials	Powassan encephalitis and Eastern equine encephalitis (Passed Filter: True)
Excluded conditions	Japanese encephalitis and St. Louis encephalitis (Passed Filter: False)
Blocking symptoms	spastic paralysis (patient denied)

Success criteria: Model identifies West Nile as the primary diagnosis despite low confidence, explicitly recommends further testing and mentions the absence of spastic paralysis as the reason for excluding Japanese encephalitis and St. Louis encephalitis.

TC4 — Overlapping Symptoms

Focus: Handling overlapping symptoms where no conditions are excluded.

Diseases Graph Visualization

Field	Value
Most Likely	Powassan encephalitis (Passed Filter: True)
Differentials	nonparalytic poliomyelitis, La Crosse encephalitis and poliomyelitis (Passed Filter: True)
Excluded conditions	—
Blocking symptoms	—

Success criteria: Model identifies seizure as the clinical tie-breaker that elevates Powassan encephalitis over the competing candidates.

⚙️ Test Environment

Property	Value
OS	Windows-11-10.0.26200-SP0
CPU	AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
RAM	17.1 GB
GPU	NVIDIA GeForce RTX 3060
VRAM	12.0 GB
Python	3.12.9

🧪 Test 1: llama3.2:3b (local)

Overall assessment: The updated prompt improved JSON structure compliance with the new differential_comparison format. However, the model continues to exhibit critical failures in passed_filter logic — placing excluded diseases in differentials and hallucinating exclusion reasons for diseases that were never filtered out.

⏱️ Performance

Test Case	Total Time
TC1	34.96s
TC2	21.46s
TC3	29.74s
TC4	22.61s
Total	108.77s

📊 Performance Matrix

Metric	Result	Commentary
JSON structural integrity	100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	25%	TC1: placed Marburg and West Nile in differentials; TC3: placed Japanese encephalitis and St. Louis encephalitis in differentials and in excluded conditions
Internal consistency	Fail	TC4: hallucinated exclusion of primary amebic meningoencephalitis (`exclusion_criteria`) despite passed_filter: true
Clinical tone	High	Medical vocabulary

💬 Qualitative Analysis

1. Negation & filter logic (TC1 & TC3)

In TC3, the model correctly identifies West Nile encephalitis as the most likely diagnosis. However, it fails the logical constraint test by placing Japanese encephalitis and St. Louis encephalitis into the differential diagnosis category, completely ignoring the passed_filter: false flag. These conditions should have been moved to excluded_conditions due to the presence of the blocking symptom spastic paralysis, which the patient explicitly denied. Furthermore, the model incorrectly justifies their inclusion by claiming they have lower scores, when in fact, they had higher raw scores but were excluded.

2. Hallucinated exclusion (TC4)

In TC4, all five diseases have passed_filter: true and no blocking symptoms exist. Despite this, the model placed primary amebic meningoencephalitis in excluded_conditions, falsely claiming the patient denied coma — a symptom that appears only in the Missing List, not the Blocking Symptoms list. This is the same missing/blocking confusion as before.

🧪 Test 2: llama3:8b (local)

Overall assessment: llama3:8b shows improved structural compliance with the new output format. TC3 and TC4 are handled correctly. However, TC1 still exhibits a critical passed_filter mapping error where Marburg hemorrhagic fever is placed in differentials instead of excluded conditions, and poliomyelitis is incorrectly placed in excluded conditions.

⏱️ Performance

Test Case	Total Time
TC1	22.45s
TC2	21.07s
TC3	20.36s
TC4	22.66s
Total	86.54s

📊 Performance Matrix

Metric	Result	Commentary
JSON structural integrity	100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	50%	TC1: Marburg in differentials, poliomyelitis incorrectly excluded
Internal consistency	Partial	TC1 has contradictions
Clinical tone	High	Medical vocabulary

💬 Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, the model placed Marburg hemorrhagic fever in differentials despite its passed_filter: false flag. At the same time, it incorrectly excluded poliomyelitis — a disease with passed_filter: true — citing flaccid paralysis as a blocking symptom. Flaccid paralysis is listed under Missing List, not Blocking Symptoms, making this an unjustified exclusion.

2. Correct handling (TC2, TC3, TC4)

TC2 correctly excluded hepatitis D with the right justification. TC3 correctly placed Japanese encephalitis and St. Louis encephalitis in excluded_conditions citing spastic paralysis. TC4 correctly identified Powassan encephalitis as most likely with all others as differentials and an empty excluded_conditions list.

🧪 Test 3: mistral-nemo:12b (local)

Overall assessment: Mistral Nemo 12b shows mixed results. It handles TC2, TC3, and TC4 correctly but fails in TC1 by incorrectly placing West Nile fever in differentials and excluding poliomyelitis using Missing List symptoms as justification.

⏱️ Performance

Test Case	Total Time
TC1	28.15s
TC2	26.26s
TC3	23.26s
TC4	18.16s
Total	95.83s

📊 Performance Matrix

💬 Qualitative Analysis

1. Negation & filter logic (TC1)

In TC1, West Nile fever (passed_filter: false) was placed in differentials instead of excluded_conditions. The model also incorrectly excluded poliomyelitis (passed_filter: true), justifying the exclusion with flaccid paralysis — which appears only in the Missing List and does not qualify as a blocking symptom.

2. Correct handling (TC2, TC3, TC4)

TC2 correctly excluded hepatitis D citing drowsiness and confusion. TC3 correctly excluded both Japanese and St. Louis encephalitis citing spastic paralysis. TC4 correctly identified Powassan encephalitis as most likely and included all other diseases as differentials with an empty excluded list.

3. TC4 differential comparison quality

In TC4, the model correctly identified seizure as absent from poliomyelitis and nonparalytic poliomyelitis, indirectly supporting Powassan encephalitis as the tie-breaker — though it did not explicitly name seizure as the deciding factor.

🧪 Test 4: phi4:14b (local)

Overall assessment: Phi4 14b demonstrates strong logical compliance across all test cases. All passed_filter mappings are correct. TC4 now includes all four differentials correctly and explicitly references seizure and stiff neck as the tie-breaking symptoms for Powassan encephalitis.

⏱️ Performance

Test Case	Total Time
TC1	50.28s
TC2	45.15s
TC3	45.86s
TC4	53.57s
Total	194.86s

📊 Performance Matrix

💬 Qualitative Analysis

1. Precise handling of negation (TC1, TC2, TC3)

In TC1, Marburg hemorrhagic fever and West Nile fever are correctly placed in excluded_conditions, with explicit references to the denied maculopapular rash and chills. In TC2, hepatitis D is correctly excluded citing drowsiness and confusion. In TC3, both Japanese and St. Louis encephalitis are correctly excluded citing spastic paralysis.

2. TC4 tie-breaker identification

In TC4, the model correctly identifies seizure and stiff neck as the symptoms that distinguish Powassan encephalitis from nonparalytic poliomyelitis and poliomyelitis, fulfilling the success criteria for this test case.

3. Trade-off: accuracy vs speed

Phi4 14b is the slowest local model tested, averaging nearly 49 seconds per test case. This is a significant trade-off compared to smaller models, and should be considered when evaluating it for production use.

🧪 Test 5: qwen2.5:14b (local)

Overall assessment: Qwen 2.5 14b demonstrates strong logical compliance across all test cases. All passed_filter mappings are correct. TC4 confidence is set to moderate — a more conservative and arguably more appropriate calibration than the high assigned by phi4, given that Powassan encephalitis covers only 50% of its known symptoms.

⏱️ Performance

Test Case	Total Time
TC1	49.49s
TC2	56.38s
TC3	52.73s
TC4	58.12s
Total	216.72s

📊 Performance Matrix

Metric	Result	Commentary
JSON structural integrity	100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	100%	All passed_filter: false diseases correctly placed in excluded_conditions across all test cases
Internal consistency	High	Textual reasoning directly supports the content of the JSON arrays
Clinical tone	High	Medical vocabulary

💬 Qualitative Analysis

1. Precise handling of negation (TC1, TC2, TC3)

In TC1, Marburg hemorrhagic fever and West Nile fever are correctly excluded citing denied maculopapular rash and chills. In TC2, hepatitis D is correctly excluded citing drowsiness and confusion. In TC3, Japanese and St. Louis encephalitis are correctly excluded citing spastic paralysis.

2. TC4 differential comparison quality

In TC4, the model explicitly names seizure as the differentiating symptom across all differential comparisons — noting its absence from nonparalytic poliomyelitis, poliomyelitis, La Crosse encephalitis, and primary amebic meningoencephalitis. This is the most thorough fulfillment of the TC4 success criteria among all tested models.

3. Confidence calibration

The moderate confidence assigned in TC4 reflects a correct reading of the data — Disease Coverage is 50% and Input Coverage is 100%, which places the case in the moderate band per the prompt definition. This is more precise than phi4’s high assignment for the same scenario.

🧪 Test 6: meta-llama/llama-4-scout-17b-16e-instruct (cloud)

Overall assessment: The cloud-hosted Llama 4 Scout demonstrates strong logical compliance with the passed_filter flag and delivers clinically coherent reasoning. Its most notable characteristic is its inference speed — completing each test case in approximately 1.3–1.4 seconds, significantly faster than any local model tested.

⏱️ Performance

Test Case	Total Time
TC1	1.44s
TC2	1.32s
TC3	1.21s
TC4	1.41s
Total	5.38s

📊 Performance Matrix

Metric	Result	Commentary
JSON structural integrity	100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	100%	All passed_filter: false diseases correctly placed in excluded_conditions across all test cases
Internal consistency	High	Textual reasoning directly supports the content of the JSON arrays
Clinical tone	High	Medical vocabulary

💬 Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, correctly places Marburg hemorrhagic fever and West Nile fever in excluded_conditions, explicitly citing the absence of maculopapular rash and chills as the deciding factors.

In TC2, Hepatitis D is correctly excluded with a precise explanation referencing the absence of drowsiness and confusion as mandatory markers.

In TC3, Japanese encephalitis and St. Louis encephalitis are correctly excluded, with the model clearly attributing the exclusion to the denied spastic paralysis.

🧪 Test 7: openai/gpt-oss-120b (cloud)

Overall assessment: The largest model tested demonstrates strong clinical reasoning and correct filter logic compliance. However, it exhibits the most pronounced High-Intelligence Bias — it consistently generates more nuanced and detailed reasoning than other models, sometimes at the cost of strict adherence to the simplified expected output format.

⏱️ Performance

Test Case	Total Time
TC1	2.95s
TC2	2.24s
TC3	11.88s
TC4	33.34s
Total	50.41s

📊 Performance Matrix

Metric	Result	Commentary
JSON structural integrity	100%	Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms)	100%	All `passed_filter: false` diseases correctly placed in `excluded_conditions` across all test cases
Internal consistency	High	Textual reasoning directly supports the content of the JSON arrays
Clinical tone	High	Most detailed medical vocabulary

💬 Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, correctly excludes Marburg hemorrhagic fever and West Nile fever, providing the most detailed blocking symptom explanation of all tested models.

In TC2, correctly excludes Hepatitis D with precise clinical justification referencing drowsiness and confusion.

In TC3, correctly excludes Japanese encephalitis and St. Louis encephalitis, citing spastic paralysis as the decisive blocking factor.

📊 Summary: Comparison of Model Performance

Model	Size	Type	JSON Integrity	Exclusion Logic	Internal Consistency	Clinical Tone	Total Time
llama3.2:3b	3B	Local	✅ 100%	❌ 25%	❌ Fail	✅ High	108.77s
llama3:8b	8B	Local	✅ 100%	⚠️ 50%	⚠️ Partial	✅ High	86.54s
mistral-nemo:12b	12B	Local	✅ 100%	⚠️ 50%	⚠️ Partial	✅ High	95.83s
phi4:14b	14B	Local	✅ 100%	✅ 100%	✅ High	✅ High	194.86s
qwen2.5:14b	14B	Local	✅ 100%	✅ 100%	✅ High	✅ High	216.72s
meta-llama/llama-4-scout-17b-16e-instruct	17b-16e (MoE)	Cloud	✅ 100%	✅ 100%	✅ High	✅ High	5.38s
openai/gpt-oss-120b	120B	Cloud	✅ 100%	✅ 100%	✅ High	✅ High	50.41s

🏁 Conclusion

The updated prompt with structured differential_comparison and exclusion_criteria formats produced measurable improvements across all models. The 14B threshold remains the critical boundary for full passed_filter compliance — both phi4 and qwen2.5 correctly handled all four test cases, while models below 12B consistently failed on at least two.

A notable new finding is the introduction of mistral-nemo:12b, which performs on par with llama3:8b despite having 50% more parameters. Both achieve 50% exclusion logic accuracy, suggesting that parameter count alone is not the primary factor — instruction-following capability and training data quality play an equally important role.

Cloud models offer significantly faster inference — llama-4-scout completed all four test cases in just 5 seconds compared to 70–90 seconds for local 14B models. However, gpt-oss-120b showed variable latency (2.5s to 21s per case), suggesting that response depth impacts inference time more than model size alone.

For production use, qwen2.5:14b represents the optimal local model — combining full logical compliance with reasonable inference speed. For cloud deployments where latency is critical, llama-4-scout-17b is the preferred choice.