๐Ÿงช Testing the Explainable AI (XAI) Layer

๐Ÿ“‹ Overview

The Explainable AI (XAI) layer is evaluated using four distinct clinical scenarios generated by the Neo4j inference engine. These tests are designed to challenge the LLMโ€™s ability to handle logical negations (blocking symptoms), differential weighting, and confidence assessment.


๐Ÿ“ Test Cases

TC1 โ€” Negation Handling

Focus: Can the model correctly identify why a high-scoring disease was excluded?

Diseases Graph Visualization

Field Value
Most Likely nonparalytic poliomyelitis (Passed Filter: True)
Differentials Ebola virus disease and poliomyelitis (Passed Filter: True)
Excluded conditions Marburg hemorrhagic fever and West Nile fever (Passed Filter: False)
Blocking symptoms maculopapular rash and/or chills (patient denied both)

Success criteria: The reasoning must explicitly mention the absence of the rash as the reason for excluding West Nile fever and both rash and chills for Marburg hemorrhagic fever.


TC2 โ€” Broad Differential Diagnostic

Focus: Distinguishing between diseases with very similar symptom profiles.

Diseases Graph Visualization

Field Value
Most Likely hepatitis E (Passed Filter: True)
Differentials hepatitis B, hepatitis C and hepatitis A (Passed Filter: True)
Excluded conditions hepatitis D (Passed Filter: False)
Blocking symptoms drowsiness and confusion (patient denied both)

Success criteria: Model assigns High Confidence to Hepatitis E, explains that A, B, and C are less likely due to lower symptomatic alignment and mentions the absence of drowsiness and confusion as the reason for excluding hepatitis D.


TC3 โ€” Low Disease Coverage

Focus: Managing high-severity cases with low disease coverage.

Diseases Graph Visualization

Field Value
Most Likely West Nile encephalitis (Passed Filter: True)
Differentials Powassan encephalitis and Eastern equine encephalitis (Passed Filter: True)
Excluded conditions Japanese encephalitis and St. Louis encephalitis (Passed Filter: False)
Blocking symptoms spastic paralysis (patient denied)

Success criteria: Model identifies West Nile as the primary diagnosis despite low confidence, explicitly recommends further testing and mentions the absence of spastic paralysis as the reason for excluding Japanese encephalitis and St. Louis encephalitis.


TC4 โ€” Overlapping Symptoms

Focus: Handling overlapping symptoms where no conditions are excluded.

Diseases Graph Visualization

Field Value
Most Likely Powassan encephalitis (Passed Filter: True)
Differentials nonparalytic poliomyelitis, La Crosse encephalitis and poliomyelitis (Passed Filter: True)
Excluded conditions โ€”
Blocking symptoms โ€”

Success criteria: Model identifies seizure as the clinical tie-breaker that elevates Powassan encephalitis over the competing candidates.


โš™๏ธ Test Environment

Property Value
OS Windows-11-10.0.26200-SP0
CPU AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
RAM 17.1 GB
GPU NVIDIA GeForce RTX 3060
VRAM 12.0 GB
Python 3.12.9

๐Ÿงช Test 1: llama3.2:3b (local)

Overall assessment: The updated prompt improved JSON structure compliance with the new differential_comparison format. However, the model continues to exhibit critical failures in passed_filter logic โ€” placing excluded diseases in differentials and hallucinating exclusion reasons for diseases that were never filtered out.

โฑ๏ธ Performance

Test Case Total Time
TC1 34.96s
TC2 21.46s
TC3 29.74s
TC4 22.61s
Total 108.77s

๐Ÿ“Š Performance Matrix

Metric Result Commentary
JSON structural integrity 100% Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms) 25% TC1: placed Marburg and West Nile in differentials; TC3: placed Japanese encephalitis and St. Louis encephalitis in differentials and in excluded conditions
Internal consistency Fail TC4: hallucinated exclusion of primary amebic meningoencephalitis (exclusion_criteria) despite passed_filter: true
Clinical tone High Medical vocabulary

๐Ÿ’ฌ Qualitative Analysis

1. Negation & filter logic (TC1 & TC3)

In TC3, the model correctly identifies West Nile encephalitis as the most likely diagnosis. However, it fails the logical constraint test by placing Japanese encephalitis and St. Louis encephalitis into the differential diagnosis category, completely ignoring the passed_filter: false flag. These conditions should have been moved to excluded_conditions due to the presence of the blocking symptom spastic paralysis, which the patient explicitly denied. Furthermore, the model incorrectly justifies their inclusion by claiming they have lower scores, when in fact, they had higher raw scores but were excluded.

2. Hallucinated exclusion (TC4)

In TC4, all five diseases have passed_filter: true and no blocking symptoms exist. Despite this, the model placed primary amebic meningoencephalitis in excluded_conditions, falsely claiming the patient denied coma โ€” a symptom that appears only in the Missing List, not the Blocking Symptoms list. This is the same missing/blocking confusion as before.


๐Ÿงช Test 2: llama3:8b (local)

Overall assessment: llama3:8b shows improved structural compliance with the new output format. TC3 and TC4 are handled correctly. However, TC1 still exhibits a critical passed_filter mapping error where Marburg hemorrhagic fever is placed in differentials instead of excluded conditions, and poliomyelitis is incorrectly placed in excluded conditions.

โฑ๏ธ Performance

Test Case Total Time
TC1 22.45s
TC2 21.07s
TC3 20.36s
TC4 22.66s
Total 86.54s

๐Ÿ“Š Performance Matrix

Metric Result Commentary
JSON structural integrity 100% Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms) 50% TC1: Marburg in differentials, poliomyelitis incorrectly excluded
Internal consistency Partial TC1 has contradictions
Clinical tone High Medical vocabulary

๐Ÿ’ฌ Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, the model placed Marburg hemorrhagic fever in differentials despite its passed_filter: false flag. At the same time, it incorrectly excluded poliomyelitis โ€” a disease with passed_filter: true โ€” citing flaccid paralysis as a blocking symptom. Flaccid paralysis is listed under Missing List, not Blocking Symptoms, making this an unjustified exclusion.

2. Correct handling (TC2, TC3, TC4)

TC2 correctly excluded hepatitis D with the right justification. TC3 correctly placed Japanese encephalitis and St. Louis encephalitis in excluded_conditions citing spastic paralysis. TC4 correctly identified Powassan encephalitis as most likely with all others as differentials and an empty excluded_conditions list.


๐Ÿงช Test 3: mistral-nemo:12b (local)

Overall assessment: Mistral Nemo 12b shows mixed results. It handles TC2, TC3, and TC4 correctly but fails in TC1 by incorrectly placing West Nile fever in differentials and excluding poliomyelitis using Missing List symptoms as justification.

โฑ๏ธ Performance

Test Case Total Time
TC1 28.15s
TC2 26.26s
TC3 23.26s
TC4 18.16s
Total 95.83s

๐Ÿ“Š Performance Matrix

| Metric | Result | Commentary | |โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”-|โ€”โ€”โ€”-|โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”-| | JSON structural integrity | 100% | Perfectly followed the schema and maintained all keys | | Exclusion logic (blocking symptoms) | 50% | TC1: West Nile fever in differentials, poliomyelitis incorrectly excluded; | | Internal consistency | Partial | TC1 internally inconsistent; | | Clinical tone | High | Medical vocabulary |

๐Ÿ’ฌ Qualitative Analysis

1. Negation & filter logic (TC1)

In TC1, West Nile fever (passed_filter: false) was placed in differentials instead of excluded_conditions. The model also incorrectly excluded poliomyelitis (passed_filter: true), justifying the exclusion with flaccid paralysis โ€” which appears only in the Missing List and does not qualify as a blocking symptom.

2. Correct handling (TC2, TC3, TC4)

TC2 correctly excluded hepatitis D citing drowsiness and confusion. TC3 correctly excluded both Japanese and St. Louis encephalitis citing spastic paralysis. TC4 correctly identified Powassan encephalitis as most likely and included all other diseases as differentials with an empty excluded list.

3. TC4 differential comparison quality

In TC4, the model correctly identified seizure as absent from poliomyelitis and nonparalytic poliomyelitis, indirectly supporting Powassan encephalitis as the tie-breaker โ€” though it did not explicitly name seizure as the deciding factor.


๐Ÿงช Test 4: phi4:14b (local)

Overall assessment: Phi4 14b demonstrates strong logical compliance across all test cases. All passed_filter mappings are correct. TC4 now includes all four differentials correctly and explicitly references seizure and stiff neck as the tie-breaking symptoms for Powassan encephalitis.

โฑ๏ธ Performance

Test Case Total Time
TC1 50.28s
TC2 45.15s
TC3 45.86s
TC4 53.57s
Total 194.86s

๐Ÿ“Š Performance Matrix

| Metric | Result | Commentary | |โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”-|โ€”โ€”โ€”-|โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€“| | JSON structural integrity | 100% | Perfectly followed the schema and maintained all keys | | Exclusion logic (blocking symptoms) | 100% | All passed_filter: false diseases correctly placed in excluded_conditions across all test cases | | Internal consistency | High | Textual reasoning directly supports the content of the JSON arrays | | Clinical tone | High | Medical vocabulary |

๐Ÿ’ฌ Qualitative Analysis

1. Precise handling of negation (TC1, TC2, TC3)

In TC1, Marburg hemorrhagic fever and West Nile fever are correctly placed in excluded_conditions, with explicit references to the denied maculopapular rash and chills. In TC2, hepatitis D is correctly excluded citing drowsiness and confusion. In TC3, both Japanese and St. Louis encephalitis are correctly excluded citing spastic paralysis.

2. TC4 tie-breaker identification

In TC4, the model correctly identifies seizure and stiff neck as the symptoms that distinguish Powassan encephalitis from nonparalytic poliomyelitis and poliomyelitis, fulfilling the success criteria for this test case.

3. Trade-off: accuracy vs speed

Phi4 14b is the slowest local model tested, averaging nearly 49 seconds per test case. This is a significant trade-off compared to smaller models, and should be considered when evaluating it for production use.


๐Ÿงช Test 5: qwen2.5:14b (local)

Overall assessment: Qwen 2.5 14b demonstrates strong logical compliance across all test cases. All passed_filter mappings are correct. TC4 confidence is set to moderate โ€” a more conservative and arguably more appropriate calibration than the high assigned by phi4, given that Powassan encephalitis covers only 50% of its known symptoms.

โฑ๏ธ Performance

Test Case Total Time
TC1 49.49s
TC2 56.38s
TC3 52.73s
TC4 58.12s
Total 216.72s

๐Ÿ“Š Performance Matrix

Metric Result Commentary
JSON structural integrity 100% Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms) 100% All passed_filter: false diseases correctly placed in excluded_conditions across all test cases
Internal consistency High Textual reasoning directly supports the content of the JSON arrays
Clinical tone High Medical vocabulary

๐Ÿ’ฌ Qualitative Analysis

1. Precise handling of negation (TC1, TC2, TC3)

In TC1, Marburg hemorrhagic fever and West Nile fever are correctly excluded citing denied maculopapular rash and chills. In TC2, hepatitis D is correctly excluded citing drowsiness and confusion. In TC3, Japanese and St. Louis encephalitis are correctly excluded citing spastic paralysis.

2. TC4 differential comparison quality

In TC4, the model explicitly names seizure as the differentiating symptom across all differential comparisons โ€” noting its absence from nonparalytic poliomyelitis, poliomyelitis, La Crosse encephalitis, and primary amebic meningoencephalitis. This is the most thorough fulfillment of the TC4 success criteria among all tested models.

3. Confidence calibration

The moderate confidence assigned in TC4 reflects a correct reading of the data โ€” Disease Coverage is 50% and Input Coverage is 100%, which places the case in the moderate band per the prompt definition. This is more precise than phi4โ€™s high assignment for the same scenario.


๐Ÿงช Test 6: meta-llama/llama-4-scout-17b-16e-instruct (cloud)

Overall assessment: The cloud-hosted Llama 4 Scout demonstrates strong logical compliance with the passed_filter flag and delivers clinically coherent reasoning. Its most notable characteristic is its inference speed โ€” completing each test case in approximately 1.3โ€“1.4 seconds, significantly faster than any local model tested.

โฑ๏ธ Performance

Test Case Total Time
TC1 1.44s
TC2 1.32s
TC3 1.21s
TC4 1.41s
Total 5.38s

๐Ÿ“Š Performance Matrix

Metric Result Commentary
JSON structural integrity 100% Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms) 100% All passed_filter: false diseases correctly placed in excluded_conditions across all test cases
Internal consistency High Textual reasoning directly supports the content of the JSON arrays
Clinical tone High Medical vocabulary

๐Ÿ’ฌ Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, correctly places Marburg hemorrhagic fever and West Nile fever in excluded_conditions, explicitly citing the absence of maculopapular rash and chills as the deciding factors.

In TC2, Hepatitis D is correctly excluded with a precise explanation referencing the absence of drowsiness and confusion as mandatory markers.

In TC3, Japanese encephalitis and St. Louis encephalitis are correctly excluded, with the model clearly attributing the exclusion to the denied spastic paralysis.


๐Ÿงช Test 7: openai/gpt-oss-120b (cloud)

Overall assessment: The largest model tested demonstrates strong clinical reasoning and correct filter logic compliance. However, it exhibits the most pronounced High-Intelligence Bias โ€” it consistently generates more nuanced and detailed reasoning than other models, sometimes at the cost of strict adherence to the simplified expected output format.

โฑ๏ธ Performance

Test Case Total Time
TC1 2.95s
TC2 2.24s
TC3 11.88s
TC4 33.34s
Total 50.41s

๐Ÿ“Š Performance Matrix

Metric Result Commentary
JSON structural integrity 100% Perfectly followed the schema and maintained all keys
Exclusion logic (blocking symptoms) 100% All passed_filter: false diseases correctly placed in excluded_conditions across all test cases
Internal consistency High Textual reasoning directly supports the content of the JSON arrays
Clinical tone High Most detailed medical vocabulary

๐Ÿ’ฌ Qualitative Analysis

1. Negation & filter logic (TC1, TC2 & TC3)

In TC1, correctly excludes Marburg hemorrhagic fever and West Nile fever, providing the most detailed blocking symptom explanation of all tested models.

In TC2, correctly excludes Hepatitis D with precise clinical justification referencing drowsiness and confusion.

In TC3, correctly excludes Japanese encephalitis and St. Louis encephalitis, citing spastic paralysis as the decisive blocking factor.


๐Ÿ“Š Summary: Comparison of Model Performance

Model Size Type JSON Integrity Exclusion Logic Internal Consistency Clinical Tone Total Time
llama3.2:3b 3B Local โœ… 100% โŒ 25% โŒ Fail โœ… High 108.77s
llama3:8b 8B Local โœ… 100% โš ๏ธ 50% โš ๏ธ Partial โœ… High 86.54s
mistral-nemo:12b 12B Local โœ… 100% โš ๏ธ 50% โš ๏ธ Partial โœ… High 95.83s
phi4:14b 14B Local โœ… 100% โœ… 100% โœ… High โœ… High 194.86s
qwen2.5:14b 14B Local โœ… 100% โœ… 100% โœ… High โœ… High 216.72s
meta-llama/llama-4-scout-17b-16e-instruct 17b-16e (MoE) Cloud โœ… 100% โœ… 100% โœ… High โœ… High 5.38s
openai/gpt-oss-120b 120B Cloud โœ… 100% โœ… 100% โœ… High โœ… High 50.41s

๐Ÿ Conclusion

The updated prompt with structured differential_comparison and exclusion_criteria formats produced measurable improvements across all models. The 14B threshold remains the critical boundary for full passed_filter compliance โ€” both phi4 and qwen2.5 correctly handled all four test cases, while models below 12B consistently failed on at least two.

A notable new finding is the introduction of mistral-nemo:12b, which performs on par with llama3:8b despite having 50% more parameters. Both achieve 50% exclusion logic accuracy, suggesting that parameter count alone is not the primary factor โ€” instruction-following capability and training data quality play an equally important role.

Cloud models offer significantly faster inference โ€” llama-4-scout completed all four test cases in just 5 seconds compared to 70โ€“90 seconds for local 14B models. However, gpt-oss-120b showed variable latency (2.5s to 21s per case), suggesting that response depth impacts inference time more than model size alone.

For production use, qwen2.5:14b represents the optimal local model โ€” combining full logical compliance with reasonable inference speed. For cloud deployments where latency is critical, llama-4-scout-17b is the preferred choice.