🧪 Testing the Natural Language Processing Layer (NLP)

📋 Introduction

This report documents the testing and evaluation of the Natural Language Processing (NLP) layer, whose primary role is the automatic extraction of medical symptoms from unstructured text in English.

The core challenge of this layer is accurately identifying symptoms in complex sentences, distinguishing confirmed from negated conditions, and normalizing extracted terms to canonical English equivalents compatible with medical ontologies (SYMP/DOID).

The goal of testing is to identify the optimal combination of LLM architecture and system prompt configuration in order to achieve high precision and reliability before data is passed to the next layer — the embedding (vectorization) layer.

Testing was conducted across 7 phases, varying model size (from 3B to 120B parameters) and instruction complexity, with a focus on reaching an 80% success threshold across key metrics. The following models were evaluated:

🤖 llama3.2:3b (local)
🤖 llama3.3:8 (local)
🤖 mistral-nemo:12b (local)
🤖 qwen2.5:14b (local)
🤖 phi4:14b (local)
☁️ meta-llama/llama-4-scout-17b-16e-instruct (cloud)
☁️ openai/gpt-oss-120b (cloud)

📐 Evaluation Methodology

Standard NLP metrics were used to assess extraction quality:

Precision — the model’s ability to extract only relevant and correct symptoms without introducing noise
Recall — the model’s ability to identify all symptoms mentioned in the text without omissions
F1 Score — the harmonic mean of precision and recall, representing the overall accuracy of the model

🎯 Target Metrics

Metric	Success Threshold	Formula
Precision	✅ ≥ 80%	TP / (TP + FP) — or 0 if no symptoms extracted
Recall	✅ ≥ 80%	TP / (TP + FN) — or 0 if no symptoms expected
F1 Score	✅ ≥ 80%	2 × (Precision × Recall) / (Precision + Recall) — or 0 if both are 0

⚙️ Test Environment

Property	Value
OS	Windows-11-10.0.26200-SP0
CPU	AMD64 Family 23 Model 113 Stepping 0, AuthenticAMD
RAM	17.1 GB
GPU	NVIDIA GeForce RTX 3060
VRAM	12.0 GB
Python	3.12.9
Test Cases	100

🧪 Test 1

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	llama3.2:3b (local)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-1.txt

🔍 Model Overview — `llama3.2:3b`

Property	Value
Developer	Meta
Released	September 2024
Architecture	Dense · Decoder-only Transformer
Parameters	3.21B
Attention	GQA + RoPE
Context Window	128K tokens
Quantization	Q4_K_M (Ollama)
Intended Use	On-device / edge deployment
License	Llama 3.2 Community License

Lightweight text-only model optimized for mobile and edge environments. Supports 8 languages. Smallest model in the Llama 3.2 family.

📊 Results

Metric	Score
Precision	0.747
Recall	0.730
F1 Score	0.731

⏱️ Performance

Metric	Value
Total Time	39.14s
Avg/Case	0.39s

💬 Commentary

The llama3.2:3B model offers surprisingly robust baseline extraction capabilities for its size, characterized by a balanced precision-recall profile. However, its utility in strict ontology-based systems is limited by its tendency to generate pre-coordinated, hyper-specific clinical phrases (e.g., chest pain on inhalation) and its occasional, yet significant, failures in accurately resolving logical negations (case 10 and 41) within complex clinical narratives.

🧪 Test 2

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	llama3:8b (local)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-2.txt

🔍 Model Overview — `llama3:8b`

Property	Value
Developer	Meta
Released	April 2024
Architecture	Dense · Decoder-only Transformer
Parameters	8.03B
Attention	GQA + RoPE
Layers	32
Hidden Dim	4096
Context Window	8K tokens
Quantization	Q4_0 (Ollama)
License	Meta Llama 3 Community License

First generation of the Llama 3 family. Trained on 15T tokens. Strong reasoning and instruction following for its size class.

📊 Results

Metric	Score
Precision	0.809
Recall	0.806
F1 Score	0.800

⏱️ Performance

Metric	Value
Total Time	57.39s
Avg/Case	0.57s

💬 Commentary

The llama3:8B model achieved a solid 80% F1-score across the 100 test cases. While it excels at identifying standard symptoms like fever and headache, it struggles with semantic mapping—often extracting a clinically accurate term (e.g., paresthesia) that doesn’t perfectly match the expected “gold standard” label (e.g., hypoesthesia), or failing to split compound descriptions into separate entities. Despite these mapping nuances, the zero negation errors indicate high reliability in distinguishing between present and absent symptoms.

🧪 Test 3

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	mistral-nemo:12b (local)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-3.txt

🔍 Model Overview — `mistral-nemo:12b`

Property	Value
Developer	Mistral AI + NVIDIA (joint)
Released	July 2024
Architecture	Dense · Decoder-only Transformer
Parameters	12.2B
Layers	40
Attention	GQA (32 heads / 8 KV heads)
Context Window	128K tokens
Tokenizer	Tekken (Tiktoken-based, 131K vocab)
Quantization	FP8-aware training
Training Infra	NVIDIA Megatron-LM, 3072× H100 GPUs
License	Apache 2.0

Co-developed with NVIDIA on DGX Cloud. Drop-in replacement for Mistral 7B with significantly expanded context and multilingual capability across 11+ languages.

📊 Results

Metric	Score
Precision	0.784
Recall	0.810
F1 Score	0.790

⏱️ Performance

Metric	Value
Total Time	78.96s
Avg/Case	0.79s

💬 Commentary

The mistral-nemo:12b model achieved a nearly identical 79% F1-score, showing slightly better recall than llama3:8b but frequently losing points due to overly technical synonym mapping (e.g., presyncope for lightheadedness) and a tendency to combine separate symptoms into single compound entities.

🧪 Test 4

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	qwen2.5:14b (local)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-4.txt

🔍 Model Overview — `qwen2.5:14b`

Property	Value
Developer	Alibaba Cloud (Qwen Team)
Released	September 2024
Architecture	Dense · Decoder-only Transformer
Parameters	14.7B total / 13.1B non-embedding
Layers	48
Attention	GQA (40 Q heads / 8 KV heads) + RoPE
Activation	SwiGLU
Normalization	RMSNorm
Context Window	128K tokens (generates up to 8K)
Pretraining Data	~18 trillion tokens
Multilingual	29+ languages
License	Qwen Research License

Dense open-weight model from the Qwen2.5 family. Pretrained on the largest dataset among local models tested (~18T tokens). Strong structured output and instruction-following capabilities.

📊 Results

Metric	Score
Precision	0.835
Recall	0.825
F1 Score	0.825

⏱️ Performance

Metric	Value
Total Time	95.29s
Avg/Case	0.95s

💬 Commentary

The qwen2.5:14b model is the current top performer with a 0.825 F1-score, showing a superior ability to map descriptive symptoms into formal clinical terminology while maintaining perfect accuracy in negation handling, though it still occasionally penalizes itself by grouping separate symptoms into single compound terms.

🧪 Test 5

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	phi4:14b (local)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-5.txt

🔍 Model Overview — `phi4:14b`

Property	Value
Developer	Microsoft Research
Released	December 2024
Architecture	Dense · Decoder-only Transformer
Parameters	14B
Layers	40
Attention	GQA (24 heads / 8 KV heads) + RoPE
Context Window	16K tokens (extendable to 64K)
Tokenizer	tiktoken (vocab size 100,352)
Pretraining Data	~9.8T tokens (incl. ~400B synthetic)
Knowledge Cutoff	June 2024
License	MIT

STEM-focused SLM (Small Language Model) from Microsoft. Distinguished by heavy use of synthetic training data for mathematical and scientific reasoning. Punches above its weight class on GPQA and MATH benchmarks.

📊 Results

Metric	Score
Precision	0.771
Recall	0.783
F1 Score	0.772

⏱️ Performance

Metric	Value
Total Time	115.82s
Avg/Case	1.16s

💬 Commentary

The phi4:14B model demonstrates high clinical intelligence with an F1-score of 0.772, though it frequently loses points by using advanced medical terminology (e.g., presyncope instead of faint) that causes mismatches with your dataset’s expected labels. While it excels at handling negations and complex symptoms, its primary challenge is over-specification, as it often provides more detailed anatomical descriptions than your ground truth requires.

🧪 Test 6

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	meta-llama/llama-4-scout-17b-16e-instruct (cloud)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-6.txt

🔍 Model Overview — `meta-llama/llama-4-scout-17b-16e-instruct`

Property	Value
Developer	Meta
Released	April 2025
Architecture	Mixture-of-Experts (MoE) · Auto-regressive
Active Parameters	17B (Top-2 routing per token)
Total Parameters	109B (distributed across experts)
MoE Experts	16 specialized + 1 shared (always active)
Transformer Layers	40 total (20 MoE blocks)
Activation	SwiGLU
Multimodality	✅ Early fusion (up to 5 images per prompt)
Context Window	128K tokens (cloud)
Knowledge Cutoff	August 2024
Multilingual	12 languages
License	Llama 4 Community License

First MoE model in the Llama family. Each token activates only 2 of 16 experts, giving the inference cost of a 17B dense model with the knowledge capacity of 109B parameters. Natively multimodal via early fusion architecture.

📊 Results

Metric	Score
Precision	0.749
Recall	0.806
F1 Score	0.769

⏱️ Performance

Metric	Value
Total Time	206.58s
Avg/Case	2.07s

💬 Commentary

The llama-4-scout-17b model shows a solid baseline for symptom extraction with an average F1-score of 0.769, but it frequently encounters “False Positives” due to its high clinical precision. Much like the previous model, it tends to extract more detailed or technical terms (e.g., presyncope or hyperhidrosis) when the ground truth expects simpler descriptions (e.g., lightheadedness or diaphoresis). While its negation detection is nearly perfect, the overall score is primarily capped by this semantic gap between the model’s medical vocabulary and your dataset’s specific labels.

🧪 Test 7

Field	Value
Test Set	symptom-extraction-test-data-en.json
Model	openai/gpt-oss-120b (cloud)
System Prompt	symptom-extraction-prompt-1.txt
Results	symptom-extraction-result-7.txt

🔍 Model Overview — `openai/gpt-oss-120b`

Property	Value
Developer	OpenAI
Architecture	Dense · Decoder-only Transformer
Parameters	~120B
Deployment	Cloud (API)

Large-scale dense model. Highest parameter count among all tested models. Exhibits the most pronounced High-Intelligence Bias — consistently extracts more clinically nuanced and anatomically precise terms than the ground truth labels require.

📊 Results

Metric	Score
Precision	0.695
Recall	0.697
F1 Score	0.689

⏱️ Performance

Metric	Value
Total Time	362.56s
Avg/Case	3.63s

💬 Commentary

The gpt-oss-120b model achieves an average F1-score of 0.689, making it the most descriptive but least “label-compliant” model among those tested. Its primary failure mode is over-descriptive labeling, where it frequently includes anatomical details (e.g., severe abdominal pain or arm rash) that result in total mismatches with the simpler ground truth labels (e.g., abdominal pain or rash). While its extraction logic is physically accurate and highly sensitive to nuances like productive cough or high fever, its lack of constraint to your specific ontology leads to significantly lower precision and recall compared to the llama or phi models

📊 Overall Results

Model	Size	Type	Precision	Recall	F1	Total time (s)	Avg/case (s)
llama3.2:3b	3B	Local	0.747	0.730	0.731	39.14	0.39
llama3:8b	8B	Local	0.809	0.806	0.800	57.39	0.57
mistral-nemo:12b	12B	Local	0.784	0.810	0.790	78.96	0.79
qwen2.5:14b	14B	Local	0.835	0.825	0.825	95.29	0.95
phi4:14b	14B	Local	0.771	0.783	0.772	115.82	1.16
meta-llama/llama-4-scout-17b-16e-instruct	17B	Cloud	0.749	0.806	0.769	206.58	2.07
openai/gpt-oss-120b	120B	Cloud	0.695	0.697	0.689	362.56	3.63

✅ qwen2.5:14b achieved the highest F1 score across all tested models, surpassing the 80% target threshold and outperforming models up to 8x larger.

Conclusion

While the NLP layer achieved satisfactory performance (F1 ≥ 0.82 for the best model), several challenges remain that directly impact the embedding layer.

⚠️ Semantic Fragmentation and Normalization

The most critical issue identified is the lack of strict canonical normalization, which leads to semantic fragmentation. Different lexical representations of the same clinical concept (e.g., lightheadedness, dizziness, presyncope) result in distinct vector embeddings. This divergence reduces similarity accuracy and negatively affects downstream disease matching in the graph database.

🧠 Interpretation of Model Performance

It is important to emphasize that a lower F1 score does not necessarily indicate a “weaker” model. Analysis of the extraction logs reveals a distinct “High-Intelligence Bias” in larger architectures:

Descriptive Precision: Models like gpt-oss-120b often extract more technical or clinically accurate terms (e.g., productive cough instead of just cough).
Metric Penalty: These models are frequently penalized by the evaluation script for not strictly adhering to the simplified, flat labels of the ground truth, despite their output being medically valid.
Instruction Following: Mid-sized models like qwen2.5:14b demonstrate superior balance between clinical extraction and adherence to formatting constraints, leading to higher benchmark scores.

🧪 Testing the Natural Language Processing Layer (NLP)

📋 Introduction

📐 Evaluation Methodology

🎯 Target Metrics

⚙️ Test Environment

🧪 Test 1

🔍 Model Overview — llama3.2:3b

📊 Results

⏱️ Performance

💬 Commentary

🧪 Test 2

🔍 Model Overview — llama3:8b

📊 Results

⏱️ Performance

💬 Commentary

🧪 Test 3

🔍 Model Overview — mistral-nemo:12b

📊 Results

⏱️ Performance

💬 Commentary

🧪 Test 4

🔍 Model Overview — qwen2.5:14b

📊 Results

⏱️ Performance

💬 Commentary

🧪 Test 5

🔍 Model Overview — phi4:14b

📊 Results

⏱️ Performance

💬 Commentary

🧪 Test 6

🔍 Model Overview — meta-llama/llama-4-scout-17b-16e-instruct

📊 Results

⏱️ Performance

💬 Commentary

🧪 Test 7

🔍 Model Overview — openai/gpt-oss-120b

📊 Results

⏱️ Performance

💬 Commentary

📊 Overall Results

Conclusion

⚠️ Semantic Fragmentation and Normalization

🧠 Interpretation of Model Performance

🔍 Model Overview — `llama3.2:3b`

🔍 Model Overview — `llama3:8b`

🔍 Model Overview — `mistral-nemo:12b`

🔍 Model Overview — `qwen2.5:14b`

🔍 Model Overview — `phi4:14b`

🔍 Model Overview — `meta-llama/llama-4-scout-17b-16e-instruct`

🔍 Model Overview — `openai/gpt-oss-120b`