🧪 Testing the Natural Language Processing Layer (NLP)
📋 Introduction
This report documents the testing and evaluation of the Natural Language Processing (NLP) layer, whose primary role is the automatic extraction of medical symptoms from unstructured text in English.
The core challenge of this layer is accurately identifying symptoms in complex sentences, distinguishing confirmed from negated conditions, and normalizing extracted terms to canonical English equivalents compatible with medical ontologies (SYMP/DOID).
The goal of testing is to identify the optimal combination of LLM architecture and system prompt configuration in order to achieve high precision and reliability before data is passed to the next layer — the embedding (vectorization) layer.
Testing was conducted across 7 phases, varying model size (from 3B to 120B parameters) and instruction complexity, with a focus on reaching an 80% success threshold across key metrics. The following models were evaluated:
- 🤖 llama3.2:3b (local)
- 🤖 llama3.3:8 (local)
- 🤖 mistral-nemo:12b (local)
- 🤖 qwen2.5:14b (local)
- 🤖 phi4:14b (local)
- ☁️ meta-llama/llama-4-scout-17b-16e-instruct (cloud)
- ☁️ openai/gpt-oss-120b (cloud)
📐 Evaluation Methodology
Standard NLP metrics were used to assess extraction quality:
- Precision — the model’s ability to extract only relevant and correct symptoms without introducing noise
- Recall — the model’s ability to identify all symptoms mentioned in the text without omissions
- F1 Score — the harmonic mean of precision and recall, representing the overall accuracy of the model
🎯 Target Metrics
| Metric | Success Threshold | Formula |
|---|---|---|
| Precision | ✅ ≥ 80% | TP / (TP + FP) — or 0 if no symptoms extracted |
| Recall | ✅ ≥ 80% | TP / (TP + FN) — or 0 if no symptoms expected |
| F1 Score | ✅ ≥ 80% | 2 × (Precision × Recall) / (Precision + Recall) — or 0 if both are 0 |
🧪 Test 1
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | llama3.2:3b (local) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-1.txt |
🔍 Model Overview — llama3.2:3b
| Property | Value |
|---|---|
| Developer | Meta |
| Released | September 2024 |
| Architecture | Dense · Decoder-only Transformer |
| Parameters | 3.21B |
| Attention | GQA + RoPE |
| Context Window | 128K tokens |
| Quantization | Q4_K_M (Ollama) |
| Intended Use | On-device / edge deployment |
| License | Llama 3.2 Community License |
Lightweight text-only model optimized for mobile and edge environments. Supports 8 languages. Smallest model in the Llama 3.2 family.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.747 |
| Recall | 0.730 |
| F1 Score | 0.731 |
💬 Commentary
The llama3.2:3B model offers surprisingly robust baseline extraction capabilities for its size, characterized by a balanced precision-recall profile. However, its utility in strict ontology-based systems is limited by its tendency to generate pre-coordinated, hyper-specific clinical phrases (e.g., chest pain on inhalation) and its occasional, yet significant, failures in accurately resolving logical negations (case 10 and 41) within complex clinical narratives.
🧪 Test 2
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | llama3:8b (local) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-2.txt |
🔍 Model Overview — llama3:8b
| Property | Value |
|---|---|
| Developer | Meta |
| Released | April 2024 |
| Architecture | Dense · Decoder-only Transformer |
| Parameters | 8.03B |
| Attention | GQA + RoPE |
| Layers | 32 |
| Hidden Dim | 4096 |
| Context Window | 8K tokens |
| Quantization | Q4_0 (Ollama) |
| License | Meta Llama 3 Community License |
First generation of the Llama 3 family. Trained on 15T tokens. Strong reasoning and instruction following for its size class.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.809 |
| Recall | 0.806 |
| F1 Score | 0.800 |
💬 Commentary
The llama3:8B model achieved a solid 80% F1-score across the 100 test cases. While it excels at identifying standard symptoms like fever and headache, it struggles with semantic mapping—often extracting a clinically accurate term (e.g., paresthesia) that doesn’t perfectly match the expected “gold standard” label (e.g., hypoesthesia), or failing to split compound descriptions into separate entities. Despite these mapping nuances, the zero negation errors indicate high reliability in distinguishing between present and absent symptoms.
🧪 Test 3
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | mistral-nemo:12b (local) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-3.txt |
🔍 Model Overview — mistral-nemo:12b
| Property | Value |
|---|---|
| Developer | Mistral AI + NVIDIA (joint) |
| Released | July 2024 |
| Architecture | Dense · Decoder-only Transformer |
| Parameters | 12.2B |
| Layers | 40 |
| Attention | GQA (32 heads / 8 KV heads) |
| Context Window | 128K tokens |
| Tokenizer | Tekken (Tiktoken-based, 131K vocab) |
| Quantization | FP8-aware training |
| Training Infra | NVIDIA Megatron-LM, 3072× H100 GPUs |
| License | Apache 2.0 |
Co-developed with NVIDIA on DGX Cloud. Drop-in replacement for Mistral 7B with significantly expanded context and multilingual capability across 11+ languages.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.784 |
| Recall | 0.810 |
| F1 Score | 0.790 |
💬 Commentary
The mistral-nemo:12b model achieved a nearly identical 79% F1-score, showing slightly better recall than llama3:8b but frequently losing points due to overly technical synonym mapping (e.g., presyncope for lightheadedness) and a tendency to combine separate symptoms into single compound entities.
🧪 Test 4
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | qwen2.5:14b (local) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-4.txt |
🔍 Model Overview — qwen2.5:14b
| Property | Value |
|---|---|
| Developer | Alibaba Cloud (Qwen Team) |
| Released | September 2024 |
| Architecture | Dense · Decoder-only Transformer |
| Parameters | 14.7B total / 13.1B non-embedding |
| Layers | 48 |
| Attention | GQA (40 Q heads / 8 KV heads) + RoPE |
| Activation | SwiGLU |
| Normalization | RMSNorm |
| Context Window | 128K tokens (generates up to 8K) |
| Pretraining Data | ~18 trillion tokens |
| Multilingual | 29+ languages |
| License | Qwen Research License |
Dense open-weight model from the Qwen2.5 family. Pretrained on the largest dataset among local models tested (~18T tokens). Strong structured output and instruction-following capabilities.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.835 |
| Recall | 0.825 |
| F1 Score | 0.825 |
💬 Commentary
The qwen2.5:14b model is the current top performer with a 0.825 F1-score, showing a superior ability to map descriptive symptoms into formal clinical terminology while maintaining perfect accuracy in negation handling, though it still occasionally penalizes itself by grouping separate symptoms into single compound terms.
🧪 Test 5
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | phi4:14b (local) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-5.txt |
🔍 Model Overview — phi4:14b
| Property | Value |
|---|---|
| Developer | Microsoft Research |
| Released | December 2024 |
| Architecture | Dense · Decoder-only Transformer |
| Parameters | 14B |
| Layers | 40 |
| Attention | GQA (24 heads / 8 KV heads) + RoPE |
| Context Window | 16K tokens (extendable to 64K) |
| Tokenizer | tiktoken (vocab size 100,352) |
| Pretraining Data | ~9.8T tokens (incl. ~400B synthetic) |
| Knowledge Cutoff | June 2024 |
| License | MIT |
STEM-focused SLM (Small Language Model) from Microsoft. Distinguished by heavy use of synthetic training data for mathematical and scientific reasoning. Punches above its weight class on GPQA and MATH benchmarks.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.771 |
| Recall | 0.783 |
| F1 Score | 0.772 |
💬 Commentary
The phi4:14B model demonstrates high clinical intelligence with an F1-score of 0.772, though it frequently loses points by using advanced medical terminology (e.g., presyncope instead of faint) that causes mismatches with your dataset’s expected labels. While it excels at handling negations and complex symptoms, its primary challenge is over-specification, as it often provides more detailed anatomical descriptions than your ground truth requires.
🧪 Test 6
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | meta-llama/llama-4-scout-17b-16e-instruct (cloud) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-6.txt |
🔍 Model Overview — meta-llama/llama-4-scout-17b-16e-instruct
| Property | Value |
|---|---|
| Developer | Meta |
| Released | April 2025 |
| Architecture | Mixture-of-Experts (MoE) · Auto-regressive |
| Active Parameters | 17B (Top-2 routing per token) |
| Total Parameters | 109B (distributed across experts) |
| MoE Experts | 16 specialized + 1 shared (always active) |
| Transformer Layers | 40 total (20 MoE blocks) |
| Activation | SwiGLU |
| Multimodality | ✅ Early fusion (up to 5 images per prompt) |
| Context Window | 128K tokens (cloud) |
| Knowledge Cutoff | August 2024 |
| Multilingual | 12 languages |
| License | Llama 4 Community License |
First MoE model in the Llama family. Each token activates only 2 of 16 experts, giving the inference cost of a 17B dense model with the knowledge capacity of 109B parameters. Natively multimodal via early fusion architecture.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.746 |
| Recall | 0.799 |
| F1 Score | 0.763 |
💬 Commentary
The llama-4-scout-17b model shows a solid baseline for symptom extraction with an average F1-score of 0.763, but it frequently encounters “False Positives” due to its high clinical precision. Much like the previous model, it tends to extract more detailed or technical terms (e.g., presyncope or hyperhidrosis) when the ground truth expects simpler descriptions (e.g., lightheadedness or diaphoresis). While its negation detection is nearly perfect, the overall score is primarily capped by this semantic gap between the model’s medical vocabulary and your dataset’s specific labels.
🧪 Test 7
| Field | Value |
|---|---|
| Test Set | symptom-extraction-test-data-en.json |
| Model | openai/gpt-oss-120b (cloud) |
| System Prompt | symptom-extraction-prompt-1.txt |
| Results | symptom-extraction-result-7.txt |
🔍 Model Overview — openai/gpt-oss-120b
| Property | Value |
|---|---|
| Developer | OpenAI |
| Architecture | Dense · Decoder-only Transformer |
| Parameters | ~120B |
| Deployment | Cloud (API) |
Large-scale dense model. Highest parameter count among all tested models. Exhibits the most pronounced High-Intelligence Bias — consistently extracts more clinically nuanced and anatomically precise terms than the ground truth labels require.
📊 Results
| Metric | Score |
|---|---|
| Precision | 0.697 |
| Recall | 0.701 |
| F1 Score | 0.691 |
💬 Commentary
The gpt-oss-120b model achieves an average F1-score of 0.691, making it the most descriptive but least “label-compliant” model among those tested. Its primary failure mode is over-descriptive labeling, where it frequently includes anatomical details (e.g., severe abdominal pain or arm rash) that result in total mismatches with the simpler ground truth labels (e.g., abdominal pain or rash). While its extraction logic is physically accurate and highly sensitive to nuances like productive cough or high fever, its lack of constraint to your specific ontology leads to significantly lower precision and recall compared to the llama or phi models
📊 Overall Results
| Model | Size | Type | Precision | Recall | F1 |
|---|---|---|---|---|---|
| llama3.2:3b | 3B | Local | 0.747 | 0.730 | 0.731 |
| llama3:8b | 8B | Local | 0.809 | 0.806 | 0.800 |
| mistral-nemo:12b | 12B | Local | 0.784 | 0.810 | 0.790 |
| qwen2.5:14b | 14B | Local | 0.835 | 0.825 | 0.825 |
| phi4:14b | 14B | Local | 0.771 | 0.783 | 0.772 |
| meta-llama/llama-4-scout-17b-16e-instruct | 17B | Cloud | 0.746 | 0.799 | 0.763 |
| openai/gpt-oss-120b | 120B | Cloud | 0.697 | 0.701 | 0.691 |
✅
qwen2.5:14bachieved the highest F1 score across all tested models, surpassing the 80% target threshold and outperforming models up to 8x larger.
Conclusion
While the NLP layer achieved satisfactory performance (F1 ≥ 0.82 for the best model), several challenges remain that directly impact the embedding layer.
⚠️ Semantic Fragmentation and Normalization
The most critical issue identified is the lack of strict canonical normalization, which leads to semantic fragmentation. Different lexical representations of the same clinical concept (e.g., lightheadedness, dizziness, presyncope) result in distinct vector embeddings. This divergence reduces similarity accuracy and negatively affects downstream disease matching in the graph database.
🧠 Interpretation of Model Performance
It is important to emphasize that a lower F1 score does not necessarily indicate a “weaker” model. Analysis of the extraction logs reveals a distinct “High-Intelligence Bias” in larger architectures:
- Descriptive Precision: Models like
gpt-oss-120boften extract more technical or clinically accurate terms (e.g.,productive coughinstead of justcough). - Metric Penalty: These models are frequently penalized by the evaluation script for not strictly adhering to the simplified, flat labels of the ground truth, despite their output being medically valid.
- Instruction Following: Mid-sized models like
qwen2.5:14bdemonstrate superior balance between clinical extraction and adherence to formatting constraints, leading to higher benchmark scores.