Model - Results
This page summarises the evaluation results for turkish-news-bert-base on the held-out test split (15% of the dataset, stratified by category).
For the full training report — including loss and metric curves per epoch, run configuration, and per-class breakdown — see the W&B report:
W&B Training Report →Evaluation Metrics
Evaluated on the held-out test split.
| Metric | Score |
|---|---|
| Accuracy | 0.8705 |
| Macro F1 | 0.8674 |
| Weighted F1 | 0.8687 |
- Accuracy — fraction of correctly classified articles across all classes.
- Macro F1 — unweighted mean of per-class F1 scores; gives equal weight to each category regardless of sample count. This was the primary metric used to select the best model checkpoint during training.
- Weighted F1 — mean of per-class F1 scores weighted by class frequency in the test set.
The closeness of macro and weighted F1 (0.8674 vs 0.8687) reflects the dataset's balanced class distribution.
Category Map
| ID | Label | Description |
|---|---|---|
| 0 | POLITIKA | Politics and government |
| 1 | EKONOMI | Economy and finance |
| 2 | SPOR | Sports |
| 3 | SAGLIK | Health and medicine |
| 4 | KULTUR_SANAT | Culture and arts |
| 5 | DUNYA | World / international news |
| 6 | TEKNOLOJI | Technology |
Notes
Dataset balance. The dataset was collected with explicit per-category targets to ensure balanced class distribution across all three sources. Per-category shortfalls in one source were compensated by higher quotas from others (e.g. the TEKNOLOJI quota was higher for Sabah and Sözcü to offset Hürriyet's lower technology coverage).
Split strategy. The 70/15/15 train/val/test split is stratified by category label. Data leakage is verified before every training run by checking both unique_id overlap and SHA-256 text fingerprint overlap across all split pairs. See the Code Flow page for details.
Macro vs. weighted F1. Macro F1 was used as metric_for_best_model in TrainingArguments because it treats each category equally and penalises poor performance on any single class. Weighted F1 is reported as a secondary measure of overall quality weighted by class frequency.