Model - Code Flow
This page explains the runtime sequence for both the training pipeline (train.py) and the inference service (server.py / app.py / grpc_server.py).
Training Pipeline
The full training pipeline is implemented in train.py and runs the following steps in order:
seed_everything()
→ W&B init
→ load_and_split()
→ verify_no_leakage()
→ tokenize_dataset()
→ load BertForSequenceClassification
→ Trainer.train()
→ evaluate_test()
→ save_artifacts()
→ (optional) copy to Google Drive
→ (optional) push_to_hub()1. Seed
seed_everything(seed) sets seeds for random, numpy, torch, torch.cuda, PYTHONHASHSEED, and HuggingFace set_seed to ensure full reproducibility across runs.
2. Data Loading and Splitting
load_and_split() loads the data/data.jsonl split from mehmetraufoguz/turkish-news-dataset and applies three transformations:
- SHA-256 deduplication — articles are fingerprinted by their
baslikfield; exact duplicates are dropped before splitting. - ClassLabel casting — the
kategoricolumn is cast to aClassLabelfeature so HuggingFace'strain_test_splitcan stratify by class. - Two-step stratified split — first splits off 30% (val + test), then splits that 30% in half, yielding a 70 / 15 / 15 train / val / test ratio.
3. Leakage Verification
verify_no_leakage() checks all three splits for two kinds of overlap and raises RuntimeError on any hit:
- Record ID overlap — compares
unique_idsets across all split pairs. - Text fingerprint overlap — compares SHA-256 hashes of the combined
baslik + ozettext across all split pairs.
This guarantees that the test set contains no data the model saw during training.
4. Preprocessing and Tokenization
tokenize_dataset() applies prepare_input(baslik, ozet) to every example, which in turn calls clean_and_lowercase() on each field:
clean_and_lowercase(text)— strips HTML tags, decodes HTML entities (<,&,", etc.), replaces\n\r\twith a space, removes bracket characters, collapses whitespace, strips, and lowercases. ReturnsNoneif the result is empty.prepare_input(baslik, ozet)— joins the cleaned fields as"{baslik} |=| {ozet}". Empty fields become""so the separator is always present.
This preprocessing mirrors shared/src/utils/text-cleaner.ts → cleanAndLowercase() exactly, ensuring consistent behaviour between training and production inference.
Tokenization uses the BertTokenizer with truncation=True, max_length=256, and no padding at this stage. Padding is applied dynamically per-batch by DataCollatorWithPadding.
The kategori column is renamed to labels to match the expected input for BertForSequenceClassification.
5. Model
BertForSequenceClassification is loaded from dbmdz/bert-base-turkish-cased with num_labels=7. The id2label / label2id maps are injected into the model config at load time.
Architecture overview:
| Property | Value |
|---|---|
| Hidden layers | 12 |
| Attention heads | 12 |
| Hidden size | 768 |
| Vocabulary size | 32 000 |
| Max position embeds | 512 |
6. Training Arguments and Evaluation
build_training_args() configures the HuggingFace Trainer:
- Eval and save strategy:
"epoch"(evaluate and checkpoint after each epoch) - Best model selection metric:
eval_macro_f1 fp16=Trueon CUDA devicesEarlyStoppingCallbackwith patience 2 — training stops ifeval_macro_f1does not improve for 2 consecutive epochs
compute_metrics(eval_pred) is called after each eval step and returns:
| Metric | Method |
|---|---|
accuracy | accuracy_score |
macro_f1 | Unweighted mean F1 across all 7 classes |
weighted_f1 | Class-frequency-weighted mean F1 |
7. Test Evaluation and Artifact Saving
After training, evaluate_test() runs inference on the held-out test split, produces a full classification_report (sklearn), and logs all per-class and aggregate metrics to W&B under the test/ prefix.
save_artifacts() writes the best checkpoint as:
model.safetensors— model weights in safe serialization formattokenizer.*files — tokenizer vocabulary and configurationlabel_map.json—id2label,label2id,num_labels,separator
Inference Flow
Combined Server (server.py)
The production entry point runs FastAPI and gRPC concurrently in a single Python process:
main()
→ _ModelSingleton.load(model_dir) # weights loaded once
→ asyncio.TaskGroup
├── _run_grpc(port) # grpc.aio async server
└── _run_fastapi(port) # uvicorn.Server (signal handlers disabled)Both servers share the same _ModelSingleton instance. The model is loaded once before the TaskGroup starts; neither server begins accepting requests until the weights are fully loaded.
FastAPI Predict Path
POST /predict
→ validate PredictRequest (Pydantic: id, baslik, ozet)
→ prepare_input(baslik, ozet) # HTML-strip + lowercase + "|=|" join
→ tokenizer(text, truncation, max_length=256)
→ model(**inputs) [torch.no_grad()]
→ softmax(logits)
→ return PredictResponse {
predicted_category,
confidence,
all_confidences
}Returns 503 if the model singleton is not loaded, 422 if the input is empty after preprocessing.
gRPC Predict Path
ModelService.Predict(PredictRequest { id, baslik, ozet })
→ _ModelSingleton.predict(baslik, ozet)
→ (same inference path as FastAPI)
→ return PredictResponse { predicted_category, confidence, all_confidences }On error, the gRPC handler sets grpc.StatusCode.INTERNAL.
The standalone gRPC server (grpc_server.py) uses a ThreadPoolExecutor(max_workers=cpu_count) and handles graceful shutdown on SIGTERM / SIGINT with a 5-second grace period.