dscli — Dataset Gathering & Scraping Methods

This document describes the goal of the datasets-gathering project, how the dscli command-line tool works, and how Turkish news articles were scraped from Sabah, Hürriyet, and Sözcü using a two-phase pipeline.

Goal

The aim is to build a clean, labeled Turkish news dataset covering seven content categories: Politics, Economy, Sports, Health, Culture & Art, World, and Technology. Articles are collected from three major Turkish news sources, stored in a local PostgreSQL database, and then exported to formats suitable for model training or upload to Hugging Face.

Each article is stored with its title, summary, body text, publication date, source, category, and original URL. The dataset was uploaded to turkish-news-dataset on Hugging Face and is kept private for educational and learning purposes.

How the Two-Phase Pipeline Works

Data collection follows a two-phase process for every source.

Phase 1 — Metadata harvesting. The scraper visits a category listing page and collects article titles and URLs. No full article content is fetched at this stage. The result is a list of articles identified by source, category, and URL. Each record is stored in the database with is_filled=0, indicating it is not yet enriched.

Phase 2 — Content enrichment. For each URL collected in Phase 1, the scraper visits the full article page and extracts the body text, a summary, and the publication date. Once successfully processed, the record is updated and marked is_filled=1.

This split keeps the pipeline resilient: if Phase 2 fails on a particular article due to a network error, a changed page layout, or insufficient content, the Phase 1 metadata record is preserved and Phase 2 can be re-run independently for the failed items without losing any previously collected data.

Number	Label
0	POLITIKA
1	EKONOMI
2	SPOR
3	SAGLIK
4	KULTUR_SANAT
5	DUNYA
6	TEKNOLOJI

dscli — Command Reference

dscli is a local CLI tool that manages article records in a PostgreSQL database. It is built with Node.js and the Commander library. The database connection is hardcoded to postgres://postgres:postgres@localhost:5432/aan_db_scrape.

`dscli list`

Lists non-deleted records, 16 per page.

Option	Description
`--source`	Filter by source: `sabah`, `hurriyet`, or `sozcu`
`--kategori`	Filter by category number (0–6)
`--by`	Sort order: `updated-desc` (default), `updated-asc`, `created-desc`, `created-asc`
`--page`	Page number, zero-based (default: `0`)

`dscli add`

Inserts a Phase 1 metadata record. The article is stored with is_filled=0, meaning it has a title and URL but no body content yet.

Option	Required	Description
`--source`	yes	Source slug: `sabah`, `hurriyet`, or `sozcu`
`--baslik`	yes	Article title
`--kategori`	yes	Category number (0–6)
`--kaynak_url`	yes	Full article URL

Duplicate detection is built in: a unique_id is derived by concatenating source + baslik + kategori, lowercased with spaces stripped. If a record with the same unique_id already exists, the command exits with an error — preventing duplicates from entering the database. On success, the generated record_id (UUID) is printed to stdout. Scraping scripts capture this value to reference the record during Phase 2.

`dscli update`

Enriches a Phase 1 record with full content. This is the Phase 2 step. Only records that exist, are not soft-deleted, and have is_filled=0 can be updated. Once updated, is_filled is set to 1.

Option	Required	Description
`--record_id`	yes	UUID of the record to update
`--ozet`	yes	Extractive summary (typically 2–5 sentences)
`--icerik`	yes	Full cleaned article body text
`--yayim_tarihi`	yes	Publication date in `YYYY-MM-DD` format

`dscli detail`

Prints all fields of a single record by its record_id. The icerik field is truncated to 200 characters in the display.

`dscli delete`

Soft-deletes a single record by setting its deleted_at timestamp. The record remains in the database but is excluded from all queries. Hard deletion is not supported.

`dscli bulk_delete`

Soft-deletes up to 16 records in a single operation, accepting a comma-separated list of UUIDs via --record_ids.

`dscli dump`

Exports records to a file for use in training pipelines or upload to Hugging Face.

Option	Required	Description
`--source`	yes	Source slug or `all` for all sources
`--format`	yes	`csv` or `jsonl`
`--path`	yes	Output file path
`--kategori`	no	Filter by category number
`--mode`	no	`full` (all columns) or `compact` (title, summary, body, category, date, URL only). Default: `full`
`--randomize`	no	Shuffle records using Fisher-Yates before writing

`dscli status`

Prints an aggregated dashboard showing how many records have been collected per source and per category, and what percentage of them have been fully enriched with Phase 2 content. Soft-deleted records are excluded from all counts.

Database Schema

Each article record contains the following fields:

Field	Description
`record_id`	UUID — primary key
`source`	Source slug: `sabah`, `hurriyet`, or `sozcu`
`baslik`	Article title
`ozet`	Extractive summary (nullable until Phase 2 is complete)
`icerik`	Full article body (nullable until Phase 2 is complete)
`kategori`	Integer 0–6
`yayim_tarihi`	Publication date as an ISO string (nullable)
`kaynak_url`	Original article URL
`unique_id`	Deduplication key derived from `source+baslik+kategori`, lowercased with spaces removed. Has a UNIQUE constraint.
`is_filled`	`0` = metadata only; `1` = fully enriched
`created_at`	Record creation timestamp
`updated_at`	Last modification timestamp
`deleted_at`	Soft-delete timestamp (null when the record is active)

Scraping Methods by Source

Hürriyet

Tool: Node.js with the native fetch API and cheerio for HTML parsing. No browser automation is required.

Phase 1 — hurriyet_phase1.js

The script navigates Hürriyet category listing pages using ?p=<n> query parameters for standard pagination. Before collecting any URLs, it first calls dscli dump to load all previously known Hürriyet URLs into memory, preventing re-insertion of already-collected articles. For each listing page, multiple CSS selectors are tried to locate article cards (.category__list__item, .gallery-card, .list-item, and others), and the title and URL are extracted from link elements. Each new article is added to the database via dscli add.

Phase 2 — hurriyet_phase2.js

Records are processed with 4 concurrent workers. For each article URL, the following pipeline runs in sequence:

Body extraction — 11 CSS selectors are tried in priority order (div[data-test-id="article-body"], div.article-body, and others). If none match, all paragraph tags on the page are used as a fallback.
Update line removal — strips lines that contain editor update timestamps (lines matching patterns like "güncellendi:", "güncelleme tarihi:", etc.).
Disclaimer removal — removes paragraphs containing copyright notices, publication restrictions, or references to hurriyet.com.tr.
Text cleaning — strips HTML tags, decodes HTML entities, collapses whitespace, removes stray bracket characters, and lowercases the result.
Summarization — selects the first 4 sentences of at least 20 characters each, skipping date or update-related lines.
Date extraction — checks Open Graph meta tags (article:published_time), <time> elements, and JSON-LD structured data for the publication date.

Articles with fewer than 200 characters of cleaned body text are skipped entirely. Per-article intermediate workspace files (raw HTML, cleaned body, summary) are saved locally for review. The final result is stored via dscli update.

Sabah

Tool: Playwright headless Chromium for Phase 1. Phase 2 uses a three-strategy extractor: inline script parsing first, then vanilla fetch + cheerio, then Playwright headless as a last resort.

Phase 1 — sabah_phase1.js

Sabah's category pages are client-side rendered — articles load dynamically as the user scrolls rather than via traditional page links. A headless browser simulates scrolling and observes the DOM as new content appears. Article links are collected from <figcaption> elements and <a[title]> links filtered to technology-path URLs. Collected URLs are deduplicated in-memory. The browser scrolls up to 300 iterations or until the target number of articles is reached. Checkpoints are written to disk on each navigation state change, and all activity is logged to a file.

Unlike other sources, Phase 1 for Sabah does not call dscli add directly. Results are saved to a JSON file and records are ingested into the database separately after the output is reviewed.

Phase 2 — sabah_phase2.js

This is the most complex script in the pipeline. Three extraction strategies are attempted in order for each article, advancing to the next only when the current strategy yields fewer than 200 characters of usable content:

Inline script extraction — searches all <script> tags in the raw HTML for JSON-LD blocks, __NEXT_DATA__ payloads, or other embedded global state that may contain the article body. Keys checked include articleBody, body, content, description, lead, and text.
Vanilla fetch + DOM parsing — fetches the page with a plain HTTP request and uses cheerio to locate the article body via semantic selectors (main, article, [itemprop="articleBody"], div.article-body, div.haber, div.detail, and others). Paragraph text is collected and joined.
Playwright headless fallback — for articles where both previous strategies failed, a real browser loads the page, waits for article body selectors to appear, and smooth-scrolls the page up to 8 times while watching for content changes. This specifically handles Sabah's infinite gallery layout where body content is injected progressively as the page scrolls.

All extracted text passes through a canonical normalization pipeline:

HTML entities are decoded, including numeric (…) and named references
Invisible Unicode characters such as the combining dot above (U+0307) are stripped to fix mangled Turkish characters
Unicode NFC normalization is applied
Turkish-aware lowercasing uses toLocaleLowerCase('tr'), which correctly converts İ→i and I→ı (standard toLowerCase() handles Turkish incorrectly)
All whitespace and non-breaking spaces are collapsed to single spaces

Summaries are generated using the Intl.Segmenter API with the Turkish locale for accurate sentence boundary detection, taking the first 3 to 5 sentences. If a sentence is a leading repetition of the next one, it is removed. The final body text is capped at 8,000 characters.

Sözcü

Tools: Playwright headless browser for Phase 1. Phase 2 uses two paths depending on the category: vanilla fetch + cheerio for articles that render without JavaScript, and the Obscura browser automation tool for categories where JavaScript execution is required.

Phase 1 — sozcu_phase1.js

Sözcü's listing pages are JavaScript-rendered and display a cookie consent overlay on first visit. The script automatically dismisses the consent banner by clicking known consent button selectors. Article links are then collected by repeatedly reading div.row.align-items-center.mb-4 elements from the page. When a "Load More" button is visible it is clicked; otherwise the page is scrolled to trigger new content to load. Collection stops when the target number of articles is gathered or when 12 consecutive iterations yield no new URLs. Results are saved to a JSON file with titles and source URLs.

Phase 2 — fetch path (sozcu_phase2.js)

Used for the politics category (kategori 0). Each article URL is fetched directly without a browser and parsed with cheerio:

Summary — extracted from the h2.description heading element. If the result is shorter than 40 characters, the first 4 sentences of the body text are used instead.
Body — extracted from div.article-body[property="articleBody"], with fallbacks to main, article, and div.col-6. Trailing "Photo:" and "Related News:" suffixes are stripped.
Date — parsed from the <time datetime=""> attribute.

Both summary and body are lowercased. Body text is capped at 8,000 characters. Per-article validation files are written locally for inspection.

Phase 2 — Obscura path (sozcu_phase2_obscura.sh + sozcu_phase2_obscura_normalize.js)

Used for the economy category (kategori 1), where Sözcü's article pages require JavaScript execution to render content. Obscura is a CLI browser automation tool that visits a list of URLs concurrently using real browser tabs, executes a provided JavaScript expression on each fully-rendered page, and returns all results as a single JSON file. The JavaScript expression passed to Obscura selects three values from each page: the publication timestamp from <time datetime="">, the article summary from the .description heading, and the body paragraphs from div.article-body[property="articleBody"].

The first run processes all URLs at a concurrency of 8. URLs that failed or timed out are identified and retried at a lower concurrency of 5 using a separate retry shell script. The retry results are merged back into the original output. A normalization script then processes the combined raw output — cleaning and lowercasing text, parsing date strings, and matching each article to its record_id value by cross-referencing a dscli dump export keyed by URL.

Exporting for Hugging Face

Once records are fully enriched (Phase 2 complete, is_filled=1), dscli dump exports the dataset. The compact mode produces a clean output containing only the fields relevant for training: title, summary, body, category, publication date, and source URL. The --randomize flag shuffles records so that category ordering does not create bias in training batches. Both csv and jsonl formats are supported.

The scraped data has been uploaded to turkish-news-dataset on Hugging Face. The dataset is private and is used for educational and learning purposes only.

Goal

How the Two-Phase Pipeline Works

Categories

dscli — Command Reference

dscli list

dscli add

dscli update

dscli detail

dscli delete

dscli bulk_delete

dscli dump

dscli status