Structured & Detailed

This section translates the Raw notes below into a structured dataset schema, collection flow, and the CLI/storage contract used to prepare and export datasets for Hugging Face.

Schema

source: slug of origin (example: “sabah”, “hurriyet”, “sozcu”)
baslik: title of the content
ozet: short description (may be empty during initial metadata collection)
icerik: cleaned original article text (strip HTML, keep article body)
kategori: integer — value from the Kategori enum (see Raw)
yayim_tarihi: publication / release date
kaynak_url: original article URL
unique_id: concatenation source+baslik+kategori (baslik and kategori should be cleaned: lowercased, spaces removed)

Collection flow

Stage 1 — metadata collection: gather and save source, baslik, kategori, and kaynak_url for candidate items.
Stage 2 — full-content collection: once per-source category targets are reached (see Kategori targets below), fetch and save ozet, icerik, and yayim_tarihi; clean icerik before storing.

Kategori targets (from Raw)

POLITIKA (1): 80 items per source
EKONOMI (2): 72 items per source
SPOR (3): 72 items per source
SAGLIK (4): 72 items per source
KULTUR_SANAT (5): 72 items per source
DUNYA (6): 72 items per source
TEKNOLOJI (7): 72 items per source — note: some sources (Sabah, Sözcü) may provide 108 items to support Hürriyet's Teknoloji coverage.

CLI & storage contract

Non-interactive CLI (no TUI).
Local storage: save records to a pglite store.
Export: dump a single source or all sources to csv or jsonl files for Hugging Face consumption.

CLI commands (summary)

help: show all commands and brief usage; each command also supports a detailed help view with examples.
list: console table of available data (first 16 items ordered by updated-desc by default); only non-soft-deleted records; fields shown: source, baslik, kategori, kaynak_url, status (filled or not). Optional args: by (updated-desc|updated-asc|created-desc|created-asc), page (offset = page * limit, pages start at 0).
detail: show all information for a record. Required: record_id.
add: add only metadata. On success print the new record_id. A UUID may be used for record_id. Required args: source, baslik, kategori, kaynak_url. unique_id = source+baslik+kategori (cleaned).
update: add full-content fields; cannot update other metadata fields; cannot update records that are already updated or soft-deleted. Required args: record_id, ozet, icerik, yayim_tarihi.
delete: soft-delete a record. Required: record_id.
bulk_delete: soft-delete multiple records (max 16 ids, comma-separated). Required: record_ids.
dump: export records. Required args: source (all or *source-id*), format (csv|jsonl), path.

SKILL files

Create two SKILL.md files to document CLI usage and scraping instructions:
- datasets-gathering/SKILL.md — CLI usage and examples.
- datasets-gathering/scrape/SKILL.md — scraping rules, selectors and per-source notes.
- Reference: https://docs.github.com/en/copilot/how-tos/copilot-cli/customize-copilot/add-skills

Data handling notes

Persist metadata records even when ozet or icerik are missing; fill those fields during Stage 2.
Clean icerik to remove non-article elements and normalize whitespace before storing.

This part of the project will be for learning and educational purposes.

Raw

export enum Kategori {
  POLITIKA = 1, -> each source 80 content
  EKONOMI = 2, -> each source 72 content
  SPOR = 3, -> each source 72 content
  SAGLIK = 4, -> each source 72 content
  KULTUR_SANAT = 5, -> each source 72 content
  DUNYA = 6, -> each source 72 content
  TEKNOLOJI = 7, -> each source 72 content
}

We need to gather and save those fields;

source -> where did we got that content data, example: “sabah”
baslik -> title of content
ozet -> short description of content
icerik -> text of original content, needs to be cleaned
kategori -> which Kategori enum we gathered
yayim_tarihi -> content/release date
kaynak_url -> link of content

Required content gathering flow;

Collect content meta data on sources, save those fields
- source
- baslik
- kategori
- kaynak_url
When we fill required count of content on each source, we will gather content data and fill missing fields
- ozet
- icerik
- yayim_tarihi

Those gathered datasets needs to served in huggingface. to prepare those files we need to setup cli that would help clawbot organize those data properly.

this cli tool should not have tui. saves to local pglite dumps source or all sources into csv or jsonl files. it needs have those commands;

help -> shows all commands and brief of how to use them
- also all commands have detailed help command to show description and example how to use
list -> list avaiable data with console table
- list first 16 item ordered by updated-desc
- listing only not soft deleteds
- listing those fields;
  - source
  - baslik
  - kategori
  - kaynak_url
  - status (filled or not)
- optional args
  - by -> order by;
    - updated-desc
    - updated-asc
    - created-desc
    - created-asc
  - page -> kind a offset (page * limit), pages starts on 0
detail -> showing all information of record
- required args
  - record_id
add -> add only brief information, meta data
- we might need uuid for record id.
- after success perform print record id.
- unique_id -> source+baslik+kategori
  - baslik and kategori needs to be;
    - cleaned
    - lowercased
    - removed spaces
- required args
  - source
  - baslik
  - kategori
  - kaynak_url
update -> updates source details
- it cant updates other fields, if other fields needs to be updated then record needs to be deleted
- cant update already updated ones or soft deleted ones
- required args
  - record_id
  - ozet
  - icerik
  - yayim_tarihi
delete -> soft delete of record
- required args
  - record_id
bulk_delete -> bulk soft delete of records
- max 16 record with commas
- required args
  - record_ids
dump -> dumps of records with given args
- required args
  - source
    - all
    - source-id
  - format
    - csv
    - jsonl
  - path

we also need SKILL.md file for cli usage and how to source gathering.

https://docs.github.com/en/copilot/how-tos/copilot-cli/customize-copilot/add-skills

one SKILL.md for cli use, prepare file in datasets-gathering folder. another needs to be in datasets-gathering/scrape folder.

Sources

Sabah Gazetesi Haber Sitesi -> slug: “sabah”

[1] Politika: https://www.sabah.com.tr/gundem
[2] Ekonomi: https://www.sabah.com.tr/ekonomi
[3] Spor: https://www.sabah.com.tr/spor/tum-sporlar
[4] Sağlık: https://www.sabah.com.tr/saglik
[5] Kültür Sanat: https://www.sabah.com.tr/kultur-sanat
[6] Dünya: https://www.sabah.com.tr/dunya
[7] Teknoloji [need to provide 108 content to support “hurriyet" problem]: https://www.sabah.com.tr/teknoloji

Behavior

Contents are getting loaded on scrolling to bottom, part by part. Sometimes u are getting stuck on footer but when u get top a little then get back to bottom more contents would loaded.

Rendering

Sabah example rendering

HTML Selector body > section[.contentFrame] > div[.container] > figcaption elements

HTML Element

<figcaption class="">
    <a href="/teknoloji/2025/08/28/nsosyal-kesfet-sekmesini-guncelledi" title="NSosyal Keşfet sekmesini güncelledi"
        class="">NSosyal "Keşfet" sekmesini güncelledi</a>
</figcaption>

Hürriyet Gazetesi Haber Sitesi -> slug: “hurriyet”

[1] Politika: https://www.hurriyet.com.tr/gundem
[2] Ekonomi: https://www.hurriyet.com.tr/ekonomi/
[3] Spor: multiple link, gather 18 content from each link;
[4] Sağlık: https://www.hurriyet.com.tr/haberleri/saglik
[5] Kültür Sanat: https://www.hurriyet.com.tr/kitap-sanat/
[6] Dünya: https://www.hurriyet.com.tr/dunya/
[7] Teknoloji: this source if not offering useable feed on this kategori. Other sources will provide more content on this to support balance on content kategori’s

Behavior

Contents are getting loaded on page load. Kategori link is loading page 1. Next pages can be reachable with “?p=3” query on url, example page 3 of kategori [1]: “https://www.hurriyet.com.tr/gundem/?p=3”. Each page mostly contains 30 content item.

Rendering

Hurriyet example rendering

HTML Selector #content > section > div[.category__list] > div[.row] > div[.category__list__item]

HTML Element

<div class="category__list__item"><a target="_self"
        href="/gundem/batmanda-kaybolan-2-5-yasindaki-cocuk-17-saat-sonra-bulundu-43164146"
        title="Batmanda kaybolan 2,5 yaşındaki çocuk, 17 saat sonra bulundu" class="category__list__item--cover"><img
            data-src="https://image.hurimg.com/i/hurriyet/90/866x494/69f5b59aeb422ab0018d16fa.jpg"
            alt="Batmanda kaybolan 2,5 yaşındaki çocuk, 17 saat sonra bulundu"
            title="Batmanda kaybolan 2,5 yaşındaki çocuk, 17 saat sonra bulundu" class="entered loaded" width="866"
            height="494" data-ll-status="loaded"
            src="https://image.hurimg.com/i/hurriyet/90/866x494/69f5b59aeb422ab0018d16fa.jpg"></a>
    <div><a href="/gundem/batmanda-kaybolan-2-5-yasindaki-cocuk-17-saat-sonra-bulundu-43164146"
            title="Batmanda kaybolan 2,5 yaşındaki çocuk, 17 saat sonra bulundu" data-tag="h2">
            <h2>Batman'da kaybolan 2,5 yaşındaki çocuk, 17 saat sonra bulundu</h2>
        </a><a href="/gundem/batmanda-kaybolan-2-5-yasindaki-cocuk-17-saat-sonra-bulundu-43164146"
            title="Batmanda kaybolan 2,5 yaşındaki çocuk, 17 saat sonra bulundu" data-tag="p">
            <p>Batman'ın Gercüş ilçesine bağlı Bağlıca köyünde çoban babasının peşinden giderken kaybolan 2,5 yaşındaki
                Melike E., 17 saat sonra köye yaklaşık 6 kilometre mesafede bitkin halde bulundu. Melike'nin durumunun
                iyi olduğu öğrenildi.</p>
        </a><a target="_self" href="https://www.hurriyet.com.tr/haberleri/batman" title="Batman"
            data-google-interstitial="false" class="category__list__item--tag">#Batman</a></div>
</div>

Sözcü Gazetesi Haber Sitesi -> slug: “sozcu”

[1] Politika: https://www.sozcu.com.tr/gundem
[2] Ekonomi: https://www.sozcu.com.tr/ekonomi
[3] Spor: multiple link, gather 18 content from each link;
[4] Sağlık: https://www.sozcu.com.tr/saglik
[5] Kültür Sanat: https://www.sozcu.com.tr/kultur-sanat
[6] Dünya: https://www.sozcu.com.tr/dunya
[7] Teknoloji [need to provide 108 content to support “hurriyet" problem]: https://www.sozcu.com.tr/bilim-teknoloji

Behavior

Contents are getting loaded on scrolling to bottom. First u need to reach page footer. on content list and footer between there is “Daha fazla yükle” button. U need to click to load more content on related kategori.

Rendering

Sozcu example rendering

HTML Selector body > div[.container] > section > div[.row] > div[.list-content] > div[.row]

HTML Element

<div class="row align-items-center mb-4">
    <div class="col-6">
        <a href="https://www.sozcu.com.tr/unlu-firmanin-urunlerinde-kanserojen-tespit-edildi-turkiye-de-de-satiliyordu-p316327"
            class="img-holder wide radius-base">
            <img loading="lazy"
                src="https://sozcu01.sozcucdn.com/sozcu/production/uploads/images/2026/5/pastelboya-3jpg-z_kUwQnHIkG8Qaht4T3yKQ.jpg?w=490&amp;h=276&amp;mode=crop&amp;scale=both"
                alt="Ünlü firmanın ürünlerinde kanserojen tespit edildi: Türkiye'de de satılıyordu">
        </a>
    </div>
    <div class="col-6">
        <a href="https://www.sozcu.com.tr/unlu-firmanin-urunlerinde-kanserojen-tespit-edildi-turkiye-de-de-satiliyordu-p316327"
            class="d-flex flex-column">
            <span class="d-block fs-5 fw-semibold text-truncate-2">Ünlü firmanın ürünlerinde kanserojen tespit edildi:
                Türkiye'de de satılıyordu</span>
            <span class="small text-secondary text-truncate-2">Türkiye'de de atış gerçekleştiren ünlü hobi ve kırtasiye
                ürünleri firmasının bazı model ürünlerinde kanserojen riski tespit edildi. İddia üzerine ünlü market
                ürünleri geri çağırdı. Yapılan incelemelerde ürünlerde yer alan bir maddenin solunum yolu ile vücuda
                girmesi halinde ciddi akciğer hastalıklarına hatta kansere neden olabileceği ortaya çıktı.</span>
        </a>
    </div>
</div>