Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Sep 12, 2025 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Crawler for linguistic corpora
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
Extract text from Vikidia/Wikipedia articles [fr]
Crawl Ask.fm QA lists and create corpus for ML.
EU AI Act RAG — End-to-end retrieval-augmented generation pipeline: SPARQL corpus builder, Cloudflare Workers AI backend, and Streamlit playground for querying Regulation (EU) 2024/1689
App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions
CLI tool to redact and publish spam/phishing emails as a public research corpus.
A Scrapy package based web scraper for collecting Kurdish text data from websites. The tool recursively crawls specified domains, extracts article content using Trafilatura, and filters results by language using Facebook's FastText language identification model.
English-corpus building through YouTube comments and viewing comment networks in Gephi
Add a description, image, and links to the corpus-builder topic page so that developers can more easily learn about it.
To associate your repository with the corpus-builder topic, visit your repo's landing page and select "manage topics."