快速判断
本地文本清理和检查工具包。提取结构化信息(URL、邮箱、电话、IP、日期、话题标签、金额),并对个人身份信息(邮箱/电话/信用卡…)进行脱敏。
适合任务
- 按 SkillHub 收录说明复用成熟任务流程。
- 通过下载包离线阅读完整 Skill 内容。
- 结合热度指标优先评估常用 Skill。
输入与输出
输入:任务目标、上下文材料、文件路径、约束条件或需要处理的内容。
输出:按 Skill 说明生成的文档、代码、检查结果、计划、建议或操作步骤。
示例任务
- 使用 Clean Text Toolkit 帮我处理当前任务,并说明需要准备哪些输入。
- 根据 Clean Text Toolkit 的说明,先列出使用前的安全检查项。
安装方式
- 下载本站提供的 Skill ZIP 并解压。
- 把解压后的 Skill 目录放入当前 AI 工具支持的
skills目录。 - 如需在线查看原始内容,可打开 GitHub 的
SKILL.md。
风险边界
SkillHub 提供了源站安全报告入口,但本站不替代人工审查。使用前仍需检查权限、外部依赖和敏感数据边界。
SKILL.md 文档介绍
clean-text-toolkit
v0.3.0
A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.
This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.
What this skill does
scripts/extract.py— pull structured items out of any text file. Kinds:url,email,phone,ipv4,ipv6,hashtag,mention,hex-color,money,iso-date. Output to stdout (one-per-line or JSON), or to a.txt/.json/.jsonlfile. Optional--unique,--sort,--with-line(prefix with the source line number).scripts/normalize.py— clean up messy text. Chainable transforms applied in command-line order:--trim,--collapse-spaces,--strip-blank,--to-unix,--to-crlf,--dehyphenate(rejoin OCR/PDF hyphenated line-breaks),--unsmart(smart quotes / em-dashes → ASCII),--strip-bom,--strip-zwsp(zero-width spaces and joiners),--tabs-to-spaces N,--spaces-to-tabs N,--lower/--upper/--title,--normalize-unicode NFC|NFD|NFKC|NFKD.scripts/redact.py— anonymize text by replacing PII-like patterns with placeholder tokens. Kinds:email,phone,ipv4,ipv6,url,credit-card(with Luhn validation to suppress false positives),ssn-us,uuid,hex-token(32+ hex chars, typical for tokens / hashes),aws-access-key(AKIA…),jwt(three base64url segments with theeyJheader).--keep-countsmakes the same value always get the same placeholder;--preserve-lengthpads/truncates the placeholder to the original length.scripts/lines.py— line-oriented utilities.--op count | dedupe | sort | shuffle | head | tail. Streamscount,head,tail.dedupeandsortare O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop.--case-insensitive,--keep first|last,--numeric,--reverse,--seedfor deterministic shuffles.scripts/wordcount.py— word / character / line / sentence statistics. Optional--top Nfor most-frequent words,--stopwords PATH,--min-length N,--ignore-case,--regex PATTERN(default[A-Za-z']+).scripts/diff_text.py— three-mode text diff using stdlibdifflib.--mode unified(default),--mode side(custom two-column layout),--mode html(writes a full HTML file with red/green coloring).--ignore-case,--ignore-whitespace,--context N.scripts/template.py(NEW in v0.2.0) — substitute placeholders in a text file with values from a JSON object or inline--set key=valueoverrides. Mustache ({{name}}), dollar (${name}), or percent (%(name)s) syntax. Filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode. Default values:{{name ?Unknown}}. Strict mode (--strict) exits 1 if any placeholder is unresolved. No Jinja2, noeval.scripts/slug.py(NEW in v0.2.0) — turn strings into URL-safe slugs. Single string mode (--text "Hello World") or batch mode (line-in-file -> line-out-file). Options:--separator,--max-length,--no-lower,--ascii(Unicode -> ASCII transliteration via NFKD),--keep-dots(useful for filenames),--dedupe.scripts/markdown.py(NEW in v0.2.0) — strip Markdown to plain text, render a minimal HTML approximation, or extract structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. For text mode,--link-style anchor|url|bothcontrols how[text](url)is rendered.scripts/replace.py(NEW in v0.3.0) — find-and-replace with regex / literal / word-boundary modes, capture-group back-references (\1,\2), multiple--find/--replacepairs in a single pass, or a JSON--rulesfile with per-rule settings.--dry-runpreviews matches with line:col and context;--max Ncaps replacements per rule. Returns exit 1 when zero replacements happen so it slots into CI.scripts/check_deps.sh— verifypython3is available.
What this skill does not do
- It does not call any LLM, web service, or remote API.
- It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (
extract,lines --op count|head|tail,wordcountfor chars/lines counters) read one line at a time. - It does not write outside the input/output paths the caller provides.
Quick start
1. Pull every email out of a log file
python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique2. Find every URL and tag it with the source line
python3 scripts/extract.py article.md --kind url --with-line3. Clean up a messy OCR dump
python3 scripts/normalize.py scanned.txt clean.txt \
--strip-bom --to-unix --dehyphenate --collapse-spaces \
--unsmart --strip-blank --normalize-unicode NFCThe transforms run in the order you list them on the command line.
4. Redact PII before sharing a transcript
python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
--kinds email,phone --keep-counts# Custom template
python3 scripts/redact.py log.txt safe.txt \
--token-template "<<{kind}#{i}>>"# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-lengthCredit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.
5. Line utilities
# Quick file stats
python3 scripts/lines.py haystack.txt --op count
# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt
# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse
# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42
# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 206. Word counts
# Basic stats
python3 scripts/wordcount.py essay.txt
# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt
# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json7. Text diff
# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt
# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side
# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html
# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespaceExit codes
| Code | Meaning |
|---|---|
| 0 | success / one or more matches / files identical |
| 1 | zero matches / zero redactions / files differ / empty input |
| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |
This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:
# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
&& python3 scripts/redact.py clean.txt safe.txt \
&& python3 scripts/wordcount.py safe.txt --top 10Safety properties
- Pure Python 3 standard library. No third-party dependencies, no
pip install. - No
subprocesscalls. No shell invocation. - All file paths are validated against a strict allowlist regex that rejects shell metacharacters (
;,|,&,>,<,$, ``, etc.). The samesafe_path()helper that powersclean-csv-toolkit`. - Scripts only read the input paths the caller provides and write to the output paths the caller provides.
- All inputs and outputs default to UTF-8; reads fall back through
utf-8-sig,cp1252,latin-1if needed. Writes are always UTF-8. - Deterministic where it matters:
shuffle --seed Nis reproducible;extractandwordcountalways emit results in the same order for a given input.
Performance
lines.py --op dedupeprocesses 100,000 short lines (500 distinct) in ~0.06 s.lines.py --op sortprocesses 100,000 lines in ~0.10 s.extract.pyscans the file in a single streaming pass — memory does not grow with file size.
Known limitations
- The PII patterns are pragmatic heuristics, not strict RFC validators. The
emailregex acceptsuser@host.tldshapes but does not validate thathost.tldresolves.phoneaccepts three telltale formats (+<digits>,(XXX) XXX-XXXX,XXX-XXX-XXXX/XXX XXX XXXX) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats. credit-carduses the Luhn checksum, buthex-token(and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.diff_text.py --mode htmlproduces the standarddifflib.HtmlDiffmarkup, which embeds inline styles. The file is portable but the styling is not customizable.
v0.3.0 changes
- Added
scripts/replace.py: sed-like find-and-replace with optional regex, capture-group back-references, multiple find/replace pairs in one pass, JSON--rulesfile,--dry-runpreview with line:col context,--max Ncap per rule,--wordboundaries for literal mode. - Fixed
extract.py:--kind urlwas grabbing trailing sentence-punctuation (.,),,, etc.) as part of the URL. Now strips a single trailing punctuation char soVisit https://example.com.correctly extractshttps://example.cominstead ofhttps://example.com.. - Fixed
slug.py:--textmode with input that slugifies to an empty string (e.g."!!! @@@") now exits 1, matching the existing batch-mode behaviour. Previously it returned 0 silently.
v0.2.0 changes
- Added
scripts/template.py: no-Jinja2 template renderer. Three placeholder syntaxes (mustache{{x}}, dollar${x}, percent%(x)s), pipe filters, fallback defaults, and an optional--strictmode for CI. Hand-rolled regex tokenizer, noeval, nosubprocess. - Added
scripts/slug.py: URL-safe slug generator. Single-string mode (prints to stdout) or batch mode (one slug per input line). Unicode-aware with optional ASCII transliteration via NFKD;--keep-dotsfor filename use;--dedupefor batch outputs. - Added
scripts/markdown.py: three-mode Markdown processor.textstrips all markup;htmlrenders a minimal HTML approximation (headings, paragraphs, lists, blockquotes, fenced code, links, images, bold/italic/code);extractpulls structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. - All three new scripts share the same safe-path policy and 0 / 1 / 2 exit-code contract as the rest of the toolkit.
v0.1.0 changes
- First public release of clean-text-toolkit.
- Six scripts:
extract.py,normalize.py,redact.py,lines.py,wordcount.py,diff_text.py. - Shared
_common.pywithsafe_path,read_text,iter_lines, andwrite_texthelpers (mirrors the design ofclean-csv-toolkit/scripts/_common.py). - Bug fixed during development: initial
phoneregex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones. - Zero third-party dependencies; works on any system that ships Python 3.
Pairs well with
clean-csv-toolkit— same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.openclaw-prompt-shield— pairextract.py --kind email,urlwith prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.
License
MIT