C

Skill 详情

Clean Text Toolkit

本地文本清理和检查工具包。提取结构化信息(URL、邮箱、电话、IP、日期、话题标签、金额),并对个人身份信息(邮箱/电话/信用卡…)进行脱敏。

来源平台:SkillHub
来源标识:SkillHub/clean-text-toolkit
源文件:原始说明
数据处理 高关注 SkillHub 高 风险 下载 273Stars 1 SkillHub
来源平台SkillHub
文档版本0.3.0
热度高关注
排名信号下载 273
概述 安装 文档 下载

快速判断

本地文本清理和检查工具包。提取结构化信息(URL、邮箱、电话、IP、日期、话题标签、金额),并对个人身份信息(邮箱/电话/信用卡…)进行脱敏。

最后校验2026-05-27
来源平台SkillHub
安全提示
下载副本ZIP 可用

适合任务

  • 按 SkillHub 收录说明复用成熟任务流程。
  • 通过下载包离线阅读完整 Skill 内容。
  • 结合热度指标优先评估常用 Skill。

输入与输出

输入:任务目标、上下文材料、文件路径、约束条件或需要处理的内容。

输出:按 Skill 说明生成的文档、代码、检查结果、计划、建议或操作步骤。

示例任务

  • 使用 Clean Text Toolkit 帮我处理当前任务,并说明需要准备哪些输入。
  • 根据 Clean Text Toolkit 的说明,先列出使用前的安全检查项。

安装方式

  1. 下载本站提供的 Skill ZIP 并解压。
  2. 把解压后的 Skill 目录放入当前 AI 工具支持的 skills 目录。
  3. 如需在线查看原始内容,可打开 GitHub 的 SKILL.md

在线原始地址:skillhub-clean-text-toolkit/SKILL.md

风险边界

SkillHub 提供了源站安全报告入口,但本站不替代人工审查。使用前仍需检查权限、外部依赖和敏感数据边界。

SKILL.md 文档介绍

clean-text-toolkit

v0.3.0

A small, honest local toolkit for the work agents end up doing constantly: read some text someone sent you, find the structured bits, clean it up, redact the secrets, and forward it downstream. Built on Python 3 standard library only. No pandas, no nltk, no pip installs, no remote calls.

This skill is the companion to clean-csv-toolkit: that one handles structured tabular data, this one handles unstructured text.

What this skill does

  • scripts/extract.py — pull structured items out of any text file. Kinds: url, email, phone, ipv4, ipv6, hashtag, mention, hex-color, money, iso-date. Output to stdout (one-per-line or JSON), or to a .txt / .json / .jsonl file. Optional --unique, --sort, --with-line (prefix with the source line number).
  • scripts/normalize.py — clean up messy text. Chainable transforms applied in command-line order: --trim, --collapse-spaces, --strip-blank, --to-unix, --to-crlf, --dehyphenate (rejoin OCR/PDF hyphenated line-breaks), --unsmart (smart quotes / em-dashes → ASCII), --strip-bom, --strip-zwsp (zero-width spaces and joiners), --tabs-to-spaces N, --spaces-to-tabs N, --lower / --upper / --title, --normalize-unicode NFC|NFD|NFKC|NFKD.
  • scripts/redact.py — anonymize text by replacing PII-like patterns with placeholder tokens. Kinds: email, phone, ipv4, ipv6, url, credit-card (with Luhn validation to suppress false positives), ssn-us, uuid, hex-token (32+ hex chars, typical for tokens / hashes), aws-access-key (AKIA…), jwt (three base64url segments with the eyJ header). --keep-counts makes the same value always get the same placeholder; --preserve-length pads/truncates the placeholder to the original length.
  • scripts/lines.py — line-oriented utilities. --op count | dedupe | sort | shuffle | head | tail. Streams count, head, tail. dedupe and sort are O(N) memory in the number of lines, but each line is small so 1 M lines is fine on a laptop. --case-insensitive, --keep first|last, --numeric, --reverse, --seed for deterministic shuffles.
  • scripts/wordcount.py — word / character / line / sentence statistics. Optional --top N for most-frequent words, --stopwords PATH, --min-length N, --ignore-case, --regex PATTERN (default [A-Za-z']+).
  • scripts/diff_text.py — three-mode text diff using stdlib difflib. --mode unified (default), --mode side (custom two-column layout), --mode html (writes a full HTML file with red/green coloring). --ignore-case, --ignore-whitespace, --context N.
  • scripts/template.py (NEW in v0.2.0) — substitute placeholders in a text file with values from a JSON object or inline --set key=value overrides. Mustache ({{name}}), dollar (${name}), or percent (%(name)s) syntax. Filters: upper, lower, title, strip, capitalize, reverse, len, escape-html, escape-json, urlencode. Default values: {{name ?Unknown}}. Strict mode (--strict) exits 1 if any placeholder is unresolved. No Jinja2, no eval.
  • scripts/slug.py (NEW in v0.2.0) — turn strings into URL-safe slugs. Single string mode (--text "Hello World") or batch mode (line-in-file -> line-out-file). Options: --separator, --max-length, --no-lower, --ascii (Unicode -> ASCII transliteration via NFKD), --keep-dots (useful for filenames), --dedupe.
  • scripts/markdown.py (NEW in v0.2.0) — strip Markdown to plain text, render a minimal HTML approximation, or extract structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV. For text mode, --link-style anchor|url|both controls how [text](url) is rendered.
  • scripts/replace.py (NEW in v0.3.0) — find-and-replace with regex / literal / word-boundary modes, capture-group back-references (\1, \2), multiple --find/--replace pairs in a single pass, or a JSON --rules file with per-rule settings. --dry-run previews matches with line:col and context; --max N caps replacements per rule. Returns exit 1 when zero replacements happen so it slots into CI.
  • scripts/check_deps.sh — verify python3 is available.

What this skill does not do

  • It does not call any LLM, web service, or remote API.
  • It does not load entire files into memory unless an operation truly needs the whole file (full-content normalization, sort-and-write, diff). Streaming-friendly operations (extract, lines --op count|head|tail, wordcount for chars/lines counters) read one line at a time.
  • It does not write outside the input/output paths the caller provides.

Quick start

1. Pull every email out of a log file

python3 scripts/extract.py app.log --kind email --unique --sort
python3 scripts/extract.py app.log --kind email --output emails.txt --unique

2. Find every URL and tag it with the source line

python3 scripts/extract.py article.md --kind url --with-line

3. Clean up a messy OCR dump

python3 scripts/normalize.py scanned.txt clean.txt \
    --strip-bom --to-unix --dehyphenate --collapse-spaces \
    --unsmart --strip-blank --normalize-unicode NFC

The transforms run in the order you list them on the command line.

4. Redact PII before sharing a transcript

python3 scripts/redact.py transcript.txt safe.txt
# default kinds = all
# default placeholder = [REDACTED_{kind}_{i}]
# Only redact emails and phones, give the same email the same placeholder
python3 scripts/redact.py transcript.txt safe.txt \
    --kinds email,phone --keep-counts
# Custom template
python3 scripts/redact.py log.txt safe.txt \
    --token-template "<<{kind}#{i}>>"
# Pad placeholder to match original length (for fixed-width layouts)
python3 scripts/redact.py log.txt safe.txt --preserve-length

Credit-card matches are validated against the Luhn checksum so 16 random digits in a row don't trigger a false positive.

5. Line utilities

# Quick file stats
python3 scripts/lines.py haystack.txt --op count

# Drop duplicates, case-insensitive
python3 scripts/lines.py users.txt --op dedupe --case-insensitive --output unique.txt

# Numeric sort (so "100" > "23" > "7")
python3 scripts/lines.py scores.txt --op sort --numeric --reverse

# Deterministic shuffle
python3 scripts/lines.py prompts.txt --op shuffle --seed 42

# Look at the head and tail of a multi-gig log
python3 scripts/lines.py huge.log --op head -n 20
python3 scripts/lines.py huge.log --op tail -n 20

6. Word counts

# Basic stats
python3 scripts/wordcount.py essay.txt

# Top words with stopwords filter
python3 scripts/wordcount.py essay.txt --top 20 --ignore-case --stopwords stop.txt

# Machine-readable output
python3 scripts/wordcount.py essay.txt --top 10 --json > stats.json

7. Text diff

# Standard unified diff
python3 scripts/diff_text.py before.txt after.txt

# Side-by-side
python3 scripts/diff_text.py before.txt after.txt --mode side

# HTML report (colorized) for sharing
python3 scripts/diff_text.py before.txt after.txt --mode html --output diff.html

# Whitespace-insensitive compare
python3 scripts/diff_text.py before.txt after.txt --ignore-whitespace

Exit codes

| Code | Meaning |

|---|---|

| 0 | success / one or more matches / files identical |

| 1 | zero matches / zero redactions / files differ / empty input |

| 2 | bad arguments / unsafe path / missing input / unknown kind / bad regex / unsupported output extension |

This 0 / 1 / 2 split is consistent across all six scripts so they slot into shell pipelines cleanly:

# Normalize, then redact, then count words in one shot
python3 scripts/normalize.py raw.txt clean.txt --to-unix --dehyphenate \
  && python3 scripts/redact.py clean.txt safe.txt \
  && python3 scripts/wordcount.py safe.txt --top 10

Safety properties

  • Pure Python 3 standard library. No third-party dependencies, no pip install.
  • No subprocess calls. No shell invocation.
  • All file paths are validated against a strict allowlist regex that rejects shell metacharacters (;, |, &, >, <, $, ` `, etc.). The same safe_path() helper that powers clean-csv-toolkit`.
  • Scripts only read the input paths the caller provides and write to the output paths the caller provides.
  • All inputs and outputs default to UTF-8; reads fall back through utf-8-sig, cp1252, latin-1 if needed. Writes are always UTF-8.
  • Deterministic where it matters: shuffle --seed N is reproducible; extract and wordcount always emit results in the same order for a given input.

Performance

  • lines.py --op dedupe processes 100,000 short lines (500 distinct) in ~0.06 s.
  • lines.py --op sort processes 100,000 lines in ~0.10 s.
  • extract.py scans the file in a single streaming pass — memory does not grow with file size.

Known limitations

  • The PII patterns are pragmatic heuristics, not strict RFC validators. The email regex accepts user@host.tld shapes but does not validate that host.tld resolves. phone accepts three telltale formats (+<digits>, (XXX) XXX-XXXX, XXX-XXX-XXXX / XXX XXX XXXX) so it doesn't grab IPs, dates, or credit-card numbers — but it will miss exotic local formats.
  • credit-card uses the Luhn checksum, but hex-token (and similar high-recall patterns) intentionally over-match; review the count before sharing redacted output publicly.
  • diff_text.py --mode html produces the standard difflib.HtmlDiff markup, which embeds inline styles. The file is portable but the styling is not customizable.

v0.3.0 changes

  • Added scripts/replace.py: sed-like find-and-replace with optional regex, capture-group back-references, multiple find/replace pairs in one pass, JSON --rules file, --dry-run preview with line:col context, --max N cap per rule, --word boundaries for literal mode.
  • Fixed extract.py: --kind url was grabbing trailing sentence-punctuation (., ), ,, etc.) as part of the URL. Now strips a single trailing punctuation char so Visit https://example.com. correctly extracts https://example.com instead of https://example.com..
  • Fixed slug.py: --text mode with input that slugifies to an empty string (e.g. "!!! @@@") now exits 1, matching the existing batch-mode behaviour. Previously it returned 0 silently.

v0.2.0 changes

  • Added scripts/template.py: no-Jinja2 template renderer. Three placeholder syntaxes (mustache {{x}}, dollar ${x}, percent %(x)s), pipe filters, fallback defaults, and an optional --strict mode for CI. Hand-rolled regex tokenizer, no eval, no subprocess.
  • Added scripts/slug.py: URL-safe slug generator. Single-string mode (prints to stdout) or batch mode (one slug per input line). Unicode-aware with optional ASCII transliteration via NFKD; --keep-dots for filename use; --dedupe for batch outputs.
  • Added scripts/markdown.py: three-mode Markdown processor. text strips all markup; html renders a minimal HTML approximation (headings, paragraphs, lists, blockquotes, fenced code, links, images, bold/italic/code); extract pulls structured items (headings, links, images, code blocks, list items) as JSON / JSONL / TSV.
  • All three new scripts share the same safe-path policy and 0 / 1 / 2 exit-code contract as the rest of the toolkit.

v0.1.0 changes

  • First public release of clean-text-toolkit.
  • Six scripts: extract.py, normalize.py, redact.py, lines.py, wordcount.py, diff_text.py.
  • Shared _common.py with safe_path, read_text, iter_lines, and write_text helpers (mirrors the design of clean-csv-toolkit/scripts/_common.py).
  • Bug fixed during development: initial phone regex was too greedy and matched IPs / ISO dates / credit-card-with-spaces; tightened to three explicit shapes (international, parenthesized, 3-3-4 dashed) that don't collide with those other patterns. Tested against a mixed-content fixture with 5 valid phones and 3 confusable non-phones.
  • Zero third-party dependencies; works on any system that ships Python 3.

Pairs well with

  • clean-csv-toolkit — same author, same design philosophy (pure stdlib, exit-code contract, safe-path policy), for structured tabular data.
  • openclaw-prompt-shield — pair extract.py --kind email,url with prompt-shield's redaction pipeline to scrub user-supplied text before passing it to an LLM.

License

MIT

建议反馈