G

Skill 详情

glmocr

触发条件:(1) 用户希望从图片/PDF/扫描文档中提取文本、表格、公式或结构化数据,(2) 用户提到“OCR”、“文字识别”、“文档解析”,(3) 用户拥有一份文档(截图、扫描页、发票、纸质文件、白板照片)并需要将其内容以结构化形式呈现,(4) 用户请求解析、数字化或从视觉文档中提取内容。通过调用GLM-OCR SDK (pip install glmocr) 通过智谱云API解析文档。无需GPU。返回结构化的JSON(带有标签的区域

来源平台:ModelScope
来源标识:ModelScope/zai-org/glmocr-sdk
源文件:原始说明
办公效率 超热门 ModelScope 高 风险 下载 262访问 1.5千Stars 142 ModelScopeGitHub Copilot
来源平台ModelScope
文档版本master
热度超热门
排名信号下载 262
概述 安装 文档 下载

快速判断

触发条件:(1) 用户希望从图片/PDF/扫描文档中提取文本、表格、公式或结构化数据,(2) 用户提到“OCR”、“文字识别”、“文档解析”,(3) 用户拥有一份文档(截图、扫描页、发票、纸质文件、白板照片)并需要将其内容以结构化形式呈现,(4) 用户请求解析、数字化或从视觉文档中提取内容。通过调用GLM-OCR SDK (pip install glmocr) 通过智谱云API解析文档。无需GPU。返回结构化的JSON(带有标签的区域

最后校验2026-04-02
来源平台ModelScope
安全提示
下载副本ZIP 可用

适合任务

  • 按 ModelScope 收录说明完成平台、开发或工作流任务。
  • 通过下载包离线保存 Skill 内容。
  • 结合下载量、访问量和喜欢数评估优先级。

输入与输出

输入:任务目标、上下文材料、平台信息、文件路径、约束条件或需要处理的内容。

输出:按 Skill 说明生成的文档、代码、检查结果、计划、建议或操作步骤。

示例任务

  • 使用 glmocr 帮我完成当前任务,并先确认必要上下文。
  • 根据 glmocr 的说明,列出操作步骤和风险检查点。

安装方式

  1. 下载本站提供的 Skill ZIP 并解压。
  2. 把解压后的 Skill 目录放入当前 AI 工具支持的 skills 目录。
  3. 如需在线查看原始内容,可打开 GitHub 的 SKILL.md

在线原始地址:modelscope-zai-org-glmocr-sdk/SKILL.md

风险边界

使用前请检查权限、外部依赖和要处理的数据类型。第三方平台数据、支付、部署、账号和密钥相关内容应先核对官方说明。

SKILL.md 文档介绍

OpenClaw Skill: glmocr

Parses documents (images, PDFs, scans) via the GLM-OCR SDK.

> 📌 On-demand: This skill requires only ZHIPU_API_KEY in the environment. No YAML config files or GPU needed.

⚡ Quick Start

# Install
pip install glmocr

# Set API key (once)
export ZHIPU_API_KEY=sk-xxx
# or add to .env file in working directory:
echo "ZHIPU_API_KEY=sk-xxx" >> .env
# One-liner
import glmocr
result = glmocr.parse("document.pdf")
print(result.markdown_result)
print(result.to_dict())
# CLI — pass API key directly (no env setup needed)
glmocr parse image.png --api-key sk-xxx

# Or load from a specific .env file
glmocr parse image.png --env-file /path/to/.env

# Or rely on env var / auto-discovered .env (set once, then omit)
glmocr parse image.png
glmocr parse ./scans/ --output ./output/ --stdout

---

Configuration Priority

Constructor kwargs  >  os.environ  >  .env file  >  config.yaml  >  built-in defaults

Agents override everything via constructor kwargs or env vars — no YAML editing needed.

Key Environment Variables

| Variable | Description | Example |

| ---------------------- | -------------------------------------- | ----------- |

| ZHIPU_API_KEY | API key (required for MaaS) | sk-abc123 |

| GLMOCR_MODEL | Model name | glm-ocr |

| GLMOCR_TIMEOUT | Request timeout (seconds) | 600 |

| GLMOCR_ENABLE_LAYOUT | Layout detection on/off | true |

| GLMOCR_LOG_LEVEL | DEBUG / INFO / WARNING / ERROR | INFO |

---

Python API

Convenience function (single call)

import glmocr

# Single file → PipelineResult
result = glmocr.parse("invoice.png")

# Multiple files → list[PipelineResult]
results = glmocr.parse(["page1.png", "page2.png", "report.pdf"])

Class-based (multiple calls / resource reuse)

from glmocr import GlmOcr

parser = GlmOcr(api_key="sk-xxx")   # mode auto-set to "maas"
parser = GlmOcr(mode="maas")        # reads ZHIPU_API_KEY from env

# Always use as context manager or call .close()
with GlmOcr(api_key="sk-xxx") as parser:
    result = parser.parse("document.png")
    print(result.markdown_result)

parser.close()   # if not using `with`

Constructor Parameters

| Parameter | Type | Description |

| --------------- | ------ | ----------------------------------------------- |

| api_key | str | API key. Providing this auto-enables MaaS mode. |

| api_url | str | Override MaaS endpoint URL |

| model | str | Model name override |

| timeout | int | Request timeout in seconds (default: 600) |

| enable_layout | bool | Enable layout detection |

| log_level | str | Logging level |

---

Working with PipelineResult

Fields

result.markdown_result    # str — full document as Markdown
result.json_result        # list[list[dict]] — structured regions per page
result.original_images    # list[str] — absolute paths of input images

json_result structure

List of pages → list of regions per page:

[
  [
    {
      "index": 0,
      "label": "title",
      "content": "Annual Report 2024",
      "bbox_2d": [100, 50, 900, 120]
    },
    {
      "index": 1,
      "label": "table",
      "content": "| Q1 | Q2 |\n|---|---|\n| 120 | 145 |",
      "bbox_2d": [100, 140, 900, 400]
    }
  ]
]

Bounding boxes (bbox_2d): [x1, y1, x2, y2] normalised to 0–1000 scale.

Region labels: title, text, table, figure, formula, header, footer, page_number, reference, seal

Serialization

# Dict (JSON-serializable, for passing to other tools)
d = result.to_dict()
# Keys: json_result, markdown_result, original_images, usage (MaaS), data_info (MaaS)

# JSON string
json_str = result.to_json()                 # pretty-printed, ensure_ascii=False
json_str = result.to_json(indent=None)      # compact single line

# Save to disk: writes <stem>/<stem>.json + <stem>/<stem>.md + layout_vis/
result.save(output_dir="./output")
result.save(output_dir="./output", save_layout_visualization=False)

Error Handling

The SDK does not raise on MaaS errors — check to_dict() for an "error" key:

result = parser.parse("image.png")
d = result.to_dict()
if "error" in d:
    # Handle failure
    print("OCR failed:", d["error"])
else:
    print(d["markdown_result"])

---

CLI Reference

> Agent-preferred interface: use the CLI for most operations. Set ZHIPU_API_KEY in env once, then invoke as needed.

Supported input formats: .jpg, .jpeg, .png, .bmp, .gif, .webp, .pdf

Basic usage

# Parse a single file → saves to ./output/<stem>/
# MaaS mode is the default; ZHIPU_API_KEY must be set (or use --api-key)
glmocr parse image.png

# Pass API key directly without any env setup
glmocr parse image.png --api-key sk-xxx

# Parse a directory → saves each file to ./output/<stem>/
glmocr parse ./scans/

# Use self-hosted vLLM/SGLang instead of cloud
glmocr parse image.png --mode selfhosted

# Specify output directory
glmocr parse image.png --output ./results/

Read results in the terminal (agent-friendly)

# Print Markdown + JSON to stdout (and still save to disk)
glmocr parse image.png --stdout

# Print to stdout ONLY — do not write any files
glmocr parse image.png --stdout --no-save

# JSON only (no Markdown output)
glmocr parse image.png --stdout --json-only

# Pipe JSON into jq for structured extraction
glmocr parse image.png --stdout --json-only --no-save | jq '.[0] | map(select(.label=="table"))'

Save control

# Skip layout visualization images (faster, smaller output)
glmocr parse image.png --no-layout-vis

# Parse and save only JSON + Markdown, skip layout vis
glmocr parse image.png --no-layout-vis --output ./results/

Batch processing

# All images in a folder
glmocr parse ./invoice_scans/ --output ./parsed/ --no-layout-vis

# With progress visible in logs
glmocr parse ./docs/ --output ./parsed/ --log-level INFO

Debugging

glmocr parse image.png --log-level DEBUG

Full flag reference

| Flag | Default | Description |

| ----------------- | ---------- | ----------------------------------------------------- |

| --api-key / -k | env var | API key for MaaS mode (overrides ZHIPU_API_KEY) |

| --mode | maas | maas (cloud, default) or selfhosted (local GPU) |

| --env-file | auto | Path to .env file (default: auto-discover from cwd) |

| --output / -o | ./output | Output directory |

| --stdout | off | Print JSON + Markdown to stdout |

| --no-save | off | Skip writing files (use with --stdout) |

| --json-only | off | stdout JSON only, no Markdown |

| --no-layout-vis | off | Skip layout visualization images |

| --config / -c | none | Path to YAML config override |

| --log-level | INFO | DEBUG / INFO / WARNING / ERROR |

---

Typical Agent Workflow

receive document path / URL
       │
       ▼
glmocr.parse(path)            ← single call, handles PDF/image
       │
       ▼
result.to_dict()              ← safe to pass as tool output
       │
       ├── markdown_result    → hand to LLM for reading / summarization
       └── json_result        → structured extraction (tables, formulas, regions by label)

Filter by label

result = glmocr.parse("report.png")
regions = result.json_result[0]  # first page

tables = [r for r in regions if r["label"] == "table"]
formulas = [r for r in regions if r["label"] == "formula"]
body_text = [r for r in regions if r["label"] == "text"]

Multi-page PDF → iterate pages

with GlmOcr(api_key="sk-xxx") as parser:
    result = parser.parse("document.pdf")   # all pages in one PipelineResult
    for page_idx, page_regions in enumerate(result.json_result):
        print(f"Page {page_idx + 1}: {len(page_regions)} regions")
        for region in page_regions:
            print(f"  [{region['label']}] {region['content'][:60]}")

Programmatic config (no env vars)

from glmocr.config import GlmOcrConfig

cfg = GlmOcrConfig.from_env(
    api_key="sk-xxx",
    mode="maas",
    timeout=600,
    log_level="DEBUG",
)

---

Output Directory Layout

After result.save(output_dir):

output_dir/
  <image_stem>/
    <image_stem>.json         ← structured regions
    <image_stem>.md           ← full Markdown (with cropped figure images)
    imgs/                     ← cropped figures referenced in Markdown
    layout_vis/               ← layout detection overlay images (if enabled)
      <image_stem>.jpg

---

Common Pitfalls

  • ZHIPU_API_KEY not set: SDK defaults to MaaS mode. Without a key, parse() will fail with a clear error message and quick-fix instructions. Set via export ZHIPU_API_KEY=sk-xxx, add to a .env file, or pass --api-key sk-xxx to the CLI.
  • Large PDFs: Default timeout is 600s. For very long documents increase with timeout=1200.
  • result.json_result is a string: Happens when the model returns malformed JSON. The SDK preserves the raw string — parse or log it manually.
建议反馈