Hi @IrisNeko,
感谢你的积极推进~~
关于你提到的几个问题:
-
官方文档及 Repo 列表: 委员会已经在着手确认目前最新、最核心的官方文档链接,会尽快整理好提供给你。
-
历史数据快照: 很遗憾,委员会没有现成的打包数据集供你直接使用,初始数据的获取还是需要你通过 API 自行抓取。
请先按照你规划的方向继续推进,文档链接准备好后我们会第一时间同步给你。
Hi @IrisNeko,
感谢你的积极推进~~
关于你提到的几个问题:
官方文档及 Repo 列表: 委员会已经在着手确认目前最新、最核心的官方文档链接,会尽快整理好提供给你。
历史数据快照: 很遗憾,委员会没有现成的打包数据集供你直接使用,初始数据的获取还是需要你通过 API 自行抓取。
请先按照你规划的方向继续推进,文档链接准备好后我们会第一时间同步给你。
本周按照上周计划,围绕数据获取与数据库构建两条主线推进。完成了检索层双层数据库架构的落地实现、多路检索融合流程的工程化,以及 Nervos Talk 论坛爬虫的开发与接入。"数据获取 → 数据入库"最小链路已基本跑通,但受制于官方文档尚未整理完毕,暂未进行完整数据集的统一测试。
本周完成了"浅层索引 + 深层原件"双层存储模型的工程实现:
ArchiveStore(SQLAlchemy ORM)存储完整原文,字段涵盖 raw_text、raw_format、content_hash 等,支持去重和增量更新。DualLayerWriter:同步写入两层,content_hash(SHA-256)作为幂等键,重复入库不产生冗余记录。这一架构避免了为频繁变更的大库维护全量知识图谱,同时保留了随时通过锚点(anchor)拉取完整原文的能力。
在双层存储基础上,本周同步实现了四路检索融合管道:
query
├─ 1. 向量检索 (Qdrant) — 语义召回
├─ 2. BM25 检索 (rank_bm25) — 关键词/术语召回,支持中英文混合分词
├─ 3. 模糊匹配 (difflib) — 命名变体、缩写、轻微拼写差异
└─ 4. 精确匹配 (Qdrant 过滤) — 函数名、类名、帖子标题等硬锚点
↓
RRF 融合(Reciprocal Rank Fusion)
↓
Evidence(附带各路来源分数溯源)
MultiRetriever 统一管理四条路径,各路径可通过配置独立开关,RRF 结果的 payload 字段保留了每条路径的原始得分,便于后续诊断和调优。
针对 Nervos Talk(Discourse 论坛)开发了完整的爬虫模块:
/t/{id}.json、/t/{id}/posts.json),支持分页批量拉取。html_to_text 将 Discourse 返回的 cooked HTML 转换为干净纯文本,保留代码块标记。scripts/run_discourse_crawl.py CLI,支持单 topic 爬取、最新话题扫描、分类爬取等模式。数据获取 → 清洗 → 双层入库 → 多路检索的完整链路已通过集成测试验证(含对 https://talk.nervos.org 的真实 API 请求)。等待 Nervos 官方整理好最新文档后,再进行完整数据集的统一入库测试。
本周完成了从"数据库结构设计"到"可运行数据管道"的跨越。当前已经可以通过一条命令将论坛帖子完整地写入双层数据库,并通过多路融合检索取回结果。这为后续接入更多数据源(RFC 文档、GitHub 代码)以及 LangGraph 工作流调用真实检索打下了基础。
官方文档数据源尚未就绪,当前检索效果的验证仍依赖少量样本数据,尚未在完整语料下进行系统性的检索质量评估。BM25 索引目前为内存构建,进程重启后需重新加载;待数据量增长后需考虑持久化方案。
当前图引擎的主链路已成型,但检索节点仍使用 Mock 数据,下周重点将工作流与真实数据层打通:
RetrievalExecutor 节点中的 Mock 实现替换为真实的 MultiRetriever,使图引擎能够从 Qdrant + SQLite 双层库中召回实际证据。MemoryService 与图状态的联动,使 Agent 在多轮对话中能够读取用户历史记忆和频道上下文,丰富检索过滤条件。开始推进 Bot 接入层,以 Telegram 为首个接入平台,通过 MCP(Model Context Protocol)将 Telegram 消息通道封装为标准工具接口:
MCPTransportAdapter 抽象基础上,新增 TelegramTransportAdapter,将 Telegram 消息的接收与回复封装为图引擎可直接调用的工具接口。ResponseNormalizer,将图引擎输出的 FinalResponse 转换为 Telegram 支持的 Markdown 格式,处理长回复的分段发送逻辑。Following last week’s plan, this week’s progress was driven by two main tracks: data acquisition and database construction. We completed the implementation of the dual-layer database architecture for the retrieval layer, the engineering of the multi-way retrieval fusion pipeline, and the development and integration of the Nervos Talk forum crawler. The minimum viable pipeline of “data acquisition → data ingestion” is basically operational. However, because the official documentation has not yet been fully organized, we have temporarily paused unified testing on the complete dataset.
This week, we completed the engineering implementation of the “shallow index + deep original document” dual-layer storage model:
ArchiveStore (SQLAlchemy ORM) to store the complete original text. Fields include raw_text, raw_format, content_hash, etc., supporting deduplication and incremental updates.DualLayerWriter: Synchronously writes to both layers, using content_hash (SHA-256) as the idempotent key to prevent redundant records during repeated ingestions.This architecture avoids the need to maintain a full knowledge graph for a frequently changing large database, while preserving the ability to fetch the complete original text via anchors at any time.
Building upon the dual-layer storage, we simultaneously implemented a four-way retrieval fusion pipeline this week:
query
├─ 1. Vector Retrieval (Qdrant) — Semantic recall
├─ 2. BM25 Retrieval (rank_bm25) — Keyword/term recall, supports mixed English-Chinese tokenization
├─ 3. Fuzzy Matching (difflib) — Naming variants, abbreviations, minor spelling differences
└─ 4. Exact Match (Qdrant filter)— Hard anchors like function names, class names, post titles
↓
RRF Fusion (Reciprocal Rank Fusion)
↓
Evidence (includes score tracing from each source)
The MultiRetriever centrally manages these four paths, each of which can be toggled independently via configuration. The payload field of the RRF results retains the original scores from each path to facilitate subsequent diagnostics and tuning.
Developed a complete crawler module for Nervos Talk (a Discourse forum):
/t/{id}.json, /t/{id}/posts.json), supporting paginated batch fetching.html_to_text converts the cooked HTML returned by Discourse into clean plain text, preserving code block formatting.scripts/run_discourse_crawl.py CLI, supporting modes such as single-topic crawling, scanning for latest topics, and category-based crawling.The complete pipeline of “data acquisition → cleaning → dual-layer ingestion → multi-way retrieval” has been verified via integration testing (including actual API requests to https://talk.nervos.org). We are waiting for the official Nervos team to finalize the latest documentation before conducting a unified ingestion test on the complete dataset.
This week marks the leap from “database architecture design” to a “runnable data pipeline.” Currently, we can completely ingest forum posts into the dual-layer database and retrieve results via the multi-way fusion retrieval using a single command. This lays the groundwork for integrating more data sources (RFC documents, GitHub code) and allowing LangGraph workflows to call real retrieval mechanisms.
The official documentation data source is not yet ready. Verification of the current retrieval effectiveness still relies on a small amount of sample data, and a systematic retrieval quality assessment on a complete corpus has not yet been conducted. The BM25 index is currently built in memory and needs to be reloaded after process restarts; a persistence solution needs to be considered as the data volume grows.
The main pipeline of the graph engine has taken shape, but the retrieval nodes still use Mock data. Next week’s focus will be connecting the workflow with the real data layer:
RetrievalExecutor node with the real MultiRetriever, enabling the graph engine to recall actual evidence from the Qdrant + SQLite dual-layer database.MemoryService and the graph state, allowing the Agent to read the user’s historical memory and channel context in multi-turn dialogues, thereby enriching retrieval filter conditions.Begin advancing the Bot integration layer, starting with Telegram as the first platform. We will use MCP (Model Context Protocol) to encapsulate the Telegram messaging channel into standard tool interfaces:
MCPTransportAdapter abstraction by adding TelegramTransportAdapter, encapsulating the receiving and replying of Telegram messages into a tool interface directly callable by the graph engine.ResponseNormalizer to convert the FinalResponse outputted by the graph engine into the Markdown format supported by Telegram, handling the segmented sending logic for long replies.@xingtianchunyan 您好,请问这些最新最核心的文档官方链接,最晚什么时候能提供,我可以据此更好地安排开发进程。
Hi @IrisNeko ,
很开心看到你的最新进展,你的稳定发挥值得赞扬~~
以下是你可以用到的核心文档链接:
Nervos Network:Nervos Network · GitHub
Web5:web5fans · GitHub
DevRel:CKB DevRel · GitHub
RGB++ : RGB++ Protocol · GitHub
Fiber:GitHub - nervosnetwork/fiber: A scalable, privacy-by-default payment & swap network · GitHub
App5:appfi5 · GitHub
关于文档我想需要向你道声抱歉,耽误了一些时间,希望以上的核心文档足够你推进当前的工作。有拿不准的地方可以再跟我们确认。
祝好
行天
本周围绕“真实数据接入 + 真实模型回答”推进,重点完成了 GitHub 多源文档抓取与入库、检索库规模化构建,以及可直接运行的 Agent 端到端示例。当前系统已经能够在真实数据库中检索证据,并调用 glm-4.7 生成带引用的回答,形成“数据抓取 → 入库 → 检索 → 回答”的可用闭环。
README、docs/、guide/、*.md、*.rst 等),跳过明显非文档目录。1232 条(seen=1232, written=1232)。163 个 repo topic。data/sources/github_docs.jsonl,可用于复现和二次处理。archive.db)已写入对应记录。source/topic/type 过滤检索范围。glm-4.7,可基于检索证据生成引用化回答。pytest -m 'not integration'),确保新增能力未破坏既有链路。本周项目从“可演示的检索框架”升级为“可直接用于问答验证的工程化系统”。核心价值在于:
glm-4.7 + 引用)。TelegramTransportAdapter:对齐现有 MCP 接口,完成消息输入/输出适配。FinalResponse 转为 Telegram Markdown 并处理长消息分段。This week focused on real data integration + real model answering. We completed multi-source GitHub documentation crawling and ingestion, scaled retrieval database construction, and a runnable end-to-end Agent demo. The system now retrieves evidence from the real database and generates citation-grounded answers with glm-4.7, forming a practical pipeline from crawling to answering.
README, docs/, guide/, *.md, *.rst, etc.) and exclusion of non-doc directories.1232 documents ingested (seen=1232, written=1232)163 repository topicsdata/sources/github_docs.jsonl generated for reproducibility and reprocessingsource/topic/type.glm-4.7, producing evidence-grounded answers.pytest -m 'not integration'), confirming no breakage.The project has moved from a demo-oriented retrieval framework to a practical, testable QA system:
glm-4.7 citations.TelegramTransportAdapter aligned with existing MCP interfaces.因个人私事耽搁,本期(第四周)周报较原计划推迟一天发布,感谢理解。