Spark Program | Nervos Brain - A Global Developer Onboarding Engine and Cross-Language Hub Powered by Agentic RAG

Hi @IrisNeko,

感谢你的积极推进~~

关于你提到的几个问题:

  1. 官方文档及 Repo 列表: 委员会已经在着手确认目前最新、最核心的官方文档链接,会尽快整理好提供给你。

  2. 历史数据快照: 很遗憾,委员会没有现成的打包数据集供你直接使用,初始数据的获取还是需要你通过 API 自行抓取。

请先按照你规划的方向继续推进,文档链接准备好后我们会第一时间同步给你。

1 Like

第三周周报

一、本周工作概述

本周按照上周计划,围绕数据获取与数据库构建两条主线推进。完成了检索层双层数据库架构的落地实现、多路检索融合流程的工程化,以及 Nervos Talk 论坛爬虫的开发与接入。"数据获取 → 数据入库"最小链路已基本跑通,但受制于官方文档尚未整理完毕,暂未进行完整数据集的统一测试。

二、本周完成内容

1. 数据库双层架构落地

本周完成了"浅层索引 + 深层原件"双层存储模型的工程实现:

  • 浅层索引层(Qdrant):仅存储标题、摘要、关键词等元数据的向量,保持索引体积精简、语义检索高效。
  • 深层原件层(SQLite):通过 ArchiveStore(SQLAlchemy ORM)存储完整原文,字段涵盖 raw_textraw_formatcontent_hash 等,支持去重和增量更新。
  • DualLayerWriter:同步写入两层,content_hash(SHA-256)作为幂等键,重复入库不产生冗余记录。

这一架构避免了为频繁变更的大库维护全量知识图谱,同时保留了随时通过锚点(anchor)拉取完整原文的能力。

2. 多路检索融合流程

在双层存储基础上,本周同步实现了四路检索融合管道:

query
├─ 1. 向量检索 (Qdrant)     — 语义召回
├─ 2. BM25 检索 (rank_bm25) — 关键词/术语召回,支持中英文混合分词
├─ 3. 模糊匹配 (difflib)    — 命名变体、缩写、轻微拼写差异
└─ 4. 精确匹配 (Qdrant 过滤) — 函数名、类名、帖子标题等硬锚点
↓
RRF 融合(Reciprocal Rank Fusion)
↓
Evidence(附带各路来源分数溯源)

MultiRetriever 统一管理四条路径,各路径可通过配置独立开关,RRF 结果的 payload 字段保留了每条路径的原始得分,便于后续诊断和调优。

3. Nervos Talk 论坛爬虫

针对 Nervos Talk(Discourse 论坛)开发了完整的爬虫模块:

  • 基于 Discourse REST API(/t/{id}.json/t/{id}/posts.json),支持分页批量拉取。
  • 增量更新:首次全量爬取后,后续运行通过 anchor 比对自动跳过已入库帖子,只写入新内容。
  • html_to_text 将 Discourse 返回的 cooked HTML 转换为干净纯文本,保留代码块标记。
  • 提供 scripts/run_discourse_crawl.py CLI,支持单 topic 爬取、最新话题扫描、分类爬取等模式。

4. 最小链路测试

数据获取 → 清洗 → 双层入库 → 多路检索的完整链路已通过集成测试验证(含对 https://talk.nervos.org 的真实 API 请求)。等待 Nervos 官方整理好最新文档后,再进行完整数据集的统一入库测试。

三、本周阶段性成果

本周完成了从"数据库结构设计"到"可运行数据管道"的跨越。当前已经可以通过一条命令将论坛帖子完整地写入双层数据库,并通过多路融合检索取回结果。这为后续接入更多数据源(RFC 文档、GitHub 代码)以及 LangGraph 工作流调用真实检索打下了基础。

四、存在问题

官方文档数据源尚未就绪,当前检索效果的验证仍依赖少量样本数据,尚未在完整语料下进行系统性的检索质量评估。BM25 索引目前为内存构建,进程重启后需重新加载;待数据量增长后需考虑持久化方案。

五、下周计划

1. 继续完善 Agent 核心

当前图引擎的主链路已成型,但检索节点仍使用 Mock 数据,下周重点将工作流与真实数据层打通:

  • 检索节点替换:将 RetrievalExecutor 节点中的 Mock 实现替换为真实的 MultiRetriever,使图引擎能够从 Qdrant + SQLite 双层库中召回实际证据。
  • 上下文注入:补充 MemoryService 与图状态的联动,使 Agent 在多轮对话中能够读取用户历史记忆和频道上下文,丰富检索过滤条件。
  • 端到端冒烟测试:用真实数据跑通"用户提问 → 检索 → 证据评分 → 回答生成"的完整链路,确认各节点在真实输入下的基本稳定性。

2. 开发 Telegram MCP

开始推进 Bot 接入层,以 Telegram 为首个接入平台,通过 MCP(Model Context Protocol)将 Telegram 消息通道封装为标准工具接口:

  • TG Bot 基础搭建:基于 Telegram Bot API 实现消息收发,处理私聊和群组两种场景下的消息格式差异。
  • MCP Transport 实现:在现有 MCPTransportAdapter 抽象基础上,新增 TelegramTransportAdapter,将 Telegram 消息的接收与回复封装为图引擎可直接调用的工具接口。
  • 消息格式适配:对接 ResponseNormalizer,将图引擎输出的 FinalResponse 转换为 Telegram 支持的 Markdown 格式,处理长回复的分段发送逻辑。

Week 3 Weekly Report

1. Overview of This Week’s Work

Following last week’s plan, this week’s progress was driven by two main tracks: data acquisition and database construction. We completed the implementation of the dual-layer database architecture for the retrieval layer, the engineering of the multi-way retrieval fusion pipeline, and the development and integration of the Nervos Talk forum crawler. The minimum viable pipeline of “data acquisition → data ingestion” is basically operational. However, because the official documentation has not yet been fully organized, we have temporarily paused unified testing on the complete dataset.

2. Completed Tasks

1. Implementation of Dual-Layer Database Architecture

This week, we completed the engineering implementation of the “shallow index + deep original document” dual-layer storage model:

  • Shallow Index Layer (Qdrant): Stores only vectors of metadata such as titles, summaries, and keywords, keeping the index size compact and semantic retrieval highly efficient.
  • Deep Original Document Layer (SQLite): Uses ArchiveStore (SQLAlchemy ORM) to store the complete original text. Fields include raw_text, raw_format, content_hash, etc., supporting deduplication and incremental updates.
  • DualLayerWriter: Synchronously writes to both layers, using content_hash (SHA-256) as the idempotent key to prevent redundant records during repeated ingestions.

This architecture avoids the need to maintain a full knowledge graph for a frequently changing large database, while preserving the ability to fetch the complete original text via anchors at any time.

2. Multi-Way Retrieval Fusion Pipeline

Building upon the dual-layer storage, we simultaneously implemented a four-way retrieval fusion pipeline this week:

query
├─ 1. Vector Retrieval (Qdrant)  — Semantic recall
├─ 2. BM25 Retrieval (rank_bm25) — Keyword/term recall, supports mixed English-Chinese tokenization
├─ 3. Fuzzy Matching (difflib)   — Naming variants, abbreviations, minor spelling differences
└─ 4. Exact Match (Qdrant filter)— Hard anchors like function names, class names, post titles
↓
RRF Fusion (Reciprocal Rank Fusion)
↓
Evidence (includes score tracing from each source)

The MultiRetriever centrally manages these four paths, each of which can be toggled independently via configuration. The payload field of the RRF results retains the original scores from each path to facilitate subsequent diagnostics and tuning.

3. Nervos Talk Forum Crawler

Developed a complete crawler module for Nervos Talk (a Discourse forum):

  • Based on the Discourse REST API (/t/{id}.json, /t/{id}/posts.json), supporting paginated batch fetching.
  • Incremental Updates: After the initial full crawl, subsequent runs automatically skip ingested posts by comparing anchors, writing only new content.
  • html_to_text converts the cooked HTML returned by Discourse into clean plain text, preserving code block formatting.
  • Provided the scripts/run_discourse_crawl.py CLI, supporting modes such as single-topic crawling, scanning for latest topics, and category-based crawling.

4. Minimum Viable Pipeline Testing

The complete pipeline of “data acquisition → cleaning → dual-layer ingestion → multi-way retrieval” has been verified via integration testing (including actual API requests to https://talk.nervos.org). We are waiting for the official Nervos team to finalize the latest documentation before conducting a unified ingestion test on the complete dataset.

3. Milestones Achieved This Week

This week marks the leap from “database architecture design” to a “runnable data pipeline.” Currently, we can completely ingest forum posts into the dual-layer database and retrieve results via the multi-way fusion retrieval using a single command. This lays the groundwork for integrating more data sources (RFC documents, GitHub code) and allowing LangGraph workflows to call real retrieval mechanisms.

4. Known Issues

The official documentation data source is not yet ready. Verification of the current retrieval effectiveness still relies on a small amount of sample data, and a systematic retrieval quality assessment on a complete corpus has not yet been conducted. The BM25 index is currently built in memory and needs to be reloaded after process restarts; a persistence solution needs to be considered as the data volume grows.

5. Plan for Next Week

1. Continue Refining the Agent Core

The main pipeline of the graph engine has taken shape, but the retrieval nodes still use Mock data. Next week’s focus will be connecting the workflow with the real data layer:

  • Retrieval Node Replacement: Replace the Mock implementation in the RetrievalExecutor node with the real MultiRetriever, enabling the graph engine to recall actual evidence from the Qdrant + SQLite dual-layer database.
  • Context Injection: Implement the linkage between MemoryService and the graph state, allowing the Agent to read the user’s historical memory and channel context in multi-turn dialogues, thereby enriching retrieval filter conditions.
  • End-to-End Smoke Testing: Run the complete pipeline of “user query → retrieval → evidence scoring → answer generation” with real data to confirm the fundamental stability of each node under real inputs.

2. Develop Telegram MCP

Begin advancing the Bot integration layer, starting with Telegram as the first platform. We will use MCP (Model Context Protocol) to encapsulate the Telegram messaging channel into standard tool interfaces:

  • Basic TG Bot Setup: Implement message sending and receiving based on the Telegram Bot API, handling message format differences between private chats and group scenarios.
  • MCP Transport Implementation: Build upon the existing MCPTransportAdapter abstraction by adding TelegramTransportAdapter, encapsulating the receiving and replying of Telegram messages into a tool interface directly callable by the graph engine.
  • Message Format Adaptation: Connect with ResponseNormalizer to convert the FinalResponse outputted by the graph engine into the Markdown format supported by Telegram, handling the segmented sending logic for long replies.
2 Likes

@xingtianchunyan 您好,请问这些最新最核心的文档官方链接,最晚什么时候能提供,我可以据此更好地安排开发进程。

Hi @IrisNeko

很开心看到你的最新进展,你的稳定发挥值得赞扬~~

以下是你可以用到的核心文档链接:
Nervos Network:Nervos Network · GitHub
Web5:web5fans · GitHub
DevRel:CKB DevRel · GitHub
RGB++ : RGB++ Protocol · GitHub
Fiber:GitHub - nervosnetwork/fiber: A scalable, privacy-by-default payment & swap network · GitHub
App5:appfi5 · GitHub

关于文档我想需要向你道声抱歉,耽误了一些时间,希望以上的核心文档足够你推进当前的工作。有拿不准的地方可以再跟我们确认。

祝好
行天

2 Likes

第四周周报(中文)

一、本周工作概述

本周围绕“真实数据接入 + 真实模型回答”推进,重点完成了 GitHub 多源文档抓取与入库、检索库规模化构建,以及可直接运行的 Agent 端到端示例。当前系统已经能够在真实数据库中检索证据,并调用 glm-4.7 生成带引用的回答,形成“数据抓取 → 入库 → 检索 → 回答”的可用闭环。

二、本周完成内容

1. GitHub 文档爬虫与统一入库能力

  • 新增 GitHub 文档爬虫模块,支持组织级与仓库级目标输入(owner/repo URL)。
  • 自动筛选文档文件(如 READMEdocs/guide/*.md*.rst 等),跳过明显非文档目录。
  • 新增一键脚本,支持“抓取 + JSONL 导出 + 双层数据库入库 + BM25 重建”。
  • 对网络异常、限流、空仓库等场景补充可读错误处理,保障批量任务稳定执行。

2. 规模化数据构建结果

  • 本周完成了 Nervos 相关公开源的批量抓取与入库。
  • 产出规模:
    • 入库文档 1232 条(seen=1232, written=1232)。
    • 覆盖 163 个 repo topic。
    • 生成 data/sources/github_docs.jsonl,可用于复现和二次处理。
  • 双层库状态:
    • SQLite 深层原件库(archive.db)已写入对应记录。
    • Qdrant 浅层索引可正常检索,支撑后续 Agent 召回。

3. Agent 真实链路打通(检索 + 模型回答)

  • 新增可直接运行的实战脚本,完成“数据库检索证据 → 模型生成回答 → 引用输出”。
  • 内置多组测试样例(Fiber / Cell / CCC),支持按 source/topic/type 过滤检索范围。
  • 在真实环境验证通过,模型使用 glm-4.7,可基于检索证据生成引用化回答。

4. 工程质量与可验证性

  • 为 GitHub 爬虫新增针对性单元测试(目标解析、文档筛选、产出结构)。
  • 全量非集成测试回归通过(pytest -m 'not integration'),确保新增能力未破坏既有链路。

三、阶段性成果

本周项目从“可演示的检索框架”升级为“可直接用于问答验证的工程化系统”。核心价值在于:

  1. 数据侧具备持续扩展能力(可批量新增来源并统一入库)。
  2. 检索侧具备真实语料召回能力(不再依赖纯样例数据)。
  3. 回答侧具备证据约束输出能力(glm-4.7 + 引用)。

四、当前问题

  1. 部分仓库为空仓库或默认分支异常,爬虫需跳过处理(不影响整体任务,但会产生噪音日志)。
  2. 检索召回已可用,但回答质量仍受证据覆盖度影响;部分问题仍只能返回方向性答案。

五、下周计划

A. 现有系统改进

  1. 增强检索质量评估:补充可量化评测集(命中率、引用有效率、答案可执行性)。
  2. 强化 Agent 规划能力:将“缺参追问 → 再检索 → 最终回答”做成稳定多轮闭环。
  3. 扩展数据源与更新机制:加入增量调度策略,支持周期性更新与变更追踪。
  4. 优化回答约束:提升引用覆盖率与证据一致性校验,减少无依据泛化描述。

B. TG MCP 开发

  1. 搭建 Telegram Bot 基础能力:实现私聊/群聊消息收发与基础权限处理。
  2. 开发 TelegramTransportAdapter:对齐现有 MCP 接口,完成消息输入/输出适配。
  3. 接入图引擎回包链路:将 FinalResponse 转为 Telegram Markdown 并处理长消息分段。
  4. 打通端到端最小链路:完成“Telegram 消息 → MCP → Agent → 回复”实测闭环。

Week 4 Report (English)

1. Weekly Overview

This week focused on real data integration + real model answering. We completed multi-source GitHub documentation crawling and ingestion, scaled retrieval database construction, and a runnable end-to-end Agent demo. The system now retrieves evidence from the real database and generates citation-grounded answers with glm-4.7, forming a practical pipeline from crawling to answering.

2. Completed Work

1. GitHub crawler and unified ingestion

  • Added a GitHub docs crawler supporting both owner-level and repo-level targets.
  • Implemented automatic doc-file filtering (README, docs/, guide/, *.md, *.rst, etc.) and exclusion of non-doc directories.
  • Added a one-command workflow for crawl + JSONL export + dual-layer DB ingestion + BM25 rebuild.
  • Improved resilience for network failures, rate limits, and empty repositories.

2. Scaled data build outcomes

  • Completed batch crawling and ingestion for Nervos-related public sources.
  • Output scale:
    • 1232 documents ingested (seen=1232, written=1232)
    • Coverage of 163 repository topics
    • data/sources/github_docs.jsonl generated for reproducibility and reprocessing
  • Storage status:
    • SQLite archive layer is populated and queryable
    • Qdrant shallow index is available for retrieval

3. Real Agent path online (retrieval + model answer)

  • Added a practical runnable script for “retrieve DB evidence → call model → output citations”.
  • Included built-in samples (Fiber / Cell / CCC) and filtering by source/topic/type.
  • Verified in real runs with glm-4.7, producing evidence-grounded answers.

4. Engineering quality and verification

  • Added unit tests for target parsing, doc-path filtering, and normalized crawler output.
  • Non-integration regression tests passed (pytest -m 'not integration'), confirming no breakage.

3. Milestone Achievements

The project has moved from a demo-oriented retrieval framework to a practical, testable QA system:

  1. Scalable data onboarding through unified ingestion.
  2. Real retrieval over a production-like corpus.
  3. Evidence-constrained answer generation with glm-4.7 citations.

4. Current Issues

  1. Some repositories are empty or have unusual default-branch states, requiring skip logic.
  2. Retrieval is usable, but answer quality still depends on evidence coverage.

5. Plan for Next Week

A. Improvements to the current system

  1. Build a measurable evaluation set (hit rate, citation validity, answer actionability).
  2. Improve multi-turn agent flow: ask missing info → retrieve again → finalize answer.
  3. Add incremental update scheduling and change tracking for data sources.
  4. Tighten answer constraints to improve citation coverage and evidence consistency.

B. TG MCP development

  1. Build Telegram Bot basics for private/group message handling.
  2. Implement TelegramTransportAdapter aligned with existing MCP interfaces.
  3. Connect graph-engine output to Telegram Markdown and long-message chunking.
  4. Validate the end-to-end path: Telegram message → MCP → Agent → response.
4 Likes

因个人私事耽搁,本期(第四周)周报较原计划推迟一天发布,感谢理解。

3 Likes