第三周周报
一、本周工作概述
本周按照上周计划,围绕数据获取与数据库构建两条主线推进。完成了检索层双层数据库架构的落地实现、多路检索融合流程的工程化,以及 Nervos Talk 论坛爬虫的开发与接入。"数据获取 → 数据入库"最小链路已基本跑通,但受制于官方文档尚未整理完毕,暂未进行完整数据集的统一测试。
二、本周完成内容
1. 数据库双层架构落地
本周完成了"浅层索引 + 深层原件"双层存储模型的工程实现:
- 浅层索引层(Qdrant):仅存储标题、摘要、关键词等元数据的向量,保持索引体积精简、语义检索高效。
- 深层原件层(SQLite):通过
ArchiveStore(SQLAlchemy ORM)存储完整原文,字段涵盖raw_text、raw_format、content_hash等,支持去重和增量更新。 DualLayerWriter:同步写入两层,content_hash(SHA-256)作为幂等键,重复入库不产生冗余记录。
这一架构避免了为频繁变更的大库维护全量知识图谱,同时保留了随时通过锚点(anchor)拉取完整原文的能力。
2. 多路检索融合流程
在双层存储基础上,本周同步实现了四路检索融合管道:
query
├─ 1. 向量检索 (Qdrant) — 语义召回
├─ 2. BM25 检索 (rank_bm25) — 关键词/术语召回,支持中英文混合分词
├─ 3. 模糊匹配 (difflib) — 命名变体、缩写、轻微拼写差异
└─ 4. 精确匹配 (Qdrant 过滤) — 函数名、类名、帖子标题等硬锚点
↓
RRF 融合(Reciprocal Rank Fusion)
↓
Evidence(附带各路来源分数溯源)
MultiRetriever 统一管理四条路径,各路径可通过配置独立开关,RRF 结果的 payload 字段保留了每条路径的原始得分,便于后续诊断和调优。
3. Nervos Talk 论坛爬虫
针对 Nervos Talk(Discourse 论坛)开发了完整的爬虫模块:
- 基于 Discourse REST API(
/t/{id}.json、/t/{id}/posts.json),支持分页批量拉取。 - 增量更新:首次全量爬取后,后续运行通过 anchor 比对自动跳过已入库帖子,只写入新内容。
html_to_text将 Discourse 返回的cookedHTML 转换为干净纯文本,保留代码块标记。- 提供
scripts/run_discourse_crawl.pyCLI,支持单 topic 爬取、最新话题扫描、分类爬取等模式。
4. 最小链路测试
数据获取 → 清洗 → 双层入库 → 多路检索的完整链路已通过集成测试验证(含对 https://talk.nervos.org 的真实 API 请求)。等待 Nervos 官方整理好最新文档后,再进行完整数据集的统一入库测试。
三、本周阶段性成果
本周完成了从"数据库结构设计"到"可运行数据管道"的跨越。当前已经可以通过一条命令将论坛帖子完整地写入双层数据库,并通过多路融合检索取回结果。这为后续接入更多数据源(RFC 文档、GitHub 代码)以及 LangGraph 工作流调用真实检索打下了基础。
四、存在问题
官方文档数据源尚未就绪,当前检索效果的验证仍依赖少量样本数据,尚未在完整语料下进行系统性的检索质量评估。BM25 索引目前为内存构建,进程重启后需重新加载;待数据量增长后需考虑持久化方案。
五、下周计划
1. 继续完善 Agent 核心
当前图引擎的主链路已成型,但检索节点仍使用 Mock 数据,下周重点将工作流与真实数据层打通:
- 检索节点替换:将
RetrievalExecutor节点中的 Mock 实现替换为真实的MultiRetriever,使图引擎能够从 Qdrant + SQLite 双层库中召回实际证据。 - 上下文注入:补充
MemoryService与图状态的联动,使 Agent 在多轮对话中能够读取用户历史记忆和频道上下文,丰富检索过滤条件。 - 端到端冒烟测试:用真实数据跑通"用户提问 → 检索 → 证据评分 → 回答生成"的完整链路,确认各节点在真实输入下的基本稳定性。
2. 开发 Telegram MCP
开始推进 Bot 接入层,以 Telegram 为首个接入平台,通过 MCP(Model Context Protocol)将 Telegram 消息通道封装为标准工具接口:
- TG Bot 基础搭建:基于 Telegram Bot API 实现消息收发,处理私聊和群组两种场景下的消息格式差异。
- MCP Transport 实现:在现有
MCPTransportAdapter抽象基础上,新增TelegramTransportAdapter,将 Telegram 消息的接收与回复封装为图引擎可直接调用的工具接口。 - 消息格式适配:对接
ResponseNormalizer,将图引擎输出的FinalResponse转换为 Telegram 支持的 Markdown 格式,处理长回复的分段发送逻辑。
Week 3 Weekly Report
1. Overview of This Week’s Work
Following last week’s plan, this week’s progress was driven by two main tracks: data acquisition and database construction. We completed the implementation of the dual-layer database architecture for the retrieval layer, the engineering of the multi-way retrieval fusion pipeline, and the development and integration of the Nervos Talk forum crawler. The minimum viable pipeline of “data acquisition → data ingestion” is basically operational. However, because the official documentation has not yet been fully organized, we have temporarily paused unified testing on the complete dataset.
2. Completed Tasks
1. Implementation of Dual-Layer Database Architecture
This week, we completed the engineering implementation of the “shallow index + deep original document” dual-layer storage model:
- Shallow Index Layer (Qdrant): Stores only vectors of metadata such as titles, summaries, and keywords, keeping the index size compact and semantic retrieval highly efficient.
- Deep Original Document Layer (SQLite): Uses
ArchiveStore(SQLAlchemy ORM) to store the complete original text. Fields includeraw_text,raw_format,content_hash, etc., supporting deduplication and incremental updates. DualLayerWriter: Synchronously writes to both layers, usingcontent_hash(SHA-256) as the idempotent key to prevent redundant records during repeated ingestions.
This architecture avoids the need to maintain a full knowledge graph for a frequently changing large database, while preserving the ability to fetch the complete original text via anchors at any time.
2. Multi-Way Retrieval Fusion Pipeline
Building upon the dual-layer storage, we simultaneously implemented a four-way retrieval fusion pipeline this week:
query
├─ 1. Vector Retrieval (Qdrant) — Semantic recall
├─ 2. BM25 Retrieval (rank_bm25) — Keyword/term recall, supports mixed English-Chinese tokenization
├─ 3. Fuzzy Matching (difflib) — Naming variants, abbreviations, minor spelling differences
└─ 4. Exact Match (Qdrant filter)— Hard anchors like function names, class names, post titles
↓
RRF Fusion (Reciprocal Rank Fusion)
↓
Evidence (includes score tracing from each source)
The MultiRetriever centrally manages these four paths, each of which can be toggled independently via configuration. The payload field of the RRF results retains the original scores from each path to facilitate subsequent diagnostics and tuning.
3. Nervos Talk Forum Crawler
Developed a complete crawler module for Nervos Talk (a Discourse forum):
- Based on the Discourse REST API (
/t/{id}.json,/t/{id}/posts.json), supporting paginated batch fetching. - Incremental Updates: After the initial full crawl, subsequent runs automatically skip ingested posts by comparing anchors, writing only new content.
html_to_textconverts thecookedHTML returned by Discourse into clean plain text, preserving code block formatting.- Provided the
scripts/run_discourse_crawl.pyCLI, supporting modes such as single-topic crawling, scanning for latest topics, and category-based crawling.
4. Minimum Viable Pipeline Testing
The complete pipeline of “data acquisition → cleaning → dual-layer ingestion → multi-way retrieval” has been verified via integration testing (including actual API requests to https://talk.nervos.org). We are waiting for the official Nervos team to finalize the latest documentation before conducting a unified ingestion test on the complete dataset.
3. Milestones Achieved This Week
This week marks the leap from “database architecture design” to a “runnable data pipeline.” Currently, we can completely ingest forum posts into the dual-layer database and retrieve results via the multi-way fusion retrieval using a single command. This lays the groundwork for integrating more data sources (RFC documents, GitHub code) and allowing LangGraph workflows to call real retrieval mechanisms.
4. Known Issues
The official documentation data source is not yet ready. Verification of the current retrieval effectiveness still relies on a small amount of sample data, and a systematic retrieval quality assessment on a complete corpus has not yet been conducted. The BM25 index is currently built in memory and needs to be reloaded after process restarts; a persistence solution needs to be considered as the data volume grows.
5. Plan for Next Week
1. Continue Refining the Agent Core
The main pipeline of the graph engine has taken shape, but the retrieval nodes still use Mock data. Next week’s focus will be connecting the workflow with the real data layer:
- Retrieval Node Replacement: Replace the Mock implementation in the
RetrievalExecutornode with the realMultiRetriever, enabling the graph engine to recall actual evidence from the Qdrant + SQLite dual-layer database. - Context Injection: Implement the linkage between
MemoryServiceand the graph state, allowing the Agent to read the user’s historical memory and channel context in multi-turn dialogues, thereby enriching retrieval filter conditions. - End-to-End Smoke Testing: Run the complete pipeline of “user query → retrieval → evidence scoring → answer generation” with real data to confirm the fundamental stability of each node under real inputs.
2. Develop Telegram MCP
Begin advancing the Bot integration layer, starting with Telegram as the first platform. We will use MCP (Model Context Protocol) to encapsulate the Telegram messaging channel into standard tool interfaces:
- Basic TG Bot Setup: Implement message sending and receiving based on the Telegram Bot API, handling message format differences between private chats and group scenarios.
- MCP Transport Implementation: Build upon the existing
MCPTransportAdapterabstraction by addingTelegramTransportAdapter, encapsulating the receiving and replying of Telegram messages into a tool interface directly callable by the graph engine. - Message Format Adaptation: Connect with
ResponseNormalizerto convert theFinalResponseoutputted by the graph engine into the Markdown format supported by Telegram, handling the segmented sending logic for long replies.