link消息入库完成;本地有文件的前提下 图片及各类文本 均能正常入库

This commit is contained in:
astro 2026-04-22 16:56:27 +08:00
parent a3789232d4
commit 96f31bb549
12 changed files with 2009 additions and 0 deletions

15
.env.sample Normal file
View File

@ -0,0 +1,15 @@
# MySQL
MYSQL_HOST=localhost
MYSQL_PORT=3306
MYSQL_USER=root
MYSQL_PASSWORD=your_password
MYSQL_DATABASE=your_database
MYSQL_TABLE=wechat_group_message
# Tencent COS
COS_SECRET_ID=your_cos_secret_id
COS_SECRET_KEY=your_cos_secret_key
COS_BUCKET=your_bucket_name
COS_REGION=ap-beijing
COS_DOWNLOAD_DOMAIN=your_domain.com
COS_BASE_PATH=your/cos/path

1
.gitignore vendored
View File

@ -35,6 +35,7 @@ Thumbs.db
*.db-shm
# Sensitive data — NEVER commit
.env
*.json
!pyproject.toml
!npm/**/package.json

45
CLAUDE.md Normal file
View File

@ -0,0 +1,45 @@
# wechat-cli 项目
## 项目概述
基于本地微信数据库的 CLI 查询工具支持会话、消息、联系人、群成员等查询。AI-first 设计,输出结构化 JSON。
## 技术栈
- Python 3.10+, Click, pycryptodome, zstandard
- SQLCipher 4 解密AES-256-CBC
- npm 分发 + PyInstaller 打包
## 目录结构
- `wechat_cli/` — 主 CLI 包commands/, core/, keys/, output/
- `skills/` — Claude Code skillsexport-chat, tencent-cos-upload
- `collector/` — 群聊消息自动采集器
- `collect_chats.py` — 采集器入口脚本
- `npm/` — npm 分发包
## 群聊采集器collector/
自动采集微信群聊消息的子系统:
- `collector/config.py` — 配置管理MySQL、COS、扫描策略、回溯补录
- `collector/storage.py` — MySQL 存储层wechat_group_message 表)
- `collector/wechat_adapter.py` — 封装 wechat_cli.core 的适配层,包含 XML 元信息提取
- `collector/scanner.py` — 优先级队列扫描调度 + COS 媒体上传
- `collector/backfill.py` — 媒体回溯补录watchdog 实时监听 + 定时兜底扫描)
## 关键约定
- wechat-cli 是只读工具,不修改微信数据
- 群聊通过 `@chatroom` 后缀识别
- 消息可跨多个 `message_N.db` 文件
- DBCache 使用 mtime 机制缓存解密副本到 `/tmp/wechat_cli_cache/`
- 采集器独立维护状态,不干扰 `~/.wechat-cli/last_check.json`
- 非文本消息(文件/视频/链接等)从 XML 中提取元信息存入 content 字段,不依赖本地文件下载
- 群聊消息 content_raw 格式为 `wxid:\n<xml>`,解析 XML 时需先剥离发送者前缀
- 聊天记录合并转发app_type=19解析 recorditem XML格式化为多行纯文本
- 引用/回复消息app_type=57提取 refermsg 中的 svrid 建立消息关联content 格式化为 "正文\n ↳ 回复 发送者: 被引用内容"
- 数据库通过 `svr_msg_id`(微信 server_id`refer_msg_svrid`refermsg/svrid实现引用消息的关联查询
- 采集器扫描策略: hot=15s, warm=30s, cold 退避至 120sbackoff 1.2),保证 ≤2min 入库
- 媒体文件只采集原始可读文件,不做 .dat 解密
- 图片: 从 `temp/RWTemp/YYYY-MM/{md5}.ext` 获取(微信查看图片后自动生成)
- 文件: 从 `msg/file/YYYY-MM/filename` 获取
- 视频: 从 `msg/video/YYYY-MM/{rawmd5}.mp4` 获取
- 图片消息 content 格式为 `[图片] {md5}`md5 是图片原始内容的 hash来自消息 XML 的 md5 属性)
- **重要**: RWTemp 中的图片文件名是派生 hash与消息 XML 中的 md5/aeskey 等属性均不对应。匹配图片必须通过 `hashlib.md5(文件内容)` 计算后与 XML md5 属性比对,不能用文件名匹配
- 回溯补录: watchdog 监听 temp/RWTemp + msg/file + msg/video 目录 + 每 10 分钟定时扫描 media_url=NULL 记录7天内
- 默认日志级别 DEBUG可通过配置文件 `~/.wechat-cli/collect-chats/config.json` 修改

93
COLLECT_README.md Normal file
View File

@ -0,0 +1,93 @@
# 微信群聊消息采集器
自动扫描所有微信群聊,增量采集新消息,存入 MySQL媒体文件上传腾讯 COS。
## 前置条件
0. 安装 wechat-cli
cd wechat-cli
pip install -e .
微信版本建议 4.1.6 或 4.1.8
1. **微信已登录**`wechat-cli` 已初始化:
```bash
sudo wechat-cli init
```
2. **安装依赖**
```bash
pip3 install pymysql cos-python-sdk-v5
```
## 快速启动
```bash
cd /Users/makee/Documents/cris/workspace/wechat-cli
# 查看所有群聊
python3 collect_chats.py groups
# 一次性扫描(测试用)
python3 collect_chats.py scan
# 守护模式(持续运行)
python3 collect_chats.py daemon
# 查看采集统计
python3 collect_chats.py status
# 停止采集某个群
python3 collect_chats.py disable "广告群"
# 恢复采集
python3 collect_chats.py enable "广告群"
```
## 扫描策略
采集器使用自适应频率扫描:
| 活跃度 | 条件 | 扫描间隔 |
|--------|------|----------|
| HOT | 最新消息 < 5 分钟 | 60 |
| WARM | 最新消息 < 1 小时 | 5 分钟 |
| COLD | 无近期消息 | 逐步退避至 30 分钟 |
- 每轮最多扫描 5 个群,群间随机延迟 0~3 秒
- 每 5 分钟重新发现新群聊
- 支持白名单/黑名单过滤(正则匹配)
## 自定义配置
创建 `~/.wechat-cli/collect-chats/config.json`(可选,所有字段都有默认值):
```json
{
"min_interval": 60,
"base_interval": 300,
"max_interval": 1800,
"batch_size": 5,
"messages_per_scan": 200,
"whitelist": [],
"blacklist": ["广告群", ".*测试.*"]
}
```
## 数据存储
- **消息** → MySQL `vala_test.wechat_group_message`
- **媒体文件**(语音/视频/文件)→ 腾讯 COS `vala_llm/user_feedback/wechat/类型/YYYY-MM/`
- **图片** → 暂不采集(微信 V2 AES 加密限制)
## 后台运行
```bash
# nohup 方式
nohup python3 collect_chats.py daemon > collector.log 2>&1 &
# 查看日志
tail -f collector.log
# 停止
kill $(pgrep -f "collect_chats.py daemon")
```

249
collect_chats.py Normal file
View File

@ -0,0 +1,249 @@
#!/usr/bin/env python3
"""
微信群聊消息自动采集器
用法:
python3 collect_chats.py scan # 一次性扫描所有群聊
python3 collect_chats.py daemon # 持续运行(自适应频率)
python3 collect_chats.py status # 查看采集统计
python3 collect_chats.py groups # 列出所有群聊
python3 collect_chats.py enable "群名" # 启用某群
python3 collect_chats.py disable "群名" # 停止某群
"""
import argparse
import json
import logging
import os
import signal
import sys
import time
from datetime import datetime
from collector.config import CollectorConfig
from collector.storage import MessageStorage
from collector.wechat_adapter import WeChatAdapter
from collector.scanner import GroupScanner, _init_cos_uploader
from collector.backfill import MediaBackfiller
log = logging.getLogger("collector")
def setup_logging(level="INFO"):
fmt = "%(asctime)s [%(levelname)s] %(message)s"
logging.basicConfig(level=getattr(logging, level, logging.INFO),
format=fmt, datefmt="%Y-%m-%d %H:%M:%S")
def cmd_scan(args):
"""一次性扫描"""
config = CollectorConfig.load(args.config)
setup_logging(config.log_level)
log.info("初始化微信数据适配器...")
adapter = WeChatAdapter()
log.info("连接 MySQL...")
storage = MessageStorage(config)
storage.ensure_table()
scanner = GroupScanner(adapter, storage, config)
log.info("发现群聊...")
scanner.discover_groups()
log.info("开始扫描...")
total = 0
# 循环扫描直到所有群都扫完一遍
rounds = 0
while True:
n = scanner.scan_next_batch()
total += n
rounds += 1
# 如果没有到期的群了,说明本轮扫完
if scanner.time_to_next() > 0:
break
log.info("扫描完成: 共 %d 轮, 新增 %d 条消息", rounds, total)
# 打印统计
status = scanner.get_status()
print(json.dumps(status, ensure_ascii=False, indent=2))
storage.close()
def cmd_daemon(args):
"""守护模式 — 持续运行"""
config = CollectorConfig.load(args.config)
setup_logging(config.log_level)
running = True
def on_signal(signum, _):
nonlocal running
log.info("收到信号 %d, 准备退出...", signum)
running = False
signal.signal(signal.SIGINT, on_signal)
signal.signal(signal.SIGTERM, on_signal)
log.info("=" * 50)
log.info("微信群聊采集器 - 守护模式启动")
log.info("扫描策略: hot=%ds, warm=%ds, cold 退避至 %ds",
int(config.min_interval), int(config.base_interval), int(config.max_interval))
log.info("批次=%d, 每群最多=%d条/次", config.batch_size, config.messages_per_scan)
log.info("=" * 50)
adapter = WeChatAdapter()
storage = MessageStorage(config)
storage.ensure_table()
scanner = GroupScanner(adapter, storage, config)
# 启动回溯补录器(使用独立的 MySQL 连接,避免多线程冲突)
backfiller = None
if config.backfill_enabled:
wechat_base = os.path.dirname(adapter.db_dir) if adapter.db_dir else ""
backfill_storage = MessageStorage(config)
backfill_storage.ensure_table()
backfiller = MediaBackfiller(
storage=backfill_storage, config=config,
wechat_base_dir=wechat_base
)
try:
cos = _init_cos_uploader(config)
backfiller.set_cos_uploader(cos)
except Exception as e:
log.warning("COS 初始化失败,回溯补录上传功能不可用: %s", e)
backfiller.start()
# 初始发现
scanner.discover_groups()
cycle = 0
while running:
cycle += 1
# 定期重新发现群聊
if scanner.should_discover():
try:
adapter.refresh_names()
scanner.discover_groups()
except Exception as e:
log.error("群聊发现失败: %s", e)
# 扫描批次
try:
n = scanner.scan_next_batch()
if n > 0:
log.info("--- 第 %d 轮: 新增 %d 条消息 ---", cycle, n)
except Exception as e:
log.error("扫描异常: %s", e)
# 智能休眠
sleep_time = scanner.time_to_next()
if sleep_time > 0 and running:
time.sleep(sleep_time)
log.info("采集器已停止")
if backfiller:
backfiller.stop()
storage.close()
def cmd_status(args):
"""查看采集统计"""
config = CollectorConfig.load(args.config)
storage = MessageStorage(config)
storage.ensure_table()
total = storage.get_total_count()
groups = storage.get_group_stats()
print(f"\n消息总数: {total}")
print(f"群聊数量: {len(groups)}")
print("-" * 60)
for g in groups:
last_dt = datetime.fromtimestamp(g["last_ts"]).strftime("%m-%d %H:%M") if g["last_ts"] else ""
print(f" {g['group_name']:20s} {g['total']:>6d} 条 最新: {last_dt}")
storage.close()
def cmd_groups(args):
"""列出所有群聊"""
config = CollectorConfig.load(args.config)
setup_logging(config.log_level)
adapter = WeChatAdapter()
sessions = adapter.list_group_sessions(limit=500)
print(f"\n共发现 {len(sessions)} 个群聊:")
print("-" * 60)
for s in sessions:
ts = datetime.fromtimestamp(s["last_timestamp"]).strftime("%m-%d %H:%M")
unread_tag = f" ({s['unread']}条未读)" if s["unread"] else ""
print(f" [{ts}] {s['display_name']}{unread_tag}")
print(f" {s['username']}")
def cmd_enable(args):
"""启用/停用群聊(通过修改配置黑名单)"""
config = CollectorConfig.load(args.config)
group_name = args.group_name
if group_name in config.blacklist:
config.blacklist.remove(group_name)
config.save(args.config)
print(f"已从黑名单移除: {group_name}")
else:
print(f"{group_name} 不在黑名单中")
def cmd_disable(args):
config = CollectorConfig.load(args.config)
group_name = args.group_name
if group_name not in config.blacklist:
config.blacklist.append(group_name)
config.save(args.config)
print(f"已加入黑名单: {group_name}")
else:
print(f"{group_name} 已在黑名单中")
def main():
parser = argparse.ArgumentParser(
description="微信群聊消息自动采集器",
formatter_class=argparse.RawDescriptionHelpFormatter,
)
parser.add_argument("--config", default=None, help="配置文件路径")
sub = parser.add_subparsers(dest="command")
sub.add_parser("scan", help="一次性扫描所有群聊")
sub.add_parser("daemon", help="启动守护模式持续采集")
sub.add_parser("status", help="查看采集统计")
sub.add_parser("groups", help="列出所有群聊")
p_enable = sub.add_parser("enable", help="启用某群采集")
p_enable.add_argument("group_name", help="群聊名称")
p_disable = sub.add_parser("disable", help="停止某群采集")
p_disable.add_argument("group_name", help="群聊名称")
args = parser.parse_args()
if not args.command:
parser.print_help()
sys.exit(1)
commands = {
"scan": cmd_scan,
"daemon": cmd_daemon,
"status": cmd_status,
"groups": cmd_groups,
"enable": cmd_enable,
"disable": cmd_disable,
}
commands[args.command](args)
if __name__ == "__main__":
main()

0
collector/__init__.py Normal file
View File

399
collector/backfill.py Normal file
View File

@ -0,0 +1,399 @@
"""媒体回溯补录模块 — watchdog 实时监听 + 定时兜底扫描
当用户点开微信中的图片/视频/文件后本地会生成可读文件
本模块检测这些新文件上传到 COS并回填 media_url 到数据库中之前为 NULL 的记录
图片: temp/RWTemp/YYYY-MM/{md5}.ext微信下载/查看后自动生成
文件/视频/音频: msg/file/YYYY-MM/filename微信接收后自动生成
"""
import hashlib
import logging
import os
import re
import threading
import time
from datetime import datetime
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler, FileCreatedEvent
from .config import CollectorConfig
from .storage import MessageStorage
log = logging.getLogger(__name__)
IMAGE_EXTS = {".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp"}
VIDEO_EXTS = {".mp4", ".mov", ".avi"}
FILE_EXTS_SKIP = {".tmp", ".downloading"}
def _is_readable_media(filepath: str) -> str | None:
"""判断文件是否为可读媒体文件,返回 'image'/'video'/'file' 或 None"""
ext = os.path.splitext(filepath)[1].lower()
if ext in FILE_EXTS_SKIP:
return None
basename = os.path.basename(filepath)
if "_thumb" in basename:
return None
if ext in IMAGE_EXTS:
return "image"
if ext in VIDEO_EXTS:
return "video"
if ext and ext not in FILE_EXTS_SKIP:
return "file"
return None
def _is_in_rwtemp(filepath: str) -> bool:
"""判断文件是否在 RWTemp 目录内"""
return "/temp/RWTemp/" in filepath
def _file_md5(filepath: str) -> str:
"""计算文件内容的 md5"""
h = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
h.update(chunk)
return h.hexdigest()
def _extract_md5_from_content(content: str) -> str | None:
"""从 DB content 字段提取图片 md5: '[图片] abc123...'"""
if not content:
return None
m = re.search(r'\[图片\]\s+([a-f0-9]{32})', content)
return m.group(1) if m else None
class MediaFileHandler(FileSystemEventHandler):
"""watchdog 事件处理器 — 新文件出现时触发回溯匹配"""
def __init__(self, backfiller: "MediaBackfiller"):
super().__init__()
self._backfiller = backfiller
def on_created(self, event):
if event.is_directory:
return
if not isinstance(event, FileCreatedEvent):
return
threading.Timer(2.0, self._handle_new_file, args=[event.src_path]).start()
def _handle_new_file(self, filepath: str):
try:
if not os.path.isfile(filepath):
return
if os.path.getsize(filepath) < 100:
return
media_type = _is_readable_media(filepath)
if not media_type:
return
# RWTemp 中的都是图片
if _is_in_rwtemp(filepath):
media_type = "image"
log.info("[watchdog] 检测到新媒体文件: %s (type=%s)", filepath, media_type)
self._backfiller.try_backfill_file(filepath, media_type)
except Exception as e:
log.warning("[watchdog] 处理文件异常 %s: %s", filepath, e)
class MediaBackfiller:
"""媒体回溯补录器
两层保障:
1. watchdog 实时监听 RWTemp(图片) msg/file(文件)新文件立即匹配上传
2. 定时扫描 DB media_url=NULL 的记录重新尝试路径解析
"""
def __init__(self, storage: MessageStorage, config: CollectorConfig,
cos_uploader=None, wechat_base_dir: str = ""):
self._storage = storage
self._cfg = config
self._cos = cos_uploader
self._wechat_base = wechat_base_dir
self._observer: Observer | None = None
self._scan_thread: threading.Thread | None = None
self._running = False
self._upload_lock = threading.Lock()
def start(self):
"""启动 watchdog 监听 + 定时扫描线程"""
if self._running:
return
self._running = True
watch_dirs = self._get_watch_dirs()
if watch_dirs:
self._observer = Observer()
handler = MediaFileHandler(self)
for d in watch_dirs:
self._observer.schedule(handler, d, recursive=True)
log.info("[backfill] 监听目录: %s", d)
self._observer.start()
log.info("[backfill] watchdog 已启动, 监听 %d 个目录", len(watch_dirs))
else:
log.warning("[backfill] 未找到可监听的微信媒体目录")
self._scan_thread = threading.Thread(
target=self._periodic_scan_loop, daemon=True, name="backfill-scan"
)
self._scan_thread.start()
log.info("[backfill] 定时扫描已启动, 间隔 %ds", int(self._cfg.backfill_interval))
def stop(self):
"""停止所有后台线程"""
self._running = False
if self._observer:
self._observer.stop()
self._observer.join(timeout=5)
self._observer = None
if self._scan_thread:
self._scan_thread.join(timeout=10)
self._scan_thread = None
log.info("[backfill] 已停止")
def set_cos_uploader(self, cos):
"""延迟设置 COS 上传器"""
self._cos = cos
def _get_watch_dirs(self) -> list[str]:
"""获取需要监听的目录: temp/RWTemp(图片) + msg/file(文件) + msg/video(视频)"""
if not self._wechat_base:
return []
dirs = []
rwtemp = os.path.join(self._wechat_base, "temp", "RWTemp")
if os.path.isdir(rwtemp):
dirs.append(rwtemp)
file_dir = os.path.join(self._wechat_base, "msg", "file")
if os.path.isdir(file_dir):
dirs.append(file_dir)
video_dir = os.path.join(self._wechat_base, "msg", "video")
if os.path.isdir(video_dir):
dirs.append(video_dir)
return dirs
def try_backfill_file(self, filepath: str, media_type: str):
"""尝试将新文件匹配到 DB 中的 pending 记录并上传"""
if not self._cos:
log.debug("[backfill] COS 未配置, 跳过上传")
return
with self._upload_lock:
pending = self._storage.get_pending_media_messages(
lookback_days=self._cfg.backfill_lookback_days, limit=500
)
matched = self._match_file_to_record(filepath, media_type, pending)
if matched:
self._do_upload_and_update(filepath, matched)
def _match_file_to_record(self, filepath: str, media_type: str,
records: list[dict]) -> dict | None:
"""将文件路径匹配到数据库记录"""
if media_type == "image":
return self._match_image(filepath, records)
elif media_type == "video":
return self._match_video(filepath, records)
else:
return self._match_file(filepath, records)
def _match_image(self, filepath: str, records: list[dict]) -> dict | None:
"""图片匹配: 计算文件内容 md5匹配 DB content 中的 md5"""
try:
file_hash = _file_md5(filepath)
except Exception:
return None
for rec in records:
if rec["msg_type"] != "image":
continue
rec_md5 = _extract_md5_from_content(rec.get("content", ""))
if rec_md5 and rec_md5 == file_hash:
log.info("[backfill] 图片内容md5匹配: %s → record id=%d", file_hash, rec["id"])
return rec
return None
def _match_file(self, filepath: str, records: list[dict]) -> dict | None:
"""文件匹配: 通过文件名匹配 DB content 中的标题"""
filename = os.path.basename(filepath)
date_prefix = self._extract_date_from_path(filepath)
for rec in records:
if rec["msg_type"] != "file":
continue
if date_prefix:
rec_date = datetime.fromtimestamp(rec["msg_timestamp"]).strftime("%Y-%m")
if rec_date != date_prefix:
continue
content = rec.get("content", "") or ""
title = content.split(" (")[0].strip() if content else ""
if title and (title in filename or filename in title):
log.info("[backfill] 文件匹配: %s → record id=%d", filename, rec["id"])
return rec
return None
def _match_video(self, filepath: str, records: list[dict]) -> dict | None:
"""视频匹配: 通过文件名中的 rawmd5 匹配同月的 video 记录"""
filename = os.path.splitext(os.path.basename(filepath))[0].lower()
date_prefix = self._extract_date_from_path(filepath)
for rec in records:
if rec["msg_type"] != "video":
continue
if date_prefix:
rec_date = datetime.fromtimestamp(rec["msg_timestamp"]).strftime("%Y-%m")
if rec_date != date_prefix:
continue
if re.fullmatch(r'[a-f0-9]{32}', filename):
log.info("[backfill] 视频匹配: %s → record id=%d", filename, rec["id"])
return rec
return None
def _do_upload_and_update(self, filepath: str, record: dict):
"""上传文件到 COS 并更新 DB需在锁内调用"""
if not self._cos or not os.path.isfile(filepath):
return
try:
ext = os.path.splitext(filepath)[1]
filename = os.path.basename(filepath)
safe_name = re.sub(r'[^\w.\-]', '_', filename)
dt = datetime.fromtimestamp(record["msg_timestamp"])
date_prefix = dt.strftime("%Y-%m")
cos_key = f"{self._cfg.cos_base_path}/{record['msg_type']}/{date_prefix}/{safe_name}"
url = self._cos.upload(filepath, cos_key)
if url:
updated = self._storage.update_media_url(record["id"], url)
if updated:
log.info("[backfill] 回溯成功: id=%d, type=%s%s",
record["id"], record["msg_type"], url)
else:
log.debug("[backfill] 记录已被更新(可能重复): id=%d", record["id"])
except Exception as e:
log.warning("[backfill] 上传失败: %s%s", filepath, e)
def _periodic_scan_loop(self):
"""定时扫描兜底线程"""
while self._running:
try:
self._do_periodic_scan()
except Exception as e:
log.error("[backfill] 定时扫描异常: %s", e)
wait = self._cfg.backfill_interval
elapsed = 0
while elapsed < wait and self._running:
time.sleep(min(5.0, wait - elapsed))
elapsed += 5.0
def _do_periodic_scan(self):
"""执行一次全量回溯扫描"""
if not self._cos:
return
with self._upload_lock:
pending = self._storage.get_pending_media_messages(
lookback_days=self._cfg.backfill_lookback_days
)
if not pending:
return
log.info("[backfill] 定时扫描: %d 条待补录记录", len(pending))
filled = 0
for rec in pending:
if not self._running:
break
filepath = self._try_resolve_path(rec)
if filepath and os.path.isfile(filepath):
with self._upload_lock:
self._do_upload_and_update(filepath, rec)
filled += 1
if filled > 0:
log.info("[backfill] 定时扫描完成: 成功补录 %d", filled)
def _try_resolve_path(self, record: dict) -> str | None:
"""尝试为一条 pending 记录解析本地文件路径"""
if not self._wechat_base:
return None
msg_type = record["msg_type"]
msg_ts = record["msg_timestamp"]
dt = datetime.fromtimestamp(msg_ts)
date_prefix = dt.strftime("%Y-%m")
if msg_type == "image":
return self._resolve_image_path(date_prefix, record)
elif msg_type == "video":
return self._resolve_video_path(date_prefix)
elif msg_type == "file":
return self._resolve_file_path(date_prefix, record)
return None
def _resolve_image_path(self, date_prefix: str, record: dict) -> str | None:
"""在 temp/RWTemp/YYYY-MM/ 中按文件内容 md5 查找图片"""
img_md5 = _extract_md5_from_content(record.get("content", ""))
if not img_md5:
return None
rwtemp_dir = os.path.join(self._wechat_base, "temp", "RWTemp", date_prefix)
if not os.path.isdir(rwtemp_dir):
return None
for f in os.listdir(rwtemp_dir):
fp = os.path.join(rwtemp_dir, f)
ext = os.path.splitext(f)[1].lower()
if ext not in (".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp"):
continue
if not os.path.isfile(fp):
continue
try:
if _file_md5(fp) == img_md5:
return fp
except Exception:
continue
return None
def _resolve_video_path(self, date_prefix: str) -> str | None:
"""在 msg/video/YYYY-MM/ 中查找 mp4"""
video_dir = os.path.join(self._wechat_base, "msg", "video", date_prefix)
if not os.path.isdir(video_dir):
return None
for f in os.listdir(video_dir):
if f.endswith(".mp4") and os.path.isfile(os.path.join(video_dir, f)):
return os.path.join(video_dir, f)
return None
def _resolve_file_path(self, date_prefix: str, record: dict) -> str | None:
"""在 msg/file/YYYY-MM/ 中按文件名查找"""
content = record.get("content", "") or ""
title = content.split(" (")[0].strip() if content else ""
if not title:
return None
file_dir = os.path.join(self._wechat_base, "msg", "file", date_prefix)
if not os.path.isdir(file_dir):
return None
target = os.path.join(file_dir, title)
if os.path.isfile(target):
return target
for f in os.listdir(file_dir):
if title in f or f in title:
fp = os.path.join(file_dir, f)
if os.path.isfile(fp):
return fp
return None
@staticmethod
def _extract_date_from_path(filepath: str) -> str | None:
m = re.search(r'(\d{4}-\d{2})', filepath)
return m.group(1) if m else None

69
collector/config.py Normal file
View File

@ -0,0 +1,69 @@
"""采集器配置"""
import json
import os
from dataclasses import dataclass, field, asdict
from dotenv import load_dotenv
load_dotenv()
DEFAULT_CONFIG_PATH = os.path.expanduser("~/.wechat-cli/collect-chats/config.json")
@dataclass
class CollectorConfig:
# ---- MySQL ----
mysql_host: str = os.environ.get("MYSQL_HOST", "localhost")
mysql_port: int = int(os.environ.get("MYSQL_PORT", "3306"))
mysql_user: str = os.environ.get("MYSQL_USER", "root")
mysql_password: str = os.environ.get("MYSQL_PASSWORD", "")
mysql_database: str = os.environ.get("MYSQL_DATABASE", "")
mysql_table: str = os.environ.get("MYSQL_TABLE", "wechat_group_message")
# ---- 腾讯 COS ----
cos_secret_id: str = os.environ.get("COS_SECRET_ID", "")
cos_secret_key: str = os.environ.get("COS_SECRET_KEY", "")
cos_bucket: str = os.environ.get("COS_BUCKET", "")
cos_region: str = os.environ.get("COS_REGION", "ap-beijing")
cos_download_domain: str = os.environ.get("COS_DOWNLOAD_DOMAIN", "")
cos_base_path: str = os.environ.get("COS_BASE_PATH", "")
# ---- 扫描策略 ----
min_interval: float = 15.0 # hot 群扫描间隔(秒)
base_interval: float = 30.0 # warm 群扫描间隔
max_interval: float = 120.0 # cold 群最大间隔(保证 ≤2min 入库)
backoff_factor: float = 1.2 # cold 退避系数
batch_size: int = 10 # 每轮最多扫描群数
messages_per_scan: int = 200 # 每群每次最多拉取消息数
jitter_max: float = 1.0 # 群间随机延迟上限(秒)
cycle_sleep: float = 5.0 # 轮次间休眠(秒)
discovery_interval: float = 180.0 # 群聊发现间隔(秒)
hot_threshold: int = 300 # 最新消息 < N秒 算 hot
warm_threshold: int = 3600 # 最新消息 < N秒 算 warm
# ---- 过滤 ----
whitelist: list = field(default_factory=list) # 空=全部采集
blacklist: list = field(default_factory=list) # 跳过的群(支持正则)
# ---- 回溯补录 ----
backfill_interval: float = 600.0 # 定时扫描间隔(秒), 默认 10 分钟
backfill_lookback_days: int = 7 # 回溯天数
backfill_enabled: bool = True # 是否启用回溯补录
# ---- 其他 ----
log_level: str = "DEBUG"
@classmethod
def load(cls, path=None):
path = path or DEFAULT_CONFIG_PATH
if os.path.isfile(path):
with open(path, encoding="utf-8") as f:
data = json.load(f)
return cls(**{k: v for k, v in data.items() if k in cls.__dataclass_fields__})
return cls()
def save(self, path=None):
path = path or DEFAULT_CONFIG_PATH
os.makedirs(os.path.dirname(path), exist_ok=True)
with open(path, "w", encoding="utf-8") as f:
json.dump(asdict(self), f, ensure_ascii=False, indent=2)

312
collector/scanner.py Normal file
View File

@ -0,0 +1,312 @@
"""扫描调度器 — 群聊发现 + 优先级队列 + 自适应频率 + COS 上传"""
import heapq
import logging
import os
import re
import sys
import time
import random
from datetime import datetime
from .config import CollectorConfig
from .storage import MessageStorage
from .wechat_adapter import WeChatAdapter, MEDIA_TYPES
log = logging.getLogger(__name__)
def _init_cos_uploader(config: CollectorConfig):
"""延迟初始化 COS 上传器(避免未安装 SDK 时导入失败)"""
cos_script = os.path.join(
os.path.dirname(os.path.dirname(__file__)),
"skills", "tencent-cos-upload.xiaokui", "scripts",
)
if cos_script not in sys.path:
sys.path.insert(0, cos_script)
from cos_upload import CosUploader
return CosUploader(
secret_id=config.cos_secret_id,
secret_key=config.cos_secret_key,
region=config.cos_region,
bucket=config.cos_bucket,
domain=config.cos_download_domain,
)
def _build_cos_key(config: CollectorConfig, msg_type: str, create_time: int, filename: str) -> str:
"""构建 COS 存储路径: {base}/类型/YYYY-MM/文件名"""
dt = datetime.fromtimestamp(create_time)
date_prefix = dt.strftime("%Y-%m")
# 文件名清理:去除中文和特殊字符,保留 ASCII
safe_name = re.sub(r'[^\w.\-]', '_', filename)
return f"{config.cos_base_path}/{msg_type}/{date_prefix}/{safe_name}"
class GroupState:
"""单个群聊的扫描状态"""
__slots__ = (
"username", "display_name", "next_scan_time",
"scan_interval", "activity_level", "last_message_at",
"consecutive_empty",
)
def __init__(self, username, display_name, next_scan_time=0,
scan_interval=300.0, activity_level="cold",
last_message_at=0, consecutive_empty=0):
self.username = username
self.display_name = display_name
self.next_scan_time = next_scan_time
self.scan_interval = scan_interval
self.activity_level = activity_level
self.last_message_at = last_message_at
self.consecutive_empty = consecutive_empty
def __lt__(self, other):
return self.next_scan_time < other.next_scan_time
class GroupScanner:
def __init__(self, adapter: WeChatAdapter, storage: MessageStorage, config: CollectorConfig):
self._adapter = adapter
self._storage = storage
self._cfg = config
self._groups: dict[str, GroupState] = {} # username -> GroupState
self._queue: list[GroupState] = [] # heapq
self._cos = None
self._last_discovery = 0
def _get_cos(self):
if self._cos is None:
try:
self._cos = _init_cos_uploader(self._cfg)
log.info("COS 上传器初始化成功")
except Exception as e:
log.warning("COS 上传器初始化失败,媒体文件将不会上传: %s", e)
return self._cos
def _matches_filter(self, display_name: str, username: str, patterns: list) -> bool:
"""检查群名或 username 是否匹配过滤模式列表"""
for pat in patterns:
if pat == display_name or pat == username:
return True
try:
if re.search(pat, display_name) or re.search(pat, username):
return True
except re.error:
pass
return False
def discover_groups(self) -> list[str]:
"""发现所有群聊,应用过滤规则,返回新增群列表"""
sessions = self._adapter.list_group_sessions(limit=500)
new_groups = []
for s in sessions:
username = s["username"]
display = s["display_name"]
# 白名单过滤
if self._cfg.whitelist:
if not self._matches_filter(display, username, self._cfg.whitelist):
continue
# 黑名单过滤
if self._cfg.blacklist:
if self._matches_filter(display, username, self._cfg.blacklist):
continue
if username not in self._groups:
# 新发现的群
last_ts = self._storage.get_last_msg_timestamp(username)
state = GroupState(
username=username,
display_name=display,
next_scan_time=time.time(), # 立即扫描
scan_interval=self._cfg.base_interval,
last_message_at=last_ts,
)
self._groups[username] = state
heapq.heappush(self._queue, state)
new_groups.append(username)
log.info("发现群聊: %s (%s)", display, username)
else:
# 已知群,更新名称
self._groups[username].display_name = display
self._last_discovery = time.time()
log.info("群聊发现完成: 共 %d 个群, 新增 %d", len(self._groups), len(new_groups))
return new_groups
def scan_next_batch(self) -> int:
"""扫描到期的群聊批次,返回本批新增消息总数"""
now = time.time()
total_new = 0
scanned = 0
while scanned < self._cfg.batch_size and self._queue:
# peek
if self._queue[0].next_scan_time > now:
break
state = heapq.heappop(self._queue)
# 可能已经被移除
if state.username not in self._groups:
continue
start_t = time.time()
try:
new_count = self._scan_single_group(state)
duration_ms = int((time.time() - start_t) * 1000)
total_new += new_count
self._update_activity(state, new_count)
log.info(
"[%s] %s: +%d 条, 耗时 %dms, 下次 %ds 后 (%s)",
state.activity_level.upper(), state.display_name,
new_count, duration_ms, int(state.scan_interval),
state.username,
)
except Exception as e:
duration_ms = int((time.time() - start_t) * 1000)
log.error("[ERROR] %s 扫描失败: %s", state.display_name, e)
# 出错后延长间隔
state.scan_interval = min(
state.scan_interval * 2, self._cfg.max_interval
)
# 推回队列
state.next_scan_time = time.time() + state.scan_interval
heapq.heappush(self._queue, state)
scanned += 1
# 群间 jitter
if scanned < self._cfg.batch_size and self._queue:
jitter = random.uniform(0, self._cfg.jitter_max)
time.sleep(jitter)
return total_new
def _scan_single_group(self, state: GroupState) -> int:
"""扫描单个群聊,返回新增消息数"""
messages = self._adapter.query_new_messages(
state.username,
after_ts=state.last_message_at,
limit=self._cfg.messages_per_scan,
)
if not messages:
return 0
# 处理媒体上传 + 详细日志
media_stats = {"uploaded": 0, "no_file": 0, "failed": 0}
for msg in messages:
log.info(
" 新消息: [%s] %s: %s (local_id=%d)",
msg["msg_type"], msg.get("sender_name", "?"),
(msg.get("content") or "")[:80],
msg["local_id"],
)
if msg["msg_type"] in MEDIA_TYPES:
if msg.get("media_path"):
url = self._upload_media(msg)
msg["media_url"] = url
if url:
media_stats["uploaded"] += 1
else:
media_stats["failed"] += 1
else:
msg["media_url"] = None
media_stats["no_file"] += 1
log.info(" → 媒体文件未在本地找到, 跳过上传")
else:
msg["media_url"] = None
# 批量入库
inserted = self._storage.insert_messages(messages)
# 媒体统计日志
if any(v > 0 for v in media_stats.values()):
log.info(
" 媒体统计: 上传成功=%d, 本地无文件=%d, 上传失败=%d",
media_stats["uploaded"], media_stats["no_file"], media_stats["failed"],
)
# 更新状态
max_ts = max(m["create_time"] for m in messages)
state.last_message_at = max_ts
return inserted
def _upload_media(self, msg: dict) -> str | None:
"""上传媒体文件到 COS返回 URL"""
cos = self._get_cos()
if not cos:
return None
local_path = msg["media_path"]
if not local_path or not os.path.isfile(local_path):
return None
try:
filename = os.path.basename(local_path)
cos_key = _build_cos_key(
self._cfg, msg["msg_type"], msg["create_time"], filename
)
url = cos.upload(local_path, cos_key)
log.info(" → COS 上传成功: %s%s", local_path, url)
return url
except Exception as e:
log.warning("上传失败 %s: %s", local_path, e)
return None
def _update_activity(self, state: GroupState, new_count: int):
"""根据扫描结果更新活跃度和下次扫描间隔"""
now = time.time()
if new_count > 0:
state.consecutive_empty = 0
age = now - state.last_message_at if state.last_message_at else float("inf")
if age < self._cfg.hot_threshold:
state.activity_level = "hot"
state.scan_interval = self._cfg.min_interval
elif age < self._cfg.warm_threshold:
state.activity_level = "warm"
state.scan_interval = self._cfg.base_interval
else:
state.activity_level = "warm"
state.scan_interval = self._cfg.base_interval
else:
state.consecutive_empty += 1
state.activity_level = "cold"
state.scan_interval = min(
state.scan_interval * self._cfg.backoff_factor,
self._cfg.max_interval,
)
def should_discover(self) -> bool:
return time.time() - self._last_discovery >= self._cfg.discovery_interval
def get_status(self) -> dict:
"""获取扫描器当前状态"""
total_msgs = self._storage.get_total_count()
group_stats = {}
for username, state in self._groups.items():
group_stats[state.display_name] = {
"username": username,
"activity": state.activity_level,
"interval": f"{int(state.scan_interval)}s",
"last_msg_at": state.last_message_at,
}
return {
"total_groups": len(self._groups),
"total_messages": total_msgs,
"groups": group_stats,
}
def time_to_next(self) -> float:
"""返回距离下一个群到期的秒数"""
if not self._queue:
return self._cfg.cycle_sleep
wait = self._queue[0].next_scan_time - time.time()
return max(0, min(wait, self._cfg.cycle_sleep))

194
collector/storage.py Normal file
View File

@ -0,0 +1,194 @@
"""MySQL 存储层 — 只操作 wechat_group_message 表"""
import logging
import pymysql
log = logging.getLogger(__name__)
CREATE_TABLE_SQL = """
CREATE TABLE IF NOT EXISTS `{table}` (
`id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
`group_username` VARCHAR(128) NOT NULL COMMENT '群聊 username, 如 xxx@chatroom',
`group_name` VARCHAR(255) NOT NULL DEFAULT '' COMMENT '群聊名称',
`sender_username` VARCHAR(128) NOT NULL DEFAULT '' COMMENT '发送者 wxid',
`sender_name` VARCHAR(255) NOT NULL DEFAULT '' COMMENT '发送者显示名称',
`msg_type` VARCHAR(32) NOT NULL DEFAULT 'text' COMMENT '消息类型: text/voice/video/file/sticker/link/location/call/system',
`content` TEXT COMMENT '消息文本内容或描述',
`media_url` VARCHAR(1024) DEFAULT NULL COMMENT '媒体文件 COS URL',
`local_id` BIGINT NOT NULL DEFAULT 0 COMMENT '微信消息 local_id',
`local_type` BIGINT NOT NULL DEFAULT 0 COMMENT '微信消息原始类型',
`source_db` VARCHAR(255) NOT NULL DEFAULT '' COMMENT '来源 message_N.db 路径',
`msg_time` DATETIME NOT NULL COMMENT '消息发送时间(微信 create_time)',
`msg_timestamp` BIGINT NOT NULL DEFAULT 0 COMMENT '消息时间戳(unix)',
`svr_msg_id` BIGINT UNSIGNED DEFAULT NULL COMMENT '微信服务端消息ID(server_id)',
`refer_msg_svrid` BIGINT UNSIGNED DEFAULT NULL COMMENT '引用消息的服务端ID(refermsg/svrid)',
`collected_at` DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '采集入库时间',
PRIMARY KEY (`id`),
UNIQUE KEY `uk_group_local` (`group_username`, `local_id`, `source_db`, `msg_timestamp`),
KEY `idx_group_time` (`group_username`, `msg_timestamp`),
KEY `idx_msg_time` (`msg_timestamp`),
KEY `idx_sender` (`sender_username`),
KEY `idx_svr_msg_id` (`svr_msg_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
COMMENT='微信群聊消息采集表';
"""
INSERT_SQL = """
INSERT IGNORE INTO `{table}`
(group_username, group_name, sender_username, sender_name,
msg_type, content, media_url, local_id, local_type, source_db,
msg_time, msg_timestamp, svr_msg_id, refer_msg_svrid)
VALUES
(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, FROM_UNIXTIME(%s), %s, %s, %s)
"""
class MessageStorage:
def __init__(self, config):
self._cfg = config
self._table = config.mysql_table
self._conn = None
def _get_conn(self):
if self._conn is not None:
try:
self._conn.ping(reconnect=True)
return self._conn
except Exception:
self._conn = None
self._conn = pymysql.connect(
host=self._cfg.mysql_host,
port=self._cfg.mysql_port,
user=self._cfg.mysql_user,
password=self._cfg.mysql_password,
database=self._cfg.mysql_database,
charset="utf8mb4",
connect_timeout=10,
read_timeout=30,
write_timeout=30,
autocommit=False,
)
return self._conn
def ensure_table(self):
conn = self._get_conn()
with conn.cursor() as cur:
cur.execute(CREATE_TABLE_SQL.format(table=self._table))
self._ensure_columns(cur)
conn.commit()
log.info("%s 已就绪", self._table)
def _ensure_columns(self, cur):
"""为已有表补充新增列(兼容旧表结构)"""
cur.execute(f"SHOW COLUMNS FROM `{self._table}`")
existing = {row[0] for row in cur.fetchall()}
migrations = [
("svr_msg_id", "BIGINT UNSIGNED DEFAULT NULL COMMENT '微信服务端消息ID(server_id)' AFTER `msg_timestamp`"),
("refer_msg_svrid", "BIGINT UNSIGNED DEFAULT NULL COMMENT '引用消息的服务端ID(refermsg/svrid)' AFTER `svr_msg_id`"),
]
for col, definition in migrations:
if col not in existing:
cur.execute(f"ALTER TABLE `{self._table}` ADD COLUMN `{col}` {definition}")
log.info("已为表 %s 添加列 %s", self._table, col)
if "svr_msg_id" not in existing:
cur.execute(f"ALTER TABLE `{self._table}` ADD INDEX `idx_svr_msg_id` (`svr_msg_id`)")
log.info("已为表 %s 添加索引 idx_svr_msg_id", self._table)
def insert_messages(self, messages: list[dict]) -> int:
"""批量插入消息,返回实际插入条数"""
if not messages:
return 0
conn = self._get_conn()
sql = INSERT_SQL.format(table=self._table)
rows = []
for m in messages:
rows.append((
m["group_username"],
m.get("group_name", ""),
m.get("sender_username", ""),
m.get("sender_name", ""),
m.get("msg_type", "text"),
m.get("content", ""),
m.get("media_url"),
m["local_id"],
m.get("local_type", 0),
m.get("source_db", ""),
m["create_time"],
m["create_time"],
m.get("svr_msg_id"),
m.get("refer_msg_svrid"),
))
try:
with conn.cursor() as cur:
cur.executemany(sql, rows)
conn.commit()
return cur.rowcount
except Exception:
conn.rollback()
raise
def get_last_msg_timestamp(self, group_username: str) -> int:
"""获取某群最后已入库消息的时间戳"""
conn = self._get_conn()
sql = f"SELECT MAX(msg_timestamp) FROM `{self._table}` WHERE group_username = %s"
with conn.cursor() as cur:
cur.execute(sql, (group_username,))
row = cur.fetchone()
return row[0] or 0
def get_group_stats(self) -> list[dict]:
"""获取各群采集统计"""
conn = self._get_conn()
sql = f"""
SELECT group_username, group_name,
COUNT(*) AS total,
MAX(msg_timestamp) AS last_ts,
MIN(msg_timestamp) AS first_ts
FROM `{self._table}`
GROUP BY group_username, group_name
ORDER BY last_ts DESC
"""
with conn.cursor(pymysql.cursors.DictCursor) as cur:
cur.execute(sql)
return cur.fetchall()
def get_total_count(self) -> int:
conn = self._get_conn()
with conn.cursor() as cur:
cur.execute(f"SELECT COUNT(*) FROM `{self._table}`")
return cur.fetchone()[0]
def get_pending_media_messages(self, lookback_days: int = 7, limit: int = 500) -> list[dict]:
"""查询 media_url 为空的媒体消息(用于回溯补录)"""
conn = self._get_conn()
sql = f"""
SELECT id, group_username, msg_type, content, local_id,
local_type, source_db, msg_timestamp
FROM `{self._table}`
WHERE media_url IS NULL
AND msg_type IN ('image', 'voice', 'video', 'file')
AND msg_time > DATE_SUB(NOW(), INTERVAL %s DAY)
ORDER BY msg_timestamp DESC
LIMIT %s
"""
with conn.cursor(pymysql.cursors.DictCursor) as cur:
cur.execute(sql, (lookback_days, limit))
return cur.fetchall()
def update_media_url(self, record_id: int, media_url: str) -> bool:
"""更新单条记录的 media_url"""
conn = self._get_conn()
sql = f"UPDATE `{self._table}` SET media_url = %s WHERE id = %s AND media_url IS NULL"
try:
with conn.cursor() as cur:
cur.execute(sql, (media_url, record_id))
conn.commit()
return cur.rowcount > 0
except Exception:
conn.rollback()
raise
def close(self):
if self._conn and self._conn.open:
self._conn.close()
self._conn = None

583
collector/wechat_adapter.py Normal file
View File

@ -0,0 +1,583 @@
"""微信数据适配层 — 封装 wechat_cli.core提供群聊发现和增量消息查询"""
import hashlib
import logging
import os
import re
import sqlite3
import xml.etree.ElementTree as ET
from contextlib import closing
from datetime import datetime
from wechat_cli.core.context import AppContext
from wechat_cli.core.contacts import (
get_contact_names,
resolve_username,
display_name_for_username,
)
from wechat_cli.core.messages import (
find_msg_db_keys,
_find_msg_tables_for_user,
_is_safe_msg_table_name,
_build_message_filters,
decompress_content,
_parse_message_content,
_split_msg_type,
format_msg_type,
_load_name2id_maps,
)
log = logging.getLogger(__name__)
# 消息 base_type → 简化类型名
_TYPE_MAP = {
1: "text", 3: "image", 34: "voice", 42: "contact_card",
43: "video", 47: "sticker", 48: "location", 49: "link",
50: "call", 10000: "system", 10002: "revoked",
}
# 需要上传 COS 的媒体类型image/voice 只上传已解密的可读文件,跳过 .dat
MEDIA_TYPES = {"video", "file", "image", "voice"}
def _table_has_column(conn, table_name: str, column_name: str) -> bool:
"""检查 SQLite 表是否包含指定列"""
try:
cols = conn.execute(f"PRAGMA table_info([{table_name}])").fetchall()
return any(row[1] == column_name for row in cols)
except Exception:
return False
def _extract_appmsg_meta(content_xml: str) -> dict | None:
"""从 type=49 消息的 XML 中提取 appmsg 元信息(文件名、大小、类型等)
返回 dict: {title, des, file_size, file_ext, app_type,
refer_svrid, refer_displayname, refer_content} None
"""
if not content_xml:
return None
try:
root = ET.fromstring(content_xml) if content_xml.lstrip().startswith("<") else None
if root is None:
return None
appmsg = root.find(".//appmsg")
if appmsg is None:
return None
app_type = int((appmsg.findtext("type") or "0").strip())
title = (appmsg.findtext("title") or "").strip()
des = (appmsg.findtext("des") or "").strip()
# 附件信息
attach = appmsg.find("appattach")
file_size = 0
file_ext = ""
if attach is not None:
file_size = int((attach.findtext("totallen") or "0").strip())
file_ext = (attach.findtext("fileext") or "").strip()
result = {
"app_type": app_type,
"title": title,
"des": des,
"file_size": file_size,
"file_ext": file_ext,
}
# 引用消息 (app_type=57): 提取 refermsg 中的 svrid 和被引用内容
if app_type == 57:
ref = appmsg.find(".//refermsg")
if ref is not None:
svrid_text = (ref.findtext("svrid") or "").strip()
if svrid_text:
try:
result["refer_svrid"] = int(svrid_text)
except ValueError:
pass
result["refer_displayname"] = (ref.findtext("displayname") or "").strip()
result["refer_content"] = (ref.findtext("content") or "").strip()
return result
except Exception:
return None
def _format_file_content(meta: dict) -> str:
"""将文件元信息格式化为可读的 content 字符串"""
parts = []
if meta.get("title"):
parts.append(meta["title"])
size = meta.get("file_size", 0)
if size > 0:
if size >= 1024 * 1024:
parts.append(f"({size / 1024 / 1024:.1f}MB)")
elif size >= 1024:
parts.append(f"({size / 1024:.1f}KB)")
else:
parts.append(f"({size}B)")
return " ".join(parts)
def _extract_video_meta(content_xml: str) -> str:
"""从视频消息 XML 中提取描述信息"""
if not content_xml:
return "[视频]"
try:
root = ET.fromstring(content_xml) if content_xml.lstrip().startswith("<") else None
if root is None:
return "[视频]"
video = root.find(".//videomsg")
if video is not None:
length = video.get("length", "")
raw_length = video.get("rawlength", "")
dur = length or raw_length
if dur:
return f"[视频] {dur}"
return "[视频]"
except Exception:
return "[视频]"
def _extract_chat_record(content_xml: str) -> str | None:
"""从 type=49, app_type=19 的聊天记录消息中提取纯文本
格式:
[聊天记录] 标题
发送者A: 消息内容
发送者B: 消息内容
...
"""
if not content_xml:
return None
try:
root = ET.fromstring(content_xml) if content_xml.lstrip().startswith("<") else None
if root is None:
return None
appmsg = root.find(".//appmsg")
if appmsg is None:
return None
title = (appmsg.findtext("title") or "聊天记录").strip()
lines = [f"[聊天记录] {title}"]
# recorditem 内嵌了一段 XML 字符串
recorditem_text = appmsg.findtext("recorditem") or ""
if not recorditem_text.strip():
return lines[0] if lines else None
rec_root = ET.fromstring(recorditem_text)
for item in rec_root.findall(".//datalist/dataitem"):
sender = (item.findtext("sourcename") or "").strip()
# datatitle 是聊天内容datadesc 是附加描述
msg_text = (item.findtext("datatitle") or "").strip()
if not msg_text:
msg_text = (item.findtext("datadesc") or "").strip()
if not msg_text:
# 可能是图片/视频等非文本
data_type = item.get("datatype", "")
if data_type == "2":
msg_text = "[图片]"
elif data_type == "4":
msg_text = "[视频]"
elif data_type == "6":
msg_text = "[文件]"
else:
msg_text = "[其他]"
if sender:
lines.append(f"{sender}: {msg_text}")
else:
lines.append(msg_text)
return "\n".join(lines)
except Exception:
return None
def _extract_image_md5(content_xml: str) -> str | None:
"""从图片消息 XML 中提取 md5 属性(图片内容 hash"""
if not content_xml:
return None
m = re.search(r'\bmd5="([a-fA-F0-9]{32})"', content_xml)
return m.group(1).lower() if m else None
def _file_md5(filepath: str) -> str:
"""计算文件内容的 md5"""
h = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
h.update(chunk)
return h.hexdigest()
class WeChatAdapter:
def __init__(self):
self._app = AppContext()
self._names = get_contact_names(self._app.cache, self._app.decrypted_dir)
@property
def db_dir(self):
return self._app.db_dir
def refresh_names(self):
"""刷新联系人名称缓存(全局单例需要重置)"""
import wechat_cli.core.contacts as _c
_c._contact_names = None
_c._contact_full = None
self._names = get_contact_names(self._app.cache, self._app.decrypted_dir)
def list_group_sessions(self, limit=500) -> list[dict]:
"""列出所有群聊会话"""
path = self._app.cache.get(os.path.join("session", "session.db"))
if not path:
log.error("无法解密 session.db")
return []
with closing(sqlite3.connect(path)) as conn:
rows = conn.execute("""
SELECT username, unread_count, summary, last_timestamp,
last_msg_type, last_msg_sender, last_sender_display_name
FROM SessionTable
WHERE last_timestamp > 0
ORDER BY last_timestamp DESC
LIMIT ?
""", (limit,)).fetchall()
groups = []
for r in rows:
username, unread, summary, ts, msg_type, sender, sender_name = r
if "@chatroom" not in username:
continue
display = self._names.get(username, username)
groups.append({
"username": username,
"display_name": display,
"last_timestamp": ts,
"unread": unread or 0,
})
return groups
def resolve_group_username(self, group_name: str) -> str | None:
return resolve_username(group_name, self._app.cache, self._app.decrypted_dir)
def display_name(self, username: str) -> str:
return self._names.get(username, username)
def query_new_messages(self, username: str, after_ts: int, limit: int = 200) -> list[dict]:
"""增量查询某群新消息create_time > after_ts按时间升序返回结构化数据"""
tables = _find_msg_tables_for_user(
username, self._app.msg_db_keys, self._app.cache
)
if not tables:
return []
group_name = self._names.get(username, username)
is_group = "@chatroom" in username
all_messages = []
for table_info in tables:
# 优化:跳过 max_create_time <= after_ts 的表
if after_ts and table_info["max_create_time"] <= after_ts:
continue
db_path = table_info["db_path"]
table_name = table_info["table_name"]
if not _is_safe_msg_table_name(table_name):
continue
try:
conn = sqlite3.connect(db_path, timeout=5)
try:
id_to_username = _load_name2id_maps(conn)
has_server_id = _table_has_column(conn, table_name, "server_id")
# 自定义查询ORDER BY create_time ASC
clauses, params = _build_message_filters(start_ts=after_ts)
# 改为严格大于(排除已入库的那条)
if clauses:
clauses[0] = "create_time > ?"
where_sql = f"WHERE {' AND '.join(clauses)}" if clauses else ""
extra_col = ", server_id" if has_server_id else ""
sql = f"""
SELECT local_id, local_type, create_time, real_sender_id,
message_content, WCDB_CT_message_content{extra_col}
FROM [{table_name}]
{where_sql}
ORDER BY create_time ASC
LIMIT ?
"""
rows = conn.execute(sql, (*params, limit)).fetchall()
for row in rows:
msg = self._parse_row(
row, username, group_name, is_group,
id_to_username, db_path, has_server_id
)
if msg:
all_messages.append(msg)
finally:
conn.close()
except Exception as e:
log.warning("查询 %s%s 失败: %s", username, db_path, e)
# 跨表合并后按时间排序,截断到 limit
all_messages.sort(key=lambda m: m["create_time"])
return all_messages[:limit]
def _parse_row(self, row, username, group_name, is_group, id_to_username, db_path, has_server_id=False):
"""解析单条消息原始行为结构化 dict"""
local_id, local_type, create_time, real_sender_id, content_raw, ct = row[:6]
server_id = row[6] if has_server_id and len(row) > 6 else None
content_raw = decompress_content(content_raw, ct)
if content_raw is None:
content_raw = ""
# 解析发送者和消息内容
sender_from_content, text = _parse_message_content(content_raw, local_type, is_group)
# 解析发送者
sender_username = id_to_username.get(real_sender_id, "")
if not sender_username and sender_from_content:
sender_username = sender_from_content
sender_name = self._names.get(sender_username, sender_username)
# 解析消息类型
base_type, sub_type = _split_msg_type(local_type)
msg_type = _TYPE_MAP.get(base_type, "other")
if base_type == 49 and sub_type == 6:
msg_type = "file"
# content 处理:文本消息存原文,非文本消息提取元信息
# 注意:群聊 content_raw 格式为 "wxid:\n<msg>...",用 text剥离发送者后的部分解析 XML
xml_content = text if text else content_raw
refer_msg_svrid = None
if base_type == 1:
final_content = text
elif base_type == 49:
# appmsg 类型:文件(6)、链接(5)、小程序(33/36)、聊天记录(19)、引用(57) 等
meta = _extract_appmsg_meta(xml_content)
if meta and meta["app_type"] == 19:
# 聊天记录合并转发
final_content = _extract_chat_record(xml_content) or meta.get("title", "")
elif meta and meta["app_type"] == 57:
# 引用/回复消息
refer_msg_svrid = meta.get("refer_svrid")
quote_text = meta.get("title") or "[引用消息]"
ref_name = meta.get("refer_displayname", "")
ref_content = meta.get("refer_content", "")
if len(ref_content) > 160:
ref_content = ref_content[:160] + "..."
if ref_content:
prefix = f"回复 {ref_name}: " if ref_name else "回复: "
quote_text += f"\n{prefix}{ref_content}"
final_content = quote_text
log.debug("引用消息: refer_svrid=%s, title=%s", refer_msg_svrid, meta.get("title", ""))
elif meta:
if meta["app_type"] == 6:
final_content = _format_file_content(meta)
elif meta.get("title"):
final_content = meta["title"]
if meta.get("des"):
final_content += f" - {meta['des']}"
else:
final_content = ""
log.debug("appmsg 元信息: type=%d, title=%s", meta["app_type"], meta.get("title", ""))
else:
final_content = ""
elif base_type == 43:
final_content = _extract_video_meta(xml_content)
elif base_type == 34:
final_content = "[语音]"
elif base_type == 3:
img_md5 = _extract_image_md5(xml_content)
final_content = f"[图片] {img_md5}" if img_md5 else "[图片]"
elif base_type == 47:
final_content = "[表情]"
elif base_type == 48:
final_content = "[位置]"
elif base_type == 42:
final_content = "[名片]"
else:
final_content = text if text else ""
# 解析媒体路径
media_path = None
if msg_type in MEDIA_TYPES and self._app.db_dir:
try:
if msg_type == "image":
media_path = self._resolve_readable_media(
base_type, content_raw, create_time, username
)
if media_path:
log.debug("图片路径解析成功: %s", media_path)
else:
log.info("图片文件未找到 (local_id=%d, ts=%d), 可能未点开",
local_id, create_time)
elif msg_type == "video":
media_path = self._resolve_video_path(content_raw, create_time)
if media_path:
log.debug("视频路径解析成功: %s", media_path)
else:
log.info("视频文件未找到 (local_id=%d, ts=%d), 可能未下载",
local_id, create_time)
elif msg_type == "file":
media_path = self._resolve_msg_file(final_content, create_time)
if media_path:
log.debug("文件路径解析成功: %s", media_path)
else:
log.info("文件未找到 (local_id=%d, ts=%d), 可能未下载",
local_id, create_time)
elif msg_type == "voice":
media_path = self._resolve_readable_media(
base_type, content_raw, create_time, username
)
if media_path:
log.debug("语音路径解析成功: %s", media_path)
else:
log.info("语音文件未找到 (local_id=%d, ts=%d)",
local_id, create_time)
except Exception as e:
log.warning("媒体路径解析异常 (local_id=%d, type=%s): %s",
local_id, msg_type, e)
return {
"group_username": username,
"group_name": group_name,
"local_id": local_id,
"local_type": local_type,
"create_time": create_time,
"sender_username": sender_username,
"sender_name": sender_name,
"msg_type": msg_type,
"content": final_content,
"media_path": media_path,
"source_db": os.path.basename(db_path),
"svr_msg_id": server_id,
"refer_msg_svrid": refer_msg_svrid,
}
def _resolve_msg_file(self, content: str, create_time: int) -> str | None:
"""在 msg/file/YYYY-MM/ 中按文件名查找(文件/视频/音频等都在此目录)"""
wechat_base = os.path.dirname(self._app.db_dir)
dt = datetime.fromtimestamp(create_time)
date_prefix = dt.strftime("%Y-%m")
file_dir = os.path.join(wechat_base, "msg", "file", date_prefix)
if not os.path.isdir(file_dir):
return None
# 从 content 提取文件名: "filename.ext (1.2MB)" 或 "[视频] 30秒" 等
title = (content or "").split(" (")[0].strip()
# 去掉前缀标记如 [视频]、[语音]
for prefix in ("[视频]", "[语音]"):
if title.startswith(prefix):
title = title[len(prefix):].strip()
break
if not title:
return None
# 精确匹配
target = os.path.join(file_dir, title)
if os.path.isfile(target):
return target
# 模糊匹配
for f in os.listdir(file_dir):
fp = os.path.join(file_dir, f)
if not os.path.isfile(fp):
continue
if title in f or f in title:
return fp
return None
def _resolve_readable_media(self, base_type: int, content: str, create_time: int, chat_username: str) -> str | None:
"""解析图片/语音的本地文件路径。
图片: temp/RWTemp/YYYY-MM/ 中按 md5 查找原始图片
语音: 只返回已解密的可读文件
"""
wechat_base = os.path.dirname(self._app.db_dir)
dt = datetime.fromtimestamp(create_time)
date_prefix = dt.strftime("%Y-%m")
if base_type == 3:
img_md5 = _extract_image_md5(content)
if not img_md5:
return None
rwtemp_dir = os.path.join(wechat_base, "temp", "RWTemp", date_prefix)
if not os.path.isdir(rwtemp_dir):
return None
for f in os.listdir(rwtemp_dir):
fp = os.path.join(rwtemp_dir, f)
ext = os.path.splitext(f)[1].lower()
if ext not in (".jpg", ".jpeg", ".png", ".gif", ".webp", ".bmp"):
continue
if not os.path.isfile(fp):
continue
if _file_md5(fp) == img_md5:
return fp
return None
else:
# 语音:只返回已解密的可读文件
msg_dir = os.path.join(wechat_base, "msg")
attach_dir = os.path.join(msg_dir, "attach")
readable_exts = (".mp3", ".wav", ".amr", ".silk", ".m4a", ".ogg")
search_hashes = []
if chat_username:
h = hashlib.md5(chat_username.encode()).hexdigest()
candidate = os.path.join(attach_dir, h)
if os.path.isdir(candidate):
search_hashes.append(h)
if not search_hashes and os.path.isdir(attach_dir):
search_hashes = [
d for d in os.listdir(attach_dir)
if os.path.isdir(os.path.join(attach_dir, d))
]
for h in search_hashes:
sub = os.path.join(attach_dir, h, date_prefix, "Voice")
if not os.path.isdir(sub):
continue
for f in os.listdir(sub):
fp = os.path.join(sub, f)
if not os.path.isfile(fp):
continue
if f.endswith(".dat"):
continue
if f.lower().endswith(readable_exts):
return fp
return None
def _resolve_video_path(self, content: str, create_time: int) -> str | None:
"""在 msg/video/YYYY-MM/ 中按 rawmd5 查找 mp4"""
wechat_base = os.path.dirname(self._app.db_dir)
video_dir = os.path.join(wechat_base, "msg", "video")
if not os.path.isdir(video_dir):
return None
rawmd5 = None
md5_m = re.search(r'rawmd5="([a-f0-9]+)"', content or "")
if md5_m:
rawmd5 = md5_m.group(1)
if not rawmd5:
return None
dt = datetime.fromtimestamp(create_time)
date_prefix = dt.strftime("%Y-%m")
month_dir = os.path.join(video_dir, date_prefix)
if not os.path.isdir(month_dir):
return None
for f in os.listdir(month_dir):
if rawmd5 not in f:
continue
fp = os.path.join(month_dir, f)
if f.endswith(".mp4") and os.path.isfile(fp):
return fp
return None

49
project.md Normal file
View File

@ -0,0 +1,49 @@
# wechat-cli 项目规划
## 项目目标
构建基于微信本地数据库的 AI-first CLI 工具生态,支持消息查询、数据导出、自动化采集。
## 当前版本
v0.2.4
## 已完成功能
### 核心 CLIwechat_cli/
- [x] 密钥提取macOS/Linux/Windows 进程内存扫描)
- [x] 会话列表、消息历史、全局搜索
- [x] 联系人查询、群成员列表
- [x] 聊天统计、收藏夹查询
- [x] 未读消息、增量消息检测
- [x] 导出聊天记录Markdown/TXT
- [x] 媒体文件路径解析(--media 标志)
### Skills
- [x] export-chat — 单个聊天完整导出(消息+视频+音频+文件)
- [x] tencent-cos-upload — 腾讯云 COS 文件上传工具
### 群聊消息采集器collector/)— 2026-04-13 新增
- [x] 自动发现所有群聊(定期扫描 session.db
- [x] 增量消息采集(基于 create_time 水位线)
- [x] MySQL 持久化wechat_group_message 表)
- [x] 媒体文件上传腾讯 COS只采集原始可读文件不做 .dat 解密)
- 图片: temp/RWTemp/YYYY-MM/{md5}.ext查看后自动生成
- 文件/视频/音频: msg/file/YYYY-MM/filename
- [x] 自适应扫描频率hot/warm/cold 三级,优先级队列调度)
- [x] 白名单/黑名单过滤(支持正则)
- [x] 守护模式 + 一次性扫描模式
- [x] 非文本消息元信息提取(文件名/大小/视频时长等,不依赖本地下载)
- [x] 详细扫描日志(每条消息类型/发送者/媒体状态)
- [x] 聊天记录合并转发解析app_type=19 → 多行纯文本)
- [x] 引用/回复消息关联app_type=57 → svr_msg_id + refer_msg_svrid支持消息间关联查询
- [x] 扫描频率优化hot=15s, warm=30s, cold≤120s保证 ≤2min 入库)
- [x] 默认 DEBUG 日志级别
- [x] 媒体回溯补录watchdog 监听 RWTemp + msg/file + 10min 定时兜底扫描,补录 7 天内记录)
## 待规划功能
- [ ] 采集数据 Web 查询界面
- [ ] 消息内容分析/统计 dashboard
## 技术约束
- macOS 需要 Full Disk Access 权限
- 微信版本 ≤ 4.1.8.100
- 图片原文件需用户在微信中查看后才会出现在 temp/RWTemp 目录