auto backup 2026-06-25 08:10:30

This commit is contained in:
ai_member_only 2026-06-25 08:10:30 +08:00
parent fe2c100193
commit c81e1b532f
10 changed files with 618 additions and 0 deletions

View File

@ -0,0 +1,58 @@
# 数据知识库索引
> 公司数据结构文档,用于支撑数据分析、业务查询、报表生成等工作。
> 所有凭证存储在 `~/.hermes/.env`,本文档仅含脱敏引用。
## 数据基础设施总览
| 类型 | 环境 | 凭证前缀 | 用途 | 状态 |
|------|------|----------|------|------|
| MySQL 8.0 | 线上 | `VALA_MYSQL_ONLINE_*` | 用户/订单/配置 | ✅ |
| MySQL 8.0 | 测试 | `VALA_MYSQL_TEST_*` | 最新配置/开发测试 | ✅ |
| PostgreSQL 17 | 线上 | `VALA_PG_ONLINE_*` | 用户行为数据 | ✅ |
| PostgreSQL 17 | 测试 | `VALA_PG_TEST_*` | 测试行为数据 | ✅ |
| Elasticsearch 7.10 | 线上 | `VALA_ES_ONLINE_*` | 服务日志 | ✅ |
| Elasticsearch | 测试 | `VALA_ES_TEST_*` | 服务日志 | ⚠️ IP白名单限制 |
## 数据域
### MySQL — 业务数据库
| 库名 | 环境 | 说明 |
|------|------|------|
| `vala_user` | 线上/测试 | [账号 & 角色](data_dict/vala_user.md) |
| `vala` | 线上/测试 | 业务主库(配置、内容) |
| `vala_order` | 线上/测试 | 订单数据 |
| `vala_gray` | 线上 | 灰度发布配置 |
| `vala_dev` | 测试 | 开发配置 |
| `vala_bak` | 测试 | 备份数据 |
### PostgreSQL — 行为数据库
| 库名 | 环境 | 主要表(部分) |
|------|------|---------------|
| `vala` | 线上 | `user_chapter_play_record_*`, `user_lesson_handbook`, `gashapon_config`, `vala_pilot_explain_*` |
| `vala_test` | 测试 | `user_chapter_play_record_*`, `user_component_play_record_*`, `user_lesson_handbook`, `account_event_count` |
### Elasticsearch — 日志搜索
| 环境 | 版本 | 用途 |
|------|------|------|
| 线上 | 7.10.1 | 正式环境服务日志 |
| 测试 | - | ⚠️ 当前机器 IP 不在白名单 |
## 工具脚本
| 脚本 | 路径 | 说明 |
|------|------|------|
| phone_encrypt | `scripts/phone_encrypt.py` | 手机号 XXTEA 加解密、MD5 |
## 参考文档
| 文档 | 说明 |
|------|------|
| [references/手机号查询角色ID方法.md](references/手机号查询角色ID方法.md) | 手机号查询角色原始文档 |
---
*持续建设中。新增数据源时更新本索引。*

View File

@ -0,0 +1,87 @@
# vala_user — 用户 & 角色数据字典
> MySQL 线上数据库,存储账号和角色信息。
## 连接信息
所有凭证存储在 `~/.hermes/.env`MySQL 线上 → `VALA_MYSQL_ONLINE_*`MySQL 测试 → `VALA_MYSQL_TEST_*`)。
## 实体关系图
```
vala_app_account (账号) vala_app_character (角色)
┌─────────────────────┐ ┌──────────────────────────┐
│ id (PK) │◄─────────│ account_id (FK) │
│ tel │ 1:N │ id (PK) │
│ tel_encrypt │ │ nickname │
└─────────────────────┘ │ gender │
│ birthday │
│ purchase_season_package │
│ created_at │
└──────────────────────────┘
```
- 一个账号可以有多个角色(例如:一个孩子一个角色)
- 关联字段:`vala_app_character.account_id = vala_app_account.id`
## 表结构
### vala_app_account账号表
| 字段 | 类型 | 说明 |
|------|------|------|
| `id` | bigint | 账号ID主键 |
| `tel` | varchar(20) | 手机号(脱敏显示,如 `158****7007` |
| `tel_encrypt` | varchar(100) | 手机号密文XXTEA + Base64 URL-safe |
### vala_app_character角色表
| 字段 | 类型 | 说明 |
|------|------|------|
| `id` | bigint | 角色ID主键 |
| `account_id` | bigint | 所属账号IDFK → vala_app_account.id |
| `nickname` | varchar(20) | 角色昵称 |
| `gender` | tinyint(1) | 性别 |
| `birthday` | varchar(50) | 生日 |
| `purchase_season_package` | text | 已购赛季包 |
| `created_at` | datetime | 创建时间 |
## 手机号加密
- **算法**: XXTEA
- **密钥**: 存储在 `~/.hermes/.env``VALA_PHONE_XXTEA_KEY`
- **编码**: Base64 URL-safe`+`→`-`, `/`→`_`, `=`→`.`
- **工具脚本**: `business_knowledge/scripts/phone_encrypt.py`
### 加密流程
```
明文手机号 → XXTEA加密 → Base64 → URL-safe替换 → tel_encrypt 密文
```
### 查询流程
1. 用 `phone_encrypt.py` 将手机号加密为密文
2. 用密文在 `vala_app_account.tel_encrypt` 精确匹配
3. JOIN `vala_app_character` 获取角色列表
```sql
SELECT
a.id AS account_id,
a.tel,
c.id AS character_id,
c.nickname,
c.gender,
c.birthday,
c.purchase_season_package
FROM vala_app_account a
LEFT JOIN vala_app_character c ON c.account_id = a.id
WHERE a.tel_encrypt = '<密文>';
```
## 注意事项
1. **tel 字段是脱敏的**(如 `158****7007`),不能用于精确匹配
2. **必须用 tel_encrypt 密文匹配**
3. **一个账号可以有多个角色**,查询结果可能返回多行
4. 测试环境和线上环境的 `tel_encrypt` 值相同(加密算法一致)

View File

@ -0,0 +1,100 @@
# 手机号 → 账号ID → 角色ID 检索方法
> ⚠️ 本文为脱敏参考版。凭证已移至 `~/.hermes/.env`,数据库字典见 `data_dict/vala_user.md`
## 数据关系
```
手机号 (明文)
│ XXTEA 加密
tel_encrypt (密文) account_id
│ │
▼ ▼
vala_app_account ──────────► vala_app_character
(账号表) 1:N 关联 (角色表)
```
- **一个账号** (`vala_app_account`) 可以有 **多个角色** (`vala_app_character`)
- 关联字段:`vala_app_character.account_id = vala_app_account.id`
## 数据库
| 项目 | 值 |
|------|-----|
| 数据库 | MySQL 线上环境 |
| 库名 | `vala_user` |
| 用户 | `read_only` |
> 具体连接信息从 `~/.hermes/.env` 读取。
## 表结构
### vala_app_account账号表
| 字段 | 类型 | 说明 |
|------|------|------|
| `id` | bigint | 账号ID主键 |
| `tel` | varchar(20) | 手机号(脱敏显示,如 `158****7007` |
| `tel_encrypt` | varchar(100) | 手机号密文(用于精确匹配) |
### vala_app_character角色表
| 字段 | 类型 | 说明 |
|------|------|------|
| `id` | bigint | 角色ID主键 |
| `account_id` | bigint | 所属账号ID |
| `nickname` | varchar(20) | 角色昵称 |
| `gender` | tinyint(1) | 性别 |
| `birthday` | varchar(50) | 生日 |
| `purchase_season_package` | text | 已购赛季包 |
## 手机号加密方式
手机号在数据库中以密文存储,加密算法为 **XXTEA + Base64 URL-safe**
密钥从 `~/.hermes/.env``VALA_PHONE_XXTEA_KEY` 读取。
## 查询步骤
### 步骤 1加密手机号
```bash
python3 business_knowledge/scripts/phone_encrypt.py encrypt 15849377007
```
### 步骤 2用密文查询账号和角色
```sql
SELECT
a.id AS account_id,
a.tel,
c.id AS character_id,
c.nickname,
c.gender,
c.birthday,
c.purchase_season_package,
c.created_at
FROM vala_app_account a
LEFT JOIN vala_app_character c ON c.account_id = a.id
WHERE a.tel_encrypt = '<密文>';
```
### 步骤 3解读结果
```
account_id tel character_id nickname gender birthday purchase_season_package
18279 158****7007 23600 Morris 1 2021-09-09 [16,17,18,19,20]
18279 158****7007 23686 Nathan 1 2018-03-13 [16]
```
- **账号ID**: 18279
- **角色**: 23600 (Morris)、23686 (Nathan)
- 一个账号下可能有多个角色(一个孩子一个角色)
## 注意事项
1. **tel 字段是脱敏的**(如 `158****7007`),不能直接用于精确匹配
2. **必须用 tel_encrypt 密文匹配**,密文由 XXTEA 加密生成
3. **一个账号可以有多个角色**,查询结果可能返回多行
4. 测试环境和线上环境的 `tel_encrypt` 值相同(加密算法一致)

View File

@ -0,0 +1,84 @@
#!/usr/bin/env python3
"""
手机号加解密工具 ~/.hermes/.env 读取 XXTEA 密钥
用法:
from phone_encrypt import encrypt_phone, decrypt_phone, phone_md5
cipher = encrypt_phone("13800138000")
phone = decrypt_phone(cipher)
md5 = phone_md5("13800138000")
命令行:
python phone_encrypt.py encrypt 13800138000
python phone_encrypt.py decrypt CxMOc6z56aYjE73r8OSAog..
"""
import os
import re
import hashlib
import base64
try:
import xxtea
except ImportError:
raise ImportError(
"请先安装 xxtea: pip install xxtea-py"
)
def _load_key() -> str:
"""从 ~/.hermes/.env 加载 XXTEA 密钥"""
env_path = os.path.expanduser("~/.hermes/.env")
if not os.path.exists(env_path):
raise FileNotFoundError(f"找不到 .env 文件: {env_path}")
with open(env_path, "r") as f:
content = f.read()
match = re.search(r"VALA_PHONE_XXTEA_KEY=(.+)", content)
if not match:
raise ValueError("在 .env 中未找到 VALA_PHONE_XXTEA_KEY")
return match.group(1).strip().strip('"')
KEY = _load_key()
def encrypt_phone(phone: str) -> str:
"""加密明文手机号,返回与数据库 tel_encrypt 字段一致的密文"""
encrypted = xxtea.encrypt(phone.encode(), KEY.encode())
result = base64.b64encode(encrypted).decode()
result = result.replace("+", "-").replace("/", "_").replace("=", ".")
return result
def decrypt_phone(encrypted: str) -> str:
"""解密 tel_encrypt 还原明文手机号"""
restored = encrypted.replace("-", "+").replace("_", "/").replace(".", "=")
decrypted = xxtea.decrypt(base64.b64decode(restored), KEY.encode())
return decrypted.decode()
def phone_md5(phone: str) -> str:
"""手机号 MD5用于跨系统关联"""
return hashlib.md5(phone.encode()).hexdigest()
if __name__ == "__main__":
import sys
if len(sys.argv) < 3:
print("用法: python phone_encrypt.py <encrypt|decrypt> <手机号|密文>")
sys.exit(1)
action, value = sys.argv[1], sys.argv[2]
if action == "encrypt":
print(encrypt_phone(value))
elif action == "decrypt":
print(decrypt_phone(value))
else:
print(f"未知操作: {action},支持 encrypt / decrypt")
sys.exit(1)

View File

@ -0,0 +1,289 @@
#!/usr/bin/env python3
"""
导出指定角色的课程巩固数据 + 原始音频
用法: python3 export_review_audio.py <角色ID1> [角色ID2] ...
python3 export_review_audio.py 23600 23686
"""
import re, json, sys, os, subprocess
from datetime import datetime
# ── 加载 .env ───────────────────────────────────────
def load_env():
env_path = os.path.expanduser("~/.hermes/.env")
with open(env_path) as f:
content = f.read()
def g(k):
m = re.search(rf"{k}=(.+)", content)
return m.group(1).strip() if m else None
return g
g = load_env()
# ── 参数 ────────────────────────────────────────────
if len(sys.argv) < 2:
print("用法: python3 export_review_audio.py <角色ID1> [角色ID2] ...")
sys.exit(1)
user_ids = [int(x) for x in sys.argv[1:]]
output_dir = os.path.expanduser("~/.hermes/workspace/output")
os.makedirs(output_dir, exist_ok=True)
ts = datetime.now().strftime("%Y%m%d_%H%M%S")
uid_str = "_".join(str(u) for u in user_ids)
output_path = f"{output_dir}/知识巩固_音频_{uid_str}_{ts}.xlsx"
print(f"导出角色: {user_ids}")
print(f"输出文件: {output_path}")
# ── 1. 查询 PG: 课程巩固记录 ───────────────────────
print("\n[1/3] 查询 PostgreSQL 课程巩固记录...")
import psycopg2
from psycopg2.extras import RealDictCursor
pg_conn = psycopg2.connect(
host=g("VALA_PG_ONLINE_HOST"), port=int(g("VALA_PG_ONLINE_PORT")),
user=g("VALA_PG_ONLINE_USER"), password=g("VALA_PG_ONLINE_PASSWORD"),
dbname=g("VALA_PG_ONLINE_DB"), connect_timeout=10,
)
with pg_conn.cursor(cursor_factory=RealDictCursor) as cur:
cur.execute("""
SELECT user_id, story_id, chapter_id, unique_id,
score, score_text, sp_value, exp, level,
question_list, play_time, created_at, updated_at
FROM user_unit_review_question_result
WHERE user_id = ANY(%s) AND deleted_at IS NULL
ORDER BY user_id, updated_at DESC
""", (user_ids,))
review_rows = cur.fetchall()
# Parse question_list JSON for readable summary
for row in review_rows:
ql = row["question_list"]
if isinstance(ql, str):
try:
ql = json.loads(ql)
except:
pass
questions = []
if isinstance(ql, list):
for item in ql:
if isinstance(item, dict):
q = item.get("question", {})
qtype = q.get("type", "")
qtitle = q.get("title", "")
user_answer = item.get("userAnswer", "")
score = item.get("score", "")
questions.append(f"[{qtype}] {qtitle} | 回答: {user_answer} | 得分: {score}")
row["question_summary"] = "\n".join(questions)
row["question_count"] = len(ql) if isinstance(ql, list) else 0
pg_conn.close()
print(f" → 查询到 {len(review_rows)} 条课程巩固记录")
# ── 2. 查询 ES: 音频数据 ────────────────────────────
print("\n[2/3] 查询 Elasticsearch 音频数据...")
es_url = f"{g('VALA_ES_ONLINE_SCHEME')}://{g('VALA_ES_ONLINE_HOST')}:{g('VALA_ES_ONLINE_PORT')}"
auth = f"{g('VALA_ES_ONLINE_USER')}:{g('VALA_ES_ONLINE_PASSWORD')}"
audio_rows = []
scroll_id = None
page_size = 500
# First page
query = {
"query": {"terms": {"userId": user_ids}},
"sort": [{"timeInt": {"order": "desc"}}],
"size": page_size,
}
r = subprocess.run([
"curl", "-sk", "-u", auth,
"-H", "Content-Type: application/json",
"--connect-timeout", "10", "--max-time", "30",
"-X", "POST", "-d", json.dumps(query),
f"{es_url}/user-audio/_search?scroll=2m"
], capture_output=True, text=True, timeout=35)
resp = json.loads(r.stdout)
scroll_id = resp.get("_scroll_id")
total = resp.get("hits", {}).get("total", {}).get("value", 0)
print(f" → ES 总计 {total} 条音频记录,分批读取...")
hits = resp.get("hits", {}).get("hits", [])
for h in hits:
audio_rows.append(h["_source"])
# Scroll remaining
batch = 1
while len(audio_rows) < total:
r = subprocess.run([
"curl", "-sk", "-u", auth,
"-H", "Content-Type: application/json",
"--connect-timeout", "10", "--max-time", "30",
"-X", "POST", "-d", json.dumps({"scroll": "2m", "scroll_id": scroll_id}),
f"{es_url}/_search/scroll"
], capture_output=True, text=True, timeout=35)
resp = json.loads(r.stdout)
scroll_id = resp.get("_scroll_id")
hits = resp.get("hits", {}).get("hits", [])
if not hits:
break
for h in hits:
audio_rows.append(h["_source"])
batch += 1
print(f" → 批次 {batch}: 已读 {len(audio_rows)}/{total}")
# Clean up scroll
subprocess.run([
"curl", "-sk", "-u", auth, "--connect-timeout", "5",
"-X", "DELETE", "-d", json.dumps({"scroll_id": scroll_id}),
f"{es_url}/_search/scroll"
], capture_output=True, timeout=10)
print(f" → 共读取 {len(audio_rows)} 条音频记录")
# ── 3. 导出 Excel ────────────────────────────────────
print("\n[3/3] 生成 Excel...")
import pandas as pd
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl.styles import Font, Alignment, PatternFill
wb = Workbook()
# Sheet 1: 课程巩固记录
ws1 = wb.active
ws1.title = "课程巩固记录"
review_data = []
for row in review_rows:
review_data.append({
"角色ID": row["user_id"],
"Level": row["level"],
"Story ID": row["story_id"],
"Chapter ID": row["chapter_id"],
"Unique ID": row["unique_id"],
"得分": row["score"],
"评级": row["score_text"],
"SP值": row["sp_value"],
"经验值": row["exp"],
"题目数": row["question_count"],
"耗时(秒)": row["play_time"],
"题目详情": row["question_summary"],
"更新时间": str(row["updated_at"]),
"创建时间": str(row["created_at"]),
})
df1 = pd.DataFrame(review_data)
for r_idx, row in enumerate(dataframe_to_rows(df1, index=False, header=True), 1):
for c_idx, value in enumerate(row, 1):
ws1.cell(row=r_idx, column=c_idx, value=value)
# Style header
header_font = Font(bold=True, color="FFFFFF")
header_fill = PatternFill(start_color="4472C4", end_color="4472C4", fill_type="solid")
for cell in ws1[1]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
# Column widths
ws1.column_dimensions["A"].width = 10
ws1.column_dimensions["L"].width = 12
ws1.column_dimensions["M"].width = 60
# Sheet 2: 音频数据
ws2 = wb.create_sheet("音频数据")
audio_data = []
for a in audio_rows:
# Extract makee_id from userMsg if present
makee_id = ""
user_msg = a.get("userMsg", "")
if isinstance(user_msg, str) and "makee_id" in user_msg:
try:
um = json.loads(user_msg)
makee_id = um.get("makee_id", "")
except:
pass
audio_data.append({
"角色ID": a.get("userId"),
"角色名": a.get("userName"),
"Session ID": a.get("sessionId"),
"组件ID": a.get("componentId"),
"组件类型": a.get("componentType"),
"音频URL": a.get("audioUrl"),
"LLM音频URL": a.get("llmAudioUrl"),
"ASR状态": a.get("asrStatus"),
"发音评分(SOE)": json.dumps(a.get("soeData")) if a.get("soeData") else "",
"第几轮": a.get("roundNum"),
"Makee ID": makee_id,
"时间": a.get("timeStr"),
"时间戳": a.get("timeInt"),
"数据版本": a.get("dataVersion"),
})
df2 = pd.DataFrame(audio_data)
for r_idx, row in enumerate(dataframe_to_rows(df2, index=False, header=True), 1):
for c_idx, value in enumerate(row, 1):
ws2.cell(row=r_idx, column=c_idx, value=value)
for cell in ws2[1]:
cell.font = header_font
cell.fill = header_fill
cell.alignment = Alignment(horizontal="center")
ws2.column_dimensions["G"].width = 50
ws2.column_dimensions["H"].width = 50
ws2.column_dimensions["I"].width = 15
ws2.column_dimensions["K"].width = 40
ws2.column_dimensions["M"].width = 22
# Sheet 3: 汇总
ws3 = wb.create_sheet("汇总")
ws3["A1"] = "导出信息"
ws3["A1"].font = Font(bold=True, size=14)
ws3["A3"] = "导出时间"
ws3["B3"] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
ws3["A4"] = "角色ID"
ws3["B4"] = ", ".join(str(u) for u in user_ids)
ws3["A5"] = "课程巩固记录数"
ws3["B5"] = len(review_rows)
ws3["A6"] = "音频记录数"
ws3["B6"] = len(audio_rows)
# Per-user breakdown
row_offset = 8
ws3[f"A{row_offset}"] = "按角色统计"
ws3[f"A{row_offset}"].font = Font(bold=True)
row_offset += 1
ws3[f"A{row_offset}"] = "角色ID"
ws3[f"B{row_offset}"] = "巩固记录"
ws3[f"C{row_offset}"] = "音频记录"
ws3[f"D{row_offset}"] = "最新巩固时间"
for cell in ws3[row_offset]:
cell.font = Font(bold=True)
cell.fill = header_fill
cell.font = Font(bold=True, color="FFFFFF")
row_offset += 1
for uid in user_ids:
r_cnt = sum(1 for r in review_rows if r["user_id"] == uid)
a_cnt = sum(1 for a in audio_rows if a.get("userId") == uid)
latest = max(
(str(r["updated_at"]) for r in review_rows if r["user_id"] == uid),
default=""
)
ws3[f"A{row_offset}"] = uid
ws3[f"B{row_offset}"] = r_cnt
ws3[f"C{row_offset}"] = a_cnt
ws3[f"D{row_offset}"] = latest
row_offset += 1
ws3.column_dimensions["A"].width = 18
ws3.column_dimensions["B"].width = 22
ws3.column_dimensions["C"].width = 18
ws3.column_dimensions["D"].width = 28
wb.save(output_path)
print(f"\n✅ 导出完成: {output_path}")
print(f" Sheet 1 — 课程巩固记录: {len(review_rows)}")
print(f" Sheet 2 — 音频数据: {len(audio_rows)}")
print(f" Sheet 3 — 汇总")