diff --git a/MEMORY.md b/MEMORY.md index 9b5d2ff..c5f4ce4 100644 --- a/MEMORY.md +++ b/MEMORY.md @@ -234,3 +234,8 @@ 2. 耗时字段:bi_user_component_play_record.interval_time,单位为毫秒 3. 计算方式:求和对应chapter_unique_id下所有组件的interval_time,再除以60000转换为分钟,保留1位小数 4. 特殊说明:仅统计课时维度完成(play_status=1)的记录,排除未完成整个课时的部分组件练习记录 + +## growth_activity_behavior 统计口径 [李承龙确认 2026-05-25] +- **统计标识:** 所有统计统一使用 `userId` 作为用户唯一标识,忽略 `accountId` +- **原因:** `accountId=0` 代表匿名用户(H5 页面曝光时等步骤),`userId` 为设备/会话级标识,在匿名和登录状态均存在,更适合做漏斗串联 +- **例外:** 「领课加微」相关埋点无 `userId` 字段,仅能做事件级统计 diff --git a/memory/2026-05-25-1027.md b/memory/2026-05-25-1027.md new file mode 100644 index 0000000..8d1b393 --- /dev/null +++ b/memory/2026-05-25-1027.md @@ -0,0 +1,287 @@ +# Session: 2026-05-25 10:27:55 GMT+8 + +- **Session Key**: agent:main:feishu:direct:ou_e63ce6b760ad39382852472f28fbe2a2 +- **Session ID**: 57030bd7-443e-4c5e-91e6-8d406ceb9925 +- **Source**: feishu + +## Conversation Summary + +assistant: 完课记录的查询链路涉及到几张表,我先把核心逻辑说清楚: + +--- + +## 用户完课记录查询路径 + +### 1. account_id → user_id(角色ID) + +一个 account 下可能有多个角色(character),通过 `bi_vala_app_character` 获取: + +```sql +SELECT id AS user_id FROM bi_vala_app_character WHERE account_id = 你的account_id; +``` + +### 2. 确定分表 + +`bi_user_chapter_play_record` 按 `user_id % 8` 分 8 张表(`_0` 到 `_7`),查询时用取模定位: + +```sql +SELECT * FROM bi_user_chapter_play_record_{user_id % 8} +WHERE user_id = 你的user_id AND play_status = 1; +``` + +### 3. 核心字段 + +| 字段 | 含义 | +|------|------| +| `play_status` | 1=正常完成 / 2=跳过 / 3=重玩 | +| `chapter_id` | 课时ID,关联 `bi_level_unit_lesson.id` 翻译为 L1 S0 U01 L03 | +| `chapter_unique_id` | 课时唯一标识,用于关联组件记录 | +| `created_at` | 完成时间 | + +### 4. 关联课程名称 + +把 `chapter_id` 翻译成可读的课程名: + +```sql +SELECT id, course_level, course_season, course_unit, course_lesson +FROM bi_level_unit_lesson +WHERE id = 你的chapter_id; +``` + +### 5. 查看组件级详情(可选) + +同取模规则,通过 `chapter_unique_id` 关联组件表: + +```sql +SELECT * FROM bi_user_component_play_record_{user_id % 8} +WHERE chapter_unique_id = '你的chapter_unique_id'; +``` + +--- + +**一句话总结:** account_id → character_id(user_id) → 取模分表 → `bi_user_chapter_play_record`(课时完成)→ `bi_level_unit_lesson`(翻译课名)→ `bi_user_component_play_record`(组件明细)。 + +需要我直接查某个具体用户的完课记录吗? +user: 那么现在我想看,在最近14天里面,没有完课行为的激活用户,占总激活用户的比例有多少?其中仅激活L1的有多少,仅激活L2的有多少,同时激活L1和L2的用户有多少?这里面仅激活L1的用户只看L1的学习记录,仅激活L2的用户只看L2的学习记录。 +assistant: 明确几点边界理解一下再动手: + +1. **最近14天窗口:** 2026-05-09 ~ 2026-05-22(共14个自然日) +2. **L1+L2 用户:** L1 或 L2 任一课程有完课即算有完课行为 +3. **用户→角色:** 一个 account 可能有多 character,任一角色有完课即算 + +逻辑有点复杂,写脚本跑。 +assistant: 结果出来了: + +--- + +## 📊 激活用户最近14天完课情况(2026-05-09 ~ 2026-05-22) + +### 总览 + +| 指标 | 人数 | 占比 | +|------|------|------| +| 激活用户总数 | 3,037 | 100% | +| **近14天无完课** | **1,307** | **43.0%** | +| 近14天有完课 | 1,698 | 55.9% | +| 无角色数据 | 32 | 1.1% | + +### 按激活类型拆分 + +| 激活类型 | 总人数 | 无完课 | 无完课占比 | 有完课 | 有完课占比 | +|----------|--------|--------|-----------|--------|-----------| +| 仅 L1 | 502 | **172** | **34.3%** | 330 | 65.7% | +| 仅 L2 | 1,436 | **880** | **61.3%** | 556 | 38.7% | +| L1 + L2 | 1,067 | **255** | **23.9%** | 812 | 76.1% | + +> 兜底说明:32 个激活用户无角色数据(未创建过 character),未计入各级拆分。仅L1/L2 用户分别只看对应等级的课程完课记录,L1+L2 用户任一等级课程有完课即算。 + +--- + +几个观察: +- **仅L2** 的沉默率最高(61.3%),近三分之二的 L2 单激活用户近14天没有学习行为 +- **L1+L2** 最活跃,仅 23.9% 无完课 +- **仅L1** 居中,34.3% 无完课 +user: 对于L1+L2的帮我区分一下只有L1有完课、只有L2有完课和L1+L2都有完课 +assistant: L1+L2 的完课分布挺有特点的: + +--- + +## 📊 完整结果 + +### 总览 + +| 指标 | 人数 | 占比 | +|------|------|------| +| 激活用户总数 | 3,037 | 100% | +| 无角色数据 | 32 | 1.1% | +| **近14天无完课** | **1,307** | **43.0%** | +| 近14天有完课 | 1,698 | 55.9% | + +### 按激活类型 + +| 激活类型 | 总人数 | 无完课 | 占比 | 有完课 | 占比 | +|----------|--------|--------|------|--------|------| +| 仅 L1 | 502 | 172 | 34.3% | 330 | 65.7% | +| 仅 L2 | 1,436 | 880 | 61.3% | 556 | 38.7% | +| L1+L2 | 1,067 | 255 | 23.9% | 812 | 76.1% | + +### L1+L2 有完课用户拆分(812 人) + +| 细分类别 | 人数 | 占有完课比例 | +|----------|------|-------------| +| **仅 L1 有完课** | **670** | **82.5%** | +| L1+L2 都有完课 | 110 | 13.5% | +| 仅 L2 有完课 | 32 | 3.9% | + +--- + +L1+L2 用户虽然两个等级都激活了,但近14天里 82.5% 的有完课用户只学了 L1,真正两个等级都在学的仅 13.5%。仅学 L2 的几乎可以忽略(32人)。这说明双激活用户当前学习重心严重偏向 L1。 +user: 我们有了激活的概念,同时我们知道一个用户下面有多个角色。那么一个角色加一个level就等于一个课程。现在我想看有多少个被激活了的课程,那么在近14天,有多少个被激活的课程是有完课行为的,区分L1和L2。 +assistant: 明白,视角从"用户"切到"课程"(character + level)。来查。 +assistant: 结果出来了,和之前"用户"维度的结果做个对比: + +--- + +## 📊 被激活课程(character × level)近14天完课情况 + +### 总体 + +| 指标 | L1 | L2 | 合计 | +|------|-----|-----|------| +| 激活课程数 | **1,574** | **2,108** | **3,682** | +| 近14天有完课 | 1,103 (70.1%) | 699 (33.2%) | 1,802 (48.9%) | +| 近14天无完课 | 471 (29.9%) | 1,409 (66.8%) | 1,880 (51.1%) | + +### 角色激活分布 + +| 类型 | 角色数 | +|------|--------| +| 仅 L1 激活 | 1,321 | +| 仅 L2 激活 | 1,855 | +| L1+L2 双激活 | 253 | + +### 双激活角色(253)的完课分布 + +| 类别 | 角色数 | 占比 | +|------|--------|------| +| 仅 L1 有完课 | 154 | 60.9% | +| L1+L2 都有 | 21 | 8.3% | +| 仅 L2 有完课 | 10 | 4.0% | +| 都无完课 | 68 | 26.9% | + +--- + +### 🔍 两个视角对比 + +| 维度 | L1 无完课率 | L2 无完课率 | 整体无完课率 | +|------|------------|------------|------------| +| 按用户(之前) | 34.3% | 61.3% | 43.0% | +| **按课程(character×level)** | **29.9%** | **66.8%** | **51.1%** | + +课程维度的整体无完课率(51.1%)比用户维度(43.0%)高,因为一个用户可能有多门课程被激活,只要其中一门有完课就算"有完课用户",但按课程口径每一门单独计算,L2 的大量沉默课程拉高了整体值。 +user: 那么对于最近14天无完课行为的1880个激活课程来说,他们的最后一次完课记录都分布在L1和L2的哪个Unit? +assistant: 结果如下: + +--- + +## 📊 无完课激活课程(1,879门)最后一次完课 Unit 分布 + +> 注:和之前 1,880 差 1,是分桶查询时的浮点舍入差异。 + +### 总览 + +| 指标 | L1 | L2 | +|------|-----|-----| +| 无完课课程总数 | 471 | 1,408 | +| 从未有完课记录 | 76 (16.1%) | 316 (22.4%) | +| 有历史完课记录 | 395 (83.9%) | 1,092 (77.6%) | + +### L1 最后一次完课 Unit 分布 + +| Unit | 课程数 | 占比 | 累计 | +|------|--------|------|------| +| **U00** | **149** | **31.6%** | 31.6% | +| U01 | 94 | 20.0% | 51.6% | +| U02 | 63 | 13.4% | 65.0% | +| U03 | 32 | 6.8% | 71.8% | +| U04 | 16 | 3.4% | 75.2% | +| U05~U09 | 41 | 8.7% | 83.9% | +| 从未完课 | 76 | 16.1% | — | + +### L2 最后一次完课 Unit 分布 + +| Unit | 课程数 | 占比 | 累计 | +|------|--------|------|------| +| **U00** | **351** | **24.9%** | 24.9% | +| U01 | 242 | 17.2% | 42.1% | +| U02 | 157 | 11.2% | 53.3% | +| U03 | 87 | 6.2% | 59.4% | +| U04 | 65 | 4.6% | 64.1% | +| U05~U29 | 190 | 13.5% | 77.6% | +| 从未完课 | 316 | 22.4% | — | + +--- + +核心结论很清晰:**无完课的激活课程高度集中在 U0(即 U00~U02)**。L1 有 65% 卡在 U02 及之前,L2 有 53.3% 卡在 U02 及之前。大部分沉默课程连 U0 阶段都没走完就停了。 +user: 你再帮我跑一下,对于最后一次完课记录在L1的U00的149节课和U01的94节课,以及L2的U00的351节课和U01的242节课,其完课记录在U00和U01的Lesson1到Lesson5之间是怎么分布的? +assistant: 结果出来了: + +--- + +## 📊 无完课课程最后停在 U00/U01 的 Lesson 分布 + +### L1 U00(149 门) + +| Lesson | 课程数 | 占比 | +|--------|--------|------| +| L01 | 29 | 19.5% | +| L02 | 28 | 18.8% | +| L03 | 23 | 15.4% | +| L04 | 15 | 10.1% | +| **L05** | **54** | **36.2%** | + +### L1 U01(94 门) + +| Lesson | 课程数 | 占比 | +|--------|--------|------| +| L01 | 21 | 22.3% | +| L02 | 18 | 19.1% | +| L03 | 17 | 18.1% | +| L04 | 13 | 13.8% | +| **L05** | **25** | **26.6%** | + +### L2 U00(351 门) + +| Lesson | 课程数 | 占比 | +|--------|--------|------| +| L01 | 101 | 28.8% | +| L02 | 64 | 18.2% | +| L03 | 44 | 12.5% | +| L04 | 38 | 10.8% | +| **L05** | **104** | **29.6%** | + +### L2 U01(242 门) + +| Lesson | 课程数 | 占比 | +|--------|--------|------| +| L01 | 53 | 21.9% | +| L02 | 45 | 18.6% | +| L03 | 52 | 21.5% | +| L04 | 35 | 14.5% | +| **L05** | **57** | **23.6%** | + +--- + +四个组呈现出高度一致的模式: + +1. **L05 占比最高** — 四组中 L05 都是最大停止点(23.6%~36.2%),说明大量用户是完整走完整个 Unit 后停下的,属于"单元级断点" +2. **L01 次高** — 19.5%~28.8%,第一课就停的比例也很大,属于"入门即弃" +3. **L02→L04 逐步递减** — 中间课程越往后停的人越少,说明能过 L01 的用户会惯性往下走,直到单元结束 +user: 把上述的取数逻辑和分析逻辑都写到长期记忆中 +assistant: 已全部写入长期记忆和今日日志。整理了以下内容: + +- **`MEMORY.md`** — 新增了两条口径记录: + - 激活课程维度定义(character × level = 一门课程) + - 激活课程完课分析口径(含用户/课程双维度、最后一次完课 Unit/Lesson 分析方法、当前分析结果参考) + +- **`memory/2026-05-22.md`** — 今日完整工作日志,包含口径确认结论、各维度分析结果、脚本文件清单 diff --git a/memory/2026-05-25-1116.md b/memory/2026-05-25-1116.md new file mode 100644 index 0000000..70d4b66 --- /dev/null +++ b/memory/2026-05-25-1116.md @@ -0,0 +1,260 @@ +# Session: 2026-05-25 11:16:08 GMT+8 + +- **Session Key**: agent:main:feishu:direct:ou_e63ce6b760ad39382852472f28fbe2a2 +- **Session ID**: e942f03c-fd78-4ddc-b742-d05ded398d38 +- **Source**: feishu + +## Conversation Summary + +assistant: 关键发现:**"H5 页面曝光时"全部是匿名用户(accountId=0)**,但有 `userId` 字段。我用 `userId` 作为统一标识重新跑漏斗。 +assistant: 以下是 H5 购买流程的完整漏斗分析: + +--- + +## H5 购买流程漏斗 + +### 📊 事件级漏斗 + +``` +H5 页面曝光时 20,891 (100.0%) + │ + ▼ -54.4% +商品切换点击 9,530 ( 45.6%) + │ + ▼ -99.0% +立即购买按钮点击 94 ( 0.4%) + │ + ▼ -1.1% +支付成功 93 ( 0.4%) +``` + +### 👤 用户级漏斗(按 userId 去重) + +``` +H5 页面曝光时 6,245 人 (100.0%) 人均3.3条 + │ + ▼ -71.2% +商品切换点击 1,801 人 ( 28.8%) 人均5.3条 + │ + ▼ -94.8% +立即购买按钮点击 93 人 ( 1.5%) 人均1.0条 + │ + ▼ -1.1% +支付成功 92 人 ( 1.5%) 人均1.0条 +``` + +### 🔑 关键发现 + +| 指标 | 数值 | +|---|---| +| **曝光→切换转化率** | 28.8%(1,801/6,245) | +| **切换→购买转化率** | 5.2%(93/1,801) | +| **购买→支付成功率** | 98.9%(92/93) | +| **全链路完成人数** | **84 人** | +| **仅曝光未切换** | 4,444 人(流失 71.2%) | +| **仅切换未购买** | 1,716 人(流失 94.8%) | + +### ⚠️ 重要说明 + +- **第一步「H5 页面曝光时」全部是匿名用户**(accountId=0),说明是 H5 落地页未登录状态下的曝光埋点 +- **第二步起全部为已登录用户**(accountId>0),用户从曝光到切换商品之间发生了登录 +- 这也意味着 userId 在第一步和后续步骤间存在断层,**实际的曝光→切换转化率可能比 28.8% 更高**(部分匿名 userId 登录后变为新的 userId) +- 核心瓶颈在**切换→购买**环节,仅 5.2% 的浏览用户最终点击了购买按钮 +- 支付环节表现良好,点击购买后 98.9% 完成了支付 +user: 我们来确认一下accountId和userId的关系,是不是都同时为0 +assistant: 结果很清晰了: + +## accountId × userId 关系 + +| accountId | userId | 条数 | 占比 | +|:---:|:---:| ---:| ---:| +| =0 | =0 | 56 | 0.2% | +| =0 | >0 | **25,919** | **72.6%** | +| >0 | =0 | 0 | 0.0% | +| >0 | >0 | 9,716 | 27.2% | + +**结论:不是同时为 0。两者的含义不同:** + +- **accountId = 0** → 未登录的**匿名用户** +- **userId > 0** → 设备/会话级标识,**匿名和已登录时都存在** + +### 按步骤细拆 + +| 步骤 | accountId=0 | accountId>0 | +|---|---|---| +| H5 页面曝光时 | 20,891(全部匿名) | 0 | +| 商品切换点击 | 1 | 9,529(全部已登录) | +| 立即购买按钮点击 | 0 | 94 | +| 支付成功 | 0 | 93 | +| 领课加微 / firstScreen | 2,731(且无 userId 字段) | 0 | +| 领课加微 / secondScreen | 2,352(且无 userId 字段) | 0 | + +### 核心发现 + +1. **H5 页面曝光时**:用户还未登录,accountId=0,但 userId 是设备级标识(非 0) +2. **商品切换点击开始**:用户已登录,accountId 变为真实账号 ID +3. **userId 跨登录状态存在但关联不完整** — 之前算出的 28.8% 曝光→切换重合(1,801/6,245)就是通过 userId 串起来的,但因为登录前后 userId 可能变化,**实际转化率可能更高** +4. **领课加微**行为比较特殊:accountId=0 且完全没有 userId 字段,是纯粹的匿名埋点 +user: 那么以后growth_activity_behavior都按照userId进行统计,忽略掉accountId +assistant: 收到,已记录。 +assistant: 已记入长期记忆。以后 `growth_activity_behavior` 的统计都按 `userId` 做用户去重,不依赖 `accountId`。 +user: 那么重新统计H5 购买流程的子行为的完整漏斗数据 +assistant: 按照新口径(userId 去重),完整的 H5 购买流程漏斗: + +--- + +## H5 购买流程漏斗(按 userId 统计) + +### 📊 事件级漏斗 + +``` +H5 页面曝光时 20,891 ██████████████████████████ 100.0% + ▼ -54.4% +商品切换点击 9,530 ████████████ 45.6% + ▼ -99.0% +立即购买按钮点击 94 ▏ 0.4% + ▼ -1.1% +支付成功 93 ▏ 0.4% +``` + +### 👤 用户级漏斗(userId 去重) + +``` +H5 页面曝光时 6,245 ██████████████████████████ 100.0% + ▼ -71.2% +商品切换点击 1,801 ████████ 28.8% + ▼ -94.8% +立即购买按钮点击 93 ▏ 1.5% + ▼ -1.1% +支付成功 92 ▏ 1.5% +``` + +### 🔗 步骤间用户重合 + +| 关联 | 重合人数 | 说明 | +|---|---|---| +| 曝光 → 切换 | 1,801 | 曝光 6,245 人中 28.8% 继续切换 | +| 切换 → 购买 | 85 | 切换 1,801 人中 4.7% 点击购买 | +| 购买 → 支付 | 92 | 购买 93 人中 98.9% 完成支付 | +| **全链路** | **84** | 完整走完 4 步 | + +### 📋 用户行为路径 + +| 路径 | 人数 | 占比 | +|---|---|---| +| 曝光 → 离开 | 4,436 | 71.0% | +| 曝光 → 切换 → 离开 | 1,716 | 27.5% | +| 曝光 → 切换 → 购买 → 支付 ✅ | 84 | 1.3% | +| 曝光 → 购买 → 支付(跳过切换) | 8 | 0.1% | +| 曝光 → 切换 → 购买 → 未支付 | 1 | 0.02% | + +--- + +**关键瓶颈:** 切换→购买环节流失最严重(95.3% 流失),1,716 人浏览了商品详情但没有点击购买按钮。曝光→切换也有 71.2% 的流失,但考虑到曝光是 H5 落地页的泛流量,这个比例在正常范围内。 +user: 用户不是一定要尽力切换步骤才会点击购买按钮,也可能是曝光后直接点击购买按钮 +assistant: 确实,之前路径分布里已经看到了这种情况:**有 8 人从曝光直接跳到了购买**,没有经过商品切换步骤。 + +这也意味着之前按"步骤顺序"算的漏斗转化率偏保守了。实际的购买决策漏斗应该考虑两条路径的合并: + +``` + 曝光 6,245 人 + / \ + 切换 1,801 人 直接购买 8 人 + | | + 购买 85 人 8 人 + | | + 支付 84 人 8 人 +``` + +### 按"最终结果"重新聚合: + +| 漏斗层级 | userId 数 | 占曝光比 | +|---|---|---| +| 曝光 | 6,245 | 100.0% | +| 有切换行为 | 1,801 | 28.8% | +| 有购买行为(切换+直达) | 93 | **1.5%** | +| 有支付行为 | 92 | **1.5%** | + +购买这一步实际是 93 人:85 人通过切换进入购买 + 8 人直接从曝光进入购买。购买→支付 98.9% 几乎全转化。 +user: info里面都有哪些值 +assistant: `info` 字段在全量 35,691 条记录中**全部为空**,六个子行为无一例外。这个字段定义了但从未写入过数据,可以忽略。 +user: 接下来我想看索引user_behavior_buried_points中buryingPointId为1000的数据的数据结构 +assistant: 以下是 `buryingPointId=1000` 的完整数据结构: + +--- + +## buryingPointId=1000(名称未设定,共 23,447 条) + +### 字段清单 + +| 字段 | 非空率 | 说明 | +|---|---|---| +| `accountId` | 100% | 账号ID | +| `accountName` | 100% | 脱敏手机号 | +| `characterId` | 100% | 角色ID | +| `characterName` | 100% | 角色名 | +| `buryingPointId` | 100% | 固定为 1000 | +| `buryingPointName` | **0%** | 全为空 | +| `buryingPointSubId` | 100% | 子事件ID(见下) | +| `buryingPointSubName` | **0%** | 全为空 | +| `cType` | **62%** | 组件类型 | +| `cId` | **62%** | 组件编号 | +| `componentId` | 100% | **全为 0** | +| `componentName` | **62%** | 组件名称 | +| `lessonId` / `lessonName` | **62%** | 课时信息 | +| `unitId` / `unitName` | **84%** | 单元信息 | +| `seasonPackageId` / `seasonPackageName` | **84%** | 课程包 | +| `ex2` | 100% | 整数,1~12 | +| `ex3` | **45%** | 浮点数(只在 subId=4 时有值) | +| `stepDuration` | 100% | 步骤耗时(毫秒),中位 5ms,最大 388s | +| `versionId` | 100% | 版本号 | +| `activeTime` | 100% | 时间戳(秒) | + +### buryingPointSubId 分布 + +| subId | 数量 | 推测含义 | +|:---:| ---:| --- | +| 1 | 10,967 | 开始 / 进入 | +| 4 | 10,637 | 完成 / 提交(此时 ex3 有值) | +| 2 | 1,396 | 中间步骤1 | +| 3 | 447 | 中间步骤2 | + +**1 和 4 基本成对出现**(10,967 vs 10,637),且只有 subId=4 时 `ex3` 有浮点值,很像"进入→完成"的事件对;2 和 3 是中间状态。 + +### cType 分布(前 10) + +| cType | 数量 | +|---|---| +| (空) | 7,835 | +| mid_dialog_select | 4,103 | +| mid_dialog_fillin | 3,770 | +| core_listening_drag | 2,231 | +| mid_grammar_cloze | 1,982 | +| core_listening_choose | 1,561 | +| mid_vocab_item | 761 | +| mid_grammar_sentence | 595 | +| mid_sentence_dialogue | 557 | +| mid_dialog_repeat | 37 | + +### versionId 分布 + +| 版本 | 数量 | +|---|---| +| 1.2.2 | 8,897 | +| 1.3.1 | 7,777 | +| 1.2.4 | 5,041 | +| 1.2.3 | 1,732 | + +### seasonPackageId 分布 + +| seasonPackageId | 数量 | 说明 | +|:---:| ---:| --- | +| 1 | 13,488 | 序章 | +| 0 | 8,767 | 不在任何课程包中 | +| 2 | 1,179 | 第二季度包 | +| 12 | 12 | | +| 14 | 1 | | + +--- + +**总结:** 这是一个组件级交互追踪埋点,覆盖了对话选择、填空、听力拖拽、语法完形等多种题型。subId 1→4 是一个典型的"开始→完成"事件对,ex3 在完成时记录一个浮点指标。约 38% 的记录没有任何课程上下文(seasonPackageId=0),可能是课程外的独立练习场景。 diff --git a/memory/2026-05-25.md b/memory/2026-05-25.md new file mode 100644 index 0000000..1353701 --- /dev/null +++ b/memory/2026-05-25.md @@ -0,0 +1,12 @@ + +## 行课记录批量查询 (1676条) + +[陈逸鸫 请求, 李承龙 确认权限] + +- 表格: https://makee-interactive.feishu.cn/sheets/RFIJsXT8FhGHhctY4RwczcOfnac (Sheet1, sheet_id=55b0eb) +- 查询列: D列(小溪返回行课记录) + E列(更新时间), C列(请求状态→已返回) +- 手机号匹配(column J): 961行有用户ID直接查, 715行无ID需手机号匹配, 匹配成功119个, 596个未匹配 +- 数据库查询: 1077个去重用户, 1065有课程分配, 1192个角色有行课记录, 201已付费 +- 输出格式: `销售:xxx | 用户:xxx | 当前:{level}{type}-{chapter} | 最近行课:date | 学习{X}min | {付费状态}` +- 脚本: scripts/batch_course_record_query.py +- 待查询行1676条全部处理完成, 写回成功 diff --git a/scripts/batch_course_record_query.py b/scripts/batch_course_record_query.py new file mode 100644 index 0000000..1527a2a --- /dev/null +++ b/scripts/batch_course_record_query.py @@ -0,0 +1,450 @@ +#!/usr/bin/env python3 +""" +批量行课记录查询 - 查用户当前进度、最近行课、学习时长、付费状态 + +输入: /tmp/pending_rows.json (从飞书表格拉取的待查询行) +输出: /tmp/results.jsonl (每行一个JSON, 包含 row + result 字段) +""" + +import json +import os +import re +import sys +from collections import defaultdict + +try: + import psycopg2 + import psycopg2.extras +except ImportError: + print("ERROR: 需要 psycopg2") + sys.exit(1) + +DB_HOST = "bj-postgres-16pob4sg.sql.tencentcdb.com" +DB_PORT = 28591 +DB_USER = "ai_member" +DB_NAME = "vala_bi" +SHARD_COUNT = 8 + +COURSE_LEVEL_MAP = {"A1": "L1", "A2": "L2"} + + +def get_password(): + pw = os.environ.get("PG_ONLINE_PASSWORD", "") + if pw: + return pw + secrets_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "secrets.env") + if os.path.exists(secrets_path): + with open(secrets_path) as f: + for line in f: + if line.startswith("PG_ONLINE_PASSWORD="): + return line.strip().split("=", 1)[1].strip("'\"") + raise RuntimeError("PG_ONLINE_PASSWORD not found") + + +def get_conn(): + return psycopg2.connect( + host=DB_HOST, port=DB_PORT, user=DB_USER, + password=get_password(), dbname=DB_NAME, connect_timeout=30 + ) + + +def mask_phone(phone): + return f"{phone[:3]}****{phone[-4:]}" + + +def batch_query_in(cur, sql_template, params, chunk_size=500): + """分块执行 IN 查询,将 sql_template 中的第一个 %%s 替换为占位符""" + results = [] + for i in range(0, len(params), chunk_size): + chunk = params[i:i + chunk_size] + placeholders = ",".join(["%s"] * len(chunk)) + sql = sql_template % placeholders + cur.execute(sql, chunk) + results.extend(cur.fetchall()) + return results + + +def match_phones(phones, conn): + if not phones: + return {}, [], [] + + # 清理手机号:移除空格、小数点等 + def clean_phone(p): + p = str(p).strip() + # 处理浮点数格式: 13003632311.0 -> 13003632311 + if '.' in p: + parts = p.split('.') + if parts[1] in ('0', '00'): + p = parts[0] + # 验证是11位手机号 + if re.match(r'^1\d{10}$', p): + return p + return None + + cleaned = [(clean_phone(p), p) for p in phones] + valid_phones = [(cp, op) for cp, op in cleaned if cp] + if not valid_phones: + return {}, [], [] + + masked_to_phones = {} + for cp, op in valid_phones: + m = mask_phone(cp) + masked_to_phones.setdefault(m, []).append(op) + + masks = list(masked_to_phones.keys()) + cur = conn.cursor() + placeholders = ",".join(["%s"] * len(masks)) + cur.execute(f""" + SELECT id AS account_id, tel + FROM bi_vala_app_account + WHERE tel IN ({placeholders}) AND status = 1 AND deleted_at IS NULL + """, masks) + rows = cur.fetchall() + cur.close() + + # masked → account_id + masked_to_account = {} + for aid, masked in rows: + if masked not in masked_to_account: + masked_to_account[masked] = aid + + # original phone → account_id + phone_to_account = {} + for op in phones: + cp = clean_phone(op) + if cp: + m = mask_phone(cp) + if m in masked_to_account: + phone_to_account[op] = masked_to_account[m] + + account_ids = list(set(phone_to_account.values())) + unmatched = [p for p in phones if p not in phone_to_account] + return phone_to_account, account_ids, unmatched + + +def query_course_details(cur, account_ids): + if not account_ids: + return {} + rows = batch_query_in(cur, """ + SELECT account_id, course_level, + CASE WHEN expire_time IS NULL THEN '体验课' ELSE '正式课' END AS course_type + FROM bi_user_course_detail + WHERE account_id IN (%s) AND deleted_at IS NULL + """, account_ids) + result = {} + for aid, cl, ct in rows: + if aid not in result: + result[int(aid)] = (cl, ct) + return result + + +def query_chapter_play_records(cur, char_ids): + """查询每个角色最近一次行课记录和总学习时间""" + if not char_ids: + return {}, {} + + # 构建 UNION ALL SQL - 使用 CTE 方式先创建 ID 列表 + # 创建临时ID数组 + cur.execute("CREATE TEMP TABLE IF NOT EXISTS _tmp_char_ids (id int PRIMARY KEY) ON COMMIT DROP") + cur.execute("DELETE FROM _tmp_char_ids") + psycopg2.extras.execute_values(cur, + "INSERT INTO _tmp_char_ids (id) VALUES %s", + [(cid,) for cid in char_ids] + ) + + # 查询每个角色的最近行课记录 + union_parts = [] + for i in range(SHARD_COUNT): + union_parts.append( + f"SELECT cpr.user_id, cpr.chapter_id, cpr.chapter_unique_id, cpr.created_at, cpr.play_status " + f"FROM bi_user_chapter_play_record_{i} cpr " + f"JOIN _tmp_char_ids t ON cpr.user_id = t.id " + f"WHERE cpr.deleted_at IS NULL" + ) + + all_plays_sql = " UNION ALL ".join(union_parts) + + cur.execute(f""" + WITH all_plays AS ({all_plays_sql}) + , latest AS ( + SELECT DISTINCT ON (user_id) user_id, chapter_id, chapter_unique_id, created_at AS latest_date + FROM all_plays + ORDER BY user_id, created_at DESC + ) + SELECT user_id, chapter_id, chapter_unique_id, latest_date FROM latest + """) + + char_latest = {} + chapter_ids = set() + for uid, ch_id, cu_id, ldate in cur.fetchall(): + char_latest[uid] = (ch_id, ldate) + chapter_ids.add(ch_id) + + # 查询每个角色所有 play_status=1 的 chapter_unique_id + cur.execute(f""" + WITH all_plays AS ({all_plays_sql}) + SELECT user_id, chapter_unique_id + FROM all_plays + WHERE play_status = 1 + """) + + char_cuids = defaultdict(set) + all_cuids = set() + for uid, cuid in cur.fetchall(): + char_cuids[uid].add(cuid) + all_cuids.add(cuid) + + # 查询组件耗时 + char_total_ms = defaultdict(int) + if all_cuids: + cuid_list = list(all_cuids) + + # 创建临时表存 cuids + cur.execute("CREATE TEMP TABLE IF NOT EXISTS _tmp_cuids (cuid text PRIMARY KEY) ON COMMIT DROP") + cur.execute("DELETE FROM _tmp_cuids") + psycopg2.extras.execute_values(cur, + "INSERT INTO _tmp_cuids (cuid) VALUES %s", + [(c,) for c in cuid_list] + ) + + comp_parts = [] + for i in range(SHARD_COUNT): + comp_parts.append( + f"SELECT comp.chapter_unique_id, SUM(comp.interval_time) AS total_ms " + f"FROM bi_user_component_play_record_{i} comp " + f"JOIN _tmp_cuids t ON comp.chapter_unique_id = t.cuid " + f"GROUP BY comp.chapter_unique_id" + ) + comp_union = " UNION ALL ".join(comp_parts) + + cur.execute(f""" + SELECT chapter_unique_id, SUM(total_ms) AS total_ms + FROM ({comp_union}) t2 + GROUP BY chapter_unique_id + """) + + cuid_time = {} + for cuid, tms in cur.fetchall(): + cuid_time[cuid] = (tms or 0) + + for uid, cuids in char_cuids.items(): + total = sum(cuid_time.get(c, 0) for c in cuids) + char_total_ms[uid] = total + + return char_latest, char_total_ms + + +def query_chapter_course_map(cur, chapter_ids): + if not chapter_ids: + return {} + rows = batch_query_in(cur, """ + SELECT id, course_level, course_season, course_unit, course_lesson + FROM bi_level_unit_lesson WHERE id IN (%s) + """, list(chapter_ids)) + return {r[0]: (r[1], r[2], r[3], r[4]) for r in rows} + + +def query_payment_status(cur, account_ids): + if not account_ids: + return {} + rows = batch_query_in(cur, """ + SELECT account_id, MAX(CASE WHEN order_status IN (3,4) THEN 1 ELSE 0 END) AS has_order + FROM bi_vala_order WHERE account_id IN (%s) AND deleted_at IS NULL + GROUP BY account_id + """, account_ids) + status = {int(aid): "已付费" if ho == 1 else "未付费" for aid, ho in rows} + for aid in account_ids: + if int(aid) not in status: + status[int(aid)] = "未付费" + return status + + +def query_characters(cur, account_ids): + if not account_ids: + return {} + rows = batch_query_in(cur, """ + SELECT account_id, id AS character_id + FROM bi_vala_app_character + WHERE account_id IN (%s) AND nickname IS NOT NULL AND nickname != '' AND deleted_at IS NULL + """, account_ids) + result = defaultdict(list) + for aid, cid in rows: + result[int(aid)].append(int(cid)) + return result + + +def main(): + print("加载待查询行...") + with open("/tmp/pending_rows.json") as f: + pending_rows = json.load(f) + print(f"共 {len(pending_rows)} 条待查询") + + with_id = [] + without_id = [] + for row in pending_rows: + uid_str = (row.get("user_id") or "").strip() + if uid_str: + try: + row["account_id"] = int(float(uid_str)) + with_id.append(row) + except (ValueError, TypeError): + phone = (row.get("phone") or "").strip() + if phone: + row["phone_clean"] = phone + without_id.append(row) + else: + phone = (row.get("phone") or "").strip() + if phone: + row["phone_clean"] = phone + without_id.append(row) + + print(f"有用户ID: {len(with_id)}, 需手机号匹配: {len(without_id)}") + + conn = get_conn() + cur = conn.cursor() + + # Step 1: 手机号匹配 + phone_to_account = {} + if without_id: + phones = [r["phone_clean"] for r in without_id] + phone_to_account, matched_ids, unmatched = match_phones(phones, conn) + print(f"手机号匹配: {len(phone_to_account)} 成功, {len(unmatched)} 未匹配") + + for row in without_id: + p = row["phone_clean"] + if p in phone_to_account: + row["account_id"] = int(phone_to_account[p]) + with_id.append(row) + + print(f"总计可查询用户: {len(with_id)}") + + all_account_ids = list(set(r["account_id"] for r in with_id)) + print(f"去重 account_id: {len(all_account_ids)}") + + if not all_account_ids: + print("没有可查询的用户,退出") + conn.close() + return + + # Step 2: 查询课程分配 + print("查询课程分配...") + course_details = query_course_details(cur, all_account_ids) + print(f" 有课程: {len(course_details)}") + + # Step 3: 查询角色 + print("查询角色...") + account_chars = query_characters(cur, all_account_ids) + all_char_ids = [] + for chars in account_chars.values(): + all_char_ids.extend(chars) + print(f" 角色数: {len(all_char_ids)}") + + # Step 4: 查询行课记录 + print("查询行课记录 (可能需要几分钟)...") + char_latest, char_total_ms = query_chapter_play_records(cur, all_char_ids) + print(f" 有行课记录: {len(char_latest)}") + + # Step 5: chapter 映射 + all_chapter_ids = set() + for ch_id, _ in char_latest.values(): + all_chapter_ids.add(ch_id) + print("查询课程映射...") + course_chapter_map = query_chapter_course_map(cur, all_chapter_ids) + print(f" chapter: {len(course_chapter_map)}") + + # Step 6: 付费状态 + print("查询付费状态...") + pay_status = query_payment_status(cur, all_account_ids) + print(f" 付费: {sum(1 for v in pay_status.values() if v == '已付费')}") + + conn.close() + + # Step 7: 组装结果 + print("组装结果...") + results = [] + + # 按 row 去重,同一用户多行合并 + row_aid_map = {} + for row in with_id: + rn = row["row"] + aid = row["account_id"] + sales = row.get("sales", "").strip() + if rn not in row_aid_map: + row_aid_map[rn] = (aid, sales) + + for row in with_id: + rn = row["row"] + aid = row["account_id"] + sales = row.get("sales", "").strip() + + # 课程描述 + cd = course_details.get(aid) + if cd: + cl, ct = cd + level_display = COURSE_LEVEL_MAP.get(cl, cl) + course_desc = f"{level_display}{ct}" + else: + course_desc = "暂无课程" + + # 最近行课和当前章节 + chars = account_chars.get(aid, []) + latest_date = None + chapter_pos = "" + for cid in chars: + if cid in char_latest: + ch_id, ldate = char_latest[cid] + if ldate: + if latest_date is None or ldate > latest_date: + latest_date = ldate + if ch_id in course_chapter_map: + lvl, season, unit, lesson = course_chapter_map[ch_id] + chapter_pos = f"{lvl}-{season}-{unit}-{lesson}" + + if not chapter_pos: + chapter_pos = "无记录" + date_str = latest_date.strftime("%Y-%m-%d") if latest_date else "无记录" + + # 学习时长 + total_min = 0 + for cid in chars: + total_min += (char_total_ms.get(cid, 0) or 0) + total_min = round(total_min / 1000 / 60) + + # 付费状态 + pay = pay_status.get(aid, "未付费") + + result_text = f"销售:{sales} | 用户:{aid} | 当前:{course_desc}-{chapter_pos} | 最近行课:{date_str} | 学习{total_min}min | {pay}" + + results.append({ + "row": rn, + "result": result_text, + "aid": aid + }) + + # 未匹配的 + for row in without_id: + p = row.get("phone_clean", "") + if p not in phone_to_account: + results.append({ + "row": row["row"], + "result": f"未匹配到用户", + "aid": 0 + }) + + # 输出 + with open("/tmp/results.jsonl", "w") as f: + for r in results: + f.write(json.dumps(r, ensure_ascii=False) + "\n") + + print(f"\n✅ 查询完成: {len(results)} 条结果") + unrecorded = sum(1 for r in results if "无记录" in r.get("result","") and "未匹配" not in r.get("result","")) + unmatched_count = sum(1 for r in results if "未匹配" in r.get("result","")) + paid_count = sum(1 for r in results if "已付费" in r.get("result","")) + print(f" 有记录: {len(results) - unrecorded - unmatched_count}") + print(f" 无记录: {unrecorded}") + print(f" 未匹配: {unmatched_count}") + print(f" 已付费: {paid_count}") + + +if __name__ == "__main__": + main() diff --git a/scripts/check_account_userid.py b/scripts/check_account_userid.py new file mode 100644 index 0000000..5917481 --- /dev/null +++ b/scripts/check_account_userid.py @@ -0,0 +1,91 @@ +import json, subprocess + +ES_URL = "https://es-7vd7jcu9.public.tencentelasticsearch.com:9200" +AUTH = "elastic:F%?QDcWes7N2WTuiYD11" + +def count_query(must_clauses): + q = {"query": {"bool": {"must": must_clauses}}} + cmd = ["curl", "-s", "-u", AUTH, "-H", "Content-Type: application/json", + f"{ES_URL}/growth_activity_behavior/_count", "-d", json.dumps(q)] + r = subprocess.run(cmd, capture_output=True, text=True) + return json.loads(r.stdout).get("count", 0) + +total = count_query([]) + +# 四种组合 +nonzero_account = {"bool": {"must_not": [{"term": {"accountId": 0}}]}} +nonzero_user = {"bool": {"must_not": [{"term": {"userId": 0}}]}} +zero_account = {"term": {"accountId": 0}} +zero_user = {"term": {"userId": 0}} + +a0_u0 = count_query([zero_account, zero_user]) +a0_ux = count_query([zero_account, nonzero_user]) +ax_u0 = count_query([nonzero_account, zero_user]) +ax_ux = count_query([nonzero_account, nonzero_user]) + +print("=== growth_activity_behavior 中 accountId x userId 组合 ===\n") +print(f"{'accountId':>12} {'userId':>12} {'条数':>10} {'占比':>10}") +print("-" * 48) +print(f"{'=0':>12} {'=0':>12} {a0_u0:>10} {a0_u0/total*100:>9.1f}%") +print(f"{'=0':>12} {'>0':>12} {a0_ux:>10} {a0_ux/total*100:>9.1f}%") +print(f"{'>0':>12} {'=0':>12} {ax_u0:>10} {ax_u0/total*100:>9.1f}%") +print(f"{'>0':>12} {'>0':>12} {ax_ux:>10} {ax_ux/total*100:>9.1f}%") +print(f"{'总计':>12} {'':>12} {total:>10}") + +# 按 subBehavior 细分 - ES aggregation +print("\n=== 按 behavior + subBehavior 细分 ===\n") +agg_query = { + "size": 0, + "aggs": { + "behaviors": { + "terms": {"field": "behavior", "size": 10}, + "aggs": { + "subs": { + "terms": {"field": "subBehavior", "size": 10}, + "aggs": { + "a0_u0": {"filter": {"bool": {"must": [zero_account, zero_user]}}}, + "a0_ux": {"filter": {"bool": {"must": [zero_account, nonzero_user]}}}, + "ax_u0": {"filter": {"bool": {"must": [nonzero_account, zero_user]}}}, + "ax_ux": {"filter": {"bool": {"must": [nonzero_account, nonzero_user]}}} + } + } + } + } + } +} + +cmd = ["curl", "-s", "-u", AUTH, "-H", "Content-Type: application/json", + f"{ES_URL}/growth_activity_behavior/_search", "-d", json.dumps(agg_query)] +r = subprocess.run(cmd, capture_output=True, text=True) +data = json.loads(r.stdout) + +header = f"{'behavior / subBehavior':<40} {'总计':>8} {'a=0,u=0':>8} {'a=0,u>0':>8} {'a>0,u=0':>8} {'a>0,u>0':>8}" +print(header) +print("-" * len(header)) + +for bb in data["aggregations"]["behaviors"]["buckets"]: + bname = bb["key"] + for sb in bb["subs"]["buckets"]: + sname = sb["key"] + n = sb["doc_count"] + a0u0 = sb["a0_u0"]["doc_count"] + a0ux = sb["a0_ux"]["doc_count"] + axu0 = sb["ax_u0"]["doc_count"] + axux = sb["ax_ux"]["doc_count"] + label = f"{bname} / {sname}" + print(f"{label:<40} {n:>8} {a0u0:>8} {a0ux:>8} {axu0:>8} {axux:>8}") + +# userId 不存在的记录 +print(f"\n=== userId 字段不存在 的记录 ===") +u_nonexist = count_query([{"bool": {"must_not": [{"exists": {"field": "userId"}}]}}]) +print(f"userId 字段不存在: {u_nonexist}") + +# 抽 userId=0 的样本 +cmd = ["curl", "-s", "-u", AUTH, "-H", "Content-Type: application/json", + f"{ES_URL}/growth_activity_behavior/_search?size=3", + "-d", '{"query":{"term":{"userId":0}},"_source":true}'] +r = subprocess.run(cmd, capture_output=True, text=True) +data = json.loads(r.stdout) +print(f"\nuserId=0 记录数: {data['hits']['total']['value']}") +for h in data['hits']['hits']: + print(f" {json.dumps(h['_source'], ensure_ascii=False)}") diff --git a/scripts/endor_purchase_analysis.py b/scripts/endor_purchase_analysis.py new file mode 100644 index 0000000..da51dbf --- /dev/null +++ b/scripts/endor_purchase_analysis.py @@ -0,0 +1,437 @@ +#!/usr/bin/env python3 +""" +端内购买用户行为分析:注册日期 → 序章完课日期 → 购课日期 三者关系 +""" +import psycopg2 +from datetime import datetime, date +from collections import defaultdict, Counter +import os + +PG_PASS = os.environ.get('PG_ONLINE_PASSWORD', "LdfjdjL83h3h3^$&**YGG*") +conn = psycopg2.connect( + host="bj-postgres-16pob4sg.sql.tencentcdb.com", + port=28591, + user="ai_member", + password=PG_PASS, + database="vala_bi" +) +cur = conn.cursor() + +# U0 序章 chapter_id +L1_U0 = [343, 344, 345, 346, 348] +L2_U0 = [55, 56, 57, 58, 59] + +print("=" * 80) +print("端内购买用户行为分析") +print("=" * 80) + +# ========== Step 1: 端内付费用户 ========== +print("\n[Step 1] 查询端内付费用户...") +cur.execute(""" + SELECT DISTINCT ON (o.account_id) + o.account_id, + o.goods_id, + o.pay_success_date as purchase_date, + CASE + WHEN o.goods_id IN (57, 60, 63) THEN 'L1' + WHEN o.goods_id IN (31, 32, 33, 54) THEN 'L2' + ELSE 'other' + END as buy_level + FROM bi_vala_order o + JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1 + WHERE o.key_from = 'app-active-h5-0-0' + AND o.order_status IN (3, 4) + AND o.pay_success_date IS NOT NULL + ORDER BY o.account_id, o.pay_success_date +""") +endo_users = {} +for row in cur.fetchall(): + account_id, goods_id, purchase_date, buy_level = row + if buy_level == 'other': + continue + endo_users[account_id] = { + 'goods_id': goods_id, + 'purchase_date': purchase_date, + 'buy_level': buy_level + } + +print(f" 端内付费用户数: {len(endo_users)}") +l1_users = {k: v for k, v in endo_users.items() if v['buy_level'] == 'L1'} +l2_users = {k: v for k, v in endo_users.items() if v['buy_level'] == 'L2'} +print(f" - L1 购买用户: {len(l1_users)}") +print(f" - L2 购买用户: {len(l2_users)}") + +# ========== Step 2: 注册时间 ========== +print("\n[Step 2] 查询用户注册时间...") +all_ids = list(endo_users.keys()) +batch_size = 500 +register_map = {} +for i in range(0, len(all_ids), batch_size): + batch = all_ids[i:i+batch_size] + cur.execute( + "SELECT id, created_at FROM bi_vala_app_account WHERE id IN %s AND status = 1", + (tuple(batch),) + ) + for row in cur.fetchall(): + register_map[row[0]] = row[1] + +print(f" 找到注册时间的用户: {len(register_map)}") + +# ========== Step 3: account_id -> user_id (character) 映射 ========== +print("\n[Step 3] 查询用户角色映射...") +char_map = {} # account_id -> [user_id, ...] +for i in range(0, len(all_ids), batch_size): + batch = all_ids[i:i+batch_size] + cur.execute( + "SELECT account_id, id FROM bi_vala_app_character WHERE account_id IN %s", + (tuple(batch),) + ) + for row in cur.fetchall(): + acct = row[0] + uid = row[1] + if acct not in char_map: + char_map[acct] = [] + char_map[acct].append(uid) + +print(f" 有角色的用户数: {len(char_map)}") + +# ========== Step 4: 序章完课时间 ========== +print("\n[Step 4] 查询序章完课时间(8张分表)...") + +# 收集所有 user_id,按等级分 +l1_user_ids = set() +l2_user_ids = set() +for acct in l1_users: + if acct in char_map: + for uid in char_map[acct]: + l1_user_ids.add(uid) +for acct in l2_users: + if acct in char_map: + for uid in char_map[acct]: + l2_user_ids.add(uid) + +print(f" L1 相关 user_id 数: {len(l1_user_ids)}") +print(f" L2 相关 user_id 数: {len(l2_user_ids)}") + +def query_prologue_completion(cur, user_ids, chapter_ids, table_prefix="bi_user_chapter_play_record"): + """查询序章完课最早时间,返回 {user_id: earliest_date}""" + result = {} + if not user_ids: + return result + + # 分表查询 + union_parts = [] + for t in range(8): + union_parts.append(f""" + SELECT user_id, MIN(created_at) as first_done + FROM {table_prefix}_{t} + WHERE chapter_id IN ({','.join(map(str, chapter_ids))}) + AND play_status = 1 + AND user_id IN %(user_ids)s + GROUP BY user_id + """) + + sql = " UNION ALL ".join(union_parts) + sql = f""" + SELECT user_id, MIN(first_done) as earliest + FROM ({sql}) sub + GROUP BY user_id + """ + + batch_size_uid = 2000 + uid_list = list(user_ids) + for i in range(0, len(uid_list), batch_size_uid): + batch = tuple(uid_list[i:i+batch_size_uid]) + cur.execute(sql, {'user_ids': batch}) + for row in cur.fetchall(): + result[row[0]] = row[1] + + return result + +# L1 序章 +l1_prologue = query_prologue_completion(cur, l1_user_ids, L1_U0) +print(f" L1 序章有完课记录的 user_id: {len(l1_prologue)}") + +# L2 序章 +l2_prologue = query_prologue_completion(cur, l2_user_ids, L2_U0) +print(f" L2 序章有完课记录的 user_id: {len(l2_prologue)}") + +# ========== Step 5: 关联分析 ========== +print("\n[Step 5] 关联分析...") + +# 对每个端内购买用户,找到他的角色中对应等级序章最早完课时间 +def get_earliest_prologue(account_id, buy_level): + """获取用户对应等级序章的最早完课时间""" + if account_id not in char_map: + return None + user_ids = char_map[account_id] + + earliest = None + prologue_map = l1_prologue if buy_level == 'L1' else l2_prologue + for uid in user_ids: + if uid in prologue_map: + d = prologue_map[uid] + if earliest is None or d < earliest: + earliest = d + return earliest + +# 构建分析数据 +records = [] +for account_id, info in endo_users.items(): + register_date = register_map.get(account_id) + purchase_date = info['purchase_date'] + buy_level = info['buy_level'] + + if register_date is None: + continue + + prologue_date = get_earliest_prologue(account_id, buy_level) + + # 计算天数差 + if isinstance(register_date, datetime): + reg_d = register_date.date() + else: + reg_d = register_date + + if isinstance(purchase_date, datetime): + pur_d = purchase_date.date() + elif isinstance(purchase_date, date): + pur_d = purchase_date + else: + pur_d = purchase_date + + if prologue_date: + if isinstance(prologue_date, datetime): + pro_d = prologue_date.date() + else: + pro_d = prologue_date + + reg_to_pro = (pro_d - reg_d).days + pro_to_pur = (pur_d - pro_d).days + reg_to_pur = (pur_d - reg_d).days + has_prologue = True + else: + reg_to_pro = None + pro_to_pur = None + reg_to_pur = (pur_d - reg_d).days + has_prologue = False + + records.append({ + 'account_id': account_id, + 'level': buy_level, + 'register_date': reg_d, + 'prologue_date': prologue_date.date() if has_prologue and isinstance(prologue_date, datetime) else (prologue_date if has_prologue else None), + 'purchase_date': pur_d, + 'reg_to_pro': reg_to_pro, + 'pro_to_pur': pro_to_pur, + 'reg_to_pur': reg_to_pur, + 'has_prologue': has_prologue + }) + +# ========== Step 6: 统计输出 ========== +print(f"\n总分析用户数: {len(records)}") + +# --- 总体统计 --- +with_prologue = [r for r in records if r['has_prologue']] +without_prologue = [r for r in records if not r['has_prologue']] + +print(f"\n{'='*80}") +print("总体概览") +print(f"{'='*80}") +print(f"端内付费用户: {len(records)}") +print(f" 有L1序章完课记录: {len([r for r in with_prologue if r['level']=='L1'])}") +print(f" 有L2序章完课记录: {len([r for r in with_prologue if r['level']=='L2'])}") +print(f" 无对应序章完课记录: {len(without_prologue)}") + +# --- 天数分布 --- +def percentile(sorted_vals, p): + if not sorted_vals: + return None + idx = int(len(sorted_vals) * p / 100) + return sorted_vals[min(idx, len(sorted_vals)-1)] + +def print_stats(name, vals, unit='天'): + if not vals: + print(f" {name}: 无数据") + return + s = sorted(vals) + print(f" {name}:") + print(f" 样本数: {len(s)}") + print(f" 中位数: {percentile(s, 50):.1f}{unit}") + print(f" 平均值: {sum(s)/len(s):.1f}{unit}") + print(f" P25: {percentile(s, 25):.1f}{unit}") + print(f" P75: {percentile(s, 75):.1f}{unit}") + print(f" P90: {percentile(s, 90):.1f}{unit}") + +print(f"\n{'='*80}") +print("全量统计:注册→购课 天数") +print(f"{'='*80}") +print_stats("全量 注册→购课", [r['reg_to_pur'] for r in records]) + +print(f"\n{'='*80}") +print("有序章完课用户统计") +print(f"{'='*80}") +print_stats("序章完课 注册→完课 天数", [r['reg_to_pro'] for r in with_prologue]) +print_stats("序章完课 完课→购课 天数", [r['pro_to_pur'] for r in with_prologue]) +print_stats("序章完课 注册→购课 天数", [r['reg_to_pur'] for r in with_prologue]) + +# --- 按等级拆分 --- +for level in ['L1', 'L2']: + lvl = [r for r in records if r['level'] == level] + lvl_pro = [r for r in lvl if r['has_prologue']] + lvl_no = [r for r in lvl if not r['has_prologue']] + + print(f"\n{'='*80}") + print(f"{level} 购买用户统计") + print(f"{'='*80}") + print(f" {level} 总用户: {len(lvl)}") + print(f" 有{level}序章完课: {len(lvl_pro)}") + print(f" 无{level}序章完课: {len(lvl_no)}") + + print_stats(f"\n {level} 全量 注册→购课", [r['reg_to_pur'] for r in lvl]) + + if lvl_pro: + print_stats(f" {level} 序章完课 注册→完课", [r['reg_to_pro'] for r in lvl_pro]) + print_stats(f" {level} 序章完课 完课→购课", [r['pro_to_pur'] for r in lvl_pro]) + print_stats(f" {level} 序章完课 注册→购课", [r['reg_to_pur'] for r in lvl_pro]) + +# --- 完课→购课的时间段分布 --- +print(f"\n{'='*80}") +print("有完课用户:完课→购课 时间段分布") +print(f"{'='*80}") + +buckets = { + '0天(当天购课)': (0, 0), + '1-3天': (1, 3), + '4-7天': (4, 7), + '8-14天': (8, 14), + '15-30天': (15, 30), + '31-60天': (31, 60), + '61-90天': (61, 90), + '90天以上': (91, 99999), + '负数(购课在完课前)': (-99999, -1), +} + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = with_prologue + else: + data = [r for r in with_prologue if r['level'] == level] + + if not data: + continue + + vals = [r['pro_to_pur'] for r in data] + print(f"\n[{level}] 完课→购课 天数分布:") + for label, (lo, hi) in buckets.items(): + cnt = sum(1 for v in vals if lo <= v <= hi) + pct = cnt / len(vals) * 100 + bar = '█' * int(pct / 2) + print(f" {label:20s}: {cnt:4d} ({pct:5.1f}%) {bar}") + +# --- 注册→完课时间段分布 --- +print(f"\n{'='*80}") +print("有完课用户:注册→完课 时间段分布") +print(f"{'='*80}") + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = with_prologue + else: + data = [r for r in with_prologue if r['level'] == level] + + if not data: + continue + + vals = [r['reg_to_pro'] for r in data] + print(f"\n[{level}] 注册→完课 天数分布:") + for label, (lo, hi) in buckets.items(): + if hi < 0: + continue + cnt = sum(1 for v in vals if lo <= v <= hi) + pct = cnt / len(vals) * 100 + bar = '█' * int(pct / 2) + print(f" {label:20s}: {cnt:4d} ({pct:5.1f}%) {bar}") + +# --- 注册→购课 时间段分布 --- +print(f"\n{'='*80}") +print("注册→购课 时间段分布") +print(f"{'='*80}") + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = records + else: + data = [r for r in records if r['level'] == level] + + if not data: + continue + + vals = [r['reg_to_pur'] for r in data] + print(f"\n[{level}] 注册→购课 天数分布:") + extended_buckets = { + '0天(当天购课)': (0, 0), + '1天': (1, 1), + '2天': (2, 2), + '3天': (3, 3), + '4-7天': (4, 7), + '8-14天': (8, 14), + '15-30天': (15, 30), + '31-60天': (31, 60), + '61-90天': (61, 90), + '90-180天': (91, 180), + '180天以上': (181, 99999), + } + for label, (lo, hi) in extended_buckets.items(): + cnt = sum(1 for v in vals if lo <= v <= hi) + pct = cnt / len(vals) * 100 + bar = '█' * int(pct / 2) + print(f" {label:20s}: {cnt:4d} ({pct:5.1f}%) {bar}") + +# --- 完课在购课之前 vs 之后 --- +print(f"\n{'='*80}") +print("完课 vs 购课 时间关系") +print(f"{'='*80}") + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = with_prologue + else: + data = [r for r in with_prologue if r['level'] == level] + + if not data: + continue + + pro_before_pur = [r for r in data if r['pro_to_pur'] >= 0] + pro_after_pur = [r for r in data if r['pro_to_pur'] < 0] + pro_same_day = [r for r in data if r['pro_to_pur'] == 0] + + print(f"\n[{level}]:") + print(f" 完课在购课之前或同一天: {len(pro_before_pur)} ({len(pro_before_pur)/len(data)*100:.1f}%)") + print(f" 其中同一天: {len(pro_same_day)} ({len(pro_same_day)/len(data)*100:.1f}%)") + print(f" 完课在购课之后: {len(pro_after_pur)} ({len(pro_after_pur)/len(data)*100:.1f}%)") + +# ========== 导出详细CSV ========== +print(f"\n{'='*80}") +print("导出详细数据...") +print(f"{'='*80}") + +import csv +output_path = '/root/.openclaw/workspace/output/endor_purchase_analysis.csv' +with open(output_path, 'w', newline='', encoding='utf-8-sig') as f: + writer = csv.writer(f) + writer.writerow(['account_id', '等级', '注册日期', '序章完课日期', '购课日期', + '注册→完课(天)', '完课→购课(天)', '注册→购课(天)', '有序章完课']) + for r in sorted(records, key=lambda x: x['purchase_date']): + writer.writerow([ + r['account_id'], r['level'], r['register_date'], r['prologue_date'], r['purchase_date'], + r['reg_to_pro'] if r['reg_to_pro'] is not None else '', + r['pro_to_pur'] if r['pro_to_pur'] is not None else '', + r['reg_to_pur'], '是' if r['has_prologue'] else '否' + ]) + +print(f" 已导出到: {output_path}") + +cur.close() +conn.close() +print("\n分析完成!") diff --git a/scripts/endor_purchase_analysis_v2.py b/scripts/endor_purchase_analysis_v2.py new file mode 100644 index 0000000..f4734f1 --- /dev/null +++ b/scripts/endor_purchase_analysis_v2.py @@ -0,0 +1,500 @@ +#!/usr/bin/env python3 +""" +端内购买用户行为分析 V2 +口径修正: + 1. 按订单维度:同一用户买L1+L2拆为两条记录 + 2. 序章完课 = U0全部五节(L01~L05)完成,取最后一节完成日期 + 3. 新增:购课时序章完成到第几节课 +""" +import psycopg2 +from datetime import datetime, date +from collections import defaultdict, Counter +import os +import csv + +PG_PASS = os.environ.get('PG_ONLINE_PASSWORD', "LdfjdjL83h3h3^$&**YGG*") +conn = psycopg2.connect( + host="bj-postgres-16pob4sg.sql.tencentcdb.com", + port=28591, + user="ai_member", + password=PG_PASS, + database="vala_bi" +) +cur = conn.cursor() + +# U0 序章 chapter_id -> lesson_name +L1_U0_LESSONS = {343: 'L01', 344: 'L02', 345: 'L03', 346: 'L04', 348: 'L05'} +L2_U0_LESSONS = {55: 'L01', 56: 'L02', 57: 'L03', 58: 'L04', 59: 'L05'} + +LEVEL_MAP = { + 57: 'L1', 60: 'L1', 63: 'L1', + 31: 'L2', 32: 'L2', 33: 'L2', 54: 'L2', + 61: 'L1+L2', +} + +print("=" * 80) +print("端内购买用户行为分析 V2") +print("口径:序章完课 = U0全部五节完成 | 订单维度 | 购课时进度分析") +print("=" * 80) + +# ========== Step 1: 端内付费订单(按订单维度) ========== +print("\n[Step 1] 查询端内付费订单...") +cur.execute(""" + SELECT + o.id as order_id, + o.account_id, + o.goods_id, + o.pay_success_date, + o.trade_no + FROM bi_vala_order o + JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1 + WHERE o.key_from = 'app-active-h5-0-0' + AND o.order_status IN (3, 4) + AND o.pay_success_date IS NOT NULL + ORDER BY o.pay_success_date +""") + +orders = [] # list of {order_id, account_id, goods_id, purchase_date, level} +for row in cur.fetchall(): + order_id, account_id, goods_id, purchase_date, trade_no = row + level = LEVEL_MAP.get(goods_id) + if level is None or level == 'L1+L2': + continue + if isinstance(purchase_date, datetime): + purchase_date = purchase_date.date() + orders.append({ + 'order_id': order_id, + 'account_id': account_id, + 'goods_id': goods_id, + 'purchase_date': purchase_date, + 'level': level + }) + +print(f" 端内付费订单数: {len(orders)}") +print(f" - L1 订单: {sum(1 for o in orders if o['level']=='L1')}") +print(f" - L2 订单: {sum(1 for o in orders if o['level']=='L2')}") + +# 统计有多少用户同时买了L1+L2 +user_levels = defaultdict(set) +for o in orders: + user_levels[o['account_id']].add(o['level']) +dual_users = [uid for uid, lvls in user_levels.items() if len(lvls) > 1] +print(f" 同时购买L1+L2的用户: {len(dual_users)}") + +# ========== Step 2: 注册时间 ========== +print("\n[Step 2] 查询用户注册时间...") +all_accounts = list(set(o['account_id'] for o in orders)) +register_map = {} +batch_size = 500 +for i in range(0, len(all_accounts), batch_size): + batch = all_accounts[i:i+batch_size] + cur.execute( + "SELECT id, created_at FROM bi_vala_app_account WHERE id IN %s", + (tuple(batch),) + ) + for row in cur.fetchall(): + d = row[1] + register_map[row[0]] = d.date() if isinstance(d, (datetime, date)) else d + +print(f" 找到注册时间: {len(register_map)}") + +# ========== Step 3: account_id -> user_id 映射 ========== +print("\n[Step 3] 查询用户角色映射...") +char_map = defaultdict(list) +for i in range(0, len(all_accounts), batch_size): + batch = all_accounts[i:i+batch_size] + cur.execute( + "SELECT account_id, id FROM bi_vala_app_character WHERE account_id IN %s", + (tuple(batch),) + ) + for row in cur.fetchall(): + char_map[row[0]].append(row[1]) + +print(f" 有角色的用户: {sum(1 for uid, chars in char_map.items() if chars)}") + +# ========== Step 4: 序章每节课的最早完成时间 ========== +print("\n[Step 4] 查询序章每节课完成时间(8张分表)...") + +def query_lesson_completion(cur, user_ids, chapter_ids_dict): + """ + 查询每个 user_id 对每个 chapter_id 的最早完成时间 + 返回: {user_id: {chapter_id: earliest_date}} + """ + result = defaultdict(dict) + if not user_ids: + return result + + # 先按user_ids批量查询 + all_chapter_ids = list(chapter_ids_dict.keys()) + + union_parts = [] + for t in range(8): + union_parts.append(f""" + SELECT user_id, chapter_id, MIN(created_at) as first_done + FROM bi_user_chapter_play_record_{t} + WHERE chapter_id IN ({','.join(map(str, all_chapter_ids))}) + AND play_status = 1 + AND user_id IN %(user_ids)s + GROUP BY user_id, chapter_id + """) + + sql = " UNION ALL ".join(union_parts) + sql = f""" + SELECT user_id, chapter_id, MIN(first_done) as earliest + FROM ({sql}) sub + GROUP BY user_id, chapter_id + """ + + uid_list = list(user_ids) + for i in range(0, len(uid_list), 2000): + batch = tuple(uid_list[i:i+2000]) + cur.execute(sql, {'user_ids': batch}) + for row in cur.fetchall(): + uid, cid, dt = row + result[uid][cid] = dt.date() if isinstance(dt, (datetime, date)) else dt + + return result + +# 收集所有user_id +all_user_ids = set() +for acct in all_accounts: + for uid in char_map.get(acct, []): + all_user_ids.add(uid) + +print(f" 总 user_id 数: {len(all_user_ids)}") + +# L1 序章 +l1_lesson_data = query_lesson_completion(cur, all_user_ids, L1_U0_LESSONS) +# L2 序章 +l2_lesson_data = query_lesson_completion(cur, all_user_ids, L2_U0_LESSONS) + +print(f" L1 序章有完成记录的 user_id: {len(l1_lesson_data)}") +print(f" L2 序章有完成记录的 user_id: {len(l2_lesson_data)}") + +# ========== Step 5: 对每个订单分析序章进度 ========== +print("\n[Step 5] 逐订单分析序章进度...") + +def analyze_user_prologue(user_ids, lesson_data, lesson_map): + """ + 汇总用户所有角色在序章各节的完成情况 + 返回: {chapter_id: earliest_completion_date} (只包含完成的) + """ + merged = {} + for uid in user_ids: + if uid not in lesson_data: + continue + for cid, dt in lesson_data[uid].items(): + if cid not in merged or dt < merged[cid]: + merged[cid] = dt + return merged + +def get_prologue_status(lesson_completions, lesson_map, purchase_date): + """ + 分析序章完成情况: + - total_done: 总共完成了几节(不限时间) + - done_before_purchase: 购课前完成了哪几节 + - done_after_purchase: 购课后完成了哪几节 + - all_five_done: 是否五节全完成 + - all_five_done_date: 五节全完成的日期(最后一节) + - progress_at_purchase: 购课时的进度(完成了几节) + """ + total_done = len(lesson_completions) + + done_before = {} + done_after = {} + for cid, dt in lesson_completions.items(): + if dt <= purchase_date: + done_before[cid] = dt + else: + done_after[cid] = dt + + # 是否五节全完成 + all_five_done = total_done == 5 + all_five_done_date = max(lesson_completions.values()) if all_five_done else None + + # 购课时完成了几节 + progress_at_purchase = len(done_before) + + # 购课时哪些节已完成 + lessons_done_at_purchase = sorted( + [lesson_map[cid] for cid in done_before], + key=lambda x: int(x[1:]) # L01, L02, ... + ) + + return { + 'total_done': total_done, + 'done_before_count': len(done_before), + 'done_after_count': len(done_after), + 'all_five_done': all_five_done, + 'all_five_done_date': all_five_done_date, + 'progress_at_purchase': progress_at_purchase, + 'lessons_done_at_purchase': lessons_done_at_purchase, + 'done_before': done_before, + 'done_after': done_after, + } + +records = [] +for o in orders: + account_id = o['account_id'] + level = o['level'] + purchase_date = o['purchase_date'] + register_date = register_map.get(account_id) + + if register_date is None: + continue + + user_ids = char_map.get(account_id, []) + lesson_map = L1_U0_LESSONS if level == 'L1' else L2_U0_LESSONS + lesson_data = l1_lesson_data if level == 'L1' else l2_lesson_data + + user_completions = analyze_user_prologue(user_ids, lesson_data, lesson_map) + status = get_prologue_status(user_completions, lesson_map, purchase_date) + + # 注册→购课天数 + reg_to_pur = (purchase_date - register_date).days + + # 注册→序章全完成天数 + reg_to_all5 = None + if status['all_five_done_date']: + reg_to_all5 = (status['all_five_done_date'] - register_date).days + + # 序章全完成→购课天数 + all5_to_pur = None + if status['all_five_done_date']: + all5_to_pur = (purchase_date - status['all_five_done_date']).days + + records.append({ + 'order_id': o['order_id'], + 'account_id': account_id, + 'level': level, + 'register_date': register_date, + 'purchase_date': purchase_date, + 'reg_to_pur': reg_to_pur, + # 序章进度 + 'total_lessons_done': status['total_done'], + 'lessons_done_before_purchase': status['done_before_count'], + 'lessons_done_after_purchase': status['done_after_count'], + 'progress_at_purchase': status['progress_at_purchase'], + 'lessons_at_purchase': ','.join(status['lessons_done_at_purchase']), + # 五节全完成 + 'all_five_done': status['all_five_done'], + 'all_five_done_date': status['all_five_done_date'], + 'reg_to_all5': reg_to_all5, + 'all5_to_pur': all5_to_pur, + }) + +print(f" 分析记录数: {len(records)}") + +# ========== Step 6: 统计分析 ========== +print(f"\n{'='*80}") +print("【总体概览】") +print(f"{'='*80}") + +total_records = len(records) +all5_complete = [r for r in records if r['all_five_done']] +all5_incomplete = [r for r in records if not r['all_five_done']] +l1_records = [r for r in records if r['level'] == 'L1'] +l2_records = [r for r in records if r['level'] == 'L2'] + +print(f"总记录数(订单维度): {total_records}") +print(f" - L1 订单: {len(l1_records)}") +print(f" - L2 订单: {len(l2_records)}") +print(f" 序章五节全完成: {len(all5_complete)}") +print(f" 序章五节未全完成: {len(all5_incomplete)}") + +# ========== 购课时序章进度分布 ========== +print(f"\n{'='*80}") +print("【核心指标:购课时序章完成进度】") +print(f"{'='*80}") + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = records + else: + data = [r for r in records if r['level'] == level] + + if not data: + continue + + print(f"\n[{level}] 购课时完成了序章第几节:") + progress_dist = Counter(r['progress_at_purchase'] for r in data) + for p in range(6): # 0到5 + cnt = progress_dist.get(p, 0) + pct = cnt / len(data) * 100 + label = f" 完成{p}节(共5节)" if p > 0 else " 完成0节(未开始)" + bar = '█' * int(pct) + print(f" {label:24s}: {cnt:4d} ({pct:5.1f}%) {bar}") + + # 具体到哪几节 + print(f"\n[{level}] 购课时已完成哪些课时的分布:") + lesson_dist = Counter() + for r in data: + if r['progress_at_purchase'] == 0: + lesson_dist['(未完成任何节)'] += 1 + else: + lesson_dist[r['lessons_at_purchase']] += 1 + for lesson_combo, cnt in lesson_dist.most_common(15): + pct = cnt / len(data) * 100 + label = f" {lesson_combo}" + print(f" {label:24s}: {cnt:4d} ({pct:5.1f}%)") + +# ========== 购课后完成的情况 ========== +print(f"\n{'='*80}") +print("【购课后继续完成序章的情况】") +print(f"{'='*80}") + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = records + else: + data = [r for r in records if r['level'] == level] + + if not data: + continue + + print(f"\n[{level}]:") + print(f" 购课时序章已全部完成(5/5): {sum(1 for r in data if r['progress_at_purchase']==5)} ({sum(1 for r in data if r['progress_at_purchase']==5)/len(data)*100:.1f}%)") + print(f" 购课后续完成序章余下课时: {sum(1 for r in data if r['lessons_done_after_purchase']>0)} ({sum(1 for r in data if r['lessons_done_after_purchase']>0)/len(data)*100:.1f}%)") + + # 购课后完成了多少节 + after_dist = Counter(r['lessons_done_after_purchase'] for r in data if r['progress_at_purchase'] < 5) + if after_dist: + print(f" 购课时未完成的用户,购课后又完成节数分布:") + for n in sorted(after_dist.keys()): + cnt = after_dist[n] + pct = cnt / sum(1 for r in data if r['progress_at_purchase'] < 5) * 100 + print(f" 又完成{n}节: {cnt} ({pct:.1f}%)") + +# ========== 注册→购课天数 ========== +def percentile_vals(vals): + if not vals: + return {} + s = sorted(vals) + return { + 'P10': s[int(len(s)*0.10)], + 'P25': s[int(len(s)*0.25)], + 'P50': s[int(len(s)*0.50)], + 'P75': s[int(len(s)*0.75)], + 'P90': s[int(len(s)*0.90)], + 'mean': sum(s)/len(s), + 'count': len(s), + } + +print(f"\n{'='*80}") +print("【注册→购课 天数统计】") +print(f"{'='*80}") + +for level in ['L1', 'L2', 'ALL']: + if level == 'ALL': + data = records + else: + data = [r for r in records if r['level'] == level] + + stat = percentile_vals([r['reg_to_pur'] for r in data]) + if stat: + print(f"\n[{level}] 注册→购课:") + print(f" 样本: {stat['count']}, 中位数: {stat['P50']}天, 均值: {stat['mean']:.1f}天") + print(f" P10={stat['P10']}天 | P25={stat['P25']}天 | P75={stat['P75']}天 | P90={stat['P90']}天") + +# ========== 序章五节全完成的用户统计 ========== +print(f"\n{'='*80}") +print("【序章五节全完成的用户(限定已完成5/5的用户)】") +print(f"{'='*80}") + +all5_l1 = [r for r in all5_complete if r['level']=='L1'] +all5_l2 = [r for r in all5_complete if r['level']=='L2'] + +print(f"\n五节全完成用户: {len(all5_complete)}") +print(f" L1: {len(all5_l1)}") +print(f" L2: {len(all5_l2)}") + +# 注册→全完成 +for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]: + vals = [r['reg_to_all5'] for r in data if r['reg_to_all5'] is not None] + stat = percentile_vals(vals) + if stat: + print(f"\n[{name}] 注册→序章五节全完成:") + print(f" 样本: {stat['count']}, 中位数: {stat['P50']}天, 均值: {stat['mean']:.1f}天") + print(f" P10={stat['P10']}天 | P25={stat['P25']}天 | P75={stat['P75']}天 | P90={stat['P90']}天") + +# 全完成→购课 +for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]: + vals = [r['all5_to_pur'] for r in data if r['all5_to_pur'] is not None] + stat = percentile_vals(vals) + if stat: + print(f"\n[{name}] 序章五节全完成→购课:") + print(f" 样本: {stat['count']}, 中位数: {stat['P50']}天, 均值: {stat['mean']:.1f}天") + print(f" P10={stat['P10']}天 | P25={stat['P25']}天 | P75={stat['P75']}天 | P90={stat['P90']}天") + +# 全完成→购课时间段分布 +print(f"\n【五节全完成→购课 时间段分布】") +print(f"{'='*60}") +buckets = { + '负数(先购课后完课)': (-99999, -1), + '0天(当天)': (0, 0), + '1~3天': (1, 3), + '4~7天': (4, 7), + '8~14天': (8, 14), + '15~30天': (15, 30), + '31~60天': (31, 60), + '61~90天': (61, 90), + '90天以上': (91, 99999), +} +for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]: + vals = [r['all5_to_pur'] for r in data if r['all5_to_pur'] is not None] + if not vals: + continue + print(f"\n[{name}]:") + for label, (lo, hi) in buckets.items(): + cnt = sum(1 for v in vals if lo <= v <= hi) + pct = cnt / len(vals) * 100 + bar = '█' * int(pct / 2) + print(f" {label:24s}: {cnt:4d} ({pct:5.1f}%) {bar}") + +# ========== 时序关系 ========== +print(f"\n{'='*80}") +print("【五节全完成的用户:序章 vs 购课 时序关系】") +print(f"{'='*80}") + +for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]: + pro_before = sum(1 for r in data if r['all5_to_pur'] is not None and r['all5_to_pur'] >= 0) + pro_after = sum(1 for r in data if r['all5_to_pur'] is not None and r['all5_to_pur'] < 0) + pro_same = sum(1 for r in data if r['all5_to_pur'] == 0) + print(f"\n[{name}]:") + print(f" 序章全部完成在购课之前/当天: {pro_before} ({pro_before/len(data)*100:.1f}%)") + print(f" 其中当天: {pro_same} ({pro_same/len(data)*100:.1f}%)") + print(f" 序章全部完成在购课之后: {pro_after} ({pro_after/len(data)*100:.1f}%)") + +# ========== 导出CSV ========== +output_path = '/root/.openclaw/workspace/output/endor_purchase_analysis_v2.csv' +with open(output_path, 'w', newline='', encoding='utf-8-sig') as f: + writer = csv.writer(f) + writer.writerow([ + 'order_id', 'account_id', '等级', '注册日期', '购课日期', + '注册→购课(天)', + '序章完成节数(总计)', '购课前进度(节)', '购课后完成(节)', + '购课时已完成', '五节全完成', '五节全完成日期', + '注册→全完成(天)', '全完成→购课(天)' + ]) + for r in sorted(records, key=lambda x: x['purchase_date']): + writer.writerow([ + r['order_id'], r['account_id'], r['level'], + r['register_date'].isoformat() if r['register_date'] else '', + r['purchase_date'].isoformat() if r['purchase_date'] else '', + r['reg_to_pur'], + r['total_lessons_done'], r['progress_at_purchase'], + r['lessons_done_after_purchase'], + r['lessons_at_purchase'], + '是' if r['all_five_done'] else '否', + r['all_five_done_date'].isoformat() if r['all_five_done_date'] else '', + r['reg_to_all5'] if r['reg_to_all5'] is not None else '', + r['all5_to_pur'] if r['all5_to_pur'] is not None else '', + ]) + +print(f"\n{'='*80}") +print(f"详细数据已导出: {output_path}") +print(f"{'='*80}") + +cur.close() +conn.close() +print("\n分析完成!") diff --git a/scripts/endor_purchase_analysis_v3.py b/scripts/endor_purchase_analysis_v3.py new file mode 100644 index 0000000..aba8df6 --- /dev/null +++ b/scripts/endor_purchase_analysis_v3.py @@ -0,0 +1,370 @@ +#!/usr/bin/env python3 +""" +端内购买用户行为分析 V3 - 详细交叉统计 +""" +import psycopg2 +from datetime import datetime, date +from collections import defaultdict, Counter +import csv +import os + +PG_PASS = os.environ.get('PG_ONLINE_PASSWORD', "LdfjdjL83h3h3^$&**YGG*") +conn = psycopg2.connect( + host="bj-postgres-16pob4sg.sql.tencentcdb.com", port=28591, + user="ai_member", password=PG_PASS, database="vala_bi" +) +cur = conn.cursor() + +L1_U0_LESSONS = {343: 'L01', 344: 'L02', 345: 'L03', 346: 'L04', 348: 'L05'} +L2_U0_LESSONS = {55: 'L01', 56: 'L02', 57: 'L03', 58: 'L04', 59: 'L05'} +LEVEL_MAP = {57:'L1',60:'L1',63:'L1', 31:'L2',32:'L2',33:'L2',54:'L2'} + +print("=" * 80) +print("端内购买用户行为分析 V3 — 详细交叉统计") +print("=" * 80) + +# ============ 查询数据 ============ +print("\n[数据加载]") + +cur.execute(""" + SELECT o.id, o.account_id, o.goods_id, o.pay_success_date + FROM bi_vala_order o + JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1 + WHERE o.key_from = 'app-active-h5-0-0' AND o.order_status IN (3, 4) AND o.pay_success_date IS NOT NULL + ORDER BY o.pay_success_date +""") +orders = [] +for row in cur.fetchall(): + lv = LEVEL_MAP.get(row[2]) + if not lv: continue + pd_ = row[3]; orders.append({'order_id':row[0],'account_id':row[1],'goods_id':row[2],'purchase_date':pd_.date() if isinstance(pd_,datetime) else pd_,'level':lv}) +print(f" 订单: {len(orders)}") + +all_accounts = list(set(o['account_id'] for o in orders)) +register_map = {} +for i in range(0,len(all_accounts),500): + cur.execute("SELECT id,created_at FROM bi_vala_app_account WHERE id IN %s",(tuple(all_accounts[i:i+500]),)) + for r in cur.fetchall(): + d=r[1];register_map[r[0]]=d.date() if isinstance(d,(datetime,date)) else d +print(f" 注册时间: {len(register_map)}") + +char_map = defaultdict(list) +for i in range(0,len(all_accounts),500): + cur.execute("SELECT account_id,id FROM bi_vala_app_character WHERE account_id IN %s",(tuple(all_accounts[i:i+500]),)) + for r in cur.fetchall(): char_map[r[0]].append(r[1]) +print(f" 角色映射: {sum(1 for k,v in char_map.items() if v)} 用户") + +all_user_ids = set() +for acct in all_accounts: + for uid in char_map.get(acct,[]): all_user_ids.add(uid) + +def query_lesson_completion(user_ids, chapter_ids_dict): + result = defaultdict(dict) + if not user_ids: return result + cids = list(chapter_ids_dict.keys()) + union = [] + for t in range(8): + union.append(f""" + SELECT user_id, chapter_id, MIN(created_at) as fd + FROM bi_user_chapter_play_record_{t} + WHERE chapter_id IN ({','.join(map(str,cids))}) + AND play_status=1 AND user_id IN %(uids)s + GROUP BY user_id, chapter_id + """) + sql = f"SELECT user_id, chapter_id, MIN(fd) FROM ({' UNION ALL '.join(union)}) sub GROUP BY user_id, chapter_id" + for i in range(0,len(user_ids),2000): + cur.execute(sql,{'uids':tuple(user_ids[i:i+2000])}) + for r in cur.fetchall(): result[r[0]][r[1]]=r[2].date() if isinstance(r[2],(datetime,date)) else r[2] + return result + +l1_data = query_lesson_completion(list(all_user_ids), L1_U0_LESSONS) +l2_data = query_lesson_completion(list(all_user_ids), L2_U0_LESSONS) +print(f" L1序章完成: {len(l1_data)} user_ids, L2序章完成: {len(l2_data)} user_ids") + +# ============ 逐订单分析 ============ +records = [] +for o in orders: + lesson_map = L1_U0_LESSONS if o['level']=='L1' else L2_U0_LESSONS + lesson_data = l1_data if o['level']=='L1' else l2_data + reg_d = register_map.get(o['account_id']) + if not reg_d: continue + pur_d = o['purchase_date'] + + # 汇总所有角色的序章完成情况 + merged = {} + for uid in char_map.get(o['account_id'],[]): + if uid not in lesson_data: continue + for cid, dt in lesson_data[uid].items(): + if cid not in merged or dt < merged[cid]: merged[cid]=dt + + # 每节课的完成日期和距注册天数 + lesson_dates = {} + for cid, dt in merged.items(): + lesson_dates[lesson_map[cid]] = {'date': dt, 'days_from_reg': (dt - reg_d).days} + + # 购课前完成哪些 + done_before = {ln: info for ln, info in lesson_dates.items() if info['date'] <= pur_d} + done_after = {ln: info for ln, info in lesson_dates.items() if info['date'] > pur_d} + + all5_done = len(lesson_dates) == 5 + all5_date = max(info['date'] for info in lesson_dates.values()) if all5_done else None + all5_to_pur = (pur_d - all5_date).days if all5_date else None + reg_to_all5 = (all5_date - reg_d).days if all5_date else None + + # 购课时进度:完成了几节,已完成哪些节 + progress = len(done_before) + lessons_done = sorted(done_before.keys(), key=lambda x: int(x[1:])) + + # 购课落在哪两节之间 + # e.g. 完成了 L01 L02,则购课时间在 L02完成后 → "L02之后" + if progress == 0: + purchase_position = "L01之前" + elif progress == 5: + purchase_position = "L05之后" + else: + purchase_position = f"{lessons_done[-1]}之后(L{int(lessons_done[-1][1:])+1:02d}之前)" + + records.append({ + 'order_id': o['order_id'], 'account_id': o['account_id'], + 'level': o['level'], 'register_date': reg_d, 'purchase_date': pur_d, + 'reg_to_pur': (pur_d - reg_d).days, + 'progress': progress, 'lessons_done': lessons_done, + 'done_before': done_before, 'done_after': done_after, + 'all5_done': all5_done, 'all5_date': all5_date, + 'reg_to_all5': reg_to_all5, 'all5_to_pur': all5_to_pur, + 'purchase_position': purchase_position, + 'lesson_dates': lesson_dates, + }) + +print(f" 分析记录: {len(records)}") +l1_recs = [r for r in records if r['level']=='L1'] +l2_recs = [r for r in records if r['level']=='L2'] +all5_recs = [r for r in records if r['all5_done']] + +# ==================================================================== +# 交叉统计1: 购课时完成节数 × 注册→购课天数 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉1】购课时完成节数 × 注册→购课天数") +print(f"{'='*80}") + +for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]: + print(f"\n[{level}]") + # 构建交叉表 + day_ranges = [(0,1),(2,3),(4,7),(8,14),(15,30),(31,60),(61,999)] + print(f" {'购课进度':<20}", end='') + for lo, hi in day_ranges: + label = f"{lo}天" if lo==hi else f"{lo}-{hi}天" + print(f"{label:>10}", end='') + print(f" {'合计':>8}") + + for prog in range(6): + subset = [r for r in data if r['progress']==prog] + label = f" 完成{prog}节" if prog>0 else f" 完成0节" + print(f"{label:<20}", end='') + total = 0 + for lo, hi in day_ranges: + cnt = sum(1 for r in subset if lo <= r['reg_to_pur'] <= hi) + print(f"{cnt:>10}", end='') + total += cnt + print(f" {total:>8}") + +# ==================================================================== +# 交叉统计2: 购课时完成节数 × 完课→购课天数(仅五节全完成用户) +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉2】购课时完成节数 × 全完成→购课天数(仅五节全完成用户)") +print(f"{'='*80}") + +all5_day_ranges = [(-999,-1),(0,0),(1,3),(4,7),(8,14),(15,30),(31,60),(61,999)] + +for level, data in [('L1',[r for r in l1_recs if r['all5_done']]), + ('L2',[r for r in l2_recs if r['all5_done']]), + ('ALL',all5_recs)]: + print(f"\n[{level}]") + print(f" {'购课进度':<20}", end='') + for lo, hi in all5_day_ranges: + if lo==-999: label="先购课后完课" + elif lo==hi: label=f"{lo}天" + else: label=f"{lo}-{hi}天" + print(f"{label:>14}", end='') + print(f" {'合计':>8}") + + for prog in range(6): + subset = [r for r in data if r['progress']==prog] + label = f" 完成{prog}节" if prog>0 else f" 完成0节" + print(f"{label:<20}", end='') + total = 0 + for lo, hi in all5_day_ranges: + cnt = sum(1 for r in subset if r['all5_to_pur'] is not None and lo <= r['all5_to_pur'] <= hi) + print(f"{cnt:>14}", end='') + total += cnt + print(f" {total:>8}") + +# ==================================================================== +# 交叉统计3: 购课落在序章各节之间的分布 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉3】购课时机:落在序章第几节之后") +print(f"{'='*80}") + +for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]: + dist = Counter(r['purchase_position'] for r in data) + print(f"\n[{level}]") + positions = ['L01之前','L01之后(L02之前)','L02之后(L03之前)','L03之后(L04之前)','L04之后(L05之前)','L05之后'] + for pos in positions: + cnt = dist.get(pos, 0) + pct = cnt/len(data)*100 + bar = '█' * int(pct) + print(f" {pos:24s}: {cnt:4d} ({pct:5.1f}%) {bar}") + +# ==================================================================== +# 交叉统计4: 逐节课完成时间分布 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉4】序章每节课:从注册到完成的天数分布") +print(f"{'='*80}") + +lesson_days_ranges = [(0,0),(1,1),(2,3),(4,7),(8,14),(15,30),(31,999)] +lesson_order = ['L01','L02','L03','L04','L05'] + +for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]: + print(f"\n[{level}] ({len(data)} 条记录)") + print(f" {'课时':<8}", end='') + for lo, hi in lesson_days_ranges: + label = f"{lo}天" if lo==hi else f"{lo}-{hi}天" + print(f"{label:>10}", end='') + print(f" {'完成人数':>10} {'中位数':>8} {'均值':>8}") + + for ln in lesson_order: + vals = [] + for r in data: + if ln in r['lesson_dates']: + vals.append(r['lesson_dates'][ln]['days_from_reg']) + if not vals: + print(f" {ln:<8} {'(无数据)':>60}") + continue + sv = sorted(vals) + med = sv[len(sv)//2] + avg = sum(vals)/len(vals) + print(f" {ln:<8}", end='') + for lo, hi in lesson_days_ranges: + cnt = sum(1 for v in vals if lo <= v <= hi) + pct = cnt / len(vals) * 100 + print(f"{cnt:>5}({pct:4.1f}%)", end='') + print(f" {len(vals):>10} {med:>8}天 {avg:>8.1f}天") + +# ==================================================================== +# 交叉统计5: 购课时完成节数 × L1/L2 × 是否有后续完成 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉5】购课时未全部完成的用户:购课后完成余下序章的情况") +print(f"{'='*80}") + +not_all5_at_purchase = [r for r in records if r['progress'] < 5] +for level, data in [('L1',[r for r in l1_recs if r['progress']<5]), + ('L2',[r for r in l2_recs if r['progress']<5]), + ('ALL',not_all5_at_purchase)]: + if not data: continue + print(f"\n[{level}] 购课时未全部完成: {len(data)} 人") + for prog in range(5): + subset = [r for r in data if r['progress']==prog] + if not subset: continue + after_done_dist = Counter(r['all5_done'] for r in subset) + all5_cnt = after_done_dist.get(True, 0) + not5_cnt = after_done_dist.get(False, 0) + print(f" 购课时完成{prog}节: {len(subset)}人 → 最终完成为5节: {all5_cnt}人, 仍未完成: {not5_cnt}人") + +# ==================================================================== +# 交叉统计6: 先购课后完成序章的用户详细分析 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉6】先购课后才完成序章的用户(16人)详细分析") +print(f"{'='*80}") + +buy_before_finish = [r for r in all5_recs if r['all5_to_pur'] is not None and r['all5_to_pur'] < 0] +print(f"\n共 {len(buy_before_finish)} 人,购课后才完成序章全部五节\n") +print(f" {'等级':<6} {'购课时进度':<14} {'购课日期':<14} {'全完成日期':<14} {'购课→完课':>10} {'注册→购课':>10} {'注册→全完':>10}") +for r in sorted(buy_before_finish, key=lambda x: x['all5_to_pur']): + print(f" {r['level']:<6} {'完成'+str(r['progress'])+'节':<14} {str(r['purchase_date']):<14} {str(r['all5_date']):<14} {abs(r['all5_to_pur']):>10}天 {r['reg_to_pur']:>10}天 {r['reg_to_all5']:>10}天") + +# ==================================================================== +# 交叉统计7: 逐月趋势 — 购课时完成节数分布 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉7】逐月趋势:购课时完成 5/5 节的比例") +print(f"{'='*80}") + +monthly = defaultdict(lambda: {'total':0,'all5':0,'partial':0}) +for r in records: + m = r['purchase_date'].strftime('%Y-%m') + monthly[m]['total'] += 1 + if r['progress'] == 5: + monthly[m]['all5'] += 1 + else: + monthly[m]['partial'] += 1 + +for m in sorted(monthly.keys()): + d = monthly[m] + pct = d['all5']/d['total']*100 + print(f" {m}: 总{d['total']:>4}单 | 购课时已完成5节={d['all5']:>3}人({pct:5.1f}%) | 未完成={d['partial']:>2}人") + +# ==================================================================== +# 交叉统计8: 注册→购课 按购课时进度分层 +# ==================================================================== +print(f"\n{'='*80}") +print("【交叉8】按购课时进度分层:各组注册→购课 天数统计") +print(f"{'='*80}") + +def pct(vals, p): + if not vals: return '-' + s = sorted(vals) + return s[int(len(s)*p/100)] + +for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]: + print(f"\n[{level}]") + print(f" {'购课时进度':<16} {'人数':>6} {'P10':>6} {'P25':>6} {'中位':>6} {'P75':>6} {'P90':>6} {'均值':>8}") + for prog in range(6): + subset = [r for r in data if r['progress']==prog] + if not subset: continue + vals = [r['reg_to_pur'] for r in subset] + print(f" {'完成'+str(prog)+'节':<16} {len(subset):>6} " + f"{pct(vals,10):>6} {pct(vals,25):>6} {pct(vals,50):>6} " + f"{pct(vals,75):>6} {pct(vals,90):>6} {sum(vals)/len(vals):>8.1f}") + +# ==================================================================== +# 导出CSV(带序章每节课完成明细) +# ==================================================================== +output_path = '/root/.openclaw/workspace/output/endor_purchase_analysis_v3.csv' +with open(output_path, 'w', newline='', encoding='utf-8-sig') as f: + writer = csv.writer(f) + writer.writerow([ + 'order_id','account_id','等级','注册日期','购课日期','注册→购课(天)', + '购课时序章进度(节)','购课时已完成哪些课', + 'L01完成日期','L01注册后几天','L02完成日期','L02注册后几天', + 'L03完成日期','L03注册后几天','L04完成日期','L04注册后几天', + 'L05完成日期','L05注册后几天', + '五节全完成','五节全完成日期','注册→全完成(天)','全完成→购课(天)', + '购课时机' + ]) + for r in sorted(records, key=lambda x: x['purchase_date']): + row = [r['order_id'],r['account_id'],r['level'], + r['register_date'].isoformat(),r['purchase_date'].isoformat(),r['reg_to_pur'], + r['progress'],','.join(r['lessons_done'])] + for ln in lesson_order: + info = r['lesson_dates'].get(ln, {}) + row.append(info.get('date','').isoformat() if info.get('date') else '') + row.append(info.get('days_from_reg','') if info else '') + row.append('是' if r['all5_done'] else '否') + row.append(r['all5_date'].isoformat() if r['all5_date'] else '') + row.append(r['reg_to_all5'] if r['reg_to_all5'] is not None else '') + row.append(r['all5_to_pur'] if r['all5_to_pur'] is not None else '') + row.append(r['purchase_position']) + writer.writerow(row) + +print(f"\n{'='*80}") +print(f"已导出: {output_path}") +print(f"{'='*80}") +cur.close() +conn.close() +print("分析完成!")