🤖 每日自动备份 - 2026-05-26 08:00:01

This commit is contained in:
小溪 2026-05-26 08:00:01 +08:00
parent b09b21cb44
commit 1077dceca0
9 changed files with 2412 additions and 0 deletions

View File

@ -234,3 +234,8 @@
2. 耗时字段bi_user_component_play_record.interval_time单位为毫秒
3. 计算方式求和对应chapter_unique_id下所有组件的interval_time再除以60000转换为分钟保留1位小数
4. 特殊说明仅统计课时维度完成play_status=1的记录排除未完成整个课时的部分组件练习记录
## growth_activity_behavior 统计口径 [李承龙确认 2026-05-25]
- **统计标识:** 所有统计统一使用 `userId` 作为用户唯一标识,忽略 `accountId`
- **原因:** `accountId=0` 代表匿名用户H5 页面曝光时等步骤),`userId` 为设备/会话级标识,在匿名和登录状态均存在,更适合做漏斗串联
- **例外:** 「领课加微」相关埋点无 `userId` 字段,仅能做事件级统计

287
memory/2026-05-25-1027.md Normal file
View File

@ -0,0 +1,287 @@
# Session: 2026-05-25 10:27:55 GMT+8
- **Session Key**: agent:main:feishu:direct:ou_e63ce6b760ad39382852472f28fbe2a2
- **Session ID**: 57030bd7-443e-4c5e-91e6-8d406ceb9925
- **Source**: feishu
## Conversation Summary
assistant: 完课记录的查询链路涉及到几张表,我先把核心逻辑说清楚:
---
## 用户完课记录查询路径
### 1. account_id → user_id角色ID
一个 account 下可能有多个角色character通过 `bi_vala_app_character` 获取:
```sql
SELECT id AS user_id FROM bi_vala_app_character WHERE account_id = 你的account_id;
```
### 2. 确定分表
`bi_user_chapter_play_record``user_id % 8` 分 8 张表(`_0` 到 `_7`),查询时用取模定位:
```sql
SELECT * FROM bi_user_chapter_play_record_{user_id % 8}
WHERE user_id = 你的user_id AND play_status = 1;
```
### 3. 核心字段
| 字段 | 含义 |
|------|------|
| `play_status` | 1=正常完成 / 2=跳过 / 3=重玩 |
| `chapter_id` | 课时ID关联 `bi_level_unit_lesson.id` 翻译为 L1 S0 U01 L03 |
| `chapter_unique_id` | 课时唯一标识,用于关联组件记录 |
| `created_at` | 完成时间 |
### 4. 关联课程名称
`chapter_id` 翻译成可读的课程名:
```sql
SELECT id, course_level, course_season, course_unit, course_lesson
FROM bi_level_unit_lesson
WHERE id = 你的chapter_id;
```
### 5. 查看组件级详情(可选)
同取模规则,通过 `chapter_unique_id` 关联组件表:
```sql
SELECT * FROM bi_user_component_play_record_{user_id % 8}
WHERE chapter_unique_id = '你的chapter_unique_id';
```
---
**一句话总结:** account_id → character_id(user_id) → 取模分表 → `bi_user_chapter_play_record`(课时完成)→ `bi_level_unit_lesson`(翻译课名)→ `bi_user_component_play_record`(组件明细)。
需要我直接查某个具体用户的完课记录吗?
user: 那么现在我想看在最近14天里面没有完课行为的激活用户占总激活用户的比例有多少其中仅激活L1的有多少仅激活L2的有多少同时激活L1和L2的用户有多少这里面仅激活L1的用户只看L1的学习记录仅激活L2的用户只看L2的学习记录。
assistant: 明确几点边界理解一下再动手:
1. **最近14天窗口** 2026-05-09 ~ 2026-05-22共14个自然日
2. **L1+L2 用户:** L1 或 L2 任一课程有完课即算有完课行为
3. **用户→角色:** 一个 account 可能有多 character任一角色有完课即算
逻辑有点复杂,写脚本跑。
assistant: 结果出来了:
---
## 📊 激活用户最近14天完课情况2026-05-09 ~ 2026-05-22
### 总览
| 指标 | 人数 | 占比 |
|------|------|------|
| 激活用户总数 | 3,037 | 100% |
| **近14天无完课** | **1,307** | **43.0%** |
| 近14天有完课 | 1,698 | 55.9% |
| 无角色数据 | 32 | 1.1% |
### 按激活类型拆分
| 激活类型 | 总人数 | 无完课 | 无完课占比 | 有完课 | 有完课占比 |
|----------|--------|--------|-----------|--------|-----------|
| 仅 L1 | 502 | **172** | **34.3%** | 330 | 65.7% |
| 仅 L2 | 1,436 | **880** | **61.3%** | 556 | 38.7% |
| L1 + L2 | 1,067 | **255** | **23.9%** | 812 | 76.1% |
> 兜底说明32 个激活用户无角色数据(未创建过 character未计入各级拆分。仅L1/L2 用户分别只看对应等级的课程完课记录L1+L2 用户任一等级课程有完课即算。
---
几个观察:
- **仅L2** 的沉默率最高61.3%),近三分之二的 L2 单激活用户近14天没有学习行为
- **L1+L2** 最活跃,仅 23.9% 无完课
- **仅L1** 居中34.3% 无完课
user: 对于L1+L2的帮我区分一下只有L1有完课、只有L2有完课和L1+L2都有完课
assistant: L1+L2 的完课分布挺有特点的:
---
## 📊 完整结果
### 总览
| 指标 | 人数 | 占比 |
|------|------|------|
| 激活用户总数 | 3,037 | 100% |
| 无角色数据 | 32 | 1.1% |
| **近14天无完课** | **1,307** | **43.0%** |
| 近14天有完课 | 1,698 | 55.9% |
### 按激活类型
| 激活类型 | 总人数 | 无完课 | 占比 | 有完课 | 占比 |
|----------|--------|--------|------|--------|------|
| 仅 L1 | 502 | 172 | 34.3% | 330 | 65.7% |
| 仅 L2 | 1,436 | 880 | 61.3% | 556 | 38.7% |
| L1+L2 | 1,067 | 255 | 23.9% | 812 | 76.1% |
### L1+L2 有完课用户拆分812 人)
| 细分类别 | 人数 | 占有完课比例 |
|----------|------|-------------|
| **仅 L1 有完课** | **670** | **82.5%** |
| L1+L2 都有完课 | 110 | 13.5% |
| 仅 L2 有完课 | 32 | 3.9% |
---
L1+L2 用户虽然两个等级都激活了但近14天里 82.5% 的有完课用户只学了 L1真正两个等级都在学的仅 13.5%。仅学 L2 的几乎可以忽略32人。这说明双激活用户当前学习重心严重偏向 L1。
user: 我们有了激活的概念同时我们知道一个用户下面有多个角色。那么一个角色加一个level就等于一个课程。现在我想看有多少个被激活了的课程那么在近14天有多少个被激活的课程是有完课行为的区分L1和L2。
assistant: 明白,视角从"用户"切到"课程"character + level。来查。
assistant: 结果出来了,和之前"用户"维度的结果做个对比:
---
## 📊 被激活课程character × level近14天完课情况
### 总体
| 指标 | L1 | L2 | 合计 |
|------|-----|-----|------|
| 激活课程数 | **1,574** | **2,108** | **3,682** |
| 近14天有完课 | 1,103 (70.1%) | 699 (33.2%) | 1,802 (48.9%) |
| 近14天无完课 | 471 (29.9%) | 1,409 (66.8%) | 1,880 (51.1%) |
### 角色激活分布
| 类型 | 角色数 |
|------|--------|
| 仅 L1 激活 | 1,321 |
| 仅 L2 激活 | 1,855 |
| L1+L2 双激活 | 253 |
### 双激活角色253的完课分布
| 类别 | 角色数 | 占比 |
|------|--------|------|
| 仅 L1 有完课 | 154 | 60.9% |
| L1+L2 都有 | 21 | 8.3% |
| 仅 L2 有完课 | 10 | 4.0% |
| 都无完课 | 68 | 26.9% |
---
### 🔍 两个视角对比
| 维度 | L1 无完课率 | L2 无完课率 | 整体无完课率 |
|------|------------|------------|------------|
| 按用户(之前) | 34.3% | 61.3% | 43.0% |
| **按课程character×level** | **29.9%** | **66.8%** | **51.1%** |
课程维度的整体无完课率51.1%比用户维度43.0%)高,因为一个用户可能有多门课程被激活,只要其中一门有完课就算"有完课用户"但按课程口径每一门单独计算L2 的大量沉默课程拉高了整体值。
user: 那么对于最近14天无完课行为的1880个激活课程来说他们的最后一次完课记录都分布在L1和L2的哪个Unit
assistant: 结果如下:
---
## 📊 无完课激活课程1,879门最后一次完课 Unit 分布
> 注:和之前 1,880 差 1是分桶查询时的浮点舍入差异。
### 总览
| 指标 | L1 | L2 |
|------|-----|-----|
| 无完课课程总数 | 471 | 1,408 |
| 从未有完课记录 | 76 (16.1%) | 316 (22.4%) |
| 有历史完课记录 | 395 (83.9%) | 1,092 (77.6%) |
### L1 最后一次完课 Unit 分布
| Unit | 课程数 | 占比 | 累计 |
|------|--------|------|------|
| **U00** | **149** | **31.6%** | 31.6% |
| U01 | 94 | 20.0% | 51.6% |
| U02 | 63 | 13.4% | 65.0% |
| U03 | 32 | 6.8% | 71.8% |
| U04 | 16 | 3.4% | 75.2% |
| U05~U09 | 41 | 8.7% | 83.9% |
| 从未完课 | 76 | 16.1% | — |
### L2 最后一次完课 Unit 分布
| Unit | 课程数 | 占比 | 累计 |
|------|--------|------|------|
| **U00** | **351** | **24.9%** | 24.9% |
| U01 | 242 | 17.2% | 42.1% |
| U02 | 157 | 11.2% | 53.3% |
| U03 | 87 | 6.2% | 59.4% |
| U04 | 65 | 4.6% | 64.1% |
| U05~U29 | 190 | 13.5% | 77.6% |
| 从未完课 | 316 | 22.4% | — |
---
核心结论很清晰:**无完课的激活课程高度集中在 U0即 U00~U02**。L1 有 65% 卡在 U02 及之前L2 有 53.3% 卡在 U02 及之前。大部分沉默课程连 U0 阶段都没走完就停了。
user: 你再帮我跑一下对于最后一次完课记录在L1的U00的149节课和U01的94节课以及L2的U00的351节课和U01的242节课其完课记录在U00和U01的Lesson1到Lesson5之间是怎么分布的
assistant: 结果出来了:
---
## 📊 无完课课程最后停在 U00/U01 的 Lesson 分布
### L1 U00149 门)
| Lesson | 课程数 | 占比 |
|--------|--------|------|
| L01 | 29 | 19.5% |
| L02 | 28 | 18.8% |
| L03 | 23 | 15.4% |
| L04 | 15 | 10.1% |
| **L05** | **54** | **36.2%** |
### L1 U0194 门)
| Lesson | 课程数 | 占比 |
|--------|--------|------|
| L01 | 21 | 22.3% |
| L02 | 18 | 19.1% |
| L03 | 17 | 18.1% |
| L04 | 13 | 13.8% |
| **L05** | **25** | **26.6%** |
### L2 U00351 门)
| Lesson | 课程数 | 占比 |
|--------|--------|------|
| L01 | 101 | 28.8% |
| L02 | 64 | 18.2% |
| L03 | 44 | 12.5% |
| L04 | 38 | 10.8% |
| **L05** | **104** | **29.6%** |
### L2 U01242 门)
| Lesson | 课程数 | 占比 |
|--------|--------|------|
| L01 | 53 | 21.9% |
| L02 | 45 | 18.6% |
| L03 | 52 | 21.5% |
| L04 | 35 | 14.5% |
| **L05** | **57** | **23.6%** |
---
四个组呈现出高度一致的模式:
1. **L05 占比最高** — 四组中 L05 都是最大停止点23.6%~36.2%),说明大量用户是完整走完整个 Unit 后停下的,属于"单元级断点"
2. **L01 次高** — 19.5%~28.8%,第一课就停的比例也很大,属于"入门即弃"
3. **L02→L04 逐步递减** — 中间课程越往后停的人越少,说明能过 L01 的用户会惯性往下走,直到单元结束
user: 把上述的取数逻辑和分析逻辑都写到长期记忆中
assistant: 已全部写入长期记忆和今日日志。整理了以下内容:
- **`MEMORY.md`** — 新增了两条口径记录:
- 激活课程维度定义character × level = 一门课程)
- 激活课程完课分析口径(含用户/课程双维度、最后一次完课 Unit/Lesson 分析方法、当前分析结果参考)
- **`memory/2026-05-22.md`** — 今日完整工作日志,包含口径确认结论、各维度分析结果、脚本文件清单

260
memory/2026-05-25-1116.md Normal file
View File

@ -0,0 +1,260 @@
# Session: 2026-05-25 11:16:08 GMT+8
- **Session Key**: agent:main:feishu:direct:ou_e63ce6b760ad39382852472f28fbe2a2
- **Session ID**: e942f03c-fd78-4ddc-b742-d05ded398d38
- **Source**: feishu
## Conversation Summary
assistant: 关键发现:**"H5 页面曝光时"全部是匿名用户accountId=0**,但有 `userId` 字段。我用 `userId` 作为统一标识重新跑漏斗。
assistant: 以下是 H5 购买流程的完整漏斗分析:
---
## H5 购买流程漏斗
### 📊 事件级漏斗
```
H5 页面曝光时 20,891 (100.0%)
▼ -54.4%
商品切换点击 9,530 ( 45.6%)
▼ -99.0%
立即购买按钮点击 94 ( 0.4%)
▼ -1.1%
支付成功 93 ( 0.4%)
```
### 👤 用户级漏斗(按 userId 去重)
```
H5 页面曝光时 6,245 人 (100.0%) 人均3.3条
▼ -71.2%
商品切换点击 1,801 人 ( 28.8%) 人均5.3条
▼ -94.8%
立即购买按钮点击 93 人 ( 1.5%) 人均1.0条
▼ -1.1%
支付成功 92 人 ( 1.5%) 人均1.0条
```
### 🔑 关键发现
| 指标 | 数值 |
|---|---|
| **曝光→切换转化率** | 28.8%1,801/6,245 |
| **切换→购买转化率** | 5.2%93/1,801 |
| **购买→支付成功率** | 98.9%92/93 |
| **全链路完成人数** | **84 人** |
| **仅曝光未切换** | 4,444 人(流失 71.2% |
| **仅切换未购买** | 1,716 人(流失 94.8% |
### ⚠️ 重要说明
- **第一步「H5 页面曝光时」全部是匿名用户**accountId=0说明是 H5 落地页未登录状态下的曝光埋点
- **第二步起全部为已登录用户**accountId>0用户从曝光到切换商品之间发生了登录
- 这也意味着 userId 在第一步和后续步骤间存在断层,**实际的曝光→切换转化率可能比 28.8% 更高**(部分匿名 userId 登录后变为新的 userId
- 核心瓶颈在**切换→购买**环节,仅 5.2% 的浏览用户最终点击了购买按钮
- 支付环节表现良好,点击购买后 98.9% 完成了支付
user: 我们来确认一下accountId和userId的关系是不是都同时为0
assistant: 结果很清晰了:
## accountId × userId 关系
| accountId | userId | 条数 | 占比 |
|:---:|:---:| ---:| ---:|
| =0 | =0 | 56 | 0.2% |
| =0 | >0 | **25,919** | **72.6%** |
| >0 | =0 | 0 | 0.0% |
| >0 | >0 | 9,716 | 27.2% |
**结论:不是同时为 0。两者的含义不同**
- **accountId = 0** → 未登录的**匿名用户**
- **userId > 0** → 设备/会话级标识,**匿名和已登录时都存在**
### 按步骤细拆
| 步骤 | accountId=0 | accountId>0 |
|---|---|---|
| H5 页面曝光时 | 20,891全部匿名 | 0 |
| 商品切换点击 | 1 | 9,529全部已登录 |
| 立即购买按钮点击 | 0 | 94 |
| 支付成功 | 0 | 93 |
| 领课加微 / firstScreen | 2,731且无 userId 字段) | 0 |
| 领课加微 / secondScreen | 2,352且无 userId 字段) | 0 |
### 核心发现
1. **H5 页面曝光时**用户还未登录accountId=0但 userId 是设备级标识(非 0
2. **商品切换点击开始**用户已登录accountId 变为真实账号 ID
3. **userId 跨登录状态存在但关联不完整** — 之前算出的 28.8% 曝光→切换重合1,801/6,245就是通过 userId 串起来的,但因为登录前后 userId 可能变化,**实际转化率可能更高**
4. **领课加微**行为比较特殊accountId=0 且完全没有 userId 字段,是纯粹的匿名埋点
user: 那么以后growth_activity_behavior都按照userId进行统计忽略掉accountId
assistant: 收到,已记录。
assistant: 已记入长期记忆。以后 `growth_activity_behavior` 的统计都按 `userId` 做用户去重,不依赖 `accountId`
user: 那么重新统计H5 购买流程的子行为的完整漏斗数据
assistant: 按照新口径userId 去重),完整的 H5 购买流程漏斗:
---
## H5 购买流程漏斗(按 userId 统计)
### 📊 事件级漏斗
```
H5 页面曝光时 20,891 ██████████████████████████ 100.0%
▼ -54.4%
商品切换点击 9,530 ████████████ 45.6%
▼ -99.0%
立即购买按钮点击 94 ▏ 0.4%
▼ -1.1%
支付成功 93 ▏ 0.4%
```
### 👤 用户级漏斗userId 去重)
```
H5 页面曝光时 6,245 ██████████████████████████ 100.0%
▼ -71.2%
商品切换点击 1,801 ████████ 28.8%
▼ -94.8%
立即购买按钮点击 93 ▏ 1.5%
▼ -1.1%
支付成功 92 ▏ 1.5%
```
### 🔗 步骤间用户重合
| 关联 | 重合人数 | 说明 |
|---|---|---|
| 曝光 → 切换 | 1,801 | 曝光 6,245 人中 28.8% 继续切换 |
| 切换 → 购买 | 85 | 切换 1,801 人中 4.7% 点击购买 |
| 购买 → 支付 | 92 | 购买 93 人中 98.9% 完成支付 |
| **全链路** | **84** | 完整走完 4 步 |
### 📋 用户行为路径
| 路径 | 人数 | 占比 |
|---|---|---|
| 曝光 → 离开 | 4,436 | 71.0% |
| 曝光 → 切换 → 离开 | 1,716 | 27.5% |
| 曝光 → 切换 → 购买 → 支付 ✅ | 84 | 1.3% |
| 曝光 → 购买 → 支付(跳过切换) | 8 | 0.1% |
| 曝光 → 切换 → 购买 → 未支付 | 1 | 0.02% |
---
**关键瓶颈:** 切换→购买环节流失最严重95.3% 流失1,716 人浏览了商品详情但没有点击购买按钮。曝光→切换也有 71.2% 的流失,但考虑到曝光是 H5 落地页的泛流量,这个比例在正常范围内。
user: 用户不是一定要尽力切换步骤才会点击购买按钮,也可能是曝光后直接点击购买按钮
assistant: 确实,之前路径分布里已经看到了这种情况:**有 8 人从曝光直接跳到了购买**,没有经过商品切换步骤。
这也意味着之前按"步骤顺序"算的漏斗转化率偏保守了。实际的购买决策漏斗应该考虑两条路径的合并:
```
曝光 6,245 人
/ \
切换 1,801 人 直接购买 8 人
| |
购买 85 人 8 人
| |
支付 84 人 8 人
```
### 按"最终结果"重新聚合:
| 漏斗层级 | userId 数 | 占曝光比 |
|---|---|---|
| 曝光 | 6,245 | 100.0% |
| 有切换行为 | 1,801 | 28.8% |
| 有购买行为(切换+直达) | 93 | **1.5%** |
| 有支付行为 | 92 | **1.5%** |
购买这一步实际是 93 人85 人通过切换进入购买 + 8 人直接从曝光进入购买。购买→支付 98.9% 几乎全转化。
user: info里面都有哪些值
assistant: `info` 字段在全量 35,691 条记录中**全部为空**,六个子行为无一例外。这个字段定义了但从未写入过数据,可以忽略。
user: 接下来我想看索引user_behavior_buried_points中buryingPointId为1000的数据的数据结构
assistant: 以下是 `buryingPointId=1000` 的完整数据结构:
---
## buryingPointId=1000名称未设定共 23,447 条)
### 字段清单
| 字段 | 非空率 | 说明 |
|---|---|---|
| `accountId` | 100% | 账号ID |
| `accountName` | 100% | 脱敏手机号 |
| `characterId` | 100% | 角色ID |
| `characterName` | 100% | 角色名 |
| `buryingPointId` | 100% | 固定为 1000 |
| `buryingPointName` | **0%** | 全为空 |
| `buryingPointSubId` | 100% | 子事件ID见下 |
| `buryingPointSubName` | **0%** | 全为空 |
| `cType` | **62%** | 组件类型 |
| `cId` | **62%** | 组件编号 |
| `componentId` | 100% | **全为 0** |
| `componentName` | **62%** | 组件名称 |
| `lessonId` / `lessonName` | **62%** | 课时信息 |
| `unitId` / `unitName` | **84%** | 单元信息 |
| `seasonPackageId` / `seasonPackageName` | **84%** | 课程包 |
| `ex2` | 100% | 整数1~12 |
| `ex3` | **45%** | 浮点数(只在 subId=4 时有值) |
| `stepDuration` | 100% | 步骤耗时(毫秒),中位 5ms最大 388s |
| `versionId` | 100% | 版本号 |
| `activeTime` | 100% | 时间戳(秒) |
### buryingPointSubId 分布
| subId | 数量 | 推测含义 |
|:---:| ---:| --- |
| 1 | 10,967 | 开始 / 进入 |
| 4 | 10,637 | 完成 / 提交(此时 ex3 有值) |
| 2 | 1,396 | 中间步骤1 |
| 3 | 447 | 中间步骤2 |
**1 和 4 基本成对出现**10,967 vs 10,637且只有 subId=4 时 `ex3` 有浮点值,很像"进入→完成"的事件对2 和 3 是中间状态。
### cType 分布(前 10
| cType | 数量 |
|---|---|
| (空) | 7,835 |
| mid_dialog_select | 4,103 |
| mid_dialog_fillin | 3,770 |
| core_listening_drag | 2,231 |
| mid_grammar_cloze | 1,982 |
| core_listening_choose | 1,561 |
| mid_vocab_item | 761 |
| mid_grammar_sentence | 595 |
| mid_sentence_dialogue | 557 |
| mid_dialog_repeat | 37 |
### versionId 分布
| 版本 | 数量 |
|---|---|
| 1.2.2 | 8,897 |
| 1.3.1 | 7,777 |
| 1.2.4 | 5,041 |
| 1.2.3 | 1,732 |
### seasonPackageId 分布
| seasonPackageId | 数量 | 说明 |
|:---:| ---:| --- |
| 1 | 13,488 | 序章 |
| 0 | 8,767 | 不在任何课程包中 |
| 2 | 1,179 | 第二季度包 |
| 12 | 12 | |
| 14 | 1 | |
---
**总结:** 这是一个组件级交互追踪埋点覆盖了对话选择、填空、听力拖拽、语法完形等多种题型。subId 1→4 是一个典型的"开始→完成"事件对ex3 在完成时记录一个浮点指标。约 38% 的记录没有任何课程上下文seasonPackageId=0可能是课程外的独立练习场景。

12
memory/2026-05-25.md Normal file
View File

@ -0,0 +1,12 @@
## 行课记录批量查询 (1676条)
[陈逸鸫 请求, 李承龙 确认权限]
- 表格: https://makee-interactive.feishu.cn/sheets/RFIJsXT8FhGHhctY4RwczcOfnac (Sheet1, sheet_id=55b0eb)
- 查询列: D列(小溪返回行课记录) + E列(更新时间), C列(请求状态→已返回)
- 手机号匹配(column J): 961行有用户ID直接查, 715行无ID需手机号匹配, 匹配成功119个, 596个未匹配
- 数据库查询: 1077个去重用户, 1065有课程分配, 1192个角色有行课记录, 201已付费
- 输出格式: `销售:xxx | 用户:xxx | 当前:{level}{type}-{chapter} | 最近行课:date | 学习{X}min | {付费状态}`
- 脚本: scripts/batch_course_record_query.py
- 待查询行1676条全部处理完成, 写回成功

View File

@ -0,0 +1,450 @@
#!/usr/bin/env python3
"""
批量行课记录查询 - 查用户当前进度最近行课学习时长付费状态
输入: /tmp/pending_rows.json (从飞书表格拉取的待查询行)
输出: /tmp/results.jsonl (每行一个JSON, 包含 row + result 字段)
"""
import json
import os
import re
import sys
from collections import defaultdict
try:
import psycopg2
import psycopg2.extras
except ImportError:
print("ERROR: 需要 psycopg2")
sys.exit(1)
DB_HOST = "bj-postgres-16pob4sg.sql.tencentcdb.com"
DB_PORT = 28591
DB_USER = "ai_member"
DB_NAME = "vala_bi"
SHARD_COUNT = 8
COURSE_LEVEL_MAP = {"A1": "L1", "A2": "L2"}
def get_password():
pw = os.environ.get("PG_ONLINE_PASSWORD", "")
if pw:
return pw
secrets_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), "..", "secrets.env")
if os.path.exists(secrets_path):
with open(secrets_path) as f:
for line in f:
if line.startswith("PG_ONLINE_PASSWORD="):
return line.strip().split("=", 1)[1].strip("'\"")
raise RuntimeError("PG_ONLINE_PASSWORD not found")
def get_conn():
return psycopg2.connect(
host=DB_HOST, port=DB_PORT, user=DB_USER,
password=get_password(), dbname=DB_NAME, connect_timeout=30
)
def mask_phone(phone):
return f"{phone[:3]}****{phone[-4:]}"
def batch_query_in(cur, sql_template, params, chunk_size=500):
"""分块执行 IN 查询,将 sql_template 中的第一个 %%s 替换为占位符"""
results = []
for i in range(0, len(params), chunk_size):
chunk = params[i:i + chunk_size]
placeholders = ",".join(["%s"] * len(chunk))
sql = sql_template % placeholders
cur.execute(sql, chunk)
results.extend(cur.fetchall())
return results
def match_phones(phones, conn):
if not phones:
return {}, [], []
# 清理手机号:移除空格、小数点等
def clean_phone(p):
p = str(p).strip()
# 处理浮点数格式: 13003632311.0 -> 13003632311
if '.' in p:
parts = p.split('.')
if parts[1] in ('0', '00'):
p = parts[0]
# 验证是11位手机号
if re.match(r'^1\d{10}$', p):
return p
return None
cleaned = [(clean_phone(p), p) for p in phones]
valid_phones = [(cp, op) for cp, op in cleaned if cp]
if not valid_phones:
return {}, [], []
masked_to_phones = {}
for cp, op in valid_phones:
m = mask_phone(cp)
masked_to_phones.setdefault(m, []).append(op)
masks = list(masked_to_phones.keys())
cur = conn.cursor()
placeholders = ",".join(["%s"] * len(masks))
cur.execute(f"""
SELECT id AS account_id, tel
FROM bi_vala_app_account
WHERE tel IN ({placeholders}) AND status = 1 AND deleted_at IS NULL
""", masks)
rows = cur.fetchall()
cur.close()
# masked → account_id
masked_to_account = {}
for aid, masked in rows:
if masked not in masked_to_account:
masked_to_account[masked] = aid
# original phone → account_id
phone_to_account = {}
for op in phones:
cp = clean_phone(op)
if cp:
m = mask_phone(cp)
if m in masked_to_account:
phone_to_account[op] = masked_to_account[m]
account_ids = list(set(phone_to_account.values()))
unmatched = [p for p in phones if p not in phone_to_account]
return phone_to_account, account_ids, unmatched
def query_course_details(cur, account_ids):
if not account_ids:
return {}
rows = batch_query_in(cur, """
SELECT account_id, course_level,
CASE WHEN expire_time IS NULL THEN '体验课' ELSE '正式课' END AS course_type
FROM bi_user_course_detail
WHERE account_id IN (%s) AND deleted_at IS NULL
""", account_ids)
result = {}
for aid, cl, ct in rows:
if aid not in result:
result[int(aid)] = (cl, ct)
return result
def query_chapter_play_records(cur, char_ids):
"""查询每个角色最近一次行课记录和总学习时间"""
if not char_ids:
return {}, {}
# 构建 UNION ALL SQL - 使用 CTE 方式先创建 ID 列表
# 创建临时ID数组
cur.execute("CREATE TEMP TABLE IF NOT EXISTS _tmp_char_ids (id int PRIMARY KEY) ON COMMIT DROP")
cur.execute("DELETE FROM _tmp_char_ids")
psycopg2.extras.execute_values(cur,
"INSERT INTO _tmp_char_ids (id) VALUES %s",
[(cid,) for cid in char_ids]
)
# 查询每个角色的最近行课记录
union_parts = []
for i in range(SHARD_COUNT):
union_parts.append(
f"SELECT cpr.user_id, cpr.chapter_id, cpr.chapter_unique_id, cpr.created_at, cpr.play_status "
f"FROM bi_user_chapter_play_record_{i} cpr "
f"JOIN _tmp_char_ids t ON cpr.user_id = t.id "
f"WHERE cpr.deleted_at IS NULL"
)
all_plays_sql = " UNION ALL ".join(union_parts)
cur.execute(f"""
WITH all_plays AS ({all_plays_sql})
, latest AS (
SELECT DISTINCT ON (user_id) user_id, chapter_id, chapter_unique_id, created_at AS latest_date
FROM all_plays
ORDER BY user_id, created_at DESC
)
SELECT user_id, chapter_id, chapter_unique_id, latest_date FROM latest
""")
char_latest = {}
chapter_ids = set()
for uid, ch_id, cu_id, ldate in cur.fetchall():
char_latest[uid] = (ch_id, ldate)
chapter_ids.add(ch_id)
# 查询每个角色所有 play_status=1 的 chapter_unique_id
cur.execute(f"""
WITH all_plays AS ({all_plays_sql})
SELECT user_id, chapter_unique_id
FROM all_plays
WHERE play_status = 1
""")
char_cuids = defaultdict(set)
all_cuids = set()
for uid, cuid in cur.fetchall():
char_cuids[uid].add(cuid)
all_cuids.add(cuid)
# 查询组件耗时
char_total_ms = defaultdict(int)
if all_cuids:
cuid_list = list(all_cuids)
# 创建临时表存 cuids
cur.execute("CREATE TEMP TABLE IF NOT EXISTS _tmp_cuids (cuid text PRIMARY KEY) ON COMMIT DROP")
cur.execute("DELETE FROM _tmp_cuids")
psycopg2.extras.execute_values(cur,
"INSERT INTO _tmp_cuids (cuid) VALUES %s",
[(c,) for c in cuid_list]
)
comp_parts = []
for i in range(SHARD_COUNT):
comp_parts.append(
f"SELECT comp.chapter_unique_id, SUM(comp.interval_time) AS total_ms "
f"FROM bi_user_component_play_record_{i} comp "
f"JOIN _tmp_cuids t ON comp.chapter_unique_id = t.cuid "
f"GROUP BY comp.chapter_unique_id"
)
comp_union = " UNION ALL ".join(comp_parts)
cur.execute(f"""
SELECT chapter_unique_id, SUM(total_ms) AS total_ms
FROM ({comp_union}) t2
GROUP BY chapter_unique_id
""")
cuid_time = {}
for cuid, tms in cur.fetchall():
cuid_time[cuid] = (tms or 0)
for uid, cuids in char_cuids.items():
total = sum(cuid_time.get(c, 0) for c in cuids)
char_total_ms[uid] = total
return char_latest, char_total_ms
def query_chapter_course_map(cur, chapter_ids):
if not chapter_ids:
return {}
rows = batch_query_in(cur, """
SELECT id, course_level, course_season, course_unit, course_lesson
FROM bi_level_unit_lesson WHERE id IN (%s)
""", list(chapter_ids))
return {r[0]: (r[1], r[2], r[3], r[4]) for r in rows}
def query_payment_status(cur, account_ids):
if not account_ids:
return {}
rows = batch_query_in(cur, """
SELECT account_id, MAX(CASE WHEN order_status IN (3,4) THEN 1 ELSE 0 END) AS has_order
FROM bi_vala_order WHERE account_id IN (%s) AND deleted_at IS NULL
GROUP BY account_id
""", account_ids)
status = {int(aid): "已付费" if ho == 1 else "未付费" for aid, ho in rows}
for aid in account_ids:
if int(aid) not in status:
status[int(aid)] = "未付费"
return status
def query_characters(cur, account_ids):
if not account_ids:
return {}
rows = batch_query_in(cur, """
SELECT account_id, id AS character_id
FROM bi_vala_app_character
WHERE account_id IN (%s) AND nickname IS NOT NULL AND nickname != '' AND deleted_at IS NULL
""", account_ids)
result = defaultdict(list)
for aid, cid in rows:
result[int(aid)].append(int(cid))
return result
def main():
print("加载待查询行...")
with open("/tmp/pending_rows.json") as f:
pending_rows = json.load(f)
print(f"{len(pending_rows)} 条待查询")
with_id = []
without_id = []
for row in pending_rows:
uid_str = (row.get("user_id") or "").strip()
if uid_str:
try:
row["account_id"] = int(float(uid_str))
with_id.append(row)
except (ValueError, TypeError):
phone = (row.get("phone") or "").strip()
if phone:
row["phone_clean"] = phone
without_id.append(row)
else:
phone = (row.get("phone") or "").strip()
if phone:
row["phone_clean"] = phone
without_id.append(row)
print(f"有用户ID: {len(with_id)}, 需手机号匹配: {len(without_id)}")
conn = get_conn()
cur = conn.cursor()
# Step 1: 手机号匹配
phone_to_account = {}
if without_id:
phones = [r["phone_clean"] for r in without_id]
phone_to_account, matched_ids, unmatched = match_phones(phones, conn)
print(f"手机号匹配: {len(phone_to_account)} 成功, {len(unmatched)} 未匹配")
for row in without_id:
p = row["phone_clean"]
if p in phone_to_account:
row["account_id"] = int(phone_to_account[p])
with_id.append(row)
print(f"总计可查询用户: {len(with_id)}")
all_account_ids = list(set(r["account_id"] for r in with_id))
print(f"去重 account_id: {len(all_account_ids)}")
if not all_account_ids:
print("没有可查询的用户,退出")
conn.close()
return
# Step 2: 查询课程分配
print("查询课程分配...")
course_details = query_course_details(cur, all_account_ids)
print(f" 有课程: {len(course_details)}")
# Step 3: 查询角色
print("查询角色...")
account_chars = query_characters(cur, all_account_ids)
all_char_ids = []
for chars in account_chars.values():
all_char_ids.extend(chars)
print(f" 角色数: {len(all_char_ids)}")
# Step 4: 查询行课记录
print("查询行课记录 (可能需要几分钟)...")
char_latest, char_total_ms = query_chapter_play_records(cur, all_char_ids)
print(f" 有行课记录: {len(char_latest)}")
# Step 5: chapter 映射
all_chapter_ids = set()
for ch_id, _ in char_latest.values():
all_chapter_ids.add(ch_id)
print("查询课程映射...")
course_chapter_map = query_chapter_course_map(cur, all_chapter_ids)
print(f" chapter: {len(course_chapter_map)}")
# Step 6: 付费状态
print("查询付费状态...")
pay_status = query_payment_status(cur, all_account_ids)
print(f" 付费: {sum(1 for v in pay_status.values() if v == '已付费')}")
conn.close()
# Step 7: 组装结果
print("组装结果...")
results = []
# 按 row 去重,同一用户多行合并
row_aid_map = {}
for row in with_id:
rn = row["row"]
aid = row["account_id"]
sales = row.get("sales", "").strip()
if rn not in row_aid_map:
row_aid_map[rn] = (aid, sales)
for row in with_id:
rn = row["row"]
aid = row["account_id"]
sales = row.get("sales", "").strip()
# 课程描述
cd = course_details.get(aid)
if cd:
cl, ct = cd
level_display = COURSE_LEVEL_MAP.get(cl, cl)
course_desc = f"{level_display}{ct}"
else:
course_desc = "暂无课程"
# 最近行课和当前章节
chars = account_chars.get(aid, [])
latest_date = None
chapter_pos = ""
for cid in chars:
if cid in char_latest:
ch_id, ldate = char_latest[cid]
if ldate:
if latest_date is None or ldate > latest_date:
latest_date = ldate
if ch_id in course_chapter_map:
lvl, season, unit, lesson = course_chapter_map[ch_id]
chapter_pos = f"{lvl}-{season}-{unit}-{lesson}"
if not chapter_pos:
chapter_pos = "无记录"
date_str = latest_date.strftime("%Y-%m-%d") if latest_date else "无记录"
# 学习时长
total_min = 0
for cid in chars:
total_min += (char_total_ms.get(cid, 0) or 0)
total_min = round(total_min / 1000 / 60)
# 付费状态
pay = pay_status.get(aid, "未付费")
result_text = f"销售:{sales} | 用户:{aid} | 当前:{course_desc}-{chapter_pos} | 最近行课:{date_str} | 学习{total_min}min | {pay}"
results.append({
"row": rn,
"result": result_text,
"aid": aid
})
# 未匹配的
for row in without_id:
p = row.get("phone_clean", "")
if p not in phone_to_account:
results.append({
"row": row["row"],
"result": f"未匹配到用户",
"aid": 0
})
# 输出
with open("/tmp/results.jsonl", "w") as f:
for r in results:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
print(f"\n✅ 查询完成: {len(results)} 条结果")
unrecorded = sum(1 for r in results if "无记录" in r.get("result","") and "未匹配" not in r.get("result",""))
unmatched_count = sum(1 for r in results if "未匹配" in r.get("result",""))
paid_count = sum(1 for r in results if "已付费" in r.get("result",""))
print(f" 有记录: {len(results) - unrecorded - unmatched_count}")
print(f" 无记录: {unrecorded}")
print(f" 未匹配: {unmatched_count}")
print(f" 已付费: {paid_count}")
if __name__ == "__main__":
main()

View File

@ -0,0 +1,91 @@
import json, subprocess
ES_URL = "https://es-7vd7jcu9.public.tencentelasticsearch.com:9200"
AUTH = "elastic:F%?QDcWes7N2WTuiYD11"
def count_query(must_clauses):
q = {"query": {"bool": {"must": must_clauses}}}
cmd = ["curl", "-s", "-u", AUTH, "-H", "Content-Type: application/json",
f"{ES_URL}/growth_activity_behavior/_count", "-d", json.dumps(q)]
r = subprocess.run(cmd, capture_output=True, text=True)
return json.loads(r.stdout).get("count", 0)
total = count_query([])
# 四种组合
nonzero_account = {"bool": {"must_not": [{"term": {"accountId": 0}}]}}
nonzero_user = {"bool": {"must_not": [{"term": {"userId": 0}}]}}
zero_account = {"term": {"accountId": 0}}
zero_user = {"term": {"userId": 0}}
a0_u0 = count_query([zero_account, zero_user])
a0_ux = count_query([zero_account, nonzero_user])
ax_u0 = count_query([nonzero_account, zero_user])
ax_ux = count_query([nonzero_account, nonzero_user])
print("=== growth_activity_behavior 中 accountId x userId 组合 ===\n")
print(f"{'accountId':>12} {'userId':>12} {'条数':>10} {'占比':>10}")
print("-" * 48)
print(f"{'=0':>12} {'=0':>12} {a0_u0:>10} {a0_u0/total*100:>9.1f}%")
print(f"{'=0':>12} {'>0':>12} {a0_ux:>10} {a0_ux/total*100:>9.1f}%")
print(f"{'>0':>12} {'=0':>12} {ax_u0:>10} {ax_u0/total*100:>9.1f}%")
print(f"{'>0':>12} {'>0':>12} {ax_ux:>10} {ax_ux/total*100:>9.1f}%")
print(f"{'总计':>12} {'':>12} {total:>10}")
# 按 subBehavior 细分 - ES aggregation
print("\n=== 按 behavior + subBehavior 细分 ===\n")
agg_query = {
"size": 0,
"aggs": {
"behaviors": {
"terms": {"field": "behavior", "size": 10},
"aggs": {
"subs": {
"terms": {"field": "subBehavior", "size": 10},
"aggs": {
"a0_u0": {"filter": {"bool": {"must": [zero_account, zero_user]}}},
"a0_ux": {"filter": {"bool": {"must": [zero_account, nonzero_user]}}},
"ax_u0": {"filter": {"bool": {"must": [nonzero_account, zero_user]}}},
"ax_ux": {"filter": {"bool": {"must": [nonzero_account, nonzero_user]}}}
}
}
}
}
}
}
cmd = ["curl", "-s", "-u", AUTH, "-H", "Content-Type: application/json",
f"{ES_URL}/growth_activity_behavior/_search", "-d", json.dumps(agg_query)]
r = subprocess.run(cmd, capture_output=True, text=True)
data = json.loads(r.stdout)
header = f"{'behavior / subBehavior':<40} {'总计':>8} {'a=0,u=0':>8} {'a=0,u>0':>8} {'a>0,u=0':>8} {'a>0,u>0':>8}"
print(header)
print("-" * len(header))
for bb in data["aggregations"]["behaviors"]["buckets"]:
bname = bb["key"]
for sb in bb["subs"]["buckets"]:
sname = sb["key"]
n = sb["doc_count"]
a0u0 = sb["a0_u0"]["doc_count"]
a0ux = sb["a0_ux"]["doc_count"]
axu0 = sb["ax_u0"]["doc_count"]
axux = sb["ax_ux"]["doc_count"]
label = f"{bname} / {sname}"
print(f"{label:<40} {n:>8} {a0u0:>8} {a0ux:>8} {axu0:>8} {axux:>8}")
# userId 不存在的记录
print(f"\n=== userId 字段不存在 的记录 ===")
u_nonexist = count_query([{"bool": {"must_not": [{"exists": {"field": "userId"}}]}}])
print(f"userId 字段不存在: {u_nonexist}")
# 抽 userId=0 的样本
cmd = ["curl", "-s", "-u", AUTH, "-H", "Content-Type: application/json",
f"{ES_URL}/growth_activity_behavior/_search?size=3",
"-d", '{"query":{"term":{"userId":0}},"_source":true}']
r = subprocess.run(cmd, capture_output=True, text=True)
data = json.loads(r.stdout)
print(f"\nuserId=0 记录数: {data['hits']['total']['value']}")
for h in data['hits']['hits']:
print(f" {json.dumps(h['_source'], ensure_ascii=False)}")

View File

@ -0,0 +1,437 @@
#!/usr/bin/env python3
"""
端内购买用户行为分析注册日期 序章完课日期 购课日期 三者关系
"""
import psycopg2
from datetime import datetime, date
from collections import defaultdict, Counter
import os
PG_PASS = os.environ.get('PG_ONLINE_PASSWORD', "LdfjdjL83h3h3^$&**YGG*")
conn = psycopg2.connect(
host="bj-postgres-16pob4sg.sql.tencentcdb.com",
port=28591,
user="ai_member",
password=PG_PASS,
database="vala_bi"
)
cur = conn.cursor()
# U0 序章 chapter_id
L1_U0 = [343, 344, 345, 346, 348]
L2_U0 = [55, 56, 57, 58, 59]
print("=" * 80)
print("端内购买用户行为分析")
print("=" * 80)
# ========== Step 1: 端内付费用户 ==========
print("\n[Step 1] 查询端内付费用户...")
cur.execute("""
SELECT DISTINCT ON (o.account_id)
o.account_id,
o.goods_id,
o.pay_success_date as purchase_date,
CASE
WHEN o.goods_id IN (57, 60, 63) THEN 'L1'
WHEN o.goods_id IN (31, 32, 33, 54) THEN 'L2'
ELSE 'other'
END as buy_level
FROM bi_vala_order o
JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1
WHERE o.key_from = 'app-active-h5-0-0'
AND o.order_status IN (3, 4)
AND o.pay_success_date IS NOT NULL
ORDER BY o.account_id, o.pay_success_date
""")
endo_users = {}
for row in cur.fetchall():
account_id, goods_id, purchase_date, buy_level = row
if buy_level == 'other':
continue
endo_users[account_id] = {
'goods_id': goods_id,
'purchase_date': purchase_date,
'buy_level': buy_level
}
print(f" 端内付费用户数: {len(endo_users)}")
l1_users = {k: v for k, v in endo_users.items() if v['buy_level'] == 'L1'}
l2_users = {k: v for k, v in endo_users.items() if v['buy_level'] == 'L2'}
print(f" - L1 购买用户: {len(l1_users)}")
print(f" - L2 购买用户: {len(l2_users)}")
# ========== Step 2: 注册时间 ==========
print("\n[Step 2] 查询用户注册时间...")
all_ids = list(endo_users.keys())
batch_size = 500
register_map = {}
for i in range(0, len(all_ids), batch_size):
batch = all_ids[i:i+batch_size]
cur.execute(
"SELECT id, created_at FROM bi_vala_app_account WHERE id IN %s AND status = 1",
(tuple(batch),)
)
for row in cur.fetchall():
register_map[row[0]] = row[1]
print(f" 找到注册时间的用户: {len(register_map)}")
# ========== Step 3: account_id -> user_id (character) 映射 ==========
print("\n[Step 3] 查询用户角色映射...")
char_map = {} # account_id -> [user_id, ...]
for i in range(0, len(all_ids), batch_size):
batch = all_ids[i:i+batch_size]
cur.execute(
"SELECT account_id, id FROM bi_vala_app_character WHERE account_id IN %s",
(tuple(batch),)
)
for row in cur.fetchall():
acct = row[0]
uid = row[1]
if acct not in char_map:
char_map[acct] = []
char_map[acct].append(uid)
print(f" 有角色的用户数: {len(char_map)}")
# ========== Step 4: 序章完课时间 ==========
print("\n[Step 4] 查询序章完课时间8张分表...")
# 收集所有 user_id按等级分
l1_user_ids = set()
l2_user_ids = set()
for acct in l1_users:
if acct in char_map:
for uid in char_map[acct]:
l1_user_ids.add(uid)
for acct in l2_users:
if acct in char_map:
for uid in char_map[acct]:
l2_user_ids.add(uid)
print(f" L1 相关 user_id 数: {len(l1_user_ids)}")
print(f" L2 相关 user_id 数: {len(l2_user_ids)}")
def query_prologue_completion(cur, user_ids, chapter_ids, table_prefix="bi_user_chapter_play_record"):
"""查询序章完课最早时间,返回 {user_id: earliest_date}"""
result = {}
if not user_ids:
return result
# 分表查询
union_parts = []
for t in range(8):
union_parts.append(f"""
SELECT user_id, MIN(created_at) as first_done
FROM {table_prefix}_{t}
WHERE chapter_id IN ({','.join(map(str, chapter_ids))})
AND play_status = 1
AND user_id IN %(user_ids)s
GROUP BY user_id
""")
sql = " UNION ALL ".join(union_parts)
sql = f"""
SELECT user_id, MIN(first_done) as earliest
FROM ({sql}) sub
GROUP BY user_id
"""
batch_size_uid = 2000
uid_list = list(user_ids)
for i in range(0, len(uid_list), batch_size_uid):
batch = tuple(uid_list[i:i+batch_size_uid])
cur.execute(sql, {'user_ids': batch})
for row in cur.fetchall():
result[row[0]] = row[1]
return result
# L1 序章
l1_prologue = query_prologue_completion(cur, l1_user_ids, L1_U0)
print(f" L1 序章有完课记录的 user_id: {len(l1_prologue)}")
# L2 序章
l2_prologue = query_prologue_completion(cur, l2_user_ids, L2_U0)
print(f" L2 序章有完课记录的 user_id: {len(l2_prologue)}")
# ========== Step 5: 关联分析 ==========
print("\n[Step 5] 关联分析...")
# 对每个端内购买用户,找到他的角色中对应等级序章最早完课时间
def get_earliest_prologue(account_id, buy_level):
"""获取用户对应等级序章的最早完课时间"""
if account_id not in char_map:
return None
user_ids = char_map[account_id]
earliest = None
prologue_map = l1_prologue if buy_level == 'L1' else l2_prologue
for uid in user_ids:
if uid in prologue_map:
d = prologue_map[uid]
if earliest is None or d < earliest:
earliest = d
return earliest
# 构建分析数据
records = []
for account_id, info in endo_users.items():
register_date = register_map.get(account_id)
purchase_date = info['purchase_date']
buy_level = info['buy_level']
if register_date is None:
continue
prologue_date = get_earliest_prologue(account_id, buy_level)
# 计算天数差
if isinstance(register_date, datetime):
reg_d = register_date.date()
else:
reg_d = register_date
if isinstance(purchase_date, datetime):
pur_d = purchase_date.date()
elif isinstance(purchase_date, date):
pur_d = purchase_date
else:
pur_d = purchase_date
if prologue_date:
if isinstance(prologue_date, datetime):
pro_d = prologue_date.date()
else:
pro_d = prologue_date
reg_to_pro = (pro_d - reg_d).days
pro_to_pur = (pur_d - pro_d).days
reg_to_pur = (pur_d - reg_d).days
has_prologue = True
else:
reg_to_pro = None
pro_to_pur = None
reg_to_pur = (pur_d - reg_d).days
has_prologue = False
records.append({
'account_id': account_id,
'level': buy_level,
'register_date': reg_d,
'prologue_date': prologue_date.date() if has_prologue and isinstance(prologue_date, datetime) else (prologue_date if has_prologue else None),
'purchase_date': pur_d,
'reg_to_pro': reg_to_pro,
'pro_to_pur': pro_to_pur,
'reg_to_pur': reg_to_pur,
'has_prologue': has_prologue
})
# ========== Step 6: 统计输出 ==========
print(f"\n总分析用户数: {len(records)}")
# --- 总体统计 ---
with_prologue = [r for r in records if r['has_prologue']]
without_prologue = [r for r in records if not r['has_prologue']]
print(f"\n{'='*80}")
print("总体概览")
print(f"{'='*80}")
print(f"端内付费用户: {len(records)}")
print(f" 有L1序章完课记录: {len([r for r in with_prologue if r['level']=='L1'])}")
print(f" 有L2序章完课记录: {len([r for r in with_prologue if r['level']=='L2'])}")
print(f" 无对应序章完课记录: {len(without_prologue)}")
# --- 天数分布 ---
def percentile(sorted_vals, p):
if not sorted_vals:
return None
idx = int(len(sorted_vals) * p / 100)
return sorted_vals[min(idx, len(sorted_vals)-1)]
def print_stats(name, vals, unit=''):
if not vals:
print(f" {name}: 无数据")
return
s = sorted(vals)
print(f" {name}:")
print(f" 样本数: {len(s)}")
print(f" 中位数: {percentile(s, 50):.1f}{unit}")
print(f" 平均值: {sum(s)/len(s):.1f}{unit}")
print(f" P25: {percentile(s, 25):.1f}{unit}")
print(f" P75: {percentile(s, 75):.1f}{unit}")
print(f" P90: {percentile(s, 90):.1f}{unit}")
print(f"\n{'='*80}")
print("全量统计:注册→购课 天数")
print(f"{'='*80}")
print_stats("全量 注册→购课", [r['reg_to_pur'] for r in records])
print(f"\n{'='*80}")
print("有序章完课用户统计")
print(f"{'='*80}")
print_stats("序章完课 注册→完课 天数", [r['reg_to_pro'] for r in with_prologue])
print_stats("序章完课 完课→购课 天数", [r['pro_to_pur'] for r in with_prologue])
print_stats("序章完课 注册→购课 天数", [r['reg_to_pur'] for r in with_prologue])
# --- 按等级拆分 ---
for level in ['L1', 'L2']:
lvl = [r for r in records if r['level'] == level]
lvl_pro = [r for r in lvl if r['has_prologue']]
lvl_no = [r for r in lvl if not r['has_prologue']]
print(f"\n{'='*80}")
print(f"{level} 购买用户统计")
print(f"{'='*80}")
print(f" {level} 总用户: {len(lvl)}")
print(f"{level}序章完课: {len(lvl_pro)}")
print(f"{level}序章完课: {len(lvl_no)}")
print_stats(f"\n {level} 全量 注册→购课", [r['reg_to_pur'] for r in lvl])
if lvl_pro:
print_stats(f" {level} 序章完课 注册→完课", [r['reg_to_pro'] for r in lvl_pro])
print_stats(f" {level} 序章完课 完课→购课", [r['pro_to_pur'] for r in lvl_pro])
print_stats(f" {level} 序章完课 注册→购课", [r['reg_to_pur'] for r in lvl_pro])
# --- 完课→购课的时间段分布 ---
print(f"\n{'='*80}")
print("有完课用户:完课→购课 时间段分布")
print(f"{'='*80}")
buckets = {
'0天当天购课': (0, 0),
'1-3天': (1, 3),
'4-7天': (4, 7),
'8-14天': (8, 14),
'15-30天': (15, 30),
'31-60天': (31, 60),
'61-90天': (61, 90),
'90天以上': (91, 99999),
'负数(购课在完课前)': (-99999, -1),
}
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = with_prologue
else:
data = [r for r in with_prologue if r['level'] == level]
if not data:
continue
vals = [r['pro_to_pur'] for r in data]
print(f"\n[{level}] 完课→购课 天数分布:")
for label, (lo, hi) in buckets.items():
cnt = sum(1 for v in vals if lo <= v <= hi)
pct = cnt / len(vals) * 100
bar = '' * int(pct / 2)
print(f" {label:20s}: {cnt:4d} ({pct:5.1f}%) {bar}")
# --- 注册→完课时间段分布 ---
print(f"\n{'='*80}")
print("有完课用户:注册→完课 时间段分布")
print(f"{'='*80}")
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = with_prologue
else:
data = [r for r in with_prologue if r['level'] == level]
if not data:
continue
vals = [r['reg_to_pro'] for r in data]
print(f"\n[{level}] 注册→完课 天数分布:")
for label, (lo, hi) in buckets.items():
if hi < 0:
continue
cnt = sum(1 for v in vals if lo <= v <= hi)
pct = cnt / len(vals) * 100
bar = '' * int(pct / 2)
print(f" {label:20s}: {cnt:4d} ({pct:5.1f}%) {bar}")
# --- 注册→购课 时间段分布 ---
print(f"\n{'='*80}")
print("注册→购课 时间段分布")
print(f"{'='*80}")
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = records
else:
data = [r for r in records if r['level'] == level]
if not data:
continue
vals = [r['reg_to_pur'] for r in data]
print(f"\n[{level}] 注册→购课 天数分布:")
extended_buckets = {
'0天当天购课': (0, 0),
'1天': (1, 1),
'2天': (2, 2),
'3天': (3, 3),
'4-7天': (4, 7),
'8-14天': (8, 14),
'15-30天': (15, 30),
'31-60天': (31, 60),
'61-90天': (61, 90),
'90-180天': (91, 180),
'180天以上': (181, 99999),
}
for label, (lo, hi) in extended_buckets.items():
cnt = sum(1 for v in vals if lo <= v <= hi)
pct = cnt / len(vals) * 100
bar = '' * int(pct / 2)
print(f" {label:20s}: {cnt:4d} ({pct:5.1f}%) {bar}")
# --- 完课在购课之前 vs 之后 ---
print(f"\n{'='*80}")
print("完课 vs 购课 时间关系")
print(f"{'='*80}")
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = with_prologue
else:
data = [r for r in with_prologue if r['level'] == level]
if not data:
continue
pro_before_pur = [r for r in data if r['pro_to_pur'] >= 0]
pro_after_pur = [r for r in data if r['pro_to_pur'] < 0]
pro_same_day = [r for r in data if r['pro_to_pur'] == 0]
print(f"\n[{level}]:")
print(f" 完课在购课之前或同一天: {len(pro_before_pur)} ({len(pro_before_pur)/len(data)*100:.1f}%)")
print(f" 其中同一天: {len(pro_same_day)} ({len(pro_same_day)/len(data)*100:.1f}%)")
print(f" 完课在购课之后: {len(pro_after_pur)} ({len(pro_after_pur)/len(data)*100:.1f}%)")
# ========== 导出详细CSV ==========
print(f"\n{'='*80}")
print("导出详细数据...")
print(f"{'='*80}")
import csv
output_path = '/root/.openclaw/workspace/output/endor_purchase_analysis.csv'
with open(output_path, 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow(['account_id', '等级', '注册日期', '序章完课日期', '购课日期',
'注册→完课(天)', '完课→购课(天)', '注册→购课(天)', '有序章完课'])
for r in sorted(records, key=lambda x: x['purchase_date']):
writer.writerow([
r['account_id'], r['level'], r['register_date'], r['prologue_date'], r['purchase_date'],
r['reg_to_pro'] if r['reg_to_pro'] is not None else '',
r['pro_to_pur'] if r['pro_to_pur'] is not None else '',
r['reg_to_pur'], '' if r['has_prologue'] else ''
])
print(f" 已导出到: {output_path}")
cur.close()
conn.close()
print("\n分析完成!")

View File

@ -0,0 +1,500 @@
#!/usr/bin/env python3
"""
端内购买用户行为分析 V2
口径修正
1. 按订单维度同一用户买L1+L2拆为两条记录
2. 序章完课 = U0全部五节(L01~L05)完成取最后一节完成日期
3. 新增购课时序章完成到第几节课
"""
import psycopg2
from datetime import datetime, date
from collections import defaultdict, Counter
import os
import csv
PG_PASS = os.environ.get('PG_ONLINE_PASSWORD', "LdfjdjL83h3h3^$&**YGG*")
conn = psycopg2.connect(
host="bj-postgres-16pob4sg.sql.tencentcdb.com",
port=28591,
user="ai_member",
password=PG_PASS,
database="vala_bi"
)
cur = conn.cursor()
# U0 序章 chapter_id -> lesson_name
L1_U0_LESSONS = {343: 'L01', 344: 'L02', 345: 'L03', 346: 'L04', 348: 'L05'}
L2_U0_LESSONS = {55: 'L01', 56: 'L02', 57: 'L03', 58: 'L04', 59: 'L05'}
LEVEL_MAP = {
57: 'L1', 60: 'L1', 63: 'L1',
31: 'L2', 32: 'L2', 33: 'L2', 54: 'L2',
61: 'L1+L2',
}
print("=" * 80)
print("端内购买用户行为分析 V2")
print("口径:序章完课 = U0全部五节完成 | 订单维度 | 购课时进度分析")
print("=" * 80)
# ========== Step 1: 端内付费订单(按订单维度) ==========
print("\n[Step 1] 查询端内付费订单...")
cur.execute("""
SELECT
o.id as order_id,
o.account_id,
o.goods_id,
o.pay_success_date,
o.trade_no
FROM bi_vala_order o
JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1
WHERE o.key_from = 'app-active-h5-0-0'
AND o.order_status IN (3, 4)
AND o.pay_success_date IS NOT NULL
ORDER BY o.pay_success_date
""")
orders = [] # list of {order_id, account_id, goods_id, purchase_date, level}
for row in cur.fetchall():
order_id, account_id, goods_id, purchase_date, trade_no = row
level = LEVEL_MAP.get(goods_id)
if level is None or level == 'L1+L2':
continue
if isinstance(purchase_date, datetime):
purchase_date = purchase_date.date()
orders.append({
'order_id': order_id,
'account_id': account_id,
'goods_id': goods_id,
'purchase_date': purchase_date,
'level': level
})
print(f" 端内付费订单数: {len(orders)}")
print(f" - L1 订单: {sum(1 for o in orders if o['level']=='L1')}")
print(f" - L2 订单: {sum(1 for o in orders if o['level']=='L2')}")
# 统计有多少用户同时买了L1+L2
user_levels = defaultdict(set)
for o in orders:
user_levels[o['account_id']].add(o['level'])
dual_users = [uid for uid, lvls in user_levels.items() if len(lvls) > 1]
print(f" 同时购买L1+L2的用户: {len(dual_users)}")
# ========== Step 2: 注册时间 ==========
print("\n[Step 2] 查询用户注册时间...")
all_accounts = list(set(o['account_id'] for o in orders))
register_map = {}
batch_size = 500
for i in range(0, len(all_accounts), batch_size):
batch = all_accounts[i:i+batch_size]
cur.execute(
"SELECT id, created_at FROM bi_vala_app_account WHERE id IN %s",
(tuple(batch),)
)
for row in cur.fetchall():
d = row[1]
register_map[row[0]] = d.date() if isinstance(d, (datetime, date)) else d
print(f" 找到注册时间: {len(register_map)}")
# ========== Step 3: account_id -> user_id 映射 ==========
print("\n[Step 3] 查询用户角色映射...")
char_map = defaultdict(list)
for i in range(0, len(all_accounts), batch_size):
batch = all_accounts[i:i+batch_size]
cur.execute(
"SELECT account_id, id FROM bi_vala_app_character WHERE account_id IN %s",
(tuple(batch),)
)
for row in cur.fetchall():
char_map[row[0]].append(row[1])
print(f" 有角色的用户: {sum(1 for uid, chars in char_map.items() if chars)}")
# ========== Step 4: 序章每节课的最早完成时间 ==========
print("\n[Step 4] 查询序章每节课完成时间8张分表...")
def query_lesson_completion(cur, user_ids, chapter_ids_dict):
"""
查询每个 user_id 对每个 chapter_id 的最早完成时间
返回: {user_id: {chapter_id: earliest_date}}
"""
result = defaultdict(dict)
if not user_ids:
return result
# 先按user_ids批量查询
all_chapter_ids = list(chapter_ids_dict.keys())
union_parts = []
for t in range(8):
union_parts.append(f"""
SELECT user_id, chapter_id, MIN(created_at) as first_done
FROM bi_user_chapter_play_record_{t}
WHERE chapter_id IN ({','.join(map(str, all_chapter_ids))})
AND play_status = 1
AND user_id IN %(user_ids)s
GROUP BY user_id, chapter_id
""")
sql = " UNION ALL ".join(union_parts)
sql = f"""
SELECT user_id, chapter_id, MIN(first_done) as earliest
FROM ({sql}) sub
GROUP BY user_id, chapter_id
"""
uid_list = list(user_ids)
for i in range(0, len(uid_list), 2000):
batch = tuple(uid_list[i:i+2000])
cur.execute(sql, {'user_ids': batch})
for row in cur.fetchall():
uid, cid, dt = row
result[uid][cid] = dt.date() if isinstance(dt, (datetime, date)) else dt
return result
# 收集所有user_id
all_user_ids = set()
for acct in all_accounts:
for uid in char_map.get(acct, []):
all_user_ids.add(uid)
print(f" 总 user_id 数: {len(all_user_ids)}")
# L1 序章
l1_lesson_data = query_lesson_completion(cur, all_user_ids, L1_U0_LESSONS)
# L2 序章
l2_lesson_data = query_lesson_completion(cur, all_user_ids, L2_U0_LESSONS)
print(f" L1 序章有完成记录的 user_id: {len(l1_lesson_data)}")
print(f" L2 序章有完成记录的 user_id: {len(l2_lesson_data)}")
# ========== Step 5: 对每个订单分析序章进度 ==========
print("\n[Step 5] 逐订单分析序章进度...")
def analyze_user_prologue(user_ids, lesson_data, lesson_map):
"""
汇总用户所有角色在序章各节的完成情况
返回: {chapter_id: earliest_completion_date} (只包含完成的)
"""
merged = {}
for uid in user_ids:
if uid not in lesson_data:
continue
for cid, dt in lesson_data[uid].items():
if cid not in merged or dt < merged[cid]:
merged[cid] = dt
return merged
def get_prologue_status(lesson_completions, lesson_map, purchase_date):
"""
分析序章完成情况
- total_done: 总共完成了几节不限时间
- done_before_purchase: 购课前完成了哪几节
- done_after_purchase: 购课后完成了哪几节
- all_five_done: 是否五节全完成
- all_five_done_date: 五节全完成的日期最后一节
- progress_at_purchase: 购课时的进度完成了几节
"""
total_done = len(lesson_completions)
done_before = {}
done_after = {}
for cid, dt in lesson_completions.items():
if dt <= purchase_date:
done_before[cid] = dt
else:
done_after[cid] = dt
# 是否五节全完成
all_five_done = total_done == 5
all_five_done_date = max(lesson_completions.values()) if all_five_done else None
# 购课时完成了几节
progress_at_purchase = len(done_before)
# 购课时哪些节已完成
lessons_done_at_purchase = sorted(
[lesson_map[cid] for cid in done_before],
key=lambda x: int(x[1:]) # L01, L02, ...
)
return {
'total_done': total_done,
'done_before_count': len(done_before),
'done_after_count': len(done_after),
'all_five_done': all_five_done,
'all_five_done_date': all_five_done_date,
'progress_at_purchase': progress_at_purchase,
'lessons_done_at_purchase': lessons_done_at_purchase,
'done_before': done_before,
'done_after': done_after,
}
records = []
for o in orders:
account_id = o['account_id']
level = o['level']
purchase_date = o['purchase_date']
register_date = register_map.get(account_id)
if register_date is None:
continue
user_ids = char_map.get(account_id, [])
lesson_map = L1_U0_LESSONS if level == 'L1' else L2_U0_LESSONS
lesson_data = l1_lesson_data if level == 'L1' else l2_lesson_data
user_completions = analyze_user_prologue(user_ids, lesson_data, lesson_map)
status = get_prologue_status(user_completions, lesson_map, purchase_date)
# 注册→购课天数
reg_to_pur = (purchase_date - register_date).days
# 注册→序章全完成天数
reg_to_all5 = None
if status['all_five_done_date']:
reg_to_all5 = (status['all_five_done_date'] - register_date).days
# 序章全完成→购课天数
all5_to_pur = None
if status['all_five_done_date']:
all5_to_pur = (purchase_date - status['all_five_done_date']).days
records.append({
'order_id': o['order_id'],
'account_id': account_id,
'level': level,
'register_date': register_date,
'purchase_date': purchase_date,
'reg_to_pur': reg_to_pur,
# 序章进度
'total_lessons_done': status['total_done'],
'lessons_done_before_purchase': status['done_before_count'],
'lessons_done_after_purchase': status['done_after_count'],
'progress_at_purchase': status['progress_at_purchase'],
'lessons_at_purchase': ','.join(status['lessons_done_at_purchase']),
# 五节全完成
'all_five_done': status['all_five_done'],
'all_five_done_date': status['all_five_done_date'],
'reg_to_all5': reg_to_all5,
'all5_to_pur': all5_to_pur,
})
print(f" 分析记录数: {len(records)}")
# ========== Step 6: 统计分析 ==========
print(f"\n{'='*80}")
print("【总体概览】")
print(f"{'='*80}")
total_records = len(records)
all5_complete = [r for r in records if r['all_five_done']]
all5_incomplete = [r for r in records if not r['all_five_done']]
l1_records = [r for r in records if r['level'] == 'L1']
l2_records = [r for r in records if r['level'] == 'L2']
print(f"总记录数(订单维度): {total_records}")
print(f" - L1 订单: {len(l1_records)}")
print(f" - L2 订单: {len(l2_records)}")
print(f" 序章五节全完成: {len(all5_complete)}")
print(f" 序章五节未全完成: {len(all5_incomplete)}")
# ========== 购课时序章进度分布 ==========
print(f"\n{'='*80}")
print("【核心指标:购课时序章完成进度】")
print(f"{'='*80}")
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = records
else:
data = [r for r in records if r['level'] == level]
if not data:
continue
print(f"\n[{level}] 购课时完成了序章第几节:")
progress_dist = Counter(r['progress_at_purchase'] for r in data)
for p in range(6): # 0到5
cnt = progress_dist.get(p, 0)
pct = cnt / len(data) * 100
label = f" 完成{p}节(共5节)" if p > 0 else " 完成0节未开始"
bar = '' * int(pct)
print(f" {label:24s}: {cnt:4d} ({pct:5.1f}%) {bar}")
# 具体到哪几节
print(f"\n[{level}] 购课时已完成哪些课时的分布:")
lesson_dist = Counter()
for r in data:
if r['progress_at_purchase'] == 0:
lesson_dist['(未完成任何节)'] += 1
else:
lesson_dist[r['lessons_at_purchase']] += 1
for lesson_combo, cnt in lesson_dist.most_common(15):
pct = cnt / len(data) * 100
label = f" {lesson_combo}"
print(f" {label:24s}: {cnt:4d} ({pct:5.1f}%)")
# ========== 购课后完成的情况 ==========
print(f"\n{'='*80}")
print("【购课后继续完成序章的情况】")
print(f"{'='*80}")
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = records
else:
data = [r for r in records if r['level'] == level]
if not data:
continue
print(f"\n[{level}]:")
print(f" 购课时序章已全部完成(5/5): {sum(1 for r in data if r['progress_at_purchase']==5)} ({sum(1 for r in data if r['progress_at_purchase']==5)/len(data)*100:.1f}%)")
print(f" 购课后续完成序章余下课时: {sum(1 for r in data if r['lessons_done_after_purchase']>0)} ({sum(1 for r in data if r['lessons_done_after_purchase']>0)/len(data)*100:.1f}%)")
# 购课后完成了多少节
after_dist = Counter(r['lessons_done_after_purchase'] for r in data if r['progress_at_purchase'] < 5)
if after_dist:
print(f" 购课时未完成的用户,购课后又完成节数分布:")
for n in sorted(after_dist.keys()):
cnt = after_dist[n]
pct = cnt / sum(1 for r in data if r['progress_at_purchase'] < 5) * 100
print(f" 又完成{n}节: {cnt} ({pct:.1f}%)")
# ========== 注册→购课天数 ==========
def percentile_vals(vals):
if not vals:
return {}
s = sorted(vals)
return {
'P10': s[int(len(s)*0.10)],
'P25': s[int(len(s)*0.25)],
'P50': s[int(len(s)*0.50)],
'P75': s[int(len(s)*0.75)],
'P90': s[int(len(s)*0.90)],
'mean': sum(s)/len(s),
'count': len(s),
}
print(f"\n{'='*80}")
print("【注册→购课 天数统计】")
print(f"{'='*80}")
for level in ['L1', 'L2', 'ALL']:
if level == 'ALL':
data = records
else:
data = [r for r in records if r['level'] == level]
stat = percentile_vals([r['reg_to_pur'] for r in data])
if stat:
print(f"\n[{level}] 注册→购课:")
print(f" 样本: {stat['count']}, 中位数: {stat['P50']}天, 均值: {stat['mean']:.1f}")
print(f" P10={stat['P10']}天 | P25={stat['P25']}天 | P75={stat['P75']}天 | P90={stat['P90']}")
# ========== 序章五节全完成的用户统计 ==========
print(f"\n{'='*80}")
print("【序章五节全完成的用户限定已完成5/5的用户")
print(f"{'='*80}")
all5_l1 = [r for r in all5_complete if r['level']=='L1']
all5_l2 = [r for r in all5_complete if r['level']=='L2']
print(f"\n五节全完成用户: {len(all5_complete)}")
print(f" L1: {len(all5_l1)}")
print(f" L2: {len(all5_l2)}")
# 注册→全完成
for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]:
vals = [r['reg_to_all5'] for r in data if r['reg_to_all5'] is not None]
stat = percentile_vals(vals)
if stat:
print(f"\n[{name}] 注册→序章五节全完成:")
print(f" 样本: {stat['count']}, 中位数: {stat['P50']}天, 均值: {stat['mean']:.1f}")
print(f" P10={stat['P10']}天 | P25={stat['P25']}天 | P75={stat['P75']}天 | P90={stat['P90']}")
# 全完成→购课
for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]:
vals = [r['all5_to_pur'] for r in data if r['all5_to_pur'] is not None]
stat = percentile_vals(vals)
if stat:
print(f"\n[{name}] 序章五节全完成→购课:")
print(f" 样本: {stat['count']}, 中位数: {stat['P50']}天, 均值: {stat['mean']:.1f}")
print(f" P10={stat['P10']}天 | P25={stat['P25']}天 | P75={stat['P75']}天 | P90={stat['P90']}")
# 全完成→购课时间段分布
print(f"\n【五节全完成→购课 时间段分布】")
print(f"{'='*60}")
buckets = {
'负数(先购课后完课)': (-99999, -1),
'0天(当天)': (0, 0),
'1~3天': (1, 3),
'4~7天': (4, 7),
'8~14天': (8, 14),
'15~30天': (15, 30),
'31~60天': (31, 60),
'61~90天': (61, 90),
'90天以上': (91, 99999),
}
for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]:
vals = [r['all5_to_pur'] for r in data if r['all5_to_pur'] is not None]
if not vals:
continue
print(f"\n[{name}]:")
for label, (lo, hi) in buckets.items():
cnt = sum(1 for v in vals if lo <= v <= hi)
pct = cnt / len(vals) * 100
bar = '' * int(pct / 2)
print(f" {label:24s}: {cnt:4d} ({pct:5.1f}%) {bar}")
# ========== 时序关系 ==========
print(f"\n{'='*80}")
print("【五节全完成的用户:序章 vs 购课 时序关系】")
print(f"{'='*80}")
for name, data in [('L1', all5_l1), ('L2', all5_l2), ('ALL', all5_complete)]:
pro_before = sum(1 for r in data if r['all5_to_pur'] is not None and r['all5_to_pur'] >= 0)
pro_after = sum(1 for r in data if r['all5_to_pur'] is not None and r['all5_to_pur'] < 0)
pro_same = sum(1 for r in data if r['all5_to_pur'] == 0)
print(f"\n[{name}]:")
print(f" 序章全部完成在购课之前/当天: {pro_before} ({pro_before/len(data)*100:.1f}%)")
print(f" 其中当天: {pro_same} ({pro_same/len(data)*100:.1f}%)")
print(f" 序章全部完成在购课之后: {pro_after} ({pro_after/len(data)*100:.1f}%)")
# ========== 导出CSV ==========
output_path = '/root/.openclaw/workspace/output/endor_purchase_analysis_v2.csv'
with open(output_path, 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow([
'order_id', 'account_id', '等级', '注册日期', '购课日期',
'注册→购课(天)',
'序章完成节数(总计)', '购课前进度(节)', '购课后完成(节)',
'购课时已完成', '五节全完成', '五节全完成日期',
'注册→全完成(天)', '全完成→购课(天)'
])
for r in sorted(records, key=lambda x: x['purchase_date']):
writer.writerow([
r['order_id'], r['account_id'], r['level'],
r['register_date'].isoformat() if r['register_date'] else '',
r['purchase_date'].isoformat() if r['purchase_date'] else '',
r['reg_to_pur'],
r['total_lessons_done'], r['progress_at_purchase'],
r['lessons_done_after_purchase'],
r['lessons_at_purchase'],
'' if r['all_five_done'] else '',
r['all_five_done_date'].isoformat() if r['all_five_done_date'] else '',
r['reg_to_all5'] if r['reg_to_all5'] is not None else '',
r['all5_to_pur'] if r['all5_to_pur'] is not None else '',
])
print(f"\n{'='*80}")
print(f"详细数据已导出: {output_path}")
print(f"{'='*80}")
cur.close()
conn.close()
print("\n分析完成!")

View File

@ -0,0 +1,370 @@
#!/usr/bin/env python3
"""
端内购买用户行为分析 V3 - 详细交叉统计
"""
import psycopg2
from datetime import datetime, date
from collections import defaultdict, Counter
import csv
import os
PG_PASS = os.environ.get('PG_ONLINE_PASSWORD', "LdfjdjL83h3h3^$&**YGG*")
conn = psycopg2.connect(
host="bj-postgres-16pob4sg.sql.tencentcdb.com", port=28591,
user="ai_member", password=PG_PASS, database="vala_bi"
)
cur = conn.cursor()
L1_U0_LESSONS = {343: 'L01', 344: 'L02', 345: 'L03', 346: 'L04', 348: 'L05'}
L2_U0_LESSONS = {55: 'L01', 56: 'L02', 57: 'L03', 58: 'L04', 59: 'L05'}
LEVEL_MAP = {57:'L1',60:'L1',63:'L1', 31:'L2',32:'L2',33:'L2',54:'L2'}
print("=" * 80)
print("端内购买用户行为分析 V3 — 详细交叉统计")
print("=" * 80)
# ============ 查询数据 ============
print("\n[数据加载]")
cur.execute("""
SELECT o.id, o.account_id, o.goods_id, o.pay_success_date
FROM bi_vala_order o
JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1
WHERE o.key_from = 'app-active-h5-0-0' AND o.order_status IN (3, 4) AND o.pay_success_date IS NOT NULL
ORDER BY o.pay_success_date
""")
orders = []
for row in cur.fetchall():
lv = LEVEL_MAP.get(row[2])
if not lv: continue
pd_ = row[3]; orders.append({'order_id':row[0],'account_id':row[1],'goods_id':row[2],'purchase_date':pd_.date() if isinstance(pd_,datetime) else pd_,'level':lv})
print(f" 订单: {len(orders)}")
all_accounts = list(set(o['account_id'] for o in orders))
register_map = {}
for i in range(0,len(all_accounts),500):
cur.execute("SELECT id,created_at FROM bi_vala_app_account WHERE id IN %s",(tuple(all_accounts[i:i+500]),))
for r in cur.fetchall():
d=r[1];register_map[r[0]]=d.date() if isinstance(d,(datetime,date)) else d
print(f" 注册时间: {len(register_map)}")
char_map = defaultdict(list)
for i in range(0,len(all_accounts),500):
cur.execute("SELECT account_id,id FROM bi_vala_app_character WHERE account_id IN %s",(tuple(all_accounts[i:i+500]),))
for r in cur.fetchall(): char_map[r[0]].append(r[1])
print(f" 角色映射: {sum(1 for k,v in char_map.items() if v)} 用户")
all_user_ids = set()
for acct in all_accounts:
for uid in char_map.get(acct,[]): all_user_ids.add(uid)
def query_lesson_completion(user_ids, chapter_ids_dict):
result = defaultdict(dict)
if not user_ids: return result
cids = list(chapter_ids_dict.keys())
union = []
for t in range(8):
union.append(f"""
SELECT user_id, chapter_id, MIN(created_at) as fd
FROM bi_user_chapter_play_record_{t}
WHERE chapter_id IN ({','.join(map(str,cids))})
AND play_status=1 AND user_id IN %(uids)s
GROUP BY user_id, chapter_id
""")
sql = f"SELECT user_id, chapter_id, MIN(fd) FROM ({' UNION ALL '.join(union)}) sub GROUP BY user_id, chapter_id"
for i in range(0,len(user_ids),2000):
cur.execute(sql,{'uids':tuple(user_ids[i:i+2000])})
for r in cur.fetchall(): result[r[0]][r[1]]=r[2].date() if isinstance(r[2],(datetime,date)) else r[2]
return result
l1_data = query_lesson_completion(list(all_user_ids), L1_U0_LESSONS)
l2_data = query_lesson_completion(list(all_user_ids), L2_U0_LESSONS)
print(f" L1序章完成: {len(l1_data)} user_ids, L2序章完成: {len(l2_data)} user_ids")
# ============ 逐订单分析 ============
records = []
for o in orders:
lesson_map = L1_U0_LESSONS if o['level']=='L1' else L2_U0_LESSONS
lesson_data = l1_data if o['level']=='L1' else l2_data
reg_d = register_map.get(o['account_id'])
if not reg_d: continue
pur_d = o['purchase_date']
# 汇总所有角色的序章完成情况
merged = {}
for uid in char_map.get(o['account_id'],[]):
if uid not in lesson_data: continue
for cid, dt in lesson_data[uid].items():
if cid not in merged or dt < merged[cid]: merged[cid]=dt
# 每节课的完成日期和距注册天数
lesson_dates = {}
for cid, dt in merged.items():
lesson_dates[lesson_map[cid]] = {'date': dt, 'days_from_reg': (dt - reg_d).days}
# 购课前完成哪些
done_before = {ln: info for ln, info in lesson_dates.items() if info['date'] <= pur_d}
done_after = {ln: info for ln, info in lesson_dates.items() if info['date'] > pur_d}
all5_done = len(lesson_dates) == 5
all5_date = max(info['date'] for info in lesson_dates.values()) if all5_done else None
all5_to_pur = (pur_d - all5_date).days if all5_date else None
reg_to_all5 = (all5_date - reg_d).days if all5_date else None
# 购课时进度:完成了几节,已完成哪些节
progress = len(done_before)
lessons_done = sorted(done_before.keys(), key=lambda x: int(x[1:]))
# 购课落在哪两节之间
# e.g. 完成了 L01 L02则购课时间在 L02完成后 → "L02之后"
if progress == 0:
purchase_position = "L01之前"
elif progress == 5:
purchase_position = "L05之后"
else:
purchase_position = f"{lessons_done[-1]}之后(L{int(lessons_done[-1][1:])+1:02d}之前)"
records.append({
'order_id': o['order_id'], 'account_id': o['account_id'],
'level': o['level'], 'register_date': reg_d, 'purchase_date': pur_d,
'reg_to_pur': (pur_d - reg_d).days,
'progress': progress, 'lessons_done': lessons_done,
'done_before': done_before, 'done_after': done_after,
'all5_done': all5_done, 'all5_date': all5_date,
'reg_to_all5': reg_to_all5, 'all5_to_pur': all5_to_pur,
'purchase_position': purchase_position,
'lesson_dates': lesson_dates,
})
print(f" 分析记录: {len(records)}")
l1_recs = [r for r in records if r['level']=='L1']
l2_recs = [r for r in records if r['level']=='L2']
all5_recs = [r for r in records if r['all5_done']]
# ====================================================================
# 交叉统计1: 购课时完成节数 × 注册→购课天数
# ====================================================================
print(f"\n{'='*80}")
print("【交叉1】购课时完成节数 × 注册→购课天数")
print(f"{'='*80}")
for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]:
print(f"\n[{level}]")
# 构建交叉表
day_ranges = [(0,1),(2,3),(4,7),(8,14),(15,30),(31,60),(61,999)]
print(f" {'购课进度':<20}", end='')
for lo, hi in day_ranges:
label = f"{lo}" if lo==hi else f"{lo}-{hi}"
print(f"{label:>10}", end='')
print(f" {'合计':>8}")
for prog in range(6):
subset = [r for r in data if r['progress']==prog]
label = f" 完成{prog}" if prog>0 else f" 完成0节"
print(f"{label:<20}", end='')
total = 0
for lo, hi in day_ranges:
cnt = sum(1 for r in subset if lo <= r['reg_to_pur'] <= hi)
print(f"{cnt:>10}", end='')
total += cnt
print(f" {total:>8}")
# ====================================================================
# 交叉统计2: 购课时完成节数 × 完课→购课天数(仅五节全完成用户)
# ====================================================================
print(f"\n{'='*80}")
print("【交叉2】购课时完成节数 × 全完成→购课天数(仅五节全完成用户)")
print(f"{'='*80}")
all5_day_ranges = [(-999,-1),(0,0),(1,3),(4,7),(8,14),(15,30),(31,60),(61,999)]
for level, data in [('L1',[r for r in l1_recs if r['all5_done']]),
('L2',[r for r in l2_recs if r['all5_done']]),
('ALL',all5_recs)]:
print(f"\n[{level}]")
print(f" {'购课进度':<20}", end='')
for lo, hi in all5_day_ranges:
if lo==-999: label="先购课后完课"
elif lo==hi: label=f"{lo}"
else: label=f"{lo}-{hi}"
print(f"{label:>14}", end='')
print(f" {'合计':>8}")
for prog in range(6):
subset = [r for r in data if r['progress']==prog]
label = f" 完成{prog}" if prog>0 else f" 完成0节"
print(f"{label:<20}", end='')
total = 0
for lo, hi in all5_day_ranges:
cnt = sum(1 for r in subset if r['all5_to_pur'] is not None and lo <= r['all5_to_pur'] <= hi)
print(f"{cnt:>14}", end='')
total += cnt
print(f" {total:>8}")
# ====================================================================
# 交叉统计3: 购课落在序章各节之间的分布
# ====================================================================
print(f"\n{'='*80}")
print("【交叉3】购课时机落在序章第几节之后")
print(f"{'='*80}")
for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]:
dist = Counter(r['purchase_position'] for r in data)
print(f"\n[{level}]")
positions = ['L01之前','L01之后(L02之前)','L02之后(L03之前)','L03之后(L04之前)','L04之后(L05之前)','L05之后']
for pos in positions:
cnt = dist.get(pos, 0)
pct = cnt/len(data)*100
bar = '' * int(pct)
print(f" {pos:24s}: {cnt:4d} ({pct:5.1f}%) {bar}")
# ====================================================================
# 交叉统计4: 逐节课完成时间分布
# ====================================================================
print(f"\n{'='*80}")
print("【交叉4】序章每节课从注册到完成的天数分布")
print(f"{'='*80}")
lesson_days_ranges = [(0,0),(1,1),(2,3),(4,7),(8,14),(15,30),(31,999)]
lesson_order = ['L01','L02','L03','L04','L05']
for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]:
print(f"\n[{level}] ({len(data)} 条记录)")
print(f" {'课时':<8}", end='')
for lo, hi in lesson_days_ranges:
label = f"{lo}" if lo==hi else f"{lo}-{hi}"
print(f"{label:>10}", end='')
print(f" {'完成人数':>10} {'中位数':>8} {'均值':>8}")
for ln in lesson_order:
vals = []
for r in data:
if ln in r['lesson_dates']:
vals.append(r['lesson_dates'][ln]['days_from_reg'])
if not vals:
print(f" {ln:<8} {'(无数据)':>60}")
continue
sv = sorted(vals)
med = sv[len(sv)//2]
avg = sum(vals)/len(vals)
print(f" {ln:<8}", end='')
for lo, hi in lesson_days_ranges:
cnt = sum(1 for v in vals if lo <= v <= hi)
pct = cnt / len(vals) * 100
print(f"{cnt:>5}({pct:4.1f}%)", end='')
print(f" {len(vals):>10} {med:>8}{avg:>8.1f}")
# ====================================================================
# 交叉统计5: 购课时完成节数 × L1/L2 × 是否有后续完成
# ====================================================================
print(f"\n{'='*80}")
print("【交叉5】购课时未全部完成的用户购课后完成余下序章的情况")
print(f"{'='*80}")
not_all5_at_purchase = [r for r in records if r['progress'] < 5]
for level, data in [('L1',[r for r in l1_recs if r['progress']<5]),
('L2',[r for r in l2_recs if r['progress']<5]),
('ALL',not_all5_at_purchase)]:
if not data: continue
print(f"\n[{level}] 购课时未全部完成: {len(data)}")
for prog in range(5):
subset = [r for r in data if r['progress']==prog]
if not subset: continue
after_done_dist = Counter(r['all5_done'] for r in subset)
all5_cnt = after_done_dist.get(True, 0)
not5_cnt = after_done_dist.get(False, 0)
print(f" 购课时完成{prog}节: {len(subset)}人 → 最终完成为5节: {all5_cnt}人, 仍未完成: {not5_cnt}")
# ====================================================================
# 交叉统计6: 先购课后完成序章的用户详细分析
# ====================================================================
print(f"\n{'='*80}")
print("【交叉6】先购课后才完成序章的用户16人详细分析")
print(f"{'='*80}")
buy_before_finish = [r for r in all5_recs if r['all5_to_pur'] is not None and r['all5_to_pur'] < 0]
print(f"\n{len(buy_before_finish)} 人,购课后才完成序章全部五节\n")
print(f" {'等级':<6} {'购课时进度':<14} {'购课日期':<14} {'全完成日期':<14} {'购课→完课':>10} {'注册→购课':>10} {'注册→全完':>10}")
for r in sorted(buy_before_finish, key=lambda x: x['all5_to_pur']):
print(f" {r['level']:<6} {'完成'+str(r['progress'])+'':<14} {str(r['purchase_date']):<14} {str(r['all5_date']):<14} {abs(r['all5_to_pur']):>10}{r['reg_to_pur']:>10}{r['reg_to_all5']:>10}")
# ====================================================================
# 交叉统计7: 逐月趋势 — 购课时完成节数分布
# ====================================================================
print(f"\n{'='*80}")
print("【交叉7】逐月趋势购课时完成 5/5 节的比例")
print(f"{'='*80}")
monthly = defaultdict(lambda: {'total':0,'all5':0,'partial':0})
for r in records:
m = r['purchase_date'].strftime('%Y-%m')
monthly[m]['total'] += 1
if r['progress'] == 5:
monthly[m]['all5'] += 1
else:
monthly[m]['partial'] += 1
for m in sorted(monthly.keys()):
d = monthly[m]
pct = d['all5']/d['total']*100
print(f" {m}: 总{d['total']:>4}单 | 购课时已完成5节={d['all5']:>3}人({pct:5.1f}%) | 未完成={d['partial']:>2}")
# ====================================================================
# 交叉统计8: 注册→购课 按购课时进度分层
# ====================================================================
print(f"\n{'='*80}")
print("【交叉8】按购课时进度分层各组注册→购课 天数统计")
print(f"{'='*80}")
def pct(vals, p):
if not vals: return '-'
s = sorted(vals)
return s[int(len(s)*p/100)]
for level, data in [('L1',l1_recs),('L2',l2_recs),('ALL',records)]:
print(f"\n[{level}]")
print(f" {'购课时进度':<16} {'人数':>6} {'P10':>6} {'P25':>6} {'中位':>6} {'P75':>6} {'P90':>6} {'均值':>8}")
for prog in range(6):
subset = [r for r in data if r['progress']==prog]
if not subset: continue
vals = [r['reg_to_pur'] for r in subset]
print(f" {'完成'+str(prog)+'':<16} {len(subset):>6} "
f"{pct(vals,10):>6} {pct(vals,25):>6} {pct(vals,50):>6} "
f"{pct(vals,75):>6} {pct(vals,90):>6} {sum(vals)/len(vals):>8.1f}")
# ====================================================================
# 导出CSV带序章每节课完成明细
# ====================================================================
output_path = '/root/.openclaw/workspace/output/endor_purchase_analysis_v3.csv'
with open(output_path, 'w', newline='', encoding='utf-8-sig') as f:
writer = csv.writer(f)
writer.writerow([
'order_id','account_id','等级','注册日期','购课日期','注册→购课(天)',
'购课时序章进度(节)','购课时已完成哪些课',
'L01完成日期','L01注册后几天','L02完成日期','L02注册后几天',
'L03完成日期','L03注册后几天','L04完成日期','L04注册后几天',
'L05完成日期','L05注册后几天',
'五节全完成','五节全完成日期','注册→全完成(天)','全完成→购课(天)',
'购课时机'
])
for r in sorted(records, key=lambda x: x['purchase_date']):
row = [r['order_id'],r['account_id'],r['level'],
r['register_date'].isoformat(),r['purchase_date'].isoformat(),r['reg_to_pur'],
r['progress'],','.join(r['lessons_done'])]
for ln in lesson_order:
info = r['lesson_dates'].get(ln, {})
row.append(info.get('date','').isoformat() if info.get('date') else '')
row.append(info.get('days_from_reg','') if info else '')
row.append('' if r['all5_done'] else '')
row.append(r['all5_date'].isoformat() if r['all5_date'] else '')
row.append(r['reg_to_all5'] if r['reg_to_all5'] is not None else '')
row.append(r['all5_to_pur'] if r['all5_to_pur'] is not None else '')
row.append(r['purchase_position'])
writer.writerow(row)
print(f"\n{'='*80}")
print(f"已导出: {output_path}")
print(f"{'='*80}")
cur.close()
conn.close()
print("分析完成!")