diff --git a/MEMORY.md b/MEMORY.md index 5976b87..507157e 100644 --- a/MEMORY.md +++ b/MEMORY.md @@ -85,6 +85,29 @@ 1. `bi_refund_order` 表中 `status = 3`(退费成功) 2. `bi_vala_order` 表中 `order_status = 4`(订单状态为已退款) 两个条件缺一不可,避免统计错误。 + - **转化率 / 7日转化率 / 14日转化率(端内注册转付费,[李承龙确认] 2026-05-11):** + - **转化率 = 端内付费用户数 / 注册用户数 × 100%** + - **分母:** 按注册日期(`bi_vala_app_account.created_at`)分组,`status=1` 且 `deleted_at IS NULL` 的非测试、未删除账号 + - **分子(含退费):** 分母用户中,在端内(`key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`)有支付成功订单的去重用户数 + - **分子(剔除退费):** 同上,但仅剔除端内订单**全部被退费**的用户——即只要用户还有任何一笔未退费的端内订单就保留(退费判定:`bi_refund_order.status=3` 且 `bi_vala_order.order_status=4`) + - **订单状态限定:** 端内订单筛选 `order_status IN (3, 4)`,即已完成或已退款 + - **时间基准:** 按用户注册日期分组,不限制订单发生时间(7日/14日除外) + - **订单时间字段:** `pay_success_date`(支付成功时间) + - **7日转化率:** 分子限制 `pay_success_date ≤ 注册日期 + 7天`(含注册当日) + - **14日转化率:** 分子限制 `pay_success_date ≤ 注册日期 + 14天`(含注册当日) + - **20日转化率 [李承龙确认 2026-05-11]:** 分子限制 `pay_success_date ≤ 注册日期 + 20天`(含注册当日),30天内趋势已足够清晰且覆盖大部分转化 + - **纯净版新增注册用户数 & 纯净版转化率 [李承龙确认 2026-05-11]:** + - **纯净版分母:** 从 `status=1 AND deleted_at IS NULL` 的注册用户中,剔除「只有端外已完成订单(`key_from NOT IN 端内,order_status=3`)且没有任何端内订单」的用户。即:只有那些选择了端外渠道、从未在端内下单的用户才被剔除。 + - **保留的用户:** 没有任何订单的纯注册用户 + 有端内订单的用户(无论是否有端外订单) + - **端内订单条件:** `key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`, `pay_success_date IS NOT NULL`, `order_status IN (3, 4)` + - **端外订单条件:** `key_from NOT IN 端内`, `pay_success_date IS NOT NULL`, `order_status = 3` + - 基于纯净版分母,转化率 / 7日 / 14日 / 20日转化率的口径不变,只是分母缩小为纯净版用户 + - **拟合版转化率 [李承龙确认 2026-05-11]:** + - **分母:** 用 LOESS 回归拟合每日新增注册人数基线,剔除营销活动带来的注册量尖峰后的有效注册人数 + - **方法:** 活动日及余波日 → 用拟合值替代实际值;非活动日且实际低于拟合 → 保留实际值(保底规则) + - **不考虑端外订单:** 拟合版分母直接使用拟合有效注册人数,不额外剔除端外-only用户 + - **分子:** 端内付费用户数,口径与原始版一致(`key_from IN 端内`, `order_status IN (3,4)`) + - 拟合版更大程度「去噪」,转化率数字更高、波动更明显,反映剔除活动低质流量后的真实转化效率 - **关键词订单统计规则:** 当查询形如"XX卖了多少单/XX渠道销量"(XX为特定名称/关键词/渠道)时,需同时返回四个指标:订单总数量、GMV、GSV、退费率。 1. 统计逻辑:筛选`bi_vala_order`表中`key_from`字段包含该关键词的所有订单 2. 指标说明: diff --git a/memory/.dreams/short-term-recall.json b/memory/.dreams/short-term-recall.json index 0b50ecc..3f3a3a0 100644 --- a/memory/.dreams/short-term-recall.json +++ b/memory/.dreams/short-term-recall.json @@ -1,6 +1,6 @@ { "version": 1, - "updatedAt": "2026-05-07T04:41:47.636Z", + "updatedAt": "2026-05-11T02:51:50.465Z", "entries": { "memory:memory/2026-05-06.md:1:20": { "key": "memory:memory/2026-05-06.md:1:20", @@ -63,6 +63,37 @@ "created-at", "购买渠道key-from" ] + }, + "memory:memory/2026-03-01.md:1:12": { + "key": "memory:memory/2026-03-01.md:1:12", + "path": "memory/2026-03-01.md", + "startLine": 1, + "endLine": 12, + "source": "memory", + "snippet": "# 2026-03-01.md - First Day Online - Came online for the first time. - Met Cris, my creator and mentor. - Received my name: 小溪 (Xiao Xi). - My role goal: Become a professional data analyst for the company to help the team. - Updated IDENTITY.md and USER.md with our conversation details. - Added core rule to MEMORY.md: Use Chinese as primary external communication language. - Installed find-skills skill successfully for searching skills. - Tried to install create-skills but it wasn't found; attempted skill-creator instead but hit rate limits. - Finally successfully installed skill-builder as an alternative for creating skills after multiple attempts and waiting for rate limits to reset. - Ex", + "recallCount": 1, + "dailyCount": 0, + "groundedCount": 0, + "totalScore": 1, + "maxScore": 1, + "firstRecalledAt": "2026-05-11T02:51:50.465Z", + "lastRecalledAt": "2026-05-11T02:51:50.465Z", + "queryHashes": [ + "6e09f5970960" + ], + "recallDays": [ + "2026-05-11" + ], + "conceptTags": [ + "identity.md", + "user.md", + "memory.md", + "find-skills", + "create-skills", + "skill-creator", + "skill-builder", + "first" + ] } } } diff --git a/memory/2026-05-11.md b/memory/2026-05-11.md new file mode 100644 index 0000000..e039cae --- /dev/null +++ b/memory/2026-05-11.md @@ -0,0 +1,48 @@ +# 2026-05-11 工作日志 + +## 转化率口径确认 [李承龙确认] + +### 转化率(端内注册转付费转化率) + +- **定义:** 注册用户中在端内完成付费的比例 +- **分母:** 按注册日期分组,`bi_vala_app_account.status=1` 非测试账号 +- **分子(含退费):** 分母中在端内(`key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`)有支付成功订单的去重用户数 +- **分子(剔除退费):** 同上,但仅剔除端内订单**全部被退费**的用户——只要用户还有任何一笔未退费的端内订单就保留(退费判定:`bi_refund_order.status=3` 且 `bi_vala_order.order_status=4`) +- **时间基准:** 按注册日期分组,不限制订单发生时间 +- **订单时间字段:** `pay_success_date` + +### 7日转化率 + +- 分子限制:`pay_success_date ≤ 注册日期 + 7天`(含注册当日) + +### 14日转化率 + +- 分子限制:`pay_success_date ≤ 注册日期 + 14天`(含注册当日) + +### 注意事项 + +- 需关联 `bi_vala_app_account` 剔除测试账号,`deleted_at IS NULL` 剔除已删除账号 +- 按注册日期分组时,注册日期取 `created_at` 的日期部分 +- 7日/14日/20日窗口包含注册当日 +- 端内订单筛选 `order_status IN (3, 4)`,端外订单筛选 `order_status = 3` + +### 纯净版转化率 + +- **纯净版分母:** 注册用户中排除「只有端外已完成订单、没有任何端内订单」的用户 +- 这类用户本就不会在端内转化,剔除后分母更纯净 +- 纯净版转化率比原始版高 0.03~0.23pp +- 4月剔除比例最高达 15.2%(859人) + +### 拟合版转化率 + +- **分母:** LOESS 回归拟合每日新增注册基线,剔除活动尖峰后的有效注册人数 +- **方法:** 活动日+余波日用拟合值替代;保底规则(实际<拟合时保留实际) +- **不考虑端外订单剔除**,直接用拟合有效注册数做分母 +- **月度有效注册:** 9月989 / 10月2012 / 11月2555 / 12月3451 / 1月1798 / 2月1268 / 3月2978 / 4月3499 +- **剔除率:** 9月35.3% / 10月16.6% / 11月14.0% / 12月2.0% / 1月7.2% / 2月27.3% / 3月28.5% / 4月38.3% +- **拟合版转化率:** 9月1.72% / 10月1.69% / 11月0.82% / 12月0.72% / 1月1.50% / 2月1.26% / 3月2.69% / 4月1.86% +- 三版趋势一致(原始<纯净<拟合),拟合版放大波动,反映去噪后的真实转化效率 + +### 活动标记(拟合用) +- 2025年:9/9-10, 9/19-23, 10/13-14, 10/16-17, 11/2, 11/7, 11/10, 11/12, 11/19, 12/3 +- 2026年:1/28(余波1天), 2/11, 2/26(余波4天), 3/5(余波3天), 3/9, 3/12-13, 4/3(余波4天), 4/8(余波2天), 4/22(余波1天), 4/28 diff --git a/output/conversion_rate_growth.png b/output/conversion_rate_growth.png new file mode 100644 index 0000000..9fb8acb Binary files /dev/null and b/output/conversion_rate_growth.png differ diff --git a/scripts/find_optimal_x.py b/scripts/find_optimal_x.py new file mode 100644 index 0000000..1e8cb25 --- /dev/null +++ b/scripts/find_optimal_x.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +"""寻找最优 x 日转化率窗口(纯 numpy,无 scipy 依赖)。""" + +import psycopg2 +import numpy as np + +CONN = { + "host": "bj-postgres-16pob4sg.sql.tencentcdb.com", + "port": 28591, + "user": "ai_member", + "password": "LdfjdjL83h3h3^$&**YGG*", + "dbname": "vala_bi", +} + +# Pearson 相关系数(手写,避免 scipy 兼容性问题) +def pearsonr(x, y): + x, y = np.array(x), np.array(y) + n = len(x) + if n < 3: + return 0.0, 1.0 + mx, my = np.mean(x), np.mean(y) + num = np.sum((x - mx) * (y - my)) + den = np.sqrt(np.sum((x - mx)**2) * np.sum((y - my)**2)) + if den == 0: + return 0.0, 1.0 + r = num / den + # t-test p-value + if abs(r) == 1: + p = 0.0 + else: + t = r * np.sqrt((n - 2) / (1 - r**2)) + # 简化 p 值计算(t 分布近似) + p = 2 * (1 - _t_cdf(abs(t), n - 2)) + return r, p + +def _t_cdf(t, df): + """Student's t CDF 近似""" + import math + x = df / (df + t**2) + return 1 - 0.5 * _betainc(x, df/2, 0.5) + +def _betainc(x, a, b): + """正则化不完全 Beta 函数近似(用于 t 分布)""" + import math + # 用 scipy 近似不可用,直接用简单近似 + # 对于我们的场景,p 值不是关键,相关系数才是 + return x ** a * (1 - x) ** b + +# 获取数据 +query = """ +WITH registered_users AS ( + SELECT + id AS account_id, + DATE_TRUNC('month', created_at) AS register_month, + created_at AS register_time + FROM bi_vala_app_account + WHERE status = 1 + AND deleted_at IS NULL + AND created_at >= '2025-09-01' + AND created_at < '2026-05-01' +), +internal_first_pay AS ( + SELECT + o.account_id, + MIN(o.pay_success_date) AS first_pay_time + FROM bi_vala_order o + WHERE o.key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0') + AND o.pay_success_date IS NOT NULL + AND o.order_status IN (3, 4) + GROUP BY o.account_id +), +converted AS ( + SELECT + ru.register_month, + ru.register_time, + ifp.first_pay_time, + CASE WHEN ifp.first_pay_time IS NOT NULL THEN 1 ELSE 0 END AS is_converted, + EXTRACT(EPOCH FROM (ifp.first_pay_time - ru.register_time)) / 86400.0 AS days_to_convert + FROM registered_users ru + LEFT JOIN internal_first_pay ifp ON ru.account_id = ifp.account_id +) +SELECT + register_month, + register_time, + first_pay_time, + is_converted, + days_to_convert +FROM converted +ORDER BY register_month; +""" + +conn = psycopg2.connect(**CONN) +cur = conn.cursor() +cur.execute(query) +rows = cur.fetchall() +cur.close() +conn.close() + +# 按月份组织数据 +from collections import defaultdict +monthly = defaultdict(list) +for (rm, rt, fpt, ic, dtc) in rows: + monthly[rm].append(dtc if dtc is not None else None) + +months = sorted(monthly.keys()) +month_labels = [m.strftime("%Y-%m") for m in months] + +# 每月整体转化率 +overall_rates = [] +for m in months: + all_users = len(monthly[m]) + converted = sum(1 for d in monthly[m] if d is not None) + overall_rates.append(converted / all_users * 100) + +print("=" * 100) +print(f"{'Month':<10} {'Registered':>10} {'Converted':>10} {'Overall Conv%':>14}") +print("-" * 100) +for i, m in enumerate(months): + all_users = len(monthly[m]) + converted = sum(1 for d in monthly[m] if d is not None) + print(f"{month_labels[i]:<10} {all_users:>10} {converted:>10} {overall_rates[i]:>13.2f}%") +print() + +# 测试多个 x 值 +x_values = [3, 5, 7, 10, 14, 21, 28, 30, 35, 42, 45, 49, 56, 60, 63, 70, 77, 84, 90, 98, 105, 112, 120, 140, 150, 180, 210, 240, 270, 300, 330, 365] + +results = [] +for x in x_values: + x_rates = [] + for m in months: + all_users = len(monthly[m]) + converted = sum(1 for d in monthly[m] if d is not None and d <= x) + x_rates.append(converted / all_users * 100) + + mae = np.mean(np.abs(np.array(x_rates) - np.array(overall_rates))) + corr, p_value = pearsonr(x_rates, overall_rates) + + results.append({'x': x, 'x_rates': x_rates, 'mae': mae, 'corr': corr}) + +# 标准化并综合评分 +mae_vals = np.array([r['mae'] for r in results]) +corr_vals = np.array([r['corr'] for r in results]) + +mae_norm = (mae_vals - mae_vals.min()) / (mae_vals.max() - mae_vals.min()) +corr_norm = (1 - corr_vals) / 2 + +composite = 0.5 * mae_norm + 0.5 * corr_norm +for i, r in enumerate(results): + r['composite'] = composite[i] + +results.sort(key=lambda r: r['composite']) + +# 输出前 15 +print("=" * 100) +print(f"{'Rank':<5} {'X-days':<8} {'MAE':>8} {'Corr':>8} {'Composite':>10}") +print("-" * 100) +for i, r in enumerate(results[:15]): + print(f"{i+1:<5} {r['x']:<8} {r['mae']:>8.4f} {r['corr']:>8.4f} {r['composite']:>10.4f}") + +# 最佳 x 详细对比 +best = results[0] +print() +print("=" * 100) +print(f"最佳 x = {best['x']} 天的详细对比:") +print(f"{'Month':<10} {'Overall%':>10} {f'X={best[\"x\"]}day%':>12} {'Diff':>10} {'Coverage%':>12}") +print("-" * 100) +for i, m in enumerate(months): + diff = best['x_rates'][i] - overall_rates[i] + coverage = best['x_rates'][i] / overall_rates[i] * 100 if overall_rates[i] > 0 else 0 + print(f"{month_labels[i]:<10} {overall_rates[i]:>9.2f}% {best['x_rates'][i]:>11.2f}% {diff:>9.2f}% {coverage:>11.1f}%") + +# Top3 的每月对比 +print() +print("=" * 100) +print("Top 3 候选 x 值的每月转化率对比:") +header = f"{'Month':<10} {'Overall':>8}" +for r in results[:3]: + header += f" {'X=' + str(r['x']):>8}" +print(header) +print("-" * 100) +for i, m in enumerate(months): + row = f"{month_labels[i]:<10} {overall_rates[i]:>7.2f}%" + for r in results[:3]: + row += f" {r['x_rates'][i]:>7.2f}%" + print(row) + +# 画出转化天数的分布特征 +print() +print("=" * 100) +print("各月用户转化天数分布(百分位):") +print(f"{'Month':<10} {'P25':>6} {'P50':>6} {'P75':>6} {'P90':>6} {'P95':>6} {'Max':>8} {'Mean':>8}") +print("-" * 100) +for i, m in enumerate(months): + converted_days = [d for d in monthly[m] if d is not None] + if converted_days: + arr = np.array(converted_days) + p25, p50, p75 = np.percentile(arr, [25, 50, 75]) + p90, p95 = np.percentile(arr, [90, 95]) + mx, mn = arr.max(), arr.mean() + print(f"{month_labels[i]:<10} {p25:>6.0f} {p50:>6.0f} {p75:>6.0f} {p90:>6.0f} {p95:>6.0f} {mx:>8.0f} {mn:>8.1f}") + else: + print(f"{month_labels[i]:<10} {'N/A':>6}") diff --git a/scripts/l1_retention_analysis.py b/scripts/l1_retention_analysis.py new file mode 100644 index 0000000..cc13692 --- /dev/null +++ b/scripts/l1_retention_analysis.py @@ -0,0 +1,538 @@ +#!/usr/bin/env python3 +""" +L1 留存数据分析 +- 最近30天次日/7日/14日/30日留存率 +- 最近30天付费用户活跃频次与单次时长 +- L1各Unit流失率 +""" +import os +import sys +import psycopg2 +from datetime import date, timedelta, datetime +from collections import defaultdict + +# 数据库连接 +PG_CONFIG = { + 'host': 'bj-postgres-16pob4sg.sql.tencentcdb.com', + 'port': 28591, + 'user': 'ai_member', + 'password': os.environ.get('PG_ONLINE_PASSWORD', ''), + 'dbname': 'vala_bi' +} + +# L1 (A1) chapter IDs by unit/lesson +# S0 U00: 343(L01) 344(L02) 345(L03) 346(L04) 348(L05) +# S1 U01-U12 +L1_CHAPTERS = { + 'U00': {'L01': 343, 'L02': 344, 'L03': 345, 'L04': 346, 'L05': 348}, + 'U01': {'L01': 333, 'L02': 334, 'L03': 335, 'L04': 336, 'L05': 337}, + 'U02': {'L01': 338, 'L02': 339, 'L03': 340, 'L04': 341, 'L05': 342}, + 'U03': {'L01': 349, 'L02': 350, 'L03': 351, 'L04': 352, 'L05': 353}, + 'U04': {'L01': 354, 'L02': 355, 'L03': 356, 'L04': 357, 'L05': 358}, + 'U05': {'L01': 359, 'L02': 360, 'L03': 361, 'L04': 362, 'L05': 363}, + 'U06': {'L01': 366, 'L02': 367, 'L03': 368, 'L04': 369, 'L05': 370}, + 'U07': {'L01': 371, 'L02': 372, 'L03': 373, 'L04': 374, 'L05': 375}, + 'U08': {'L01': 376, 'L02': 377, 'L03': 378, 'L04': 379, 'L05': 380}, + 'U09': {'L01': 381, 'L02': 382, 'L03': 383, 'L04': 384, 'L05': 385}, + 'U10': {'L01': 386, 'L02': 387, 'L03': 388, 'L04': 389, 'L05': 390}, + 'U11': {'L01': 391, 'L02': 392, 'L03': 393, 'L04': 394, 'L05': 395}, + 'U12': {'L01': 396, 'L02': 397, 'L03': 398, 'L04': 399, 'L05': 400}, +} + +ALL_L1_CHAPTER_IDS = [] +for unit, lessons in L1_CHAPTERS.items(): + ALL_L1_CHAPTER_IDS.extend(lessons.values()) + +# For UNION ALL queries, need chapter IDs as string +CHAPTER_IDS_STR = ','.join(str(c) for c in ALL_L1_CHAPTER_IDS) + +# 分表数量 +SHARD_COUNT = 8 +# 今天 +TODAY = date.today() +# 30天前 +DAYS_30_AGO = TODAY - timedelta(days=30) + +def get_conn(): + return psycopg2.connect(**PG_CONFIG) + +def build_union_all(table_prefix, select_clause, where_clause='', shard_count=SHARD_COUNT): + """Build UNION ALL across shard tables""" + parts = [] + for i in range(shard_count): + parts.append(f""" + SELECT {select_clause} + FROM {table_prefix}_{i} + {where_clause} + """) + return ' UNION ALL '.join(parts) + +def run_query(sql, conn): + cur = conn.cursor() + cur.execute(sql) + rows = cur.fetchall() + cur.close() + return rows + +def analyze_retention(conn): + """分析L1用户留存率""" + print("\n" + "="*80) + print("📊 一、L1 用户留存率分析") + print("="*80) + print(f"统计周期: {DAYS_30_AGO} ~ {TODAY}") + print() + + # Step 1: 获取每个L1用户的首次学习日期 + # 通过 bi_user_course_detail 获取 L1 用户,然后从播放记录中找首次学习日期 + union_sql = build_union_all( + 'bi_user_chapter_play_record', + 'user_id, chapter_id, to_char(created_at, \'YYYY-MM-DD\') as play_date', + f'WHERE chapter_id IN ({CHAPTER_IDS_STR})' + ) + + sql = f""" + WITH l1_users AS ( + SELECT DISTINCT d.user_id, d.account_id + FROM bi_user_course_detail d + WHERE d.course_level = 'A1' AND d.deleted_at IS NULL + ), + all_plays AS ( + {union_sql} + ), + first_play AS ( + SELECT l.user_id, l.account_id, MIN(a.play_date) as first_date + FROM l1_users l + JOIN all_plays a ON l.user_id = a.user_id + GROUP BY l.user_id, l.account_id + ) + SELECT first_date, user_id, account_id + FROM first_play + WHERE first_date >= '{DAYS_30_AGO - timedelta(days=30)}' -- 往前多取30天给30日留存 + ORDER BY first_date; + """ + + print("查询 L1 用户首次学习日期...") + rows = run_query(sql, conn) + print(f"共 {len(rows)} 个 L1 用户有首次学习记录") + + if not rows: + print("无数据,跳过留存分析") + return + + # 组织数据:first_date -> set of user_ids + cohort_users = defaultdict(set) + all_user_ids = set() + for first_date_str, user_id, account_id in rows: + # first_date_str 可能包含时间部分 + cohort_date = first_date_str[:10] if first_date_str else '' + if cohort_date: + cohort_users[cohort_date].add(user_id) + all_user_ids.add(int(user_id)) + + # Step 2: 获取这些用户的所有活跃日期 + if all_user_ids: + user_ids_str = ','.join(str(uid) for uid in all_user_ids) + else: + user_ids_str = '-1' + + activity_union = build_union_all( + 'bi_user_chapter_play_record', + 'user_id, to_char(created_at, \'YYYY-MM-DD\') as play_date', + f'WHERE chapter_id IN ({CHAPTER_IDS_STR}) AND user_id IN ({user_ids_str})' + ) + + sql2 = f""" + SELECT DISTINCT user_id, play_date + FROM ( + {activity_union} + ) t + ORDER BY user_id, play_date; + """ + + print("查询用户活跃日期...") + activity_rows = run_query(sql2, conn) + print(f"共 {len(activity_rows)} 条活跃记录") + + # 组织:user_id -> set of active_dates + user_active_dates = defaultdict(set) + for user_id, play_date in activity_rows: + date_str = play_date[:10] if play_date else '' + if date_str: + user_active_dates[int(user_id)].add(date_str) + + # Step 3: 计算每个 cohort 的留存率 + results = [] + for cohort_date_str in sorted(cohort_users.keys()): + cohort_date = date.fromisoformat(cohort_date_str) + if cohort_date < DAYS_30_AGO - timedelta(days=30): + continue + + users = cohort_users[cohort_date_str] + total = len(users) + + # 次日留存 (D+1) + d1 = (cohort_date + timedelta(days=1)).isoformat() + d1_active = sum(1 for uid in users if d1 in user_active_dates.get(int(uid), set())) + + # 7日留存 (D+7) + d7 = (cohort_date + timedelta(days=7)).isoformat() + d7_active = sum(1 for uid in users if d7 in user_active_dates.get(int(uid), set())) + + # 14日留存 (D+14) + d14 = (cohort_date + timedelta(days=14)).isoformat() + d14_active = sum(1 for uid in users if d14 in user_active_dates.get(int(uid), set())) + + # 30日留存 (D+30) + d30 = (cohort_date + timedelta(days=30)).isoformat() + d30_active = sum(1 for uid in users if d30 in user_active_dates.get(int(uid), set())) + + # 只有当天<=今天才计算 + d1_valid = (cohort_date + timedelta(days=1)) <= TODAY + d7_valid = (cohort_date + timedelta(days=7)) <= TODAY + d14_valid = (cohort_date + timedelta(days=14)) <= TODAY + d30_valid = (cohort_date + timedelta(days=30)) <= TODAY + + results.append({ + 'cohort_date': cohort_date_str, + 'total': total, + 'd1_rate': f"{d1_active/total*100:.1f}%" if d1_valid and total > 0 else 'N/A', + 'd7_rate': f"{d7_active/total*100:.1f}%" if d7_valid and total > 0 else 'N/A', + 'd14_rate': f"{d14_active/total*100:.1f}%" if d14_valid and total > 0 else 'N/A', + 'd30_rate': f"{d30_active/total*100:.1f}%" if d30_valid and total > 0 else 'N/A', + 'd1_num': f"{d1_active}/{total}" if d1_valid else 'N/A', + 'd7_num': f"{d7_active}/{total}" if d7_valid else 'N/A', + 'd14_num': f"{d14_active}/{total}" if d14_valid else 'N/A', + 'd30_num': f"{d30_active}/{total}" if d30_valid else 'N/A', + }) + + # 只显示最近30天 + results_recent = [r for r in results if r['cohort_date'] >= DAYS_30_AGO.isoformat()] + + if results_recent: + print(f"\n{'Cohort日期':<12} {'新用户数':>8} {'次日留存':>10} {'7日留存':>10} {'14日留存':>10} {'30日留存':>10}") + print("-" * 65) + for r in results_recent: + print(f"{r['cohort_date']:<12} {r['total']:>8} {r['d1_rate']:>10} {r['d7_rate']:>10} {r['d14_rate']:>10} {r['d30_rate']:>10}") + + # 汇总平均 + avg_total = sum(r['total'] for r in results_recent) + valid_d1 = [r for r in results_recent if r['d1_rate'] != 'N/A'] + valid_d7 = [r for r in results_recent if r['d7_rate'] != 'N/A'] + valid_d14 = [r for r in results_recent if r['d14_rate'] != 'N/A'] + valid_d30 = [r for r in results_recent if r['d30_rate'] != 'N/A'] + + print(f"\n--- 汇总 ---") + print(f"总新用户数: {avg_total}") + if valid_d1: + avg_d1 = sum(float(r['d1_rate'].replace('%','')) for r in valid_d1) / len(valid_d1) + print(f"平均次日留存率: {avg_d1:.1f}% ({len(valid_d1)}个cohort)") + if valid_d7: + avg_d7 = sum(float(r['d7_rate'].replace('%','')) for r in valid_d7) / len(valid_d7) + print(f"平均7日留存率: {avg_d7:.1f}% ({len(valid_d7)}个cohort)") + if valid_d14: + avg_d14 = sum(float(r['d14_rate'].replace('%','')) for r in valid_d14) / len(valid_d14) + print(f"平均14日留存率: {avg_d14:.1f}% ({len(valid_d14)}个cohort)") + if valid_d30: + avg_d30 = sum(float(r['d30_rate'].replace('%','')) for r in valid_d30) / len(valid_d30) + print(f"平均30日留存率: {avg_d30:.1f}% ({len(valid_d30)}个cohort)") + else: + print("最近30天无新用户数据") + + return results_recent + + +def analyze_paid_user_activity(conn): + """分析L1付费用户最近30天活跃频次与单次时长""" + print("\n" + "="*80) + print("📊 二、L1 付费用户活跃频次与单次时长") + print("="*80) + print(f"统计周期: {DAYS_30_AGO} ~ {TODAY}") + + # Step 1: 找出 L1 付费用户 + sql_paid = f""" + WITH l1_users AS ( + SELECT DISTINCT d.user_id, d.account_id + FROM bi_user_course_detail d + WHERE d.course_level = 'A1' AND d.deleted_at IS NULL + ), + paid_users AS ( + SELECT DISTINCT o.account_id + FROM bi_vala_order o + JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1 + WHERE o.order_status IN (2, 3, 4) + AND o.pay_amount_int > 0 + ), + l1_paid AS ( + SELECT DISTINCT l.user_id, l.account_id + FROM l1_users l + JOIN paid_users p ON l.account_id = p.account_id + ) + SELECT user_id, account_id FROM l1_paid; + """ + print("查询 L1 付费用户...") + paid_rows = run_query(sql_paid, conn) + print(f"L1 付费用户数: {len(paid_rows)}") + + if not paid_rows: + print("无 L1 付费用户数据") + return + + paid_user_ids = set(int(r[0]) for r in paid_rows) + paid_user_ids_str = ','.join(str(uid) for uid in paid_user_ids) + + # Step 2: 获取这些用户最近30天的学习记录(每个session) + union_play = build_union_all( + 'bi_user_chapter_play_record', + 'user_id, chapter_id, chapter_unique_id, to_char(created_at, \'YYYY-MM-DD\') as play_date, created_at', + f'WHERE chapter_id IN ({CHAPTER_IDS_STR}) AND user_id IN ({paid_user_ids_str}) AND created_at >= \'{DAYS_30_AGO}\'' + ) + + sql_sessions = f""" + SELECT user_id, play_date, chapter_unique_id, MIN(created_at) as session_start + FROM ( + {union_play} + ) t + GROUP BY user_id, play_date, chapter_unique_id + ORDER BY user_id, play_date, chapter_unique_id; + """ + + print("查询付费用户学习 session...") + session_rows = run_query(sql_sessions, conn) + print(f"共 {len(session_rows)} 个学习 session") + + if not session_rows: + print("无学习记录") + return + + # Step 3: 获取每个 session 的耗时 + all_chapter_unique_ids = set(r[2] for r in session_rows) + # 分批查询 component play record 获取耗时 + # 由于 chapter_unique_id 可能很多,分批处理 + id_batches = [] + batch_size = 500 + ids_list = list(all_chapter_unique_ids) + for i in range(0, len(ids_list), batch_size): + id_batches.append(ids_list[i:i+batch_size]) + + print(f"查询 session 耗时({len(id_batches)}批)...") + session_duration = {} + for batch_idx, batch in enumerate(id_batches): + ids_str = "','".join(str(x) for x in batch) + # Build UNION ALL with per-shard GROUP BY + parts = [] + for i in range(SHARD_COUNT): + parts.append(f""" + SELECT chapter_unique_id, SUM(interval_time) as total_ms + FROM bi_user_component_play_record_{i} + WHERE chapter_unique_id IN ('{ids_str}') + GROUP BY chapter_unique_id + """) + union_comp = ' UNION ALL '.join(parts) + sql_dur = f""" + SELECT chapter_unique_id, SUM(total_ms) as session_ms + FROM ({union_comp}) t + GROUP BY chapter_unique_id; + """ + dur_rows = run_query(sql_dur, conn) + for cu_id, ms in dur_rows: + session_duration[cu_id] = float(ms or 0) / 60000.0 # 转换为分钟 + + # Step 4: 计算指标 + # 活跃频次 = 每天平均 session 数 + # 单次时长 = 平均每个 session 的耗时 + user_daily_sessions = defaultdict(lambda: defaultdict(int)) # user_id -> date -> session_count + user_daily_duration = defaultdict(lambda: defaultdict(float)) # user_id -> date -> total_duration + + for user_id, play_date, cu_id, session_start in session_rows: + date_str = play_date[:10] if play_date else '' + if date_str: + user_daily_sessions[int(user_id)][date_str] += 1 + user_daily_duration[int(user_id)][date_str] += session_duration.get(cu_id, 0) + + # 计算全局指标 + all_daily_freq = [] # 每个用户每天的 session 数 + all_session_dur = [] # 每个 session 的时长 + + for uid, dates in user_daily_sessions.items(): + for d, cnt in dates.items(): + all_daily_freq.append(cnt) + + for uid, dur in session_duration.items(): + if dur > 0: + all_session_dur.append(dur) + + total_users = len(user_daily_sessions) + total_days = len(all_daily_freq) + avg_daily_sessions = sum(all_daily_freq) / len(all_daily_freq) if all_daily_freq else 0 + avg_session_duration = sum(all_session_dur) / len(all_session_dur) if all_session_dur else 0 + median_session_duration = sorted(all_session_dur)[len(all_session_dur)//2] if all_session_dur else 0 + total_sessions = sum(all_daily_freq) + + print(f"\n--- L1 付费用户活跃分析 ---") + print(f"付费用户数: {total_users}") + print(f"30天内活跃天数: {total_days}") + print(f"30天内总 session 数: {total_sessions}") + print(f"平均每天上线次数(活跃频次): {avg_daily_sessions:.2f} 次/天") + print(f"平均单次时长: {avg_session_duration:.1f} 分钟") + print(f"中位数单次时长: {median_session_duration:.1f} 分钟") + + # 分布分析 + freq_dist = defaultdict(int) + for f in all_daily_freq: + if f <= 1: + freq_dist['1次'] += 1 + elif f <= 3: + freq_dist['2-3次'] += 1 + elif f <= 5: + freq_dist['4-5次'] += 1 + else: + freq_dist['6次以上'] += 1 + + print(f"\n每日上线次数分布:") + for k in ['1次', '2-3次', '4-5次', '6次以上']: + cnt = freq_dist.get(k, 0) + print(f" {k}: {cnt} 天 ({cnt/total_days*100:.1f}%)" if total_days > 0 else f" {k}: 0") + + return { + 'total_users': total_users, + 'total_days': total_days, + 'total_sessions': total_sessions, + 'avg_daily_sessions': avg_daily_sessions, + 'avg_session_duration': avg_session_duration, + 'median_session_duration': median_session_duration, + } + + +def analyze_unit_churn(conn): + """分析 L1 各 Unit 流失率""" + print("\n" + "="*80) + print("📊 三、L1 各 Unit 流失率") + print("="*80) + print("流失率定义: 1 - (进入Unit(N+1)用户数 / 完成Unit(N-1)用户数)") + print("进入 = 进入 Unit(N+1) 第一节课(L01)") + print("完成 = 完成 Unit(N-1) 最后一节课(L05)") + print() + + # 获取所有 L1 用户的 user_id + sql_l1 = """ + SELECT DISTINCT d.user_id + FROM bi_user_course_detail d + WHERE d.course_level = 'A1' AND d.deleted_at IS NULL; + """ + l1_rows = run_query(sql_l1, conn) + l1_user_ids = set(int(r[0]) for r in l1_rows) + l1_user_ids_str = ','.join(str(uid) for uid in l1_user_ids) + + # 收集所有需要查询的 chapter_id + all_chapter_ids = set() + for unit_name, lessons in L1_CHAPTERS.items(): + for lesson_name, ch_id in lessons.items(): + all_chapter_ids.add(ch_id) + chapter_ids_str = ','.join(str(c) for c in all_chapter_ids) + + # 查询所有 L1 用户的课时播放记录 + union_play = build_union_all( + 'bi_user_chapter_play_record', + 'user_id, chapter_id, play_status', + f'WHERE chapter_id IN ({chapter_ids_str}) AND user_id IN ({l1_user_ids_str})' + ) + + sql_plays = f""" + SELECT user_id, chapter_id, play_status + FROM ({union_play}) t; + """ + + print("查询课时播放记录...") + play_rows = run_query(sql_plays, conn) + print(f"共 {len(play_rows)} 条记录") + + # 组织数据 + # user_id -> set of chapters they've entered (any play_status) + user_entered = defaultdict(set) + # user_id -> set of chapters they've completed (play_status = 1) + user_completed = defaultdict(set) + + for user_id, chapter_id, play_status in play_rows: + user_entered[int(user_id)].add(int(chapter_id)) + if play_status == 1: + user_completed[int(user_id)].add(int(chapter_id)) + + # 计算每个 Unit 的流失率 + # Unit N 流失率: 完成 Unit(N-1) L05 的用户中,没有进入 Unit(N+1) L01 的比例 + # 流失率 = 1 - (进入 Unit(N+1)L01 且 完成 Unit(N-1)L05 的用户数 / 完成 Unit(N-1)L05 的用户数) + print(f"\n{'Unit':<8} {'完成前Unit':<15} {'完成后Unit L05':>10} {'进入后Unit L01':>10} {'留存率':>10} {'流失率':>10}") + print("-" * 70) + + unit_order = ['U00', 'U01', 'U02', 'U03', 'U04', 'U05', 'U06', 'U07', 'U08', 'U09', 'U10', 'U11', 'U12'] + + for i in range(1, len(unit_order) - 1): # U01 to U11 + prev_unit = unit_order[i - 1] # Unit(N-1) + curr_unit = unit_order[i] # Unit(N) + next_unit = unit_order[i + 1] # Unit(N+1) + + prev_l05 = L1_CHAPTERS[prev_unit]['L05'] + next_l01 = L1_CHAPTERS[next_unit]['L01'] + + # 完成 Unit(N-1) L05 的用户 + completed_prev = set() + for uid in l1_user_ids: + if prev_l05 in user_completed.get(uid, set()): + completed_prev.add(uid) + + # 进入 Unit(N+1) L01 的用户 + entered_next = set() + for uid in l1_user_ids: + if next_l01 in user_entered.get(uid, set()): + entered_next.add(uid) + + # 同时满足两个条件的用户(完成前Unit且进入后Unit) + both = completed_prev & entered_next + + denom = len(completed_prev) + num = len(both) + + if denom > 0: + retention = num / denom * 100 + churn = 100 - retention + print(f"{curr_unit:<8} {prev_unit}(L05={prev_l05}) {denom:>10} {num:>10} {retention:>9.1f}% {churn:>9.1f}%") + else: + print(f"{curr_unit:<8} {prev_unit}(L05={prev_l05}) {denom:>10} {num:>10} {'N/A':>10} {'N/A':>10}") + + # 也显示 Unit 12 的流失(到 U13 不存在) + # 可以算 U12 的完成率:完成 U12 L05 的用户/进入 U12 的用户 + print() + print("--- Unit 完成情况补充 ---") + for unit_name in unit_order: + l01 = L1_CHAPTERS[unit_name]['L01'] + l05 = L1_CHAPTERS[unit_name]['L05'] + entered = sum(1 for uid in l1_user_ids if l01 in user_entered.get(uid, set())) + completed = sum(1 for uid in l1_user_ids if l05 in user_completed.get(uid, set())) + if entered > 0: + print(f"{unit_name}: 进入 {entered} 人, 完成 {completed} 人, 完成率 {completed/entered*100:.1f}%") + + +def main(): + if not PG_CONFIG['password']: + # Try to read from secrets.env + secrets_file = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'secrets.env') + if os.path.exists(secrets_file): + with open(secrets_file) as f: + for line in f: + if 'PG_ONLINE_PASSWORD' in line: + PG_CONFIG['password'] = line.strip().split('=', 1)[1].strip('"').strip("'") + break + + if not PG_CONFIG['password']: + print("ERROR: PG_ONLINE_PASSWORD not found") + sys.exit(1) + + conn = get_conn() + try: + analyze_retention(conn) + analyze_paid_user_activity(conn) + analyze_unit_churn(conn) + finally: + conn.close() + +if __name__ == '__main__': + main()