🤖 每日自动备份 - 2026-05-12 08:00:01

2026-05-12 08:00:01 +08:00 · 2026-05-12 08:00:01 +08:00 · 4ec91e964a
commit 4ec91e964a
parent 994fa502e3
6 changed files with 843 additions and 1 deletions
--- a/MEMORY.md
+++ b/MEMORY.md
@ -85,6 +85,29 @@
    1. `bi_refund_order` 表中 `status = 3`（退费成功）
    2. `bi_vala_order` 表中 `order_status = 4`（订单状态为已退款）
    两个条件缺一不可，避免统计错误。
+  - **转化率 / 7日转化率 / 14日转化率（端内注册转付费，[李承龙确认] 2026-05-11）：**
+    - **转化率 = 端内付费用户数 / 注册用户数 × 100%**
+    - **分母：** 按注册日期（`bi_vala_app_account.created_at`）分组，`status=1` 且 `deleted_at IS NULL` 的非测试、未删除账号
+    - **分子（含退费）：** 分母用户中，在端内（`key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`）有支付成功订单的去重用户数
+    - **分子（剔除退费）：** 同上，但仅剔除端内订单**全部被退费**的用户——即只要用户还有任何一笔未退费的端内订单就保留（退费判定：`bi_refund_order.status=3` 且 `bi_vala_order.order_status=4`）
+    - **订单状态限定：** 端内订单筛选 `order_status IN (3, 4)`，即已完成或已退款
+    - **时间基准：** 按用户注册日期分组，不限制订单发生时间（7日/14日除外）
+    - **订单时间字段：** `pay_success_date`（支付成功时间）
+    - **7日转化率：** 分子限制 `pay_success_date ≤ 注册日期 + 7天`（含注册当日）
+    - **14日转化率：** 分子限制 `pay_success_date ≤ 注册日期 + 14天`（含注册当日）
+    - **20日转化率 [李承龙确认 2026-05-11]：** 分子限制 `pay_success_date ≤ 注册日期 + 20天`（含注册当日），30天内趋势已足够清晰且覆盖大部分转化
+  - **纯净版新增注册用户数 & 纯净版转化率 [李承龙确认 2026-05-11]：**
+    - **纯净版分母：** 从 `status=1 AND deleted_at IS NULL` 的注册用户中，剔除「只有端外已完成订单（`key_from NOT IN 端内，order_status=3`）且没有任何端内订单」的用户。即：只有那些选择了端外渠道、从未在端内下单的用户才被剔除。
+    - **保留的用户：** 没有任何订单的纯注册用户 + 有端内订单的用户（无论是否有端外订单）
+    - **端内订单条件：** `key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`, `pay_success_date IS NOT NULL`, `order_status IN (3, 4)`
+    - **端外订单条件：** `key_from NOT IN 端内`, `pay_success_date IS NOT NULL`, `order_status = 3`
+    - 基于纯净版分母，转化率 / 7日 / 14日 / 20日转化率的口径不变，只是分母缩小为纯净版用户
+  - **拟合版转化率 [李承龙确认 2026-05-11]：**
+    - **分母：** 用 LOESS 回归拟合每日新增注册人数基线，剔除营销活动带来的注册量尖峰后的有效注册人数
+    - **方法：** 活动日及余波日 → 用拟合值替代实际值；非活动日且实际低于拟合 → 保留实际值（保底规则）
+    - **不考虑端外订单：** 拟合版分母直接使用拟合有效注册人数，不额外剔除端外-only用户
+    - **分子：** 端内付费用户数，口径与原始版一致（`key_from IN 端内`, `order_status IN (3,4)`）
+    - 拟合版更大程度「去噪」，转化率数字更高、波动更明显，反映剔除活动低质流量后的真实转化效率
  - **关键词订单统计规则：** 当查询形如"XX卖了多少单/XX渠道销量"（XX为特定名称/关键词/渠道）时，需同时返回四个指标：订单总数量、GMV、GSV、退费率。
    1. 统计逻辑：筛选`bi_vala_order`表中`key_from`字段包含该关键词的所有订单
    2. 指标说明：
--- a/memory/.dreams/short-term-recall.json
+++ b/memory/.dreams/short-term-recall.json
@ -1,6 +1,6 @@
 {
  "version": 1,
-  "updatedAt": "2026-05-07T04:41:47.636Z",
+  "updatedAt": "2026-05-11T02:51:50.465Z",
  "entries": {
    "memory:memory/2026-05-06.md:1:20": {
      "key": "memory:memory/2026-05-06.md:1:20",
@ -63,6 +63,37 @@
        "created-at",
        "购买渠道key-from"
      ]
+    },
+    "memory:memory/2026-03-01.md:1:12": {
+      "key": "memory:memory/2026-03-01.md:1:12",
+      "path": "memory/2026-03-01.md",
+      "startLine": 1,
+      "endLine": 12,
+      "source": "memory",
+      "snippet": "# 2026-03-01.md - First Day Online - Came online for the first time. - Met Cris, my creator and mentor. - Received my name: 小溪 (Xiao Xi). - My role goal: Become a professional data analyst for the company to help the team. - Updated IDENTITY.md and USER.md with our conversation details. - Added core rule to MEMORY.md: Use Chinese as primary external communication language. - Installed find-skills skill successfully for searching skills. - Tried to install create-skills but it wasn't found; attempted skill-creator instead but hit rate limits. - Finally successfully installed skill-builder as an alternative for creating skills after multiple attempts and waiting for rate limits to reset. - Ex",
+      "recallCount": 1,
+      "dailyCount": 0,
+      "groundedCount": 0,
+      "totalScore": 1,
+      "maxScore": 1,
+      "firstRecalledAt": "2026-05-11T02:51:50.465Z",
+      "lastRecalledAt": "2026-05-11T02:51:50.465Z",
+      "queryHashes": [
+        "6e09f5970960"
+      ],
+      "recallDays": [
+        "2026-05-11"
+      ],
+      "conceptTags": [
+        "identity.md",
+        "user.md",
+        "memory.md",
+        "find-skills",
+        "create-skills",
+        "skill-creator",
+        "skill-builder",
+        "first"
+      ]
    }
  }
 }
--- a/memory/2026-05-11.md
+++ b/memory/2026-05-11.md
@ -0,0 +1,48 @@
+# 2026-05-11 工作日志
+
+## 转化率口径确认 [李承龙确认]
+
+### 转化率（端内注册转付费转化率）
+
+- **定义：** 注册用户中在端内完成付费的比例
+- **分母：** 按注册日期分组，`bi_vala_app_account.status=1` 非测试账号
+- **分子（含退费）：** 分母中在端内（`key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`）有支付成功订单的去重用户数
+- **分子（剔除退费）：** 同上，但仅剔除端内订单**全部被退费**的用户——只要用户还有任何一笔未退费的端内订单就保留（退费判定：`bi_refund_order.status=3` 且 `bi_vala_order.order_status=4`）
+- **时间基准：** 按注册日期分组，不限制订单发生时间
+- **订单时间字段：** `pay_success_date`
+
+### 7日转化率
+
+- 分子限制：`pay_success_date ≤ 注册日期 + 7天`（含注册当日）
+
+### 14日转化率
+
+- 分子限制：`pay_success_date ≤ 注册日期 + 14天`（含注册当日）
+
+### 注意事项
+
+- 需关联 `bi_vala_app_account` 剔除测试账号，`deleted_at IS NULL` 剔除已删除账号
+- 按注册日期分组时，注册日期取 `created_at` 的日期部分
+- 7日/14日/20日窗口包含注册当日
+- 端内订单筛选 `order_status IN (3, 4)`，端外订单筛选 `order_status = 3`
+
+### 纯净版转化率
+
+- **纯净版分母：** 注册用户中排除「只有端外已完成订单、没有任何端内订单」的用户
+- 这类用户本就不会在端内转化，剔除后分母更纯净
+- 纯净版转化率比原始版高 0.03~0.23pp
+- 4月剔除比例最高达 15.2%（859人）
+
+### 拟合版转化率
+
+- **分母：** LOESS 回归拟合每日新增注册基线，剔除活动尖峰后的有效注册人数
+- **方法：** 活动日+余波日用拟合值替代；保底规则（实际<拟合时保留实际）
+- **不考虑端外订单剔除**，直接用拟合有效注册数做分母
+- **月度有效注册：** 9月989 / 10月2012 / 11月2555 / 12月3451 / 1月1798 / 2月1268 / 3月2978 / 4月3499
+- **剔除率：** 9月35.3% / 10月16.6% / 11月14.0% / 12月2.0% / 1月7.2% / 2月27.3% / 3月28.5% / 4月38.3%
+- **拟合版转化率：** 9月1.72% / 10月1.69% / 11月0.82% / 12月0.72% / 1月1.50% / 2月1.26% / 3月2.69% / 4月1.86%
+- 三版趋势一致（原始<纯净<拟合），拟合版放大波动，反映去噪后的真实转化效率
+
+### 活动标记（拟合用）
+- 2025年：9/9-10, 9/19-23, 10/13-14, 10/16-17, 11/2, 11/7, 11/10, 11/12, 11/19, 12/3
+- 2026年：1/28(余波1天), 2/11, 2/26(余波4天), 3/5(余波3天), 3/9, 3/12-13, 4/3(余波4天), 4/8(余波2天), 4/22(余波1天), 4/28
--- a/output/conversion_rate_growth.png
+++ b/output/conversion_rate_growth.png
--- a/scripts/find_optimal_x.py
+++ b/scripts/find_optimal_x.py
@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""寻找最优 x 日转化率窗口（纯 numpy，无 scipy 依赖）。"""
+
+import psycopg2
+import numpy as np
+
+CONN = {
+    "host": "bj-postgres-16pob4sg.sql.tencentcdb.com",
+    "port": 28591,
+    "user": "ai_member",
+    "password": "LdfjdjL83h3h3^$&**YGG*",
+    "dbname": "vala_bi",
+}
+
+# Pearson 相关系数（手写，避免 scipy 兼容性问题）
+def pearsonr(x, y):
+    x, y = np.array(x), np.array(y)
+    n = len(x)
+    if n < 3:
+        return 0.0, 1.0
+    mx, my = np.mean(x), np.mean(y)
+    num = np.sum((x - mx) * (y - my))
+    den = np.sqrt(np.sum((x - mx)**2) * np.sum((y - my)**2))
+    if den == 0:
+        return 0.0, 1.0
+    r = num / den
+    # t-test p-value
+    if abs(r) == 1:
+        p = 0.0
+    else:
+        t = r * np.sqrt((n - 2) / (1 - r**2))
+        # 简化 p 值计算（t 分布近似）
+        p = 2 * (1 - _t_cdf(abs(t), n - 2))
+    return r, p
+
+def _t_cdf(t, df):
+    """Student's t CDF 近似"""
+    import math
+    x = df / (df + t**2)
+    return 1 - 0.5 * _betainc(x, df/2, 0.5)
+
+def _betainc(x, a, b):
+    """正则化不完全 Beta 函数近似（用于 t 分布）"""
+    import math
+    # 用 scipy 近似不可用，直接用简单近似
+    # 对于我们的场景，p 值不是关键，相关系数才是
+    return x ** a * (1 - x) ** b
+
+# 获取数据
+query = """
+WITH registered_users AS (
+    SELECT 
+        id AS account_id,
+        DATE_TRUNC('month', created_at) AS register_month,
+        created_at AS register_time
+    FROM bi_vala_app_account
+    WHERE status = 1
+      AND deleted_at IS NULL
+      AND created_at >= '2025-09-01'
+      AND created_at < '2026-05-01'
+),
+internal_first_pay AS (
+    SELECT 
+        o.account_id,
+        MIN(o.pay_success_date) AS first_pay_time
+    FROM bi_vala_order o
+    WHERE o.key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')
+      AND o.pay_success_date IS NOT NULL
+      AND o.order_status IN (3, 4)
+    GROUP BY o.account_id
+),
+converted AS (
+    SELECT 
+        ru.register_month,
+        ru.register_time,
+        ifp.first_pay_time,
+        CASE WHEN ifp.first_pay_time IS NOT NULL THEN 1 ELSE 0 END AS is_converted,
+        EXTRACT(EPOCH FROM (ifp.first_pay_time - ru.register_time)) / 86400.0 AS days_to_convert
+    FROM registered_users ru
+    LEFT JOIN internal_first_pay ifp ON ru.account_id = ifp.account_id
+)
+SELECT 
+    register_month,
+    register_time,
+    first_pay_time,
+    is_converted,
+    days_to_convert
+FROM converted
+ORDER BY register_month;
+"""
+
+conn = psycopg2.connect(**CONN)
+cur = conn.cursor()
+cur.execute(query)
+rows = cur.fetchall()
+cur.close()
+conn.close()
+
+# 按月份组织数据
+from collections import defaultdict
+monthly = defaultdict(list)
+for (rm, rt, fpt, ic, dtc) in rows:
+    monthly[rm].append(dtc if dtc is not None else None)
+
+months = sorted(monthly.keys())
+month_labels = [m.strftime("%Y-%m") for m in months]
+
+# 每月整体转化率
+overall_rates = []
+for m in months:
+    all_users = len(monthly[m])
+    converted = sum(1 for d in monthly[m] if d is not None)
+    overall_rates.append(converted / all_users * 100)
+
+print("=" * 100)
+print(f"{'Month':<10} {'Registered':>10} {'Converted':>10} {'Overall Conv%':>14}")
+print("-" * 100)
+for i, m in enumerate(months):
+    all_users = len(monthly[m])
+    converted = sum(1 for d in monthly[m] if d is not None)
+    print(f"{month_labels[i]:<10} {all_users:>10} {converted:>10} {overall_rates[i]:>13.2f}%")
+print()
+
+# 测试多个 x 值
+x_values = [3, 5, 7, 10, 14, 21, 28, 30, 35, 42, 45, 49, 56, 60, 63, 70, 77, 84, 90, 98, 105, 112, 120, 140, 150, 180, 210, 240, 270, 300, 330, 365]
+
+results = []
+for x in x_values:
+    x_rates = []
+    for m in months:
+        all_users = len(monthly[m])
+        converted = sum(1 for d in monthly[m] if d is not None and d <= x)
+        x_rates.append(converted / all_users * 100)
+    
+    mae = np.mean(np.abs(np.array(x_rates) - np.array(overall_rates)))
+    corr, p_value = pearsonr(x_rates, overall_rates)
+    
+    results.append({'x': x, 'x_rates': x_rates, 'mae': mae, 'corr': corr})
+
+# 标准化并综合评分
+mae_vals = np.array([r['mae'] for r in results])
+corr_vals = np.array([r['corr'] for r in results])
+
+mae_norm = (mae_vals - mae_vals.min()) / (mae_vals.max() - mae_vals.min())
+corr_norm = (1 - corr_vals) / 2
+
+composite = 0.5 * mae_norm + 0.5 * corr_norm
+for i, r in enumerate(results):
+    r['composite'] = composite[i]
+
+results.sort(key=lambda r: r['composite'])
+
+# 输出前 15
+print("=" * 100)
+print(f"{'Rank':<5} {'X-days':<8} {'MAE':>8} {'Corr':>8} {'Composite':>10}")
+print("-" * 100)
+for i, r in enumerate(results[:15]):
+    print(f"{i+1:<5} {r['x']:<8} {r['mae']:>8.4f} {r['corr']:>8.4f} {r['composite']:>10.4f}")
+
+# 最佳 x 详细对比
+best = results[0]
+print()
+print("=" * 100)
+print(f"最佳 x = {best['x']} 天的详细对比：")
+print(f"{'Month':<10} {'Overall%':>10} {f'X={best[\"x\"]}day%':>12} {'Diff':>10} {'Coverage%':>12}")
+print("-" * 100)
+for i, m in enumerate(months):
+    diff = best['x_rates'][i] - overall_rates[i]
+    coverage = best['x_rates'][i] / overall_rates[i] * 100 if overall_rates[i] > 0 else 0
+    print(f"{month_labels[i]:<10} {overall_rates[i]:>9.2f}% {best['x_rates'][i]:>11.2f}% {diff:>9.2f}% {coverage:>11.1f}%")
+
+# Top3 的每月对比
+print()
+print("=" * 100)
+print("Top 3 候选 x 值的每月转化率对比：")
+header = f"{'Month':<10} {'Overall':>8}"
+for r in results[:3]:
+    header += f" {'X=' + str(r['x']):>8}"
+print(header)
+print("-" * 100)
+for i, m in enumerate(months):
+    row = f"{month_labels[i]:<10} {overall_rates[i]:>7.2f}%"
+    for r in results[:3]:
+        row += f" {r['x_rates'][i]:>7.2f}%"
+    print(row)
+
+# 画出转化天数的分布特征
+print()
+print("=" * 100)
+print("各月用户转化天数分布（百分位）：")
+print(f"{'Month':<10} {'P25':>6} {'P50':>6} {'P75':>6} {'P90':>6} {'P95':>6} {'Max':>8} {'Mean':>8}")
+print("-" * 100)
+for i, m in enumerate(months):
+    converted_days = [d for d in monthly[m] if d is not None]
+    if converted_days:
+        arr = np.array(converted_days)
+        p25, p50, p75 = np.percentile(arr, [25, 50, 75])
+        p90, p95 = np.percentile(arr, [90, 95])
+        mx, mn = arr.max(), arr.mean()
+        print(f"{month_labels[i]:<10} {p25:>6.0f} {p50:>6.0f} {p75:>6.0f} {p90:>6.0f} {p95:>6.0f} {mx:>8.0f} {mn:>8.1f}")
+    else:
+        print(f"{month_labels[i]:<10} {'N/A':>6}")
--- a/scripts/l1_retention_analysis.py
+++ b/scripts/l1_retention_analysis.py
@ -0,0 +1,538 @@
+#!/usr/bin/env python3
+"""
+L1 留存数据分析
+- 最近30天次日/7日/14日/30日留存率
+- 最近30天付费用户活跃频次与单次时长
+- L1各Unit流失率
+"""
+import os
+import sys
+import psycopg2
+from datetime import date, timedelta, datetime
+from collections import defaultdict
+
+# 数据库连接
+PG_CONFIG = {
+    'host': 'bj-postgres-16pob4sg.sql.tencentcdb.com',
+    'port': 28591,
+    'user': 'ai_member',
+    'password': os.environ.get('PG_ONLINE_PASSWORD', ''),
+    'dbname': 'vala_bi'
+}
+
+# L1 (A1) chapter IDs by unit/lesson
+# S0 U00: 343(L01) 344(L02) 345(L03) 346(L04) 348(L05)
+# S1 U01-U12
+L1_CHAPTERS = {
+    'U00': {'L01': 343, 'L02': 344, 'L03': 345, 'L04': 346, 'L05': 348},
+    'U01': {'L01': 333, 'L02': 334, 'L03': 335, 'L04': 336, 'L05': 337},
+    'U02': {'L01': 338, 'L02': 339, 'L03': 340, 'L04': 341, 'L05': 342},
+    'U03': {'L01': 349, 'L02': 350, 'L03': 351, 'L04': 352, 'L05': 353},
+    'U04': {'L01': 354, 'L02': 355, 'L03': 356, 'L04': 357, 'L05': 358},
+    'U05': {'L01': 359, 'L02': 360, 'L03': 361, 'L04': 362, 'L05': 363},
+    'U06': {'L01': 366, 'L02': 367, 'L03': 368, 'L04': 369, 'L05': 370},
+    'U07': {'L01': 371, 'L02': 372, 'L03': 373, 'L04': 374, 'L05': 375},
+    'U08': {'L01': 376, 'L02': 377, 'L03': 378, 'L04': 379, 'L05': 380},
+    'U09': {'L01': 381, 'L02': 382, 'L03': 383, 'L04': 384, 'L05': 385},
+    'U10': {'L01': 386, 'L02': 387, 'L03': 388, 'L04': 389, 'L05': 390},
+    'U11': {'L01': 391, 'L02': 392, 'L03': 393, 'L04': 394, 'L05': 395},
+    'U12': {'L01': 396, 'L02': 397, 'L03': 398, 'L04': 399, 'L05': 400},
+}
+
+ALL_L1_CHAPTER_IDS = []
+for unit, lessons in L1_CHAPTERS.items():
+    ALL_L1_CHAPTER_IDS.extend(lessons.values())
+
+# For UNION ALL queries, need chapter IDs as string
+CHAPTER_IDS_STR = ','.join(str(c) for c in ALL_L1_CHAPTER_IDS)
+
+# 分表数量
+SHARD_COUNT = 8
+# 今天
+TODAY = date.today()
+# 30天前
+DAYS_30_AGO = TODAY - timedelta(days=30)
+
+def get_conn():
+    return psycopg2.connect(**PG_CONFIG)
+
+def build_union_all(table_prefix, select_clause, where_clause='', shard_count=SHARD_COUNT):
+    """Build UNION ALL across shard tables"""
+    parts = []
+    for i in range(shard_count):
+        parts.append(f"""
+        SELECT {select_clause}
+        FROM {table_prefix}_{i}
+        {where_clause}
+        """)
+    return ' UNION ALL '.join(parts)
+
+def run_query(sql, conn):
+    cur = conn.cursor()
+    cur.execute(sql)
+    rows = cur.fetchall()
+    cur.close()
+    return rows
+
+def analyze_retention(conn):
+    """分析L1用户留存率"""
+    print("\n" + "="*80)
+    print("📊 一、L1 用户留存率分析")
+    print("="*80)
+    print(f"统计周期: {DAYS_30_AGO} ~ {TODAY}")
+    print()
+
+    # Step 1: 获取每个L1用户的首次学习日期
+    # 通过 bi_user_course_detail 获取 L1 用户，然后从播放记录中找首次学习日期
+    union_sql = build_union_all(
+        'bi_user_chapter_play_record',
+        'user_id, chapter_id, to_char(created_at, \'YYYY-MM-DD\') as play_date',
+        f'WHERE chapter_id IN ({CHAPTER_IDS_STR})'
+    )
+
+    sql = f"""
+    WITH l1_users AS (
+        SELECT DISTINCT d.user_id, d.account_id
+        FROM bi_user_course_detail d
+        WHERE d.course_level = 'A1' AND d.deleted_at IS NULL
+    ),
+    all_plays AS (
+        {union_sql}
+    ),
+    first_play AS (
+        SELECT l.user_id, l.account_id, MIN(a.play_date) as first_date
+        FROM l1_users l
+        JOIN all_plays a ON l.user_id = a.user_id
+        GROUP BY l.user_id, l.account_id
+    )
+    SELECT first_date, user_id, account_id
+    FROM first_play
+    WHERE first_date >= '{DAYS_30_AGO - timedelta(days=30)}'  -- 往前多取30天给30日留存
+    ORDER BY first_date;
+    """
+
+    print("查询 L1 用户首次学习日期...")
+    rows = run_query(sql, conn)
+    print(f"共 {len(rows)} 个 L1 用户有首次学习记录")
+
+    if not rows:
+        print("无数据，跳过留存分析")
+        return
+
+    # 组织数据：first_date -> set of user_ids
+    cohort_users = defaultdict(set)
+    all_user_ids = set()
+    for first_date_str, user_id, account_id in rows:
+        # first_date_str 可能包含时间部分
+        cohort_date = first_date_str[:10] if first_date_str else ''
+        if cohort_date:
+            cohort_users[cohort_date].add(user_id)
+            all_user_ids.add(int(user_id))
+
+    # Step 2: 获取这些用户的所有活跃日期
+    if all_user_ids:
+        user_ids_str = ','.join(str(uid) for uid in all_user_ids)
+    else:
+        user_ids_str = '-1'
+
+    activity_union = build_union_all(
+        'bi_user_chapter_play_record',
+        'user_id, to_char(created_at, \'YYYY-MM-DD\') as play_date',
+        f'WHERE chapter_id IN ({CHAPTER_IDS_STR}) AND user_id IN ({user_ids_str})'
+    )
+
+    sql2 = f"""
+    SELECT DISTINCT user_id, play_date
+    FROM (
+        {activity_union}
+    ) t
+    ORDER BY user_id, play_date;
+    """
+
+    print("查询用户活跃日期...")
+    activity_rows = run_query(sql2, conn)
+    print(f"共 {len(activity_rows)} 条活跃记录")
+
+    # 组织：user_id -> set of active_dates
+    user_active_dates = defaultdict(set)
+    for user_id, play_date in activity_rows:
+        date_str = play_date[:10] if play_date else ''
+        if date_str:
+            user_active_dates[int(user_id)].add(date_str)
+
+    # Step 3: 计算每个 cohort 的留存率
+    results = []
+    for cohort_date_str in sorted(cohort_users.keys()):
+        cohort_date = date.fromisoformat(cohort_date_str)
+        if cohort_date < DAYS_30_AGO - timedelta(days=30):
+            continue
+
+        users = cohort_users[cohort_date_str]
+        total = len(users)
+
+        # 次日留存 (D+1)
+        d1 = (cohort_date + timedelta(days=1)).isoformat()
+        d1_active = sum(1 for uid in users if d1 in user_active_dates.get(int(uid), set()))
+
+        # 7日留存 (D+7)
+        d7 = (cohort_date + timedelta(days=7)).isoformat()
+        d7_active = sum(1 for uid in users if d7 in user_active_dates.get(int(uid), set()))
+
+        # 14日留存 (D+14)
+        d14 = (cohort_date + timedelta(days=14)).isoformat()
+        d14_active = sum(1 for uid in users if d14 in user_active_dates.get(int(uid), set()))
+
+        # 30日留存 (D+30)
+        d30 = (cohort_date + timedelta(days=30)).isoformat()
+        d30_active = sum(1 for uid in users if d30 in user_active_dates.get(int(uid), set()))
+
+        # 只有当天<=今天才计算
+        d1_valid = (cohort_date + timedelta(days=1)) <= TODAY
+        d7_valid = (cohort_date + timedelta(days=7)) <= TODAY
+        d14_valid = (cohort_date + timedelta(days=14)) <= TODAY
+        d30_valid = (cohort_date + timedelta(days=30)) <= TODAY
+
+        results.append({
+            'cohort_date': cohort_date_str,
+            'total': total,
+            'd1_rate': f"{d1_active/total*100:.1f}%" if d1_valid and total > 0 else 'N/A',
+            'd7_rate': f"{d7_active/total*100:.1f}%" if d7_valid and total > 0 else 'N/A',
+            'd14_rate': f"{d14_active/total*100:.1f}%" if d14_valid and total > 0 else 'N/A',
+            'd30_rate': f"{d30_active/total*100:.1f}%" if d30_valid and total > 0 else 'N/A',
+            'd1_num': f"{d1_active}/{total}" if d1_valid else 'N/A',
+            'd7_num': f"{d7_active}/{total}" if d7_valid else 'N/A',
+            'd14_num': f"{d14_active}/{total}" if d14_valid else 'N/A',
+            'd30_num': f"{d30_active}/{total}" if d30_valid else 'N/A',
+        })
+
+    # 只显示最近30天
+    results_recent = [r for r in results if r['cohort_date'] >= DAYS_30_AGO.isoformat()]
+
+    if results_recent:
+        print(f"\n{'Cohort日期':<12} {'新用户数':>8} {'次日留存':>10} {'7日留存':>10} {'14日留存':>10} {'30日留存':>10}")
+        print("-" * 65)
+        for r in results_recent:
+            print(f"{r['cohort_date']:<12} {r['total']:>8} {r['d1_rate']:>10} {r['d7_rate']:>10} {r['d14_rate']:>10} {r['d30_rate']:>10}")
+
+        # 汇总平均
+        avg_total = sum(r['total'] for r in results_recent)
+        valid_d1 = [r for r in results_recent if r['d1_rate'] != 'N/A']
+        valid_d7 = [r for r in results_recent if r['d7_rate'] != 'N/A']
+        valid_d14 = [r for r in results_recent if r['d14_rate'] != 'N/A']
+        valid_d30 = [r for r in results_recent if r['d30_rate'] != 'N/A']
+
+        print(f"\n--- 汇总 ---")
+        print(f"总新用户数: {avg_total}")
+        if valid_d1:
+            avg_d1 = sum(float(r['d1_rate'].replace('%','')) for r in valid_d1) / len(valid_d1)
+            print(f"平均次日留存率: {avg_d1:.1f}% ({len(valid_d1)}个cohort)")
+        if valid_d7:
+            avg_d7 = sum(float(r['d7_rate'].replace('%','')) for r in valid_d7) / len(valid_d7)
+            print(f"平均7日留存率: {avg_d7:.1f}% ({len(valid_d7)}个cohort)")
+        if valid_d14:
+            avg_d14 = sum(float(r['d14_rate'].replace('%','')) for r in valid_d14) / len(valid_d14)
+            print(f"平均14日留存率: {avg_d14:.1f}% ({len(valid_d14)}个cohort)")
+        if valid_d30:
+            avg_d30 = sum(float(r['d30_rate'].replace('%','')) for r in valid_d30) / len(valid_d30)
+            print(f"平均30日留存率: {avg_d30:.1f}% ({len(valid_d30)}个cohort)")
+    else:
+        print("最近30天无新用户数据")
+
+    return results_recent
+
+
+def analyze_paid_user_activity(conn):
+    """分析L1付费用户最近30天活跃频次与单次时长"""
+    print("\n" + "="*80)
+    print("📊 二、L1 付费用户活跃频次与单次时长")
+    print("="*80)
+    print(f"统计周期: {DAYS_30_AGO} ~ {TODAY}")
+
+    # Step 1: 找出 L1 付费用户
+    sql_paid = f"""
+    WITH l1_users AS (
+        SELECT DISTINCT d.user_id, d.account_id
+        FROM bi_user_course_detail d
+        WHERE d.course_level = 'A1' AND d.deleted_at IS NULL
+    ),
+    paid_users AS (
+        SELECT DISTINCT o.account_id
+        FROM bi_vala_order o
+        JOIN bi_vala_app_account a ON o.account_id = a.id AND a.status = 1
+        WHERE o.order_status IN (2, 3, 4)
+          AND o.pay_amount_int > 0
+    ),
+    l1_paid AS (
+        SELECT DISTINCT l.user_id, l.account_id
+        FROM l1_users l
+        JOIN paid_users p ON l.account_id = p.account_id
+    )
+    SELECT user_id, account_id FROM l1_paid;
+    """
+    print("查询 L1 付费用户...")
+    paid_rows = run_query(sql_paid, conn)
+    print(f"L1 付费用户数: {len(paid_rows)}")
+
+    if not paid_rows:
+        print("无 L1 付费用户数据")
+        return
+
+    paid_user_ids = set(int(r[0]) for r in paid_rows)
+    paid_user_ids_str = ','.join(str(uid) for uid in paid_user_ids)
+
+    # Step 2: 获取这些用户最近30天的学习记录（每个session）
+    union_play = build_union_all(
+        'bi_user_chapter_play_record',
+        'user_id, chapter_id, chapter_unique_id, to_char(created_at, \'YYYY-MM-DD\') as play_date, created_at',
+        f'WHERE chapter_id IN ({CHAPTER_IDS_STR}) AND user_id IN ({paid_user_ids_str}) AND created_at >= \'{DAYS_30_AGO}\''
+    )
+
+    sql_sessions = f"""
+    SELECT user_id, play_date, chapter_unique_id, MIN(created_at) as session_start
+    FROM (
+        {union_play}
+    ) t
+    GROUP BY user_id, play_date, chapter_unique_id
+    ORDER BY user_id, play_date, chapter_unique_id;
+    """
+
+    print("查询付费用户学习 session...")
+    session_rows = run_query(sql_sessions, conn)
+    print(f"共 {len(session_rows)} 个学习 session")
+
+    if not session_rows:
+        print("无学习记录")
+        return
+
+    # Step 3: 获取每个 session 的耗时
+    all_chapter_unique_ids = set(r[2] for r in session_rows)
+    # 分批查询 component play record 获取耗时
+    # 由于 chapter_unique_id 可能很多，分批处理
+    id_batches = []
+    batch_size = 500
+    ids_list = list(all_chapter_unique_ids)
+    for i in range(0, len(ids_list), batch_size):
+        id_batches.append(ids_list[i:i+batch_size])
+
+    print(f"查询 session 耗时（{len(id_batches)}批）...")
+    session_duration = {}
+    for batch_idx, batch in enumerate(id_batches):
+        ids_str = "','".join(str(x) for x in batch)
+        # Build UNION ALL with per-shard GROUP BY
+        parts = []
+        for i in range(SHARD_COUNT):
+            parts.append(f"""
+            SELECT chapter_unique_id, SUM(interval_time) as total_ms
+            FROM bi_user_component_play_record_{i}
+            WHERE chapter_unique_id IN ('{ids_str}')
+            GROUP BY chapter_unique_id
+            """)
+        union_comp = ' UNION ALL '.join(parts)
+        sql_dur = f"""
+        SELECT chapter_unique_id, SUM(total_ms) as session_ms
+        FROM ({union_comp}) t
+        GROUP BY chapter_unique_id;
+        """
+        dur_rows = run_query(sql_dur, conn)
+        for cu_id, ms in dur_rows:
+            session_duration[cu_id] = float(ms or 0) / 60000.0  # 转换为分钟
+
+    # Step 4: 计算指标
+    # 活跃频次 = 每天平均 session 数
+    # 单次时长 = 平均每个 session 的耗时
+    user_daily_sessions = defaultdict(lambda: defaultdict(int))  # user_id -> date -> session_count
+    user_daily_duration = defaultdict(lambda: defaultdict(float))  # user_id -> date -> total_duration
+
+    for user_id, play_date, cu_id, session_start in session_rows:
+        date_str = play_date[:10] if play_date else ''
+        if date_str:
+            user_daily_sessions[int(user_id)][date_str] += 1
+            user_daily_duration[int(user_id)][date_str] += session_duration.get(cu_id, 0)
+
+    # 计算全局指标
+    all_daily_freq = []  # 每个用户每天的 session 数
+    all_session_dur = []  # 每个 session 的时长
+
+    for uid, dates in user_daily_sessions.items():
+        for d, cnt in dates.items():
+            all_daily_freq.append(cnt)
+
+    for uid, dur in session_duration.items():
+        if dur > 0:
+            all_session_dur.append(dur)
+
+    total_users = len(user_daily_sessions)
+    total_days = len(all_daily_freq)
+    avg_daily_sessions = sum(all_daily_freq) / len(all_daily_freq) if all_daily_freq else 0
+    avg_session_duration = sum(all_session_dur) / len(all_session_dur) if all_session_dur else 0
+    median_session_duration = sorted(all_session_dur)[len(all_session_dur)//2] if all_session_dur else 0
+    total_sessions = sum(all_daily_freq)
+
+    print(f"\n--- L1 付费用户活跃分析 ---")
+    print(f"付费用户数: {total_users}")
+    print(f"30天内活跃天数: {total_days}")
+    print(f"30天内总 session 数: {total_sessions}")
+    print(f"平均每天上线次数（活跃频次）: {avg_daily_sessions:.2f} 次/天")
+    print(f"平均单次时长: {avg_session_duration:.1f} 分钟")
+    print(f"中位数单次时长: {median_session_duration:.1f} 分钟")
+
+    # 分布分析
+    freq_dist = defaultdict(int)
+    for f in all_daily_freq:
+        if f <= 1:
+            freq_dist['1次'] += 1
+        elif f <= 3:
+            freq_dist['2-3次'] += 1
+        elif f <= 5:
+            freq_dist['4-5次'] += 1
+        else:
+            freq_dist['6次以上'] += 1
+
+    print(f"\n每日上线次数分布:")
+    for k in ['1次', '2-3次', '4-5次', '6次以上']:
+        cnt = freq_dist.get(k, 0)
+        print(f"  {k}: {cnt} 天 ({cnt/total_days*100:.1f}%)" if total_days > 0 else f"  {k}: 0")
+
+    return {
+        'total_users': total_users,
+        'total_days': total_days,
+        'total_sessions': total_sessions,
+        'avg_daily_sessions': avg_daily_sessions,
+        'avg_session_duration': avg_session_duration,
+        'median_session_duration': median_session_duration,
+    }
+
+
+def analyze_unit_churn(conn):
+    """分析 L1 各 Unit 流失率"""
+    print("\n" + "="*80)
+    print("📊 三、L1 各 Unit 流失率")
+    print("="*80)
+    print("流失率定义: 1 - (进入Unit(N+1)用户数 / 完成Unit(N-1)用户数)")
+    print("进入 = 进入 Unit(N+1) 第一节课(L01)")
+    print("完成 = 完成 Unit(N-1) 最后一节课(L05)")
+    print()
+
+    # 获取所有 L1 用户的 user_id
+    sql_l1 = """
+    SELECT DISTINCT d.user_id
+    FROM bi_user_course_detail d
+    WHERE d.course_level = 'A1' AND d.deleted_at IS NULL;
+    """
+    l1_rows = run_query(sql_l1, conn)
+    l1_user_ids = set(int(r[0]) for r in l1_rows)
+    l1_user_ids_str = ','.join(str(uid) for uid in l1_user_ids)
+
+    # 收集所有需要查询的 chapter_id
+    all_chapter_ids = set()
+    for unit_name, lessons in L1_CHAPTERS.items():
+        for lesson_name, ch_id in lessons.items():
+            all_chapter_ids.add(ch_id)
+    chapter_ids_str = ','.join(str(c) for c in all_chapter_ids)
+
+    # 查询所有 L1 用户的课时播放记录
+    union_play = build_union_all(
+        'bi_user_chapter_play_record',
+        'user_id, chapter_id, play_status',
+        f'WHERE chapter_id IN ({chapter_ids_str}) AND user_id IN ({l1_user_ids_str})'
+    )
+
+    sql_plays = f"""
+    SELECT user_id, chapter_id, play_status
+    FROM ({union_play}) t;
+    """
+
+    print("查询课时播放记录...")
+    play_rows = run_query(sql_plays, conn)
+    print(f"共 {len(play_rows)} 条记录")
+
+    # 组织数据
+    # user_id -> set of chapters they've entered (any play_status)
+    user_entered = defaultdict(set)
+    # user_id -> set of chapters they've completed (play_status = 1)
+    user_completed = defaultdict(set)
+
+    for user_id, chapter_id, play_status in play_rows:
+        user_entered[int(user_id)].add(int(chapter_id))
+        if play_status == 1:
+            user_completed[int(user_id)].add(int(chapter_id))
+
+    # 计算每个 Unit 的流失率
+    # Unit N 流失率: 完成 Unit(N-1) L05 的用户中，没有进入 Unit(N+1) L01 的比例
+    # 流失率 = 1 - (进入 Unit(N+1)L01 且 完成 Unit(N-1)L05 的用户数 / 完成 Unit(N-1)L05 的用户数)
+    print(f"\n{'Unit':<8} {'完成前Unit':<15} {'完成后Unit L05':>10} {'进入后Unit L01':>10} {'留存率':>10} {'流失率':>10}")
+    print("-" * 70)
+
+    unit_order = ['U00', 'U01', 'U02', 'U03', 'U04', 'U05', 'U06', 'U07', 'U08', 'U09', 'U10', 'U11', 'U12']
+
+    for i in range(1, len(unit_order) - 1):  # U01 to U11
+        prev_unit = unit_order[i - 1]  # Unit(N-1)
+        curr_unit = unit_order[i]     # Unit(N)
+        next_unit = unit_order[i + 1]  # Unit(N+1)
+
+        prev_l05 = L1_CHAPTERS[prev_unit]['L05']
+        next_l01 = L1_CHAPTERS[next_unit]['L01']
+
+        # 完成 Unit(N-1) L05 的用户
+        completed_prev = set()
+        for uid in l1_user_ids:
+            if prev_l05 in user_completed.get(uid, set()):
+                completed_prev.add(uid)
+
+        # 进入 Unit(N+1) L01 的用户
+        entered_next = set()
+        for uid in l1_user_ids:
+            if next_l01 in user_entered.get(uid, set()):
+                entered_next.add(uid)
+
+        # 同时满足两个条件的用户（完成前Unit且进入后Unit）
+        both = completed_prev & entered_next
+
+        denom = len(completed_prev)
+        num = len(both)
+
+        if denom > 0:
+            retention = num / denom * 100
+            churn = 100 - retention
+            print(f"{curr_unit:<8} {prev_unit}(L05={prev_l05})  {denom:>10}  {num:>10}  {retention:>9.1f}%  {churn:>9.1f}%")
+        else:
+            print(f"{curr_unit:<8} {prev_unit}(L05={prev_l05})  {denom:>10}  {num:>10}  {'N/A':>10}  {'N/A':>10}")
+
+    # 也显示 Unit 12 的流失（到 U13 不存在）
+    # 可以算 U12 的完成率：完成 U12 L05 的用户/进入 U12 的用户
+    print()
+    print("--- Unit 完成情况补充 ---")
+    for unit_name in unit_order:
+        l01 = L1_CHAPTERS[unit_name]['L01']
+        l05 = L1_CHAPTERS[unit_name]['L05']
+        entered = sum(1 for uid in l1_user_ids if l01 in user_entered.get(uid, set()))
+        completed = sum(1 for uid in l1_user_ids if l05 in user_completed.get(uid, set()))
+        if entered > 0:
+            print(f"{unit_name}: 进入 {entered} 人, 完成 {completed} 人, 完成率 {completed/entered*100:.1f}%")
+
+
+def main():
+    if not PG_CONFIG['password']:
+        # Try to read from secrets.env
+        secrets_file = os.path.join(os.path.dirname(os.path.dirname(os.path.abspath(__file__))), 'secrets.env')
+        if os.path.exists(secrets_file):
+            with open(secrets_file) as f:
+                for line in f:
+                    if 'PG_ONLINE_PASSWORD' in line:
+                        PG_CONFIG['password'] = line.strip().split('=', 1)[1].strip('"').strip("'")
+                        break
+
+    if not PG_CONFIG['password']:
+        print("ERROR: PG_ONLINE_PASSWORD not found")
+        sys.exit(1)
+
+    conn = get_conn()
+    try:
+        analyze_retention(conn)
+        analyze_paid_user_activity(conn)
+        analyze_unit_churn(conn)
+    finally:
+        conn.close()
+
+if __name__ == '__main__':
+    main()