diff --git a/MEMORY.md b/MEMORY.md index 507157e..3a6170a 100644 --- a/MEMORY.md +++ b/MEMORY.md @@ -102,12 +102,23 @@ - **端内订单条件:** `key_from IN ('app-active-h5-0-0', 'app-sales-bj-qhm-0')`, `pay_success_date IS NOT NULL`, `order_status IN (3, 4)` - **端外订单条件:** `key_from NOT IN 端内`, `pay_success_date IS NOT NULL`, `order_status = 3` - 基于纯净版分母,转化率 / 7日 / 14日 / 20日转化率的口径不变,只是分母缩小为纯净版用户 - - **拟合版转化率 [李承龙确认 2026-05-11]:** - - **分母:** 用 LOESS 回归拟合每日新增注册人数基线,剔除营销活动带来的注册量尖峰后的有效注册人数 - - **方法:** 活动日及余波日 → 用拟合值替代实际值;非活动日且实际低于拟合 → 保留实际值(保底规则) + - **拟合版转化率 [李承龙确认 2026-05-11,2026-05-12 补充实现细节]:** + - **目的:** 剔除营销活动带来的注册量尖峰,反映「去噪」后的真实转化效率 + - **三版关系:** 原始版 < 纯净版 < 拟合版(分母逐层收紧,转化率递增) + - **分母计算(5步法):** + 1. **LOESS 拟合** — 仅用清洁日(非活动日+非余波日)的每日注册人数做 LOESS 回归(frac≈0.236),得到自然增长基线 + 2. **星期因子修正** — 基于清洁日计算每周每日平均注册量与全局均值的比值,修正 LOESS 基线(周末注册量通常高于工作日,因子范围约 0.85~1.25) + 3. **活动日+余波日** — 用星期修正后的 LOESS 拟合值替代实际注册人数(压低活动带来的虚增) + 4. **非活动日** — 保留实际注册人数(不压低,非活动日的注册是「真实」的) + 5. **月度汇总** — 将每日有效注册人数按月加总,得到拟合版分母 + - **活动日历(活动日+余波日,[李承龙确认]):** + - 2025年:9/9-10, 9/19-23, 10/13-14, 10/16-17, 11/2, 11/7, 11/10, 11/12, 11/19, 12/3 + - 2026年:1/28-29(余波1天), 2/11, 2/26-3/2(余波4天), 3/5-8(余波3天), 3/9, 3/12-13, 4/3-7(余波4天), 4/8-10(余波2天), 4/22-23(余波1天), 4/28, 5/6-7 + - 共 45 个活动/余波日(254天中占18%) + - 余波日:活动日后 N 天内仍有注册溢出效应,一并纳入替换范围 - **不考虑端外订单:** 拟合版分母直接使用拟合有效注册人数,不额外剔除端外-only用户 - - **分子:** 端内付费用户数,口径与原始版一致(`key_from IN 端内`, `order_status IN (3,4)`) - - 拟合版更大程度「去噪」,转化率数字更高、波动更明显,反映剔除活动低质流量后的真实转化效率 + - **分子:** 端内付费用户数,口径与原始版一致(`key_from IN 端内`, `order_status IN (3,4)`,剔除端内订单全部退费的用户) + - **拟合版分母参考值(2025-09~2026-05):** 9月966 / 10月1992 / 11月2541 / 12月3430 / 1月1789 / 2月1285 / 3月2938 / 4月3358 / 5月869 - **关键词订单统计规则:** 当查询形如"XX卖了多少单/XX渠道销量"(XX为特定名称/关键词/渠道)时,需同时返回四个指标:订单总数量、GMV、GSV、退费率。 1. 统计逻辑:筛选`bi_vala_order`表中`key_from`字段包含该关键词的所有订单 2. 指标说明: diff --git a/memory/.dreams/short-term-recall.json b/memory/.dreams/short-term-recall.json index 3f3a3a0..4e09e29 100644 --- a/memory/.dreams/short-term-recall.json +++ b/memory/.dreams/short-term-recall.json @@ -1,6 +1,6 @@ { "version": 1, - "updatedAt": "2026-05-11T02:51:50.465Z", + "updatedAt": "2026-05-12T07:45:29.297Z", "entries": { "memory:memory/2026-05-06.md:1:20": { "key": "memory:memory/2026-05-06.md:1:20", @@ -94,6 +94,38 @@ "skill-builder", "first" ] + }, + "memory:memory/2026-05-11.md:41:49": { + "key": "memory:memory/2026-05-11.md:41:49", + "path": "memory/2026-05-11.md", + "startLine": 41, + "endLine": 49, + "source": "memory", + "snippet": "- **月度有效注册:** 9月989 / 10月2012 / 11月2555 / 12月3451 / 1月1798 / 2月1268 / 3月2978 / 4月3499 - **剔除率:** 9月35.3% / 10月16.6% / 11月14.0% / 12月2.0% / 1月7.2% / 2月27.3% / 3月28.5% / 4月38.3% - **拟合版转化率:** 9月1.72% / 10月1.69% / 11月0.82% / 12月0.72% / 1月1.50% / 2月1.26% / 3月2.69% / 4月1.86% - 三版趋势一致(原始<纯净<拟合),拟合版放大波动,反映去噪后的真实转化效率 ### 活动标记(拟合用) - 2025年:9/9-10, 9/19-23, 10/13-14, 10/16-17, 11/2, 11/7, 11/10, 11/12, 11/19, 12/3 - 2026年:1/28(余波1天), 2/11, 2/26(余波4天), 3/5(余波3天), 3/9, 3/12-13, 4/3(余波4天), 4/8(余波2天), 4/22(余波1天), 4/28", + "recallCount": 2, + "dailyCount": 0, + "groundedCount": 0, + "totalScore": 2, + "maxScore": 1, + "firstRecalledAt": "2026-05-12T07:44:48.437Z", + "lastRecalledAt": "2026-05-12T07:45:29.297Z", + "queryHashes": [ + "1a57dbf6ec49", + "ccf7251a9472" + ], + "recallDays": [ + "2026-05-12" + ], + "conceptTags": [ + "9月35.3", + "10月16.6", + "11月14.0", + "12月2.0", + "1月7.2", + "2月27.3", + "3月28.5", + "4月38.3" + ] } } } diff --git a/memory/2026-05-12-1511.md b/memory/2026-05-12-1511.md new file mode 100644 index 0000000..8a500f5 --- /dev/null +++ b/memory/2026-05-12-1511.md @@ -0,0 +1,124 @@ +# Session: 2026-05-12 15:11:44 GMT+8 + +- **Session Key**: agent:main:feishu:direct:ou_e63ce6b760ad39382852472f28fbe2a2 +- **Session ID**: 79f968a4-e253-4482-8a33-d2f5aa30dcbb +- **Source**: feishu + +## Conversation Summary + +user: 之前我们做过一次新增注册用户的拟合,目的是剔除活动带来的新增注册,你还记得么 +assistant: 找到了!让我看看之前的分析记录。 +assistant: 记得。我们之前做的分析是用 **LOESS 回归**对每日新增注册人数拟合基线趋势,目的是剔除营销活动带来的注册量尖峰,得到「有效注册人数」。 + +核心逻辑: + +1. **LOESS 平滑** — 用活动日以外的清洁数据拟合每日自然增长趋势线 +2. **星期潮汐效应** — 按比例修正周末/工作日波动(周六日通常注册量更高) +3. **活动日标记** — 已知活动日及余波日的注册量用拟合值替代 +4. **保底规则** — 实际值低于拟合值时保留实际值(只有活动带来的虚增才替换) + +当时的结论(2025-09~2026-04): + +| 月份 | 实际注册 | 有效注册 | 剔除 | 剔除率 | +|------|:---:|:---:|:---:|:---:| +| 2025-09 | — | — | — | — | +| 2025-10 | — | — | — | — | +| 2026-03 | — | 4,167 | — | — | +| 2026-04 | — | 4,283 | — | 15.2% | + +承龙,你是想把「有效注册人数」也应用到我们今天的转化率计算中吗?就是把分母从「纯净版注册用户」进一步替换为「剔除活动影响的纯净版注册用户」? +user: 我们在做一套拟合版转化率,分母是拟合版的有效注册人数,这里面不需要考虑用户是否有端外的订单,直接用拟合出来的数量错分母即可 +assistant: 明白,拟合版转化率的分母直接取拟合后的有效注册数,不管端外订单。我先拉每日注册数据、跑拟合,再算转化率。 +assistant: 7D/14D/20D 的转化用户数因为用了月初日期而非用户注册日做窗口,需要修正。让我重新计算: +assistant: 数据修正完毕: + +## 📊 拟合版转化率 + +| 月份 | 拟合注册 | Conv% | 7D% | 14D% | 20D% | +|------|:---:|:---:|:---:|:---:|:---:| +| 2025-09 | 989 | 1.72% | 0.40% | 0.51% | 0.81% | +| 2025-10 | 2,012 | 1.69% | 0.75% | 0.89% | 0.99% | +| 2025-11 | 2,555 | 0.82% | 0.43% | 0.55% | 0.63% | +| 2025-12 | 3,451 | 0.72% | 0.29% | 0.55% | 0.58% | +| 2026-01 | 1,798 | 1.50% | 0.89% | 1.06% | 1.28% | +| 2026-02 | 1,268 | 1.26% | 0.79% | 0.87% | 0.87% | +| **2026-03** | 2,978 | **2.69%** | 1.24% | 1.95% | 2.38% | +| 2026-04 | 3,499 | 1.86% | 1.26% | 1.71% | 1.80% | + +### 三版对比 + +| 月份 | 原始版 | 纯净版 | 拟合版 | +|------|:---:|:---:|:---:| +| 2025-09 | 1.11% | 1.34% | **1.72%** | +| 2025-10 | 1.41% | 1.59% | 1.69% | +| 2025-11 | 0.71% | 0.74% | 0.82% | +| 2025-12 | 0.71% | 0.75% | 0.72% | +| 2026-01 | 1.39% | 1.49% | 1.50% | +| 2026-02 | 0.92% | 1.01% | 1.26% | +| 2026-03 | 1.92% | 2.15% | **2.69%** | +| 2026-04 | 1.15% | 1.35% | 1.86% | + +--- + +三版趋势方向一致,拟合版因为剔除了活动带来的低质流量,转化率数字更高、波动也更明显。3月拟合版达到 2.69%,说明去噪后的真实转化效率相当不错。 + +需要写入记忆吗? +user: 写入记忆 +assistant: 已写入。今天沉淀了三套转化率指标体系: + +| 版本 | 分母 | 用途 | +|------|------|------| +| **原始版** | 全部注册用户 | 基础指标 | +| **纯净版** | 剔除端外-only用户 | 只看有端内转化潜力的人群 | +| **拟合版** | LOESS拟合有效注册 | 去活动噪声,反映真实转化效率 | +user: 三个版本的数据,帮我算一个2025年九月至今的合计版 +assistant: ## 📊 2025年9月~2026年5月 三版合计 + +| 版本 | 分母 | 转化用户 | 转化率 | 7日 | 14日 | 20日 | +|------|:---:|:---:|:---:|:---:|:---:|:---:| +| 原始版 | 24,805 | 292 | **1.18%** | 0.62% | 0.85% | 0.96% | +| 纯净版 | 22,321 | 292 | **1.31%** | 0.69% | 0.95% | 1.07% | +| 拟合版 | 19,397 | 292 | **1.51%** | 0.79% | 1.09% | 1.23% | + +分子相同(292人),分母逐层收紧:原始 24,805 → 纯净 22,321(剔除端外-only)→ 拟合 19,397(再剔除活动噪声),转化率从 1.18% 到 1.31% 再到 1.51%,去噪效果明显。 +user: 帮我算一下keyfrom为app-active-h5-0-0的订单的平均价格 +keyfrom为app-sales-bj-qhm-0的订单的平均价格 +assistant: | 渠道 | 订单数 | 均价 | 中位数 | 最低 | 最高 | +|------|:---:|:---:|:---:|:---:|:---:| +| `app-active-h5-0-0` | 296 | ¥1,092.92 | ¥599.00 | ¥599.00 | ¥1,999.00 | +| `app-sales-bj-qhm-0` | 34 | ¥2,251.71 | ¥1,999.00 | ¥599.00 | ¥3,598.00 | + +`app-sales-bj-qhm-0` 均价是中位数近 2 倍,且大多集中在 ¥1,999 档位——看起来是销售渠道推高客单价的产品。量不大但单笔价值明显更高。 +user: 我想区分下载渠道看三个版本的转化率 +assistant: ## 📊 分下载渠道 三版转化率(2025-09 ~ 2026-05) + +### 原始版 & 纯净版 + +| 下载渠道 | 注册用户 | 纯净用户 | 转化 | 原始转化率 | 纯净转化率 | 20日转化率 | +|------|:---:|:---:|:---:|:---:|:---:|:---:| +| Apple App Store | 4,949 | 4,351 | 101 | **2.04%** | **2.32%** | 1.70% | +| 华为应用市场 | 3,095 | 2,699 | 57 | 1.84% | 2.11% | 1.49% | +| 应用宝 | 1,288 | 1,156 | 22 | 1.71% | 1.90% | 1.24% | +| 小米 | 1,218 | 1,102 | 20 | 1.64% | 1.81% | 1.31% | +| vivo | 430 | 405 | 10 | **2.33%** | **2.47%** | 1.86% | +| 荣耀 | 520 | 456 | 7 | 1.35% | 1.54% | 1.35% | +| 魅族 | 75 | 72 | 2 | 2.67% | 2.78% | 2.67% | +| 学而思 | 3,888 | 3,593 | 31 | 0.80% | 0.86% | 0.67% | +| 科大讯飞 | 5,080 | 5,038 | 24 | 0.47% | 0.48% | 0.41% | +| 步步高 | 336 | 304 | 5 | 1.49% | 1.64% | 1.19% | +| 作业帮 | 238 | 219 | 2 | 0.84% | 0.91% | 0.84% | +| 京东方 | 362 | 274 | 3 | 0.83% | 1.09% | 0.83% | +| oppo | 443 | 410 | 4 | 0.90% | 0.98% | 0.68% | +| 希沃 | 845 | 671 | 4 | 0.47% | 0.60% | 0.12% | +| 小度 | 609 | 604 | 0 | 0.00% | 0.00% | 0.00% | +| (空) | 1,396 | 934 | 0 | 0.00% | 0.00% | 0.00% | +| 官网 | 33 | 32 | 0 | 0.00% | 0.00% | 0.00% | + +--- + +几个特征: +- **手机厂商渠道(Apple/华为/vivo/小米/魅族)转化率明显高于学习机渠道**,Apple 2.04%、vivo 2.33% +- **学习机品牌中只有步步高还行**(1.49%),学而思(0.80%)、科大讯飞(0.47%)转化很低,但科大讯飞注册量最大(5,080人) +- **(空) 渠道 1,396 人零转化**——值得排查,可能是某些未归类的非目标用户来源 + +拟合版需要按渠道分别做 LOESS,但小渠道每日数据太稀疏,拟合效果不好。你觉得拟合版是统一用整体拟合比例套到各渠道,还是主要渠道单独拟合、小渠道合并? diff --git a/output/daily_reg_fitted_0316_0426.png b/output/daily_reg_fitted_0316_0426.png new file mode 100644 index 0000000..25cbd4a Binary files /dev/null and b/output/daily_reg_fitted_0316_0426.png differ diff --git a/output/font_test.png b/output/font_test.png new file mode 100644 index 0000000..43875d4 Binary files /dev/null and b/output/font_test.png differ diff --git a/scripts/conversion_rate_analysis.py b/scripts/conversion_rate_analysis.py new file mode 100644 index 0000000..984727f --- /dev/null +++ b/scripts/conversion_rate_analysis.py @@ -0,0 +1,546 @@ +#!/usr/bin/env python3 +""" +2025年9月至今 三个版本(原始版/纯净版/拟合版)转化率分析 +维度:整体 / 分渠道(download_channel) / 区分key_from +按月份分组 +""" +import psycopg2 +import pandas as pd +import numpy as np +from statsmodels.nonparametric.smoothers_lowess import lowess +from datetime import datetime, timedelta +import warnings +warnings.filterwarnings('ignore') + +# ========== 数据库连接 ========== +conn = psycopg2.connect( + host='bj-postgres-16pob4sg.sql.tencentcdb.com', + port=28591, + user='ai_member', + password='LdfjdjL83h3h3^$&**YGG*', + dbname='vala_bi' +) + +START_DATE = '2025-09-01' +END_DATE = '2026-05-13' # 含今天 + +print("=" * 70) +print("Step 1: 提取注册用户数据...") + +# ========== 提取注册用户 ========== +reg_sql = f""" +SELECT + a.id AS account_id, + a.created_at::date AS reg_date, + a.download_channel +FROM bi_vala_app_account a +WHERE a.created_at >= '{START_DATE}' + AND a.created_at < '{END_DATE}' + AND a.status = 1 + AND a.deleted_at IS NULL +""" +reg_df = pd.read_sql(reg_sql, conn) +reg_df['reg_date'] = pd.to_datetime(reg_df['reg_date']) +reg_df['reg_month'] = reg_df['reg_date'].dt.to_period('M') +print(f" 注册用户数: {len(reg_df)}") +print(f" 月份分布: {reg_df['reg_month'].value_counts().sort_index().to_dict()}") + +# ========== 提取这些用户的订单 ========== +account_ids = reg_df['account_id'].unique() +print(f"\nStep 2: 提取订单数据... (共 {len(account_ids)} 个账号)") + +# 分批查询 +batch_size = 50000 +all_orders = [] +for i in range(0, len(account_ids), batch_size): + batch = account_ids[i:i+batch_size] + ids_str = ','.join([str(x) for x in batch]) + order_sql = f""" + SELECT + o.account_id, + o.key_from, + o.sale_channel, + o.pay_success_date::date AS pay_date, + o.order_status, + o.pay_amount_int, + o.trade_no, + o.out_trade_no + FROM bi_vala_order o + WHERE o.account_id IN ({ids_str}) + AND o.pay_success_date IS NOT NULL + """ + batch_df = pd.read_sql(order_sql, conn) + all_orders.append(batch_df) + if (i // batch_size + 1) % 5 == 0: + print(f" 已处理 {min(i+batch_size, len(account_ids))}/{len(account_ids)} 个账号...") + +order_df = pd.concat(all_orders, ignore_index=True) if all_orders else pd.DataFrame() +print(f" 订单总数: {len(order_df)}") + +if len(order_df) > 0: + print(f" key_from分布: {order_df['key_from'].value_counts().head(10).to_dict()}") + +# ========== 提取退费信息 ========== +print(f"\nStep 3: 提取退费数据...") +if len(order_df) > 0: + trade_nos = order_df['trade_no'].dropna().unique() + out_trade_nos = order_df['out_trade_no'].dropna().unique() + + all_refunds = [] + for i in range(0, len(trade_nos), batch_size): + batch = trade_nos[i:i+batch_size] + ids_str = ','.join([f"'{x}'" for x in batch]) + refund_sql = f""" + SELECT trade_no, out_trade_no, status + FROM bi_refund_order + WHERE trade_no IN ({ids_str}) AND status = 3 + """ + try: + batch_df = pd.read_sql(refund_sql, conn) + all_refunds.append(batch_df) + except: + pass + + refund_df = pd.concat(all_refunds, ignore_index=True) if all_refunds else pd.DataFrame() + print(f" 退费记录数: {len(refund_df)}") +else: + refund_df = pd.DataFrame() + +conn.close() + +# ========== 数据预处理 ========== +print(f"\nStep 4: 数据预处理...") + +# 标记端内/端外 +INTERNAL_KF = ['app-active-h5-0-0', 'app-sales-bj-qhm-0'] + +if len(order_df) > 0: + order_df['is_internal'] = order_df['key_from'].isin(INTERNAL_KF) + + # 已完成订单 (status 3 or 4) + order_df['is_completed'] = order_df['order_status'].isin([3, 4]) + + # 退费标记 + if len(refund_df) > 0: + refund_trade = set(refund_df['trade_no'].dropna()) + order_df['is_refunded'] = order_df['trade_no'].isin(refund_trade) + else: + order_df['is_refunded'] = False + + # 端内有效付费订单(已完成,含退费) + internal_orders = order_df[order_df['is_internal'] & order_df['is_completed']] + + # 端外有效付费订单(已完成) + external_orders = order_df[~order_df['is_internal'] & (order_df['order_status'] == 3)] + + # 按用户汇总 + # 端内付费用户 + internal_paid_users = set(internal_orders['account_id'].unique()) + + # 端外付费用户 + external_paid_users = set(external_orders['account_id'].unique()) + + # 端内付费用户(剔除全部退费) + # 统计每个用户的端内订单退费情况 + user_internal_orders = internal_orders.groupby('account_id').agg( + total_orders=('trade_no', 'count'), + refunded_orders=('is_refunded', 'sum') + ).reset_index() + user_internal_orders['all_refunded'] = user_internal_orders['total_orders'] == user_internal_orders['refunded_orders'] + internal_paid_users_norefund = set(user_internal_orders[~user_internal_orders['all_refunded']]['account_id']) + + print(f" 端内付费用户数: {len(internal_paid_users)}") + print(f" 端内付费用户数(剔除全部退费): {len(internal_paid_users_norefund)}") + print(f" 端外付费用户数: {len(external_paid_users)}") +else: + internal_paid_users = set() + internal_paid_users_norefund = set() + external_paid_users = set() + +# 给注册用户打标签 +reg_df['has_internal'] = reg_df['account_id'].isin(internal_paid_users) +reg_df['has_internal_norefund'] = reg_df['account_id'].isin(internal_paid_users_norefund) +reg_df['has_external'] = reg_df['account_id'].isin(external_paid_users) +reg_df['has_no_order'] = ~reg_df['has_internal'] & ~reg_df['has_external'] + +# 纯净版:剔除"只有端外订单且没有端内订单"的用户 +reg_df['is_clean_user'] = ~( + reg_df['has_external'] & ~reg_df['has_internal'] & ~reg_df['has_no_order'] +) +# 简化:剔除只有端外订单的用户 +reg_df['only_external'] = reg_df['has_external'] & ~reg_df['has_internal'] & ~reg_df['has_no_order'] + +# ========== LOESS 拟合 ========== +print(f"\nStep 5: LOESS拟合每日注册人数(含活动日历)...") + +# ===== 活动日历 ===== +# 2025年:9/9-10, 9/19-23, 10/13-14, 10/16-17, 11/2, 11/7, 11/10, 11/12, 11/19, 12/3 +# 2026年:1/28(余波1天), 2/11, 2/26(余波4天), 3/5(余波3天), 3/9, 3/12-13, +# 4/3(余波4天), 4/8(余波2天), 4/22(余波1天), 4/28, 5/6-7 + +activity_ranges = [ + ('2025-09-09', '2025-09-10'), + ('2025-09-19', '2025-09-23'), + ('2025-10-13', '2025-10-14'), + ('2025-10-16', '2025-10-17'), + ('2025-11-02', '2025-11-02'), + ('2025-11-07', '2025-11-07'), + ('2025-11-10', '2025-11-10'), + ('2025-11-12', '2025-11-12'), + ('2025-11-19', '2025-11-19'), + ('2025-12-03', '2025-12-03'), + # 2026 + ('2026-01-28', '2026-01-29'), # 余波1天 + ('2026-02-11', '2026-02-11'), + ('2026-02-26', '2026-03-02'), # 余波4天 + ('2026-03-05', '2026-03-08'), # 余波3天 + ('2026-03-09', '2026-03-09'), + ('2026-03-12', '2026-03-13'), + ('2026-04-03', '2026-04-07'), # 余波4天 + ('2026-04-08', '2026-04-10'), # 余波2天 + ('2026-04-22', '2026-04-23'), # 余波1天 + ('2026-04-28', '2026-04-28'), + ('2026-05-06', '2026-05-07'), +] + +activity_dates = set() +for start_s, end_s in activity_ranges: + s = pd.Timestamp(start_s) + e = pd.Timestamp(end_s) + for d in pd.date_range(s, e): + activity_dates.add(d.date()) + +print(f" 活动+余波天数: {len(activity_dates)}") + +daily_reg = reg_df.groupby('reg_date')['account_id'].count().reset_index() +daily_reg.columns = ['reg_date', 'reg_count'] +daily_reg = daily_reg.sort_values('reg_date') +daily_reg['is_activity'] = daily_reg['reg_date'].apply(lambda d: d.date() in activity_dates) + +# 星期因子(基于清洁日计算) +daily_reg['weekday'] = daily_reg['reg_date'].dt.dayofweek # 0=Mon +daily_reg['is_weekend'] = daily_reg['weekday'] >= 5 + +# 计算day_num(LOESS拟合用) +start_dt = pd.Timestamp(START_DATE) +daily_reg['day_num'] = (daily_reg['reg_date'] - start_dt).dt.days + +# 清洁日 = 非活动+非余波日 +clean_days = daily_reg[~daily_reg['is_activity']] +if len(clean_days) > 0: + overall_avg = clean_days['reg_count'].mean() + weekday_avg = clean_days.groupby('weekday')['reg_count'].mean() + daily_reg['weekday_factor'] = daily_reg['weekday'].map( + weekday_avg / overall_avg + ).fillna(1.0) +else: + daily_reg['weekday_factor'] = 1.0 + +# LOESS拟合(仅用清洁日) +if len(clean_days) >= 5: + frac_val = min(0.3, 60.0 / len(daily_reg)) + frac_val = max(frac_val, 0.1) + + loess_result = lowess( + clean_days['reg_count'].values, + clean_days['day_num'].values, + frac=frac_val, + it=3 + ) + # 将LOESS结果映射回所有天(用最近邻插值) + loess_days = clean_days['day_num'].values + loess_vals = loess_result[:, 1] + # 对所有天用线性插值获取LOESS基线 + all_loess = np.interp(daily_reg['day_num'].values, loess_days, loess_vals) + daily_reg['loess_baseline'] = all_loess +else: + daily_reg['loess_baseline'] = daily_reg['reg_count'] + +# 星期修正后的拟合值 +daily_reg['corrected_fitted'] = daily_reg['loess_baseline'] * daily_reg['weekday_factor'] + +# 应用规则:活动日→拟合值;非活动日→保底规则 max(实际, 拟合) +daily_reg['fitted_reg'] = np.where( + daily_reg['is_activity'], + daily_reg['corrected_fitted'], # 活动日:用拟合值替换 + daily_reg['reg_count'] # 非活动日:保留实际值 +) + +# 保底规则:拟合值不能低于0 +# daily_reg['fitted_reg'] = np.maximum(daily_reg['fitted_reg'], 0) + +act_count = daily_reg['is_activity'].sum() +print(f" 清洁日: {len(clean_days)}, 活动/余波日: {act_count}") +print(f" LOESS frac: {frac_val:.4f}") +print(f" 星期因子范围: {daily_reg['weekday_factor'].min():.2f} ~ {daily_reg['weekday_factor'].max():.2f}") + +# 打印拟合对比 +print(f" {'月份':<8} {'原始':>6} {'拟合':>6} {'剔除':>6} {'剔除率':>7}") +from collections import defaultdict +month_fit = defaultdict(lambda: {'orig': 0, 'fit': 0}) +for _, row in daily_reg.iterrows(): + m = str(row['reg_date'].to_period('M')) + month_fit[m]['orig'] += row['reg_count'] + month_fit[m]['fit'] += row['fitted_reg'] +for m in sorted(month_fit.keys()): + o = month_fit[m]['orig'] + f = month_fit[m]['fit'] + d = o - f + r = d/o*100 if o > 0 else 0 + print(f" {m:<8} {int(o):>6} {int(f):>6} {int(d):>6} {r:>6.1f}%") + +# 月度汇总拟合值 +daily_reg['reg_month'] = daily_reg['reg_date'].dt.to_period('M') +monthly_fitted = daily_reg.groupby('reg_month')['fitted_reg'].sum().reset_index() +monthly_fitted.columns = ['reg_month', 'fitted_total'] + +# ========== 计算转化率 ========== +print(f"\nStep 6: 计算各版本各维度转化率...") + +results = [] + +# 月份列表 +months = sorted(reg_df['reg_month'].unique()) + +for month in months: + month_users = reg_df[reg_df['reg_month'] == month] + + # ---- 原始版 ---- + denom_orig = len(month_users) + num_orig = month_users['has_internal_norefund'].sum() + + # ---- 纯净版 ---- + clean_users = month_users[~month_users['only_external']] + denom_clean = len(clean_users) + num_clean = clean_users['has_internal_norefund'].sum() + + # ---- 拟合版 ---- + fitted_row = monthly_fitted[monthly_fitted['reg_month'] == month] + denom_fitted = fitted_row['fitted_total'].values[0] if len(fitted_row) > 0 else denom_orig + num_fitted = month_users['has_internal_norefund'].sum() # 分子不变 + + # 整体维度 + results.append({ + '月份': str(month), + '维度': '整体', + '渠道': '全部', + '版本': '原始版', + '注册用户数': denom_orig, + '付费用户数': num_orig, + '转化率': f"{num_orig/denom_orig*100:.2f}%" if denom_orig > 0 else '0%' + }) + results.append({ + '月份': str(month), + '维度': '整体', + '渠道': '全部', + '版本': '纯净版', + '注册用户数': denom_clean if isinstance(denom_clean, (int, np.integer)) else int(denom_clean), + '付费用户数': num_clean, + '转化率': f"{num_clean/denom_clean*100:.2f}%" if denom_clean > 0 else '0%' + }) + results.append({ + '月份': str(month), + '维度': '整体', + '渠道': '全部', + '版本': '拟合版', + '注册用户数': int(round(denom_fitted)), + '付费用户数': num_fitted, + '转化率': f"{num_fitted/denom_fitted*100:.2f}%" if denom_fitted > 0 else '0%' + }) + + # ---- 分渠道(download_channel) ---- + # 计算拟合版缩放比例 + fitted_ratio = denom_fitted / denom_orig if denom_orig > 0 else 1.0 + + for ch, ch_users in month_users.groupby('download_channel'): + ch_label = ch if ch else '未知' + d_orig = len(ch_users) + n_orig = ch_users['has_internal_norefund'].sum() + + # 纯净版 + ch_clean = ch_users[~ch_users['only_external']] + d_clean = len(ch_clean) + n_clean = ch_clean['has_internal_norefund'].sum() + + # 拟合版(按整体拟合比例缩放) + d_fitted = int(round(d_orig * fitted_ratio)) + + results.append({ + '月份': str(month), + '维度': 'download_channel', + '渠道': ch_label, + '版本': '原始版', + '注册用户数': d_orig, + '付费用户数': n_orig, + '转化率': f"{n_orig/d_orig*100:.2f}%" if d_orig > 0 else '0%' + }) + results.append({ + '月份': str(month), + '维度': 'download_channel', + '渠道': ch_label, + '版本': '纯净版', + '注册用户数': d_clean, + '付费用户数': n_clean, + '转化率': f"{n_clean/d_clean*100:.2f}%" if d_clean > 0 else '0%' + }) + results.append({ + '月份': str(month), + '维度': 'download_channel', + '渠道': ch_label, + '版本': '拟合版', + '注册用户数': d_fitted, + '付费用户数': n_orig, + '转化率': f"{n_orig/d_fitted*100:.2f}%" if d_fitted > 0 else '0%' + }) + +print(f" 整体维度结果: {len([r for r in results if r['维度']=='整体'])} 条") +print(f" download_channel维度结果: {len([r for r in results if r['维度']=='download_channel'])} 条") + +# ---- 区分 key_from ---- +# 按注册月份 × key_from 统计付费用户 +if len(order_df) > 0: + # 端内已完成的订单 + internal_completed = order_df[order_df['is_internal'] & order_df['is_completed']] + + # 关联用户注册月份 + user_month_map = reg_df[['account_id', 'reg_month']].drop_duplicates('account_id') + internal_completed = internal_completed.merge(user_month_map, on='account_id', how='inner') + + # 剔除全部退费的用户 + internal_completed['is_refunded_order'] = internal_completed['is_refunded'] + user_refund_stat = internal_completed.groupby('account_id').agg( + total=('trade_no', 'count'), + refunded=('is_refunded_order', 'sum') + ).reset_index() + user_refund_stat['all_refunded'] = user_refund_stat['total'] == user_refund_stat['refunded'] + all_refunded_users = set(user_refund_stat[user_refund_stat['all_refunded']]['account_id']) + internal_completed_norefund = internal_completed[~internal_completed['account_id'].isin(all_refunded_users)] + + # 按 reg_month × key_from 统计付费用户数 + kf_stats = internal_completed_norefund.groupby(['reg_month', 'key_from'])['account_id'].nunique().reset_index() + kf_stats.columns = ['reg_month', 'key_from', 'paid_users'] + + # 各月份注册人数(原始+纯净+拟合) + month_denom = reg_df.groupby('reg_month')['account_id'].nunique().reset_index() + month_denom.columns = ['reg_month', 'total_users'] + + # 纯净版每月分母 + clean_reg = reg_df[~reg_df['only_external']] + month_clean = clean_reg.groupby('reg_month')['account_id'].nunique().reset_index() + month_clean.columns = ['reg_month', 'clean_users'] + month_denom = month_denom.merge(month_clean, on='reg_month', how='left') + month_denom['clean_users'] = month_denom['clean_users'].fillna(0).astype(int) + + # 拟合版每月分母 + month_denom = month_denom.merge(monthly_fitted, on='reg_month', how='left') + month_denom['fitted_total'] = month_denom['fitted_total'].fillna(month_denom['total_users']).round(0).astype(int) + + kf_stats = kf_stats.merge(month_denom, on='reg_month', how='left') + + for _, row in kf_stats.iterrows(): + denom_orig = int(row['total_users']) + paid = int(row['paid_users']) + denom_clean = int(row['clean_users']) + denom_fitted = int(row['fitted_total']) + + results.append({ + '月份': str(row['reg_month']), + '维度': 'key_from', + '渠道': row['key_from'], + '版本': '原始版', + '注册用户数': denom_orig, + '付费用户数': paid, + '转化率': f"{paid/denom_orig*100:.2f}%" if denom_orig > 0 else '0%' + }) + results.append({ + '月份': str(row['reg_month']), + '维度': 'key_from', + '渠道': row['key_from'], + '版本': '纯净版', + '注册用户数': denom_clean, + '付费用户数': paid, + '转化率': f"{paid/denom_clean*100:.2f}%" if denom_clean > 0 else '0%' + }) + results.append({ + '月份': str(row['reg_month']), + '维度': 'key_from', + '渠道': row['key_from'], + '版本': '拟合版', + '注册用户数': denom_fitted, + '付费用户数': paid, + '转化率': f"{paid/denom_fitted*100:.2f}%" if denom_fitted > 0 else '0%' + }) + + # 端外汇总 + external_completed = order_df[(~order_df['is_internal']) & (order_df['order_status'] == 3)] + external_completed = external_completed.merge(user_month_map, on='account_id', how='inner') + ext_stats = external_completed.groupby('reg_month')['account_id'].nunique().reset_index() + ext_stats.columns = ['reg_month', 'paid_users'] + ext_stats = ext_stats.merge(month_denom, on='reg_month', how='left') + for _, row in ext_stats.iterrows(): + denom_orig = int(row['total_users']) + paid = int(row['paid_users']) + denom_clean = int(row['clean_users']) + denom_fitted = int(row['fitted_total']) + + results.append({ + '月份': str(row['reg_month']), + '维度': 'key_from', + '渠道': '端外合计', + '版本': '原始版', + '注册用户数': denom_orig, + '付费用户数': paid, + '转化率': f"{paid/denom_orig*100:.2f}%" if denom_orig > 0 else '0%' + }) + results.append({ + '月份': str(row['reg_month']), + '维度': 'key_from', + '渠道': '端外合计', + '版本': '纯净版', + '注册用户数': denom_clean, + '付费用户数': paid, + '转化率': f"{paid/denom_clean*100:.2f}%" if denom_clean > 0 else '0%' + }) + results.append({ + '月份': str(row['reg_month']), + '维度': 'key_from', + '渠道': '端外合计', + '版本': '拟合版', + '注册用户数': denom_fitted, + '付费用户数': paid, + '转化率': f"{paid/denom_fitted*100:.2f}%" if denom_fitted > 0 else '0%' + }) + + print(f" key_from维度结果: {len([r for r in results if r['维度']=='key_from'])} 条") + +# ========== 输出到Excel ========== +result_df = pd.DataFrame(results) +print(f"\n总结果数: {len(result_df)}") + +# 按维度排序 +dim_order = {'整体': 0, 'download_channel': 1, 'key_from': 2} +result_df['dim_sort'] = result_df['维度'].map(dim_order) +result_df = result_df.sort_values(['月份', 'dim_sort', '版本', '渠道']).drop(columns=['dim_sort']) + +output_path = '/root/.openclaw/workspace/output/conversion_rate_202509_202605.xlsx' + +with pd.ExcelWriter(output_path, engine='openpyxl') as writer: + # Sheet1: 全部数据 + result_df.to_excel(writer, sheet_name='全部数据', index=False) + + # 各维度分sheet + for dim in ['整体', 'download_channel', 'key_from']: + dim_df = result_df[result_df['维度'] == dim].copy() + dim_df = dim_df.drop(columns=['维度']) + sheet_name = dim[:31] # Excel sheet name limit + dim_df.to_excel(writer, sheet_name=sheet_name, index=False) + + # 透视表:整体(原始版 vs 纯净版 vs 拟合版) + overall_df = result_df[result_df['维度'] == '整体'].copy() + pivot = overall_df.pivot_table( + index='月份', columns='版本', values='转化率', aggfunc='first' + ) + pivot.to_excel(writer, sheet_name='整体对比') + +print(f"\n输出文件: {output_path}") +print("Done!")