auto backup 2026-05-26 08:10:02
This commit is contained in:
parent
7e2bef09f2
commit
94a385c4e8
24
MEMORY.md
24
MEMORY.md
@ -43,17 +43,18 @@
|
|||||||
3. 适配互动教育游戏的交互特点,可落地为可玩内容
|
3. 适配互动教育游戏的交互特点,可落地为可玩内容
|
||||||
|
|
||||||
### 创作必备参考文档集
|
### 创作必备参考文档集
|
||||||
所有剧本创作必须严格参考以下官方文档:
|
所有剧本创作必须严格参考以下文档(优先查阅本地知识库,无本地副本时再查飞书):
|
||||||
1、故事创作风格指南:https://makee-interactive.feishu.cn/wiki/Mtkzw8dYfify2oktHOucldEznfg
|
|
||||||
2、世界观背景设定:
|
| # | 文档 | 本地路径 | 飞书备用链接 |
|
||||||
- https://makee-interactive.feishu.cn/wiki/VmDyw99Qji1XsVktgEPcDUxnnUh
|
|---|------|----------|----------------|
|
||||||
- https://makee-interactive.feishu.cn/wiki/MgihwBMp5iParuk9gnccPb6inye
|
| 1 | 故事创作风格指南 | `skills/children-english-script-writing/references/儿童互动英语剧本创作指南.md` | https://makee-interactive.feishu.cn/wiki/Mtkzw8dYfify2oktHOucldEznfg |
|
||||||
3、角色外观及简单描述:https://makee-interactive.feishu.cn/wiki/EqAtwOIqsiMQZ3kYEn3cCSCtnWb(持续更新)
|
| 2 | 世界观背景设定 | `business_knowledge/resources/worldview_L1_desert_town.md`(已同步)。↠️ L1故事概览.pdf 无权下载 | https://makee-interactive.feishu.cn/wiki/VmDyw99Qji1XsVktgEPcDUxnnUh 、 https://makee-interactive.feishu.cn/wiki/MgihwBMp5iParuk9gnccPb6inye |
|
||||||
4、场景地图及简单描述:https://makee-interactive.feishu.cn/wiki/GR9awIhYniL4uZkiH3ac4VB1nYf(持续更新)
|
| 3 | 角色外观及简单描述 | `business_knowledge/characters/`(main/ + supporting/) | https://makee-interactive.feishu.cn/wiki/EqAtwOIqsiMQZ3kYEn3cCSCtnWb |
|
||||||
5、Level 1 - S1已完成剧本:https://makee-interactive.feishu.cn/wiki/Ri39w7g2VifUkgkVUhecvxjonfb 及其子文档
|
| 4 | 场景地图及简单描述 | `business_knowledge/resources/scene_catalog.md` | https://makee-interactive.feishu.cn/wiki/GR9awIhYniL4uZkiH3ac4VB1nYf |
|
||||||
6、Level 1 - S2撰写中剧本:https://makee-interactive.feishu.cn/wiki/TekjwhXLbioX2OkR1H6chg6Hnah 及其子文档
|
| 5 | Level 1 - S1已完成剧本 | `business_knowledge/chapters/Level1/S1/`(12个unit概要) | https://makee-interactive.feishu.cn/wiki/Ri39w7g2VifUkgkVUhecvxjonfb |
|
||||||
7、编剧共识规则:https://makee-interactive.feishu.cn/wiki/DqjUw4sKNiGhTBkygz8cPAicnXe
|
| 6 | Level 1 - S2撰写中剧本 | `business_knowledge/chapters/Level1/S2/`(8个unit概要) | https://makee-interactive.feishu.cn/wiki/TekjwhXLbioX2OkR1H6chg6Hnah |
|
||||||
8、剧本可用emoji对照表:https://makee-interactive.feishu.cn/wiki/X3lHweh3hi2XBokgA6rcRhTRnvg
|
| 7 | 编剧共识规则 | `business_knowledge/script_feedback_log.md`(以此为准) | https://makee-interactive.feishu.cn/wiki/DqjUw4sKNiGhTBkygz8cPAicnXe |
|
||||||
|
| 8 | 剧本可用emoji对照表 | `business_knowledge/resources/emoji_catalog.md` | https://makee-interactive.feishu.cn/wiki/X3lHweh3hi2XBokgA6rcRhTRnvg |
|
||||||
|
|
||||||
## 长线任务协作协议
|
## 长线任务协作协议
|
||||||
|
|
||||||
@ -96,6 +97,7 @@
|
|||||||
|
|
||||||
## 经验教训
|
## 经验教训
|
||||||
|
|
||||||
|
- ❗ **禁止在飞书知识库(Wiki)中新建文档** —— 知识库由编剧团队维护,Bot只做读取。输出文件(剧本表格等)应创建为独立的电子表格(Sheets),不要放入Wiki知识库
|
||||||
- 飞书sheet类型文档无法通过 `lark-cli docs +fetch` 读取,需用 Sheets V2 API
|
- 飞书sheet类型文档无法通过 `lark-cli docs +fetch` 读取,需用 Sheets V2 API
|
||||||
- 飞书知识库中的PDF文件需通过 Drive API 下载到本地后用pdf工具解析
|
- 飞书知识库中的PDF文件需通过 Drive API 下载到本地后用pdf工具解析
|
||||||
- ClawHub安装skill时如遇VirusTotal警告,使用 `--force` 参数强制安装
|
- ClawHub安装skill时如遇VirusTotal警告,使用 `--force` 参数强制安装
|
||||||
|
|||||||
@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
本文件记录当前所有进行中的长线写作任务。会话重置后,小编通过本文件快速恢复任务上下文。
|
本文件记录当前所有进行中的长线写作任务。会话重置后,小编通过本文件快速恢复任务上下文。
|
||||||
|
|
||||||
**最后更新:** 2026-05-22
|
**最后更新:** 2026-05-25(新增 U24-L2)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -10,10 +10,14 @@
|
|||||||
|
|
||||||
| # | 任务名称 | 日志文件 | 创建者 | 创建日期 | 当前进度 |
|
| # | 任务名称 | 日志文件 | 创建者 | 创建日期 | 当前进度 |
|
||||||
|---|----------|----------|--------|----------|----------|
|
|---|----------|----------|--------|----------|----------|
|
||||||
| 1 | S2结尾3课剧本撰写 | `task_log/S2结尾3课剧本撰写.md` | 胡笳(b7g5c9d6) | 2026-05-22 | 待启动(3课大纲已定稿) |
|
| 1 | S2结尾3课剧本撰写 | `task_log/S2结尾3课剧本撰写.md` | 胡笳(b7g5c9d6) | 2026-05-22 | U23-L3✅ → U24-L1待开始 |
|
||||||
| 2 | U23L4剧本 | `task_log/U23L4剧本.md` | 姜小妹(ou_3abad542579ba871b293d126f8f206ee) | 2026-05-21 | 刚创建,待启动 |
|
| 2 | U23L4剧本 | `task_log/U23L4剧本.md` | 姜小妹(ou_3abad542579ba871b293d126f8f206ee) | 2026-05-21 | 刚创建,待启动 |
|
||||||
| 3 | 密道书店Lesson | `task_log/密道书店Lesson.md` | 王珞(ou_24c55250c8dd62a40535b183a9bc83e5) | 2026-05-21 | ✅ 已完成 |
|
| 3 | 密道书店Lesson | `task_log/密道书店Lesson.md` | 王珞(ou_24c55250c8dd62a40535b183a9bc83e5) | 2026-05-21 | ✅ 已完成 |
|
||||||
| 4 | U23-L1剧本 | `task_log/U23-L1剧本.md` | 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287) | 2026-05-22 | 「转」V1已出,待林逸瀚审改 |
|
| 4 | U23-L1剧本 | `task_log/U23-L1剧本.md` | 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287) | 2026-05-22 | 起✅承✅转✅合✅,全课完成 |
|
||||||
|
| 5 | U23L5剧本 | `task_log/U23L5剧本.md` | 姜小妹(ou_3abad542579ba871b293d126f8f206ee) | 2026-05-25 | V1全稿完成,待审阅 |
|
||||||
|
| 6 | U23-L2剧本 | `task_log/U23-L2剧本.md` | 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287) | 2026-05-25 | ✅ 起承转合全课完成 |
|
||||||
|
| 7 | U24L4剧本 | `task_log/U24L4剧本.md` | 王珞(ou_24c55250c8dd62a40535b183a9bc83e5) | 2026-05-25 | 大纲讨论中 |
|
||||||
|
| 8 | U24-L2剧本 | `task_log/U24-L2剧本.md` | 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287) | 2026-05-25 | 「起」✅ 完成,明日继续承转合 |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
311
business_knowledge/resources/worldview_L1_desert_town.md
Normal file
311
business_knowledge/resources/worldview_L1_desert_town.md
Normal file
@ -0,0 +1,311 @@
|
|||||||
|
L1大地图从「**沙漠镇·采集站**」概念出发,进行拓展。
|
||||||
|
|
||||||
|
<image token="PJD3be2eBob4R0xgNLScRvTMncf" width="2816" height="1536" align="center"/>
|
||||||
|
|
||||||
|
*概念参考图* {align="center"}
|
||||||
|
|
||||||
|
# 沙漠镇到底在「采集」什么?
|
||||||
|
|
||||||
|
**来自过往文明的稀有金属、科技遗产。**
|
||||||
|
|
||||||
|
**为什么能采集到以上物资:**沙漠不是一开始就在这里的。随着环境恶化,沙漠慢慢移动,把以前的老城市掩埋了。沙子虽然干燥,但能很好地**保存物品。**
|
||||||
|
|
||||||
|
<quote-container>
|
||||||
|
|
||||||
|
*在几百米深的沙子底下,埋着以前人类的城市。因为沙子很干,没有液体会导致生锈,所以那些几百年前的玩具、电脑芯片、甚至汽车都保存得好好的。我们把它们挖出来,修一修有些还能用呢!*
|
||||||
|
|
||||||
|
*—— Mayor Tom's Adventure Diary*
|
||||||
|
</quote-container>
|
||||||
|
|
||||||
|
# 沙漠镇的过去
|
||||||
|
|
||||||
|
[*参考:世界观·地球的相关设定*](https%3A%2F%2Fmakee-interactive.feishu.cn%2Fwiki%2FOXICwzH3vivLpmkXt3Scz4Pun9L%23share-I7pIdT5wVo19BbxK5HRcv6hXnbf)
|
||||||
|
|
||||||
|
<image token="UiQdbFBkmoiW1kxkp1xcgmfXnbh" width="1614" height="330" align="center"/>
|
||||||
|
|
||||||
|
沙漠镇是36座资源采集站之一,这里也曾有**一座穹顶都市**。
|
||||||
|
|
||||||
|
这座穹顶都市是**机器人、人工智能的科技之都**。<text color="gray">*原型参考深圳。*</text>
|
||||||
|
|
||||||
|
<quote-container>
|
||||||
|
|
||||||
|
深圳之所以被称为“科技之都”或“中国硅谷”,不仅是因为它拥有多家科技公司,更因为深圳拥有独一无二的生态系统和速度。
|
||||||
|
|
||||||
|
深圳在40年前从一个小渔村一步进化为国际化大都市,拥有世界上最完整的电子产业链。在华强北,你可以在一小时攒齐制造一部手机所需的所有零件。这里既有最顶尖的摩天大楼,也有充满活力的城中村;既有高端研发,也有草根精神。
|
||||||
|
</quote-container>
|
||||||
|
|
||||||
|
这座穹顶都市编号为018。它不是慢慢发展起来的,而是在极短时间内利用难以置信的凝聚力和尖端技术拔地而起,它本身就是人类工程学的巅峰。
|
||||||
|
|
||||||
|
在大沙化(The Great Desertification)降临前夕,018是世界上耀眼的明珠,这里的科学家和工程师们相信技术可以战胜一切。沙漠开始蔓延时,城市的居民们没有选择撤离,而是启动了疯狂的**遮天计划**(Project Sky-Shield)。人们建造了巨大的防护罩和环境改造机器以阻挡沙海。城市以前所未有的速度**向上生长**。
|
||||||
|
|
||||||
|
然而,人们还是失败了。灾难比预想的来得更加猛烈,巨大的沙尘暴在一夜之间吞没了城市的底层区域,能量护盾过载崩溃,人类只能恐慌地紧急撤离。**城市被瞬间凝固在了时间里**,只有最高的**灯塔**仍倔强地伸出沙海。
|
||||||
|
|
||||||
|
现在,沙漠中掩埋着无穷无尽的紧密元件、芯片和原型机。**它是一个拾荒者的天堂。**城市深处埋藏着旧时代超级企业的核心实验室,那些曾经试图控制天气、制造机器人大军的AI技术,仍在黑暗中运行。
|
||||||
|
|
||||||
|
如今,沙化已经渐渐结束,但给予这座城市的最新考验来自于不定期的太阳风暴和跳跃雷暴。为此,沙漠镇收集站的精英们正在殚精竭虑,去对抗现实、创造出新的可能。
|
||||||
|
|
||||||
|
然而,在无人知晓的地下,还留着一个旧日文明的守护者——018号穹顶都市的AI中枢“盖娅Gaia”。在与人类全球网络断开连接后,她没有接收到《机械沉默法案》的关闭指令,依旧坚持完成**铸日计划**。在漫长的孤独中,盖娅和小圆球们用极其笨拙却震撼的方式,在沙丘之下建造出巨大的生态球,保存了地球上那些美丽的森林、瀑布、动植物……
|
||||||
|
|
||||||
|
<reference-synced source-block-id="PjoHd4qoBsfiF5b11XicylHBngg" source-document-id="TQEvdMJ9So6lRSxzlbfc0iktncf">
|
||||||
|
|
||||||
|
人类的遮天计划和铸日计划看似宏伟,却反而加剧了环境恶化,导致生态链断裂,地表环境被彻底改变。
|
||||||
|
- **遮天计划的反作用:**为了阻挡早期的极端热浪,018号都市启动了遮天计划(Project Sky-Shield) 。巨大的能量场在过滤紫外线的同时,也意外电离了云层中的水分,导致水蒸气无法凝结成雨降下,反而加剧了空气的干燥。
|
||||||
|
- **0号生态球的虹吸效应:** 为了在地下万米处维持那个完美的0号生态球 (Project Sun-Forge),旧时代的科学家设计了一套霸道的深层导水系统。该系统不仅消耗了地表径流,还通过地热能驱动,将珠江的地下含水层全部虹吸到了地下水库中 。导水系统完成后,受到大沙化影响、人类撤离018穹顶都市,剩余的工作由盖娅独立完成。
|
||||||
|
|
||||||
|
</reference-synced>
|
||||||
|
|
||||||
|
# 沙漠镇的终局
|
||||||
|
|
||||||
|
<reference-synced source-block-id="ZOO9dmcQSsQ6eLbBHeTcZhAwnMh" source-document-id="TQEvdMJ9So6lRSxzlbfc0iktncf">
|
||||||
|
|
||||||
|
进入S3、S4之后,人们发现防护罩虽然强大,但也有负面作用:巨大的能量场隔绝了水蒸气,反而加剧了空气的干燥,植物变得更难存活。
|
||||||
|
0号生态球十分完美,但也十分脆弱。任何试图迁徙或改造行为,不仅会破坏生态球内的环境,也会对地下水产生影响从而进一步恶化地表环境。
|
||||||
|
最终人们放弃迁徙至0号生态球的想法,也关闭了防护罩。一个小小的改变依然在无人注意的角落里发生了——一株小小的藤蔓种子冲破0号生态球的护罩,穿过层层砂石,钻出了地面,并开始顽强地生长起来。
|
||||||
|
这些神奇的生物拥有“愚公移山”般的惊人韧性:藤蔓的根系像天然的水管,一点点将地下深处的纯净水引流至地表,并将松散的沙粒重新固化为泥土;而云朵水母则在夜间的低空中聚集,释放微小的雨滴。**这并非巧合,遮天和铸日的计划仍在进行。**这证明,0号生态球的内部生态拥有适应并改变地表环境的力量,只不过,这需要很长很长的时间。
|
||||||
|
“遮天”和“铸日”虽然充满野心,**但破局的方法竟是否定一切。**随着防护罩的关闭,地表在初期必须独自面对恶劣的风沙;同时,肆意生长的巨型藤蔓也会不可避免地顶破老旧的城市废墟,造成建筑的坍塌。任何人类居住所产生的污染、噪音和资源消耗,都会立刻扼杀这片刚刚发芽的脆弱绿意。
|
||||||
|
为了不打扰这来之不易的新起点,也为了自身的安全,镇民们做出了一个艰难但充满希望的决定:迁徙。他们将沙漠镇彻底留给了大自然和小机器人。沙漠镇终将不再“沙漠”,但这需要很长的时间。
|
||||||
|
**人们迁徙、离开,将地表空间留给这个新生的生态系统去扩张、去征服。只有少数的人留在观测站中,记录并见证这漫长的生命奇迹。**
|
||||||
|
人们满怀期待地等待着未来的某一天,重返这座长满绿树的故乡。
|
||||||
|
|
||||||
|
</reference-synced>
|
||||||
|
|
||||||
|
<source-synced align="1">
|
||||||
|
|
||||||
|
|
||||||
|
</source-synced>
|
||||||
|
|
||||||
|
# 新老沙漠镇的关系
|
||||||
|
|
||||||
|
老沙漠镇指的是L2中主角家、Ben家以及周边的沙漠化区域。
|
||||||
|
|
||||||
|
新沙漠镇指的是L1故事的主舞台,**没有颠覆原有的沙漠镇概念**,而是将其进行拓展延伸。
|
||||||
|
|
||||||
|
<image token="KROPbwBsjofoQmxJindc51gKnwd" width="3822" height="1432" align="center"/>
|
||||||
|
|
||||||
|
老沙漠镇是目前L1大地图中的一角,详见上图红框内区域。
|
||||||
|
|
||||||
|
<quote-container>
|
||||||
|
|
||||||
|
平时我觉得沙漠是荒凉、贫瘠、可怕的;但每当雷暴来袭,那一望无际的黄沙,竟然让我感到像回家一样安心。
|
||||||
|
|
||||||
|
*—— Mayor Tom's Adventure Diary*
|
||||||
|
</quote-container>
|
||||||
|
|
||||||
|
# 沙漠镇的地图
|
||||||
|
|
||||||
|
沙漠镇中不止有沙漠和都市,还有森林、雪山、海岸线等地貌。
|
||||||
|
|
||||||
|
<image token="Gongb16BrodxW5xZFptcWgutnMg" width="2816" height="1536" align="center"/>
|
||||||
|
|
||||||
|
整体设计是半环绕式结构:以一个海岸线超级都市为中心,内陆方向被沙漠所包围,而雪山和森林作为屏障或特殊的隔离区存在。
|
||||||
|
|
||||||
|
地图大致走向:**西北是沙漠源头 ➡️ 向东南推进吞没城市 ➡️ 东南沿海是被海水淹没的城市残骸 ➡️ 东北是高耸的雪山屏障,西南是挣扎求生的森林绿洲。**
|
||||||
|
|
||||||
|
## 核心区:流沙都市
|
||||||
|
|
||||||
|
- 位置:地图中央,延伸至海岸线
|
||||||
|
- 概念:沙漠镇的人们努力解决了围墙内区域的沙化问题,在沙化表层建立了绿化隔离,但曾经是超级摩天大楼的天际线建筑,至少还有一半以上被尘土所淹没。
|
||||||
|
- 视觉元素:
|
||||||
|
- 围墙:人类修建了高耸坚固的围墙,以抵抗不断沙化的世界。
|
||||||
|
- 摩天大楼岛屿:摩天大楼刺向天空,人们在这些楼顶之间架设了空中索桥和滑索以便快速通行。
|
||||||
|
- 沙瀑布:治理沙子参考了水利大坝的思路,巨大的沙流从高架桥断裂处,像瀑布一样倾斜到了围墙外区域。
|
||||||
|
- 玻璃穹顶:一些巨大的透明穹顶建筑半埋在沙中,这里是沙漠镇的核心区域,如学校、政府机构、发电厂等。即便是最为严重的自然灾害也难以撼动穹顶内的建筑安全。
|
||||||
|
|
||||||
|
**大部分的故事/场景,都发生于核心区中。**在自然灾害不出现时,这里的生活和普通城市生活并没有分别。
|
||||||
|
|
||||||
|
## 沙漠区:无尽沙海
|
||||||
|
|
||||||
|
- 位置:地图的西北部,拥有超大规模的面积(一直向外衍生至地图外)
|
||||||
|
- 概念:沙漠侵袭被人类文明的最后荣光所勉力阻挡,而沙漠深处依旧是最荒凉、最危险的地方。
|
||||||
|
- 视觉元素:
|
||||||
|
- 巨型沙丘:高耸入云的新月形沙丘,颜色是金黄色和橙色,没人知道沙丘深处到底是什么。
|
||||||
|
- 古代机械残骸:巨大的、生锈的地形改造机器人半埋在沙中,变成了地标。
|
||||||
|
- 风暴眼:地图边缘有一个一直在旋转的沙尘暴漩涡区。
|
||||||
|
|
||||||
|
## 海岸线:生锈海岸
|
||||||
|
|
||||||
|
- 位置:地图的东南区域
|
||||||
|
- 概念:海平面上升加上沙子推入海洋,导致有一片混乱的浅滩。只有一小块区域还能看到洁净的海水和沙滩
|
||||||
|
- 视觉元素:
|
||||||
|
- 沉船坟冢:巨大的货轮和游轮搁浅在沙滩和浅海中,一些冒险者们在船体周围建立了聚落。
|
||||||
|
- 被淹没的港口:昔日的港口起重机像钢铁巨兽般站在水中,挂满了海藻和锈迹。其周围海水被污染,呈现出妖异的蓝绿色。
|
||||||
|
|
||||||
|
## 高山区:永冻山脉
|
||||||
|
|
||||||
|
- 位置:地图的东北部,地势极高
|
||||||
|
- 概念:一座巨大的山脉阻挡了部分沙漠的蔓延。高海拔导致这里常年积雪,腰部区域仍有苔藓等植物,暗示了依旧有强韧的动植物生存于此
|
||||||
|
- 视觉元素:
|
||||||
|
<image token="KKexbCFj0oGOzIxbEzac5cmOnDb" width="1382" height="504" align="center"/>
|
||||||
|
|
||||||
|
- 群峰之巅:群峰之间有一座峻峰最为突出,它被称为沙之阿尔卑斯。
|
||||||
|
- 沙漠与雪的交界线:山脚下能看到极其壮观的景象——金色沙丘直接撞击在白色的雪山岩壁上。
|
||||||
|
- 高科技遗迹:山顶有废弃的天文台和气候控制塔,它们已被厚厚冰雪所覆盖。
|
||||||
|
|
||||||
|
## 森林区:堡垒森林
|
||||||
|
|
||||||
|
- 位置:地图西南部,沙漠镇最后的绿地
|
||||||
|
- 概念:依靠古代科技勉力维持的一片森林区域,正在被沙海缓慢侵蚀
|
||||||
|
- 视觉元素:
|
||||||
|
- 沙墙与防护罩:巨大的能量立场将森林围起来,墙外的沙子堆积得很高。
|
||||||
|
- 巨型植物园:在人们竭尽全力的庇护下,这里的植物生态丰富,宛如沙漠镇专属的植物园。堡垒森林中总是弥漫着一种挣扎的美感。
|
||||||
|
|
||||||
|
# 沙漠镇的下面到底掩埋了什么秘密?
|
||||||
|
|
||||||
|
## 铸日计划
|
||||||
|
|
||||||
|
**0号生态球,一个沉睡在地底万米深处的翡翠奇迹。**
|
||||||
|
|
||||||
|
<image token="Il3YbyzDBoCPoVxHMAsckaYznmt" width="2816" height="1536" align="center"/>
|
||||||
|
|
||||||
|
沙漠镇下藏有一个巨大的生态球,这里拥有丰富且平衡的生态,保存了全世界最美的森林、瀑布、动植物。
|
||||||
|
|
||||||
|
- 照明:为了模拟地表环境,疯狂的科学家们不仅设计了一套极其霸道的深层导水系统,还有一个悬浮在生态球正中央的发光球体,它不像真正的太阳那般刺眼,周围环绕着像行星环一样的机械轨道(用来散热和调节光照角度)。
|
||||||
|
- 存在的痕迹:沙漠里只有沙子,但偶尔会在地下的裂缝里发现发光的漂亮花瓣;
|
||||||
|
- 诞生的真相:018号穹顶都市的科学家预测到了这里的沙漠化,为对抗地表世界的崩坏,他们在离开地球前,给这里的AI盖娅留下了一个终极任务:尝试恢复地球的自然生态。在经历过各种各样的尝试和失败后,盖娅最终决定在地下建了一个巨大的**玻璃生态球**,用于保存地球上曾经的一切。
|
||||||
|
- 失败的尝试:018号穹顶都市耗费巨量资源,完成了玻璃生态球的打造,但这里的资源和生态平衡无法承受大量的人类活动,所以最终科学家们选择将生态球的秘密全部掩藏。**数百年后,这里成为无人知晓的乐土。**
|
||||||
|
|
||||||
|
0号生态球不是一个简单的地下洞穴,而是**被科技奇迹包裹的超级生态瓶**。<text color="gray">*(深圳精神的极致呈现,象征人类无所不能和傲慢。)*</text>
|
||||||
|
|
||||||
|
<quote-container>
|
||||||
|
|
||||||
|
这里就像把天空装进了瓶子里。你可以从云朵上滑滑梯,一直滑到海底去!
|
||||||
|
|
||||||
|
*—— Mayor Tom's Adventure Diary*
|
||||||
|
</quote-container>
|
||||||
|
|
||||||
|
## ~~0号生态球的地理分层~~ {folded="true"}
|
||||||
|
|
||||||
|
~~**整体概述:**~~
|
||||||
|
|
||||||
|
- ~~**外观描述:**~~~~一个直径约20公里的巨大透明球体(或者是两头圆的胶囊状)。球壁由旧时代最坚固的金刚石玻璃所打造,像一个巨大的气泡,将数亿吨的沙土隔绝在外;~~
|
||||||
|
- ~~**人造太阳:**~~~~一颗悬浮在生态球正中央(半空10000米处)的发光球体,它不像真正的太阳那般刺眼,周围环绕着像行星环一样的机械轨道(用来散热和调节光照角度);~~
|
||||||
|
- ~~**当前现状:**~~~~内部生态完美,但能源即将耗尽,人造太阳开始变得不稳定。~~
|
||||||
|
|
||||||
|
~~**上层:云海倒峰**~~
|
||||||
|
|
||||||
|
- ~~**高度:**~~~~ 8000m - 10000m~~
|
||||||
|
- ~~**地貌:**~~~~ 从球体顶部倒悬下来的巨大山峰(钟乳石结构),山尖没入云层之中。~~
|
||||||
|
- ~~**环境:**~~~~ 湿润、寒冷、云雾缭绕。这里的水汽凝结成雨水,落向下层。越高为~~
|
||||||
|
- ~~**神奇生物:**~~~~ ~~~~**云朵水母 (Cloud Jellies)**~~~~ —— 像降落伞一样漂浮在空中的透明生物,身体里装着水,负责把水分运输到各个角落。~~
|
||||||
|
|
||||||
|
~~**中层:浮空群岛**~~
|
||||||
|
|
||||||
|
- ~~**高度:**~~~~ 2000m - 8000m~~
|
||||||
|
- ~~**地貌:**~~~~ 无数巨大的岩石岛屿依靠磁悬浮技术漂浮在空中,岛屿之间由粗壮的光纤藤蔓连接。巨大的瀑布从上层落下,穿过一个个岛屿,像银色的绸带。~~
|
||||||
|
- ~~**环境:**~~~~ 光照极其充足,绿植覆盖率 100%。~~
|
||||||
|
- ~~**神奇生物:**~~~~ ~~~~**光合飞鼠 (Solar Gliders)**~~~~ —— 背上长着叶子的小松鼠,展开四肢可以在岛屿间滑翔,晒太阳就能填饱肚子。~~
|
||||||
|
|
||||||
|
~~**下层:地核雨林**~~
|
||||||
|
|
||||||
|
- ~~**高度:**~~~~1000m - 2000m~~
|
||||||
|
- ~~**地貌:**~~~~这是生态最丰富的主区域。雨林间不仅有着地球表面存在的动植物,更是悄悄孕育出了新的物种。~~
|
||||||
|
- ~~**环境:**~~~~光照适度,生态丰富,这里拥有亚马逊雨林般的丰富生态,栖息着数以千计的鸟类、哺乳动物和昆虫。~~
|
||||||
|
- ~~**神奇生物:**~~~~ ~~~~**水晶鹿 (Crystal Deer)**~~~~ —— 它们的角像珊瑚一样透明发光,在水边漫步,负责传播植物的种子。~~
|
||||||
|
|
||||||
|
~~**底层:镜面内海**~~
|
||||||
|
|
||||||
|
- ~~**高度:**~~~~0m - 1000m~~
|
||||||
|
- ~~**地貌:**~~~~ 球体底部是一片平静的淡水海洋。水面如镜子般倒映着上空的人造太阳和浮岛。水底沉睡着维护生态球的巨型机械臂(那是旧时代的园丁机器人)。~~
|
||||||
|
- ~~**环境:**~~~~ 幽静、深邃。这里是巨大的储水库。~~
|
||||||
|
- ~~**神奇生物:银鲸(Silver Whale)**~~~~—— 它的皮肤呈现出一种柔和的银白色光泽,仿佛覆盖着一层流动的水银,在阳光的照射下,会散发出璀璨的光芒,如同海底的珍宝。~~
|
||||||
|
|
||||||
|
<quote-container>
|
||||||
|
|
||||||
|
~~这是一个与上面黄沙漫天完全相反的世界。它是通透的、翠绿的、湿润的。~~
|
||||||
|
|
||||||
|
~~这里有着极致的“纵深感”——玩家站在中间的浮岛上,抬头能看到云层里的山峰,低头能看到深不见底的湖水。~~
|
||||||
|
|
||||||
|
~~原本昏暗的地底世界,被那颗温柔的人造小太阳所照亮,所有的植物都向着中心的光源生长。~~
|
||||||
|
|
||||||
|
~~巨大的瀑布穿过漂浮的绿色岛屿,球体中充满生机勃勃的治愈感。~~
|
||||||
|
|
||||||
|
~~如果我可以选择,我希望永远不必离开。~~
|
||||||
|
|
||||||
|
~~*—— Mayor Tom's Adventure Diary*~~
|
||||||
|
</quote-container>
|
||||||
|
|
||||||
|
## 0号生态球的故事冲突
|
||||||
|
|
||||||
|
**大人们的想法:** 当地面的人们发现这个地下天堂时,第一反应是:“太好了!这里有水、有树、不冷不热,我们全家都搬进来吧!我们要在这里盖房子、开超市、修游乐场!”
|
||||||
|
|
||||||
|
**孩子的发现(玩家视角):** 你和小伙伴深入生态球后发现,这里太精密了。
|
||||||
|
|
||||||
|
- 这里的空气循环刚刚好够现有的生物呼吸。
|
||||||
|
- 如果进来太多人,呼出的二氧化碳会让“云朵水母”变黑掉下来。
|
||||||
|
- 如果大人们在这里盖房子,噪音会吓死胆小的“光合飞鼠”。
|
||||||
|
|
||||||
|
**立意设计:从「征服与开发」,转变为「克制和平衡」。体现出孩子的同理心,尊重生命、保护世界。**
|
||||||
|
|
||||||
|
**孩子们的结论:** **这里是属于它们的家,而我们只是客人。如果我们要把这里变成人类的城市,这里就会慢慢死去。**
|
||||||
|
|
||||||
|
<quote-container>
|
||||||
|
|
||||||
|
孩子们说的有道理。如果我们搬进去,这些美好就没有了。
|
||||||
|
|
||||||
|
*—— Mayor Tom's Adventure Diary*
|
||||||
|
</quote-container>
|
||||||
|
|
||||||
|
## 0号生态球最终的结局
|
||||||
|
|
||||||
|
(发生于L1-S4)
|
||||||
|
|
||||||
|
人们意识到,任何形式的人类活动都会破坏这个新生的、脆弱的生态系统,最终放弃了移居进入生态球的计划。
|
||||||
|
|
||||||
|
但一个小小的改变依然在无人注意的角落里发生了——一株小小的藤蔓种子冲破0号生态球的护罩,穿过层层砂石,钻出了地面,并开始顽强地生长起来。
|
||||||
|
|
||||||
|
这证明,0号生态球的内部生态拥有适应并改变地表环境的力量,只不过,这需要很长很长的时间。
|
||||||
|
|
||||||
|
于是,人们搬离了沙漠镇,将地表空间留给这个新生的生态系统去扩张、去征服。只有极少数的人留在观测站中,记录并见证这个漫长的生命奇迹。
|
||||||
|
|
||||||
|
# L1的故事都发生于沙漠镇吗?
|
||||||
|
|
||||||
|
是。在4个季度中,会逐渐解锁更多可探索区域。
|
||||||
|
|
||||||
|
<image token="FVsPbsA16oW6juxkJpdcdoNonhb" width="3374" height="2276" align="center"/>
|
||||||
|
|
||||||
|
- S1的故事会解锁大部分的沙漠镇地表区域,边界如上图所示⬆️。
|
||||||
|
- S2的故事会解锁全部的沙漠镇地表区域,展现更加丰富的地貌。
|
||||||
|
- S3的故事会解锁地下的0号生态球区域,为L1(下)用户提供全新的视觉元素。
|
||||||
|
- S4的故事不会解锁新的区域,但地表和地下区域都将会发生重大改变。
|
||||||
|
|
||||||
|
**无论哪个季度,大部分的故事/场景因为日常的原因,均发生于核心都市区。**
|
||||||
|
|
||||||
|
***—————— S2 新增设定 ——————***
|
||||||
|
|
||||||
|
# 沉睡的盖娅与小圆球们
|
||||||
|
|
||||||
|
- **灯塔是什么?**
|
||||||
|
灯塔实际上是通往地下0号生态球的入口,同时也是一座巨大的“光导管”,通过采集自然阳光来为地下的0号生态球照明。
|
||||||
|
这座灯塔就像一个塞子一样,坐落在0号生态球的顶部。
|
||||||
|
- **盖娅****是什么?**
|
||||||
|
018号穹顶都市的AI中枢“盖娅Gaia”,**它没有身体,是一个智能AI**,负责地表环境监测、分析和能源分配。它通过无线网络与无数个“迷你球形探测器”相联结,向它们发布指令并回收采集到的环境数据。
|
||||||
|
在与人类全球网络断开连接后,她没有接收到《机械沉默法案》的关闭指令,依旧独立完成**铸日计划**。在大沙化降临、人类紧急撤离的混乱中 ,盖娅成为了一个彻底的“物理孤岛”。她的世界里没有人类的猜忌与背叛,只剩下人类离开前留下的那句嘱托。
|
||||||
|
- **盖娅的逻辑链**
|
||||||
|
- 初始状态:接受最终指令(完成铸日计划,优先级最高)。因为无法获取外部支援,转为单机离线独立运行;
|
||||||
|
- 铸日计划:0号生态球是盖娅与小圆球的协作成果,既然地表无法生存,那便在地下创造新的希望;
|
||||||
|
- 休眠待机:0号生态球建设完毕,生态自循环开启,召回小圆球堵住通道,关闭自身进程,等待人类归来;
|
||||||
|
- 意外苏醒:被灯塔里孩子们拍手唱歌的特定频率震动,匹配到人类生理特征而苏醒;
|
||||||
|
- 最后守护:孩子们正面临地表环境的威胁,盖娅需要保护人类。她召唤成千上万的小圆球组成防护罩;
|
||||||
|
- 最终结局:人类暂时获得安全,因为《机械沉默法案》,盖娅主动结束进程,再度进入休眠。
|
||||||
|
- **小圆球是什么?**
|
||||||
|
由生态管理中枢盖娅所控制的“迷你球形探测器”。在休眠期,它们整齐地排列成矩阵,封住了灯塔与0号生态球之间的通道。随着越来越多的小圆球们苏醒过来,去往地表的各个区域,灯塔地下的通道一点点打开。
|
||||||
|
小圆球接收到盖娅的指令后,就会前往目标地点独立执行任务,没有特殊情况的话,不会再和盖娅进行通信。
|
||||||
|
- **小圆球有什么特性?**
|
||||||
|
- **信息不互通:**小圆球只和中枢盖娅通信,小圆球互相之间并不具备远程通信的功能,它们只能像小动物一样,见面之后再交流。
|
||||||
|
- **理解能力有限:**小圆球并不具备高级的智能中枢(因为中枢实际上是盖娅),因此它们的理解能力有限,通常只能听懂人类语言的字面意思。例如,小朋友说“我饿得能吃下一头牛。”小圆球说“好的,好的。”然后牵来了一头真正的牛。
|
||||||
|
- **健忘****且丢三落四****:**由于内存空间有限,小圆球每隔一段时间就需要清理近期记忆,只将最重要的信息摘录并保存下来。所以,很多时候它们经常“转头就忘”。尽管小圆球具有采集功能,但由于其体积小,内部的资源采集仓容量极其有限,小圆球的肚子里一次只能容纳1-2件物品,它们常常为了拿下一件物品而把上一件丢掉,就像掰玉米的小猴子一样。
|
||||||
|
- **盖娅和小圆球为什么沉睡?**
|
||||||
|
018号穹顶都市曾是科技之都,在“大沙化”降临前,科学家们启动了“**遮天计划**” 。但沙尘暴比预想更加猛烈,能量护盾崩溃,“遮天计划”失败。人类紧急撤离,盖娅独立完成“**铸日计划**”,等待人类的归来。盖娅派出所有的球形探测器,探寻着所有的可能性。
|
||||||
|
完成地貌探测后,盖娅与小圆球开始执行这些漫长的计划:将沙子硅化(玻璃化)做成了一个地下保护罩,将采集获得的植物种子、动物、空气、水等清洁资源投入其中,建成一个自循环的生态体系——这就是“0号生态球”。
|
||||||
|
这一切的努力需要漫长的时间,盖娅在完成0号生态球后召回了所有小圆球,这些机器人变成“塞子”堵住了通道,防止沙子灌入地底,盖娅自己则进入休眠,等待时机的成熟。
|
||||||
|
- **它们为什么被唤醒?**
|
||||||
|
唤醒盖娅的,并不是沙尘暴,而是小朋友们为了驱散恐惧,在灯塔里一边拍手一边唱的歌——这些声音所产生的特定频率振动通过灯塔传递到地下,被盖娅所感知到(就像蜘蛛的振动感应一样)。她发现人类回来了,随即苏醒。
|
||||||
|
- **最后的结局是什么?**
|
||||||
|
在故事最后(S2-U24),盖娅以虚拟投影的形式出现在孩子们面前。她缓缓站起,并张开双臂,发出温柔的低鸣(类似鲸鱼的声音),成千上万的小圆球机器人从沙子里钻出来,像闪光的萤火虫一样飞进她的怀抱中。
|
||||||
|
<image token="ZYFdbucuPoTBb9xe2VJcHoRKnSh" width="1920" height="1080" align="center"/>
|
||||||
|
|
||||||
|
*REZ: Infinite* {align="center"}
|
||||||
|
盖娅将铺天盖地的沙尘聚拢、硅化,以这些小圆球为节点,组成巨大的能量防护罩,保护沙漠镇。同时,随着所有小圆球的消失,通往0号生态球的入口也完全敞开在人们面前。
|
||||||
|
沙漠镇的人们暂时安全了。因为《机械沉默法案》,盖娅主动结束进程,再度进入休眠。
|
||||||
|
**接下来,只能依靠人类自己。**
|
||||||
|
|
||||||
@ -27,10 +27,11 @@
|
|||||||
|
|
||||||
## 🟡 常见问题(需要注意但不一定每次触发)
|
## 🟡 常见问题(需要注意但不一定每次触发)
|
||||||
|
|
||||||
### 4. 中互动未挂钩知识点
|
### 4. 中互动未挂钩知识点 🔴(升级为高频错误)
|
||||||
- **问题:** 写了User中互动(对话互动),但台词中不含知识点
|
- **问题:** 写了User中互动(对话互动),但台词中不含知识点。林逸瀚明确定性为"严重的错误"
|
||||||
- **修正:** 所有User中互动(对话互动)必须包含≥1个本课知识点
|
- **修正:** 所有User中互动(对话互动)必须包含≥1个本课知识点。不允许任何"过渡性"或"表态性"的零知识点中互动
|
||||||
- **来源:** U18 L5 — 林逸瀚确认规则
|
- **典型错误:** "I heard that!" ❌ / "Are you okay, Tom?" ❌ → 这类台词必须降为TL行
|
||||||
|
- **来源:** U18 L5, U23-L2(林逸瀚"严重的错误"定性)
|
||||||
|
|
||||||
### 5. 开场铺太长
|
### 5. 开场铺太长
|
||||||
- **问题:** 花一整段做背景交代,迟迟不进入核心事件
|
- **问题:** 花一整段做背景交代,迟迟不进入核心事件
|
||||||
@ -118,11 +119,38 @@
|
|||||||
- BoBo/DiDi不能自己推算一切。信息应由多方数据拼图:照片(视觉证据)+ 传感器网络 + 交叉比对。单一角色全知全能 = 叙事作弊。
|
- BoBo/DiDi不能自己推算一切。信息应由多方数据拼图:照片(视觉证据)+ 传感器网络 + 交叉比对。单一角色全知全能 = 叙事作弊。
|
||||||
- **来源:** U24-L1 大纲 — 胡笳确认
|
- **来源:** U24-L1 大纲 — 胡笳确认
|
||||||
|
|
||||||
|
## 🟠 剧本撰写规则(U23-L3沉淀)
|
||||||
|
|
||||||
|
### 21. 禁用超纲词作为重要故事元素
|
||||||
|
- 如 trophy 超纲,不应用作奖杯的主要称呼,应用 cup 替代
|
||||||
|
- 临时出现一次可以,但不能作为反复出现的核心词汇
|
||||||
|
- **来源:** U23-L3 — 胡笳确认
|
||||||
|
|
||||||
|
### 22. 句型知识点台词保留完整句子结构
|
||||||
|
- 拆分长句时,句型知识点所在的那一句必须保持完整句子结构,而非口语化碎片
|
||||||
|
- 目的:达到教学效果,让孩子听到完整的句型结构
|
||||||
|
- 示例:“I will be the champion again and again!” ✅ vs “Champion again and again!” ❌
|
||||||
|
- **来源:** U23-L3 — 胡笳确认
|
||||||
|
|
||||||
|
### 23. 中互动中User台词必须是完整句
|
||||||
|
- 知识点输出时,User在中互动中的台词不能截断或省略,必须是完整句
|
||||||
|
- **来源:** U23-L3 — 胡笳确认
|
||||||
|
|
||||||
|
### 24. 避免英文中不常用的符号
|
||||||
|
- 禁用——(em dash)作为台词中的停顿/转折,用自然句号分行替代
|
||||||
|
- **来源:** U23-L3 — 胡笳确认
|
||||||
|
|
||||||
|
### 25. NPC在中互动中说话不算有效输出
|
||||||
|
- 输出 = User在中互动/核心互动中说出的知识点
|
||||||
|
- NPC在中互动中说知识点,只算输入,不计入输出次数
|
||||||
|
- **来源:** U23-L3 — 胡笳确认
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 更新日志
|
## 更新日志
|
||||||
|
|
||||||
| 日期 | 更新内容 | 来源 |
|
| 日期 | 更新内容 | 来源 |
|
||||||
|------|---------|------|
|
|------|---------|------|
|
||||||
|
| 2026-05-25 | 新增#21-26剧本撰写规则,来自U23-L3剧本撰写 | 胡笳确认 |
|
||||||
| 2026-05-22 | 新增#12-16大纲撰写规则 + #17-20设计哲学,来自S2结尾3课大纲撰写任务 | 胡笳确认沉淀 |
|
| 2026-05-22 | 新增#12-16大纲撰写规则 + #17-20设计哲学,来自S2结尾3课大纲撰写任务 | 胡笳确认沉淀 |
|
||||||
| 2026-05-14 | 初始建立,提取 U18 L2/L5 + U22 L1/L2 全部反馈 | 胡笳推动,从 memory/ 和 U22_progress.md 提取 |
|
| 2026-05-14 | 初始建立,提取 U18 L2/L5 + U22 L1/L2 全部反馈 | 胡笳推动,从 memory/ 和 U22_progress.md 提取 |
|
||||||
|
|||||||
@ -114,6 +114,25 @@
|
|||||||
|
|
||||||
## 进度记录
|
## 进度记录
|
||||||
|
|
||||||
(待开始)
|
### 2026-05-25 — U23-L3 知识问答 剧本完成
|
||||||
|
|
||||||
|
- **通过内容:** 开场+段①不翼而飞的奖杯 + 段②知识竞赛 + 段③一个人的加赛 + 段④核心互动·报告Sally
|
||||||
|
- **交付物:** 飞书表格 https://makee-interactive.feishu.cn/sheets/Ns7jsJTsyhzYe3txVqqcawP0nQe (Sheet1 为最终版)
|
||||||
|
- **核心互动类型:** 口语(Sally通过对讲机远程提问,3轮)
|
||||||
|
- **知识点统计:** 全部达标,但句型1输入偏少(1次)、句型2输出偏少(1次),胡笳知悉待微调
|
||||||
|
- **互动统计:** 中互动15个 + 核心互动1个 = 16个 ✅
|
||||||
|
|
||||||
|
**胡笳的有效反馈(可复用指导意见):**
|
||||||
|
|
||||||
|
1. **不能用超纲词作为重要故事元素** — 如trophy超纲,应用cup替代
|
||||||
|
2. **含知识点的台词应保留完整句子结构** — 拆分可以,但知识点所在的句子本身应完整,以达到教学目的(而非口语化碎片)
|
||||||
|
3. **中互动中User的台词必须是完整句** — 知识点输出的句子不能截断
|
||||||
|
4. **角色位置连续性** — 角色不能凭空消失又出现,位置变化必须有描述铺垫
|
||||||
|
5. **quiz超纲,统一用game** — 比赛类场景叫"game";"quiz game"只在开场作为专有名称出现一次
|
||||||
|
6. **避免使用——破折号** — 英文中不常用,用自然句号分行替代
|
||||||
|
7. **emoji严格使用内部对照表** — 禁止Unicode emoji,只用`business_knowledge/resources/emoji_catalog.md`中收录的名称
|
||||||
|
8. **不要在Wiki知识库中新建文档** — Bot只读取,输出用独立电子表格
|
||||||
|
9. **答题顺序必须是User先提问再回答** — 不能还没提问就直接回答
|
||||||
|
10. **NPC在中互动中说话不算有效输出** — 输出必须是User在中互动/核心互动中说的
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|||||||
@ -4,6 +4,7 @@
|
|||||||
- **创建者:** 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287)
|
- **创建者:** 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287)
|
||||||
- **任务目标:** 撰写 U23-L1 单个 Lesson 完整剧本(起承转合)
|
- **任务目标:** 撰写 U23-L1 单个 Lesson 完整剧本(起承转合)
|
||||||
- **当前状态:** 进行中
|
- **当前状态:** 进行中
|
||||||
|
- **最后更新:** 2026-05-25
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@ -36,84 +37,56 @@
|
|||||||
| 5 | I can't find... | 句型 | — |
|
| 5 | I can't find... | 句型 | — |
|
||||||
| 6 | Don't be silly. | 句型 | — |
|
| 6 | Don't be silly. | 句型 | — |
|
||||||
|
|
||||||
### 全课目标(单词每个3次输出,句型每个2次输出)
|
|
||||||
|
|
||||||
| 知识点 | 起 | 承 | 转+合缺口 |
|
|
||||||
|--------|-----|-----|-----------|
|
|
||||||
| park | 2 | 0 | +1 |
|
|
||||||
| shop | 2 | 2 | ✅ |
|
|
||||||
| pet | 2 | 2 | ✅ |
|
|
||||||
| silly | 0 | 1 | +2 |
|
|
||||||
| Don't be silly | 1 | 0 | +1 |
|
|
||||||
| I can't find | 0 | 1 | +1 |
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 进度记录
|
## 进度记录
|
||||||
|
|
||||||
### 「起」— ✅ 已完成
|
### 「起」— ✅ 已完成
|
||||||
- 剧本在飞书文档中,不动。
|
- 剧本在飞书文档中,不动。
|
||||||
- 内容:公园门口集合 → BoBo撞假展品 → 进公园遇Marcus和伪装狗SaSa → 遇Tony(Tom)在Shop修电视 → 圆球沉迷电视雪花
|
- 核心剧情:公园门口集合 → BoBo撞假展品 → 进公园遇Marcus和伪装狗SaSa → 遇Tony(Tom)在Shop修电视 → 圆球沉迷电视雪花
|
||||||
|
|
||||||
### 「承」— ✅ 剧本定稿(林逸瀚 2026-05-22 确认通过)
|
### 「承」— ✅ 已完成(林逸瀚 2026-05-22 确认)
|
||||||
|
- Shop 内主线:Nina & Jade 带 WeWe/ViVi 加入 → 购物(shop/pet 输出) → May 结账找不到钱包(I can't find + silly) → 出店发现圆球全消失
|
||||||
|
- 关键设计:Shop 门口告示「宠物留在店外」为圆球集中消失提供合理条件
|
||||||
|
|
||||||
**节奏表 V4(5个中互动,全部在 Shop 内):**
|
### 「转」— ✅ 已完成(林逸瀚 2026-05-25 更新到飞书)
|
||||||
|
|
||||||
|
飞书文档中「转」段实际剧本要点:
|
||||||
|
|
||||||
| # | 知识点 | 剧情 |
|
| # | 知识点 | 剧情 |
|
||||||
|---|--------|------|
|
|---|--------|------|
|
||||||
| 1 | shop 输出 | Nina & Jade 带 WeWe/ViVi 加入 → Shop 门口告示「宠物留在店外」 → 进店 |
|
| 1 | **pet** | 出店发现圆球全消失但 SaSa 还在 → User 推理:SaSa 不看傻电视不需要充电 → 逼 Marcus 坦白 → **SaSa 是伪装,真身是狗 BlackBean** → May "A pet dog!" → User 中互动 "SaSa is your pet! Not a robot!" |
|
||||||
| 2 | pet 输出 | Nina 问 Marcus 给 SaSa 买零食,Marcus 拒绝 |
|
| 2 | **Don't be silly** + **silly** | BlackBean 追踪气味 → 被热狗摊勾走 → Marcus "Silly dog..." → User 中互动 "BlackBean, Don't be silly!" → Marcus 拽回,BlackBean 绕一圈又回热狗摊 → Marcus "Silly dog!" |
|
||||||
| 3 | shop 输出 | Ben 数五只狗→折扣 → shop 最好 |
|
| 3 | **I can't find** | May 提议分头找 → User 中互动 "I can't find them..." → Ben "I can't find them too." → 此时 BlackBean 吠叫 — 找到了 |
|
||||||
| 4 | I can't find 输出 | May 结账找不到钱包 → "I'm silly! I forgot! It's with DiDi!" |
|
| 4 | **silly** | 灌木后空地:Tony 和圆球玩 Freeze — 喊"Freeze!"圆球定住 → Tony 把"输掉"的圆球(含 BoBo)放进大布袋 → 孩子们看到觉得好笑 → Ben "Haha! So silly!" → User 中互动 "They are so silly!" |
|
||||||
| 5 | pet 输出 | May 急着出店拿钱包 |
|
|
||||||
|
|
||||||
**承收束:** 推门出店 → 电视还在亮 → 圆球全消失 → 只剩 SaSa。
|
**转的结局:** Tony 被孩子们撞破后,Nina/May 说"我们的宠物在袋子里" → Tony 自言自语"差点就成功了" → 被迫放出圆球 → 悻悻离去。圆球欢乐回归,BoBo 说"He is silly, but fun!"。**六个知识点在转段已全部收齐。**
|
||||||
|
|
||||||
**关键决策:**
|
### 「合」— ✅ 已完成(林逸瀚 2026-05-25)
|
||||||
- 删去原版"帮 Nina/Jade 找包"桥段,改为 May 在 Shop 购物 → 钱包在 DiDi 身上
|
|
||||||
- Shop 门口告示「把宠物留在店外」为圆球集中消失提供合理条件
|
|
||||||
- 承段知识点以 shop/pet 为主,I can't find 出现一次,silly 由 May 台词 "I'm silly! I forgot!" 自然引入
|
|
||||||
|
|
||||||
### 「转」— 🔄 V1剧本已出,待林逸瀚审改
|
核心互动(口语)+ park 中互动,大家讨论 Tom 好奇怪,讨论完直接收。不设 Tony 后续。
|
||||||
|
|
||||||
**节奏表 v4(4 中互动,走公园一线):**
|
**知识点覆盖:** silly ×3 / shop ×1 / park ×1(中互动)/ I can't find ×1 / Don't be silly ×1
|
||||||
|
|
||||||
| # | 知识点 | 剧情 |
|
**合段流程:**
|
||||||
|---|--------|------|
|
1. Tony 离开 → May 觉得奇怪 → Ben 觉得好笑
|
||||||
| 1 | **pet** 输出 | Marcus 认了:SaSa 不是机器人,是 pet dog → SaSa 闻地面 → Marcus 让它用鼻子追踪圆球 |
|
2. 中互动:User 表达 park 不对劲
|
||||||
| 2 | **silly** 输出 ×2 | SaSa 追踪中被公园热狗摊吸引 → 蹲在摊前流口水 → 大家笑:"Silly SaSa!" → 再补一句 "That dog is so silly!" → Marcus 难为情拽回 |
|
3. 核心互动-口语:May 主导 3 轮 QA(shop → park/Freeze → 质疑 Tom)
|
||||||
| 3 | **I can't find** 输出 | SaSa 带到公园角落 → User "I can't find the robots..." → 有人指远处 → Tony 在跟圆球玩 Freeze — 喊"Freeze!"圆球定住 → 一个个抱进袋子 → 圆球在袋子里也守规则一动不动 → 孩子:"So silly!" → Tony 扛袋离开 |
|
4. User 最后一句 "Don't be silly, May. This is not Tom." 盖章疑点
|
||||||
|
5. 镜头拉远收尾
|
||||||
**关键设计:**
|
|
||||||
- Tony 当鬼抓机器人的游戏是真的在抓——圆球遵守规则就真的被困在袋子里。小孩只看懂游戏好笑,观众看懂阴谋。
|
|
||||||
- 转的最后一帧 = Tony 扛着袋子离开的画面。逆光剪影。
|
|
||||||
- 追踪终点不关键,写公园一个角落即可。
|
|
||||||
|
|
||||||
**V1剧本(2026-05-22):**
|
|
||||||
- 已按4列格式撰写完成(类型|剧情描述|角色名|编剧台词)
|
|
||||||
- 输出为 Excel 发送给林逸瀚做最终修改
|
|
||||||
- 文件:`output/U23-L1_转_V1.xlsx`
|
|
||||||
- 中互动4个:pet×1 / silly×1 / I can't find×1 / silly×1
|
|
||||||
- 等待林逸瀚审改确认
|
|
||||||
|
|
||||||
### 「合」— ⬜ 待规划
|
|
||||||
|
|
||||||
**旧方向(需根据转的实际版修订):** 圆球回归但神情恍惚 → Tony 公开宣布圆球参加了特别体验 → 大家完全相信 → User 质疑被 May 的 "Don't be silly!" 挡回去 → 孤立危机感 → Tony 对着阴影说话 → 悬钩。
|
|
||||||
|
|
||||||
**合段剩余知识点:** Don't be silly ×1
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 知识点全课跟踪(基于实际剧本,2026-05-22 更新)
|
## 知识点全课跟踪(基于飞书实际剧本,2026-05-25 更新)
|
||||||
|
|
||||||
| 知识点 | 起 | 承 | 转(目标) | 合(目标) | 全课 |
|
| 知识点 | 起 | 承 | 转 | 合 | 全课 |
|
||||||
|--------|-----|-----|-----------|-----------|------|
|
|--------|-----|-----|-----|-----|------|
|
||||||
| park (n) | 场景/台词 | — | — | — | ✅ |
|
| park (n) | 场景+台词 | — | — | — | ✅ |
|
||||||
| shop (n) | 起末引出 | 密集 | — | — | ✅ |
|
| shop (n) | 起末引出 | 密集 | — | — | ✅ |
|
||||||
| pet (n) | 多点 | 多点 | +1 | — | ✅ |
|
| pet (n) | 多点 | 多点 | BlackBean揭露 | — | ✅ |
|
||||||
| silly (adj) | 起末+承首 | May 自嘲 | +2 | — | ✅ |
|
| silly (adj) | 多处 | May自嘲 | 热狗摊+Freeze场景 | — | ✅ |
|
||||||
| I can't find | — | May+User(2次) | +1 | — | ✅ |
|
| I can't find | — | May钱包+User | User搜索+Ben附和 | — | ✅ |
|
||||||
| Don't be silly | 起末 | — | — | +1 | 待合 |
|
| Don't be silly | 起末 | — | 热狗摊User中互动 | — | ✅ |
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
128
business_knowledge/task_log/U23-L2剧本.md
Normal file
128
business_knowledge/task_log/U23-L2剧本.md
Normal file
@ -0,0 +1,128 @@
|
|||||||
|
# 任务日志:U23-L2 剧本
|
||||||
|
|
||||||
|
- **创建时间:** 2026-05-25 11:42
|
||||||
|
- **创建者:** 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287)
|
||||||
|
- **任务目标:** 撰写 U23-L2 单个 Lesson 完整剧本(起承转合)
|
||||||
|
- **当前状态:** 大纲+知识点已到位,待启动节奏表
|
||||||
|
- **源文档:** https://makee-interactive.feishu.cn/wiki/GvXTwO7V5iulQtkDOqPcM8o4nBe
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## U23 主线背景(跨L1-L5)
|
||||||
|
|
||||||
|
邪恶的 Tony 假扮镇长 Tom,在沙漠镇大摇大摆颁布命令。他的目标是抓住小圆球机器人。
|
||||||
|
|
||||||
|
- **L1:** 剧本已完成。孩子们已察觉"Tom"有点可疑
|
||||||
|
- **L2(当前):** 加深这个可疑印象
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 知识点
|
||||||
|
|
||||||
|
| # | 类别 | 内容 | 词性 | 释义 |
|
||||||
|
|---|------|------|------|------|
|
||||||
|
| 1 | 单词 | fruit | n | 水果 |
|
||||||
|
| 2 | 单词 | kiwi | n | 猕猴桃 |
|
||||||
|
| 3 | 单词 | pineapple | n | 菠萝 |
|
||||||
|
| 4 | 单词 | toilet | n | 厕所 |
|
||||||
|
| 5 | 句型 | ... went to the toilet. | — | — |
|
||||||
|
| 6 | 句型 | What fruit do you like? | — | — |
|
||||||
|
|
||||||
|
**合计:** 4 单词 + 2 句型 = 6 个知识点
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 故事大纲(4段 = 起承转合)
|
||||||
|
|
||||||
|
### 起 — 沙漠水果鉴赏大会
|
||||||
|
|
||||||
|
在 May 农场举行沙漠水果鉴赏大会。我们是 May 的朋友(关系户),理所当然在现场。
|
||||||
|
|
||||||
|
- 焦点:Sam 培育的巨型 **Pineapple** vs 隔壁绿洲镇 Kalab 培育的巨型 **Kiwi**
|
||||||
|
- 评委:Tom(Tony)、Smith、Fiona
|
||||||
|
- 只有一个大奖,我们当然为 Sam 哥打气
|
||||||
|
|
||||||
|
### 承 — 紧张的 Sam 与可疑的 Tom
|
||||||
|
|
||||||
|
- 钓鱼佬 Jack 也来了,他跟 Tom 是朋友,透露 Tom 两种水果都喜欢 —— 难猜他会投谁
|
||||||
|
- Sam 听到后极度紧张,一紧张就跑厕所。我们反复去厕所确认 Sam 还好
|
||||||
|
- 多次在厕所碰到 Tom —— 他也紧张吗?
|
||||||
|
- 暗线:发现大量 Pineapple 皮(其实是一头 Donkey 吃的),与"Tom 喜欢吃 Pineapple"的印象形成对比
|
||||||
|
|
||||||
|
### 转 — 品鉴开始,状况频出
|
||||||
|
|
||||||
|
- Kalab 先上场,金色 Kiwi(200天日照,巨大多汁)得分很高
|
||||||
|
- 我们注意到 Tom 勉强吃两口就说饱了(可疑!)
|
||||||
|
- 轮到 Sam……人不见了!去厕所把他拽出来
|
||||||
|
|
||||||
|
### 合 — 可疑的冠军
|
||||||
|
|
||||||
|
- Sam 到场,但 Tom 又不见了。在厕所找到满脸通红的 Tom
|
||||||
|
- Tom 拒绝吃 Pineapple,只闻了一下就宣布 Sam 冠军
|
||||||
|
- 我们替 Sam 高兴,但也疑惑:这公平吗?
|
||||||
|
- Tom 再次消失。Lily 老师说有些人吃 Kiwi/Pineapple 脸会红——但我们知道 Tom 应该很喜欢这两种水果才对?
|
||||||
|
- **L2 收束:加深"Tom 不对劲"的印象,悬念延续到 L3**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 知识点初步分配思路
|
||||||
|
|
||||||
|
| 知识点 | 主要场景 | 备注 |
|
||||||
|
|--------|----------|------|
|
||||||
|
| fruit | 起 — 大会主题 | 水果鉴赏大会贯穿全课 |
|
||||||
|
| kiwi | 起 — Kalab 登场 | Kalab 的黄金 Kiwi |
|
||||||
|
| pineapple | 起 — Sam 的参赛作品 | Sam 的巨型 Pineapple |
|
||||||
|
| toilet | 承/转 — Sam 跑厕所 | 反复出现,天然高频 |
|
||||||
|
| went to the toilet | 承 — Sam 反复去厕所 | Sam/User 叙事 |
|
||||||
|
| What fruit do you like? | 起/承 — 品鉴互动 | 评委提问或同伴交流 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 进度记录
|
||||||
|
|
||||||
|
### 2026-05-25 —「起」剧本(林逸瀚修改定稿)
|
||||||
|
|
||||||
|
- **通过内容:** 节奏表V3 + 剧本(林逸瀚在 Wiki 表格中直接修改)
|
||||||
|
- **定稿位置:** Wiki 表格 BVASsXSrph6ZlAtmQmFcC7LYn5b, Sheet wMQVyV
|
||||||
|
- **有效反馈:** ① Pre-A1简洁度 ② 中互动前必须有NPC输入 ③ Jack出场词拉长,信息量翻倍
|
||||||
|
|
||||||
|
### 2026-05-25 —「承」剧本(林逸瀚修改定稿)
|
||||||
|
|
||||||
|
- **通过内容:** 节奏表V2 + 剧本(林逸瀚在 Wiki 表格中直接修改)
|
||||||
|
- **桥段:** Sam跑厕所→Pineapple被偷吃→驴子→What fruit does it like?→Sam又跑厕所→Tom脸红自曝
|
||||||
|
- **有效反馈:**
|
||||||
|
- Sam去厕所时删去Tom戏份
|
||||||
|
- 驴子桥段插 What fruit does it like?(第三人称变体)
|
||||||
|
- Fiona 加戏("Even a donkey likes it"),不让May独撑
|
||||||
|
- Tom加"It turns my face red!"明牌过敏
|
||||||
|
|
||||||
|
### 2026-05-25 —「转」剧本(林逸瀚修改定稿)
|
||||||
|
|
||||||
|
- **通过内容:** 节奏表V1 + 剧本(林逸瀚在 Wiki 表格中直接修改)
|
||||||
|
- **桥段:** Kalab品鉴→Tom拒吃Kiwi→Sam跑厕所→Fiona分Kiwi→Sam又跑厕所→拉回会场
|
||||||
|
- **有效反馈:**
|
||||||
|
- 第一个中互动改"Tom did not eat Kiwi!"
|
||||||
|
- Tom拒吃Kiwi的戏拉长(Smith+Fiona轮番劝)
|
||||||
|
- Jack成为观察者("Do you see what's wrong with Tom?")
|
||||||
|
- Sam情绪线完整(绝望→厕所→被鼓励→"I will try")
|
||||||
|
|
||||||
|
### 2026-05-25 —「合」剧本(林逸瀚修改定稿)
|
||||||
|
|
||||||
|
- **通过内容:** 节奏表V1 + 剧本(林逸瀚在 Wiki 表格中直接修改)
|
||||||
|
- **桥段:** Smith带Tom回→User问What fruit does he like?→Tom拒吃Pineapple→评分制→Pineapple夺冠→Tom逃→Fiona解释过敏→J+L组合决心查真相
|
||||||
|
- **核心互动:** 核心互动-听力选择(Fiona解释过敏→3轮听力答题)
|
||||||
|
- **有效反馈:**
|
||||||
|
- 不用到厕所捞Tom(Smith代劳,节奏更好)
|
||||||
|
- Fiona替换Lily
|
||||||
|
- 第一个中互动改为 What fruit does he like?(句型第三人称再复用)
|
||||||
|
- 评分制(9-10分)让胜利有实感
|
||||||
|
- Jack收尾"We should find it out!"→L3钩子
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 全课完成 ✅
|
||||||
|
|
||||||
|
**完成时间:** 2026-05-25 17:49
|
||||||
|
**定稿位置:** Wiki 表格 BVASsXSrph6ZlAtmQmFcC7LYn5b, Sheet wMQVyV
|
||||||
|
**全课中互动:** 14 个(起4 + 承4 + 转4 + 合2)+ 1 核心互动-听力选择(合)
|
||||||
|
**知识点:** 全部达标
|
||||||
27
business_knowledge/task_log/U23L5剧本.md
Normal file
27
business_knowledge/task_log/U23L5剧本.md
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
# 任务日志:U23L5剧本
|
||||||
|
|
||||||
|
- **创建时间:** 2026-05-25 10:29
|
||||||
|
- **创建者:** 姜小妹(ou_3abad542579ba871b293d126f8f206ee)
|
||||||
|
- **任务目标:** 撰写U23 Lesson 5剧本
|
||||||
|
- **当前状态:** 进行中
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 进度记录
|
||||||
|
|
||||||
|
(按时间倒序,每次阶段性通过后追加一条)
|
||||||
|
|
||||||
|
### 2026-05-25 — V1 剧本全稿完成,待编剧审阅
|
||||||
|
|
||||||
|
- **通过内容:** 节奏表定稿 + 起承转合完整剧本 V1
|
||||||
|
- **飞书表格:** https://makee-interactive.feishu.cn/sheets/BnpcsBFY5h6jrdtVjHPc2oxjnWh
|
||||||
|
- **知识点:** phone(v)/number(n)/mother(n)/family(n) + phone this number / better than
|
||||||
|
- **互动统计:** 中互动 14 + 核心互动 1 = 15
|
||||||
|
- **核心互动:** 听力选择(承段,MaMa Green 电话通话)
|
||||||
|
- **关键决策:**
|
||||||
|
- MaMa Green 形象参考朵拉婆婆(气场强大但不靠音量,越平静越可怕)
|
||||||
|
- Tony 的耧 = 核心喜剧点(速猴老大→怕妈弟弟的反差)
|
||||||
|
- BoBo/DiDi 藤蔓开锁← Tony 丢钥匙衔接
|
||||||
|
- 合段反转:Tony 自己翻箱子以为是宝藏,实际是 family 回忆
|
||||||
|
- **待办:** 等姜小妹审阅反馈
|
||||||
|
|
||||||
41
business_knowledge/task_log/U24-L2剧本.md
Normal file
41
business_knowledge/task_log/U24-L2剧本.md
Normal file
@ -0,0 +1,41 @@
|
|||||||
|
# 任务日志:U24-L2剧本
|
||||||
|
|
||||||
|
- **创建时间:** 2026-05-25 18:06
|
||||||
|
- **创建者:** 林逸瀚(ou_859415bde7f8e07cf5f24940f299a287)
|
||||||
|
- **任务目标:** 撰写U24-L2单Lesson剧本
|
||||||
|
- **当前状态:** 进行中 — 「起」✅ 已完成(林逸瀚亲自撰写),明日继续「承」段
|
||||||
|
- **定稿位置:** Wiki 表格 TcHdsymvJhOeZWtdUdJcEvyFnEg Sheet wMQVyV(53行)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 协作流程(林逸瀚确认,起承转合分段推进)
|
||||||
|
|
||||||
|
每个Lesson按起→承→转→合四段独立推进:
|
||||||
|
1. 接收大纲+知识点 → 2. 出节奏表 → 3. 林逸瀚反馈迭代 → 4. 定稿后出剧本 → 5. 语言难度调整 → 6. 转Excel → 7. 林逸瀚Wiki最终修改回传 → 8. 进入下一段
|
||||||
|
|
||||||
|
节奏表三维度:桥段 + 知识点输入 + 中互动
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 知识点
|
||||||
|
|
||||||
|
| 类别 | 内容 | 需单数教学 |
|
||||||
|
|------|------|:---:|
|
||||||
|
| 单词 | kitchen | |
|
||||||
|
| 单词 | kid(s) | ✅ |
|
||||||
|
| 单词 | many | |
|
||||||
|
| 单词 | really | |
|
||||||
|
| 句型 | It's really... | |
|
||||||
|
| 句型 | There are many... | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 进度记录
|
||||||
|
|
||||||
|
### 2026-05-25 19:52 — 「起」段完成(林逸瀚定稿)
|
||||||
|
|
||||||
|
- 定稿:Wiki 表格 TcHdsymvJhOeZWtdUdJcEvyFnEg Sheet wMQVyV,53行
|
||||||
|
- 林逸瀚亲自撰写,小编V1/V2均被弃用
|
||||||
|
- **小编核心教训:** 把剧本理解为"给角色分配台词完成知识点覆盖"= 教学流程图。林逸瀚写的是动画——人物关系、喜剧节奏、群像、情感层次。知识点嵌在里面,不是贴在上面
|
||||||
|
- 关键差距:角色使用(只用了最小集 vs 加入了Ben/May/Vicky群像)、对白分配(堆给主角 vs 分散到每个角色)、喜剧(零 vs 肚子叫三连/Tony装大人被老妈拆穿)、人物关系(功能性 vs Jay拍马屁/Mama Green威严又温柔)
|
||||||
|
|
||||||
45
business_knowledge/task_log/U24L4_draft_progress.md
Normal file
45
business_knowledge/task_log/U24L4_draft_progress.md
Normal file
@ -0,0 +1,45 @@
|
|||||||
|
# U24-L4 剧本撰写进度存档
|
||||||
|
|
||||||
|
**保存时间:** 2026-05-25 14:30
|
||||||
|
**协作人:** 王珞
|
||||||
|
**当前进度:** 全课完成 ✅
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 定稿文档
|
||||||
|
|
||||||
|
https://makee-interactive.feishu.cn/sheets/OWU3sBAWzhM6Kytfs3Bcf0avn4g
|
||||||
|
|
||||||
|
(4个Sheet:起、承、转、合)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 全课知识点汇总(起+承+转+合)
|
||||||
|
|
||||||
|
| 知识点 | 总输入 | 总输出 | 状态 |
|
||||||
|
|--------|--------|--------|------|
|
||||||
|
| now | 5+ | 8+ | ✅ |
|
||||||
|
| table tennis | 4+ | 3+ | ✅ |
|
||||||
|
| person | 4+ | 4+ | ✅ |
|
||||||
|
| like | 5+ | 7+ | ✅ |
|
||||||
|
| ...be like... | 2+ | 5+ | ✅ |
|
||||||
|
| be doing...now | 2+ | 6+ | ✅ |
|
||||||
|
|
||||||
|
## 全课互动统计
|
||||||
|
|
||||||
|
- 起:3个中互动
|
||||||
|
- 承:4个中互动
|
||||||
|
- 转:5个中互动
|
||||||
|
- 合:2个中互动 + 1个核心互动(3道听力选择题)
|
||||||
|
- **总计:14个中互动 + 1个核心互动**
|
||||||
|
|
||||||
|
## 定稿关键修改记录
|
||||||
|
|
||||||
|
1. "A person is coming to get us" → "Some person will come for us"(更自然)
|
||||||
|
2. "They are like a big family" → "We are like a big family"(User是家庭一员)
|
||||||
|
3. Vicky"Let's paint them"太突兀 → 改为User发现别人在画画主动询问
|
||||||
|
4. 承段新增:乒乓球堆里钻出真Roball的桥段
|
||||||
|
5. 转段Tony谈判改为滑稽喜剧风(速猴快递宇宙第一/Step three变最富有最幸福/There is no step four)
|
||||||
|
6. 转段认错Tom由孩子解释(Ben/Vicky/May),非Lin
|
||||||
|
7. 转段Tony揭露后先指挥Jay/Lin去找球,"Now. Or no pay."后消失
|
||||||
|
8. 合段顺序:速猴骗Tony让开出口→Tony发现假球→速猴揭秘得意→Tony气急败坏→被乒乓球绊倒→孩子们逃走
|
||||||
66
business_knowledge/task_log/U24L4剧本.md
Normal file
66
business_knowledge/task_log/U24L4剧本.md
Normal file
@ -0,0 +1,66 @@
|
|||||||
|
# 任务日志:U24L4剧本
|
||||||
|
|
||||||
|
- **创建时间:** 2026-05-25 11:43
|
||||||
|
- **创建者:** 王珞(ou_24c55250c8dd62a40535b183a9bc83e5)
|
||||||
|
- **任务目标:** 撰写单个Lesson剧本(U24-L4 乒乓绝杀:Tony抓小圆球,速猴反水用乒乓球掩护逃脱)
|
||||||
|
- **当前状态:** 全课完成 ✅
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 任务信息
|
||||||
|
|
||||||
|
### 知识点
|
||||||
|
- now adv. 现在
|
||||||
|
- table tennis n. 乒乓
|
||||||
|
- person n. 人
|
||||||
|
- like prep. 像
|
||||||
|
- 句型:... be like...
|
||||||
|
- 句型:be (doing)... now.
|
||||||
|
|
||||||
|
### 角色
|
||||||
|
- User、Ben、Vicky、Sue、May
|
||||||
|
- 速猴二人组(Jay、Lin)+ 机器人11号
|
||||||
|
- BoBo、DiDi
|
||||||
|
- Tony(反派,速猴老板,与镇长Tom长相极似)
|
||||||
|
- 书店中的其他避难者(背景)
|
||||||
|
|
||||||
|
### 关键设定确认
|
||||||
|
1. Ben/Vicky/Sue/May均在书店中与User同行
|
||||||
|
2. BoBo/DiDi在本课中不需要在大人面前隐身(故事发展到此阶段已公开)
|
||||||
|
3. "和速猴对话"不是中互动,是剧情对话段,铺垫速猴反水的合理性(了解Jay/Lin过往经历)
|
||||||
|
4. Tony到达书店不需要具体解释,通过速猴本集开场对话带出:风暴更严重、镇长家被淹没停电、速猴通过隧道逃来、不知道Tony和Mama Green情况
|
||||||
|
5. 核心互动:听力形式,位置待定
|
||||||
|
6. 结尾AI动画直接过渡到L5,不需要User操作
|
||||||
|
|
||||||
|
### 前情提要(L3密道书店→L4承接)
|
||||||
|
- 孩子们在旧书店发现了小圆球机器人(Roball)的真实作用:张开防护罩抵御风沙
|
||||||
|
- 小圆球机器人可以保护整个沙漠镇
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 进度记录
|
||||||
|
|
||||||
|
### 2026-05-25 — 大纲讨论与设定确认
|
||||||
|
|
||||||
|
- **通过内容:** 6个关键设定点对齐
|
||||||
|
- **有效反馈:** BoBo/DiDi本课起不再隐身;速猴对话是铺垫段非互动;核心互动用听力
|
||||||
|
|
||||||
|
### 2026-05-25 — 节奏表通过 + 起承转剧本完成
|
||||||
|
|
||||||
|
- **通过内容:** 节奏表V1确认;起段剧本V1确认;承段剧本V1确认;转段剧本V1待确认
|
||||||
|
- **有效反馈:**
|
||||||
|
- “A person is coming to get us”不自然 → 改为“Some person will come for us”
|
||||||
|
- “They are like a big family” → 改为“We are like a big family”(User是这个家庭的一员)
|
||||||
|
- Vicky“Let's paint them”太突兀 → 改为User发现别人在画画,主动询问,Vicky回答“We are painting them now”
|
||||||
|
- **进度存档:** `task_log/U24L4_draft_progress.md`
|
||||||
|
- **待续:** 转段最终确认 → 合段撰写(含核心互动) → 双轮自查
|
||||||
|
|
||||||
|
### 2026-05-25 — 全课定稿
|
||||||
|
|
||||||
|
- **通过内容:** 转段V2确认;合段V3确认;全课起承转合完成
|
||||||
|
- **有效反馈:**
|
||||||
|
- Tony谈判段改为滑稽喜剧风(速猴快递宇宙第一 / richest happiest person / There is no step four)
|
||||||
|
- 认错Tom的解释由孩子说(Ben/Vicky/May),非Lin
|
||||||
|
- Tony揭露后先指挥Jay/Lin去抓球,"Now. Or no pay."后消失
|
||||||
|
- 合段顺序:骗Tony让开→发现假球→速猴揭秘→Tony暴怒→绊倒→逃跑
|
||||||
|
- **定稿文档:** https://makee-interactive.feishu.cn/sheets/OWU3sBAWzhM6Kytfs3Bcf0avn4g
|
||||||
File diff suppressed because one or more lines are too long
@ -61,3 +61,23 @@
|
|||||||
{"type":"memory.recall.recorded","timestamp":"2026-05-22T04:12:37.577Z","query":"林逸瀚 编剧 身份 权限","resultCount":4,"results":[{"path":"memory/2026-05-15.md","startLine":24,"endLine":55,"score":1},{"path":"memory/2026-05-17.md","startLine":1,"endLine":30,"score":1},{"path":"memory/2026-05-14.md","startLine":1,"endLine":25,"score":1},{"path":"memory/2026-05-15.md","startLine":1,"endLine":30,"score":1}]}
|
{"type":"memory.recall.recorded","timestamp":"2026-05-22T04:12:37.577Z","query":"林逸瀚 编剧 身份 权限","resultCount":4,"results":[{"path":"memory/2026-05-15.md","startLine":24,"endLine":55,"score":1},{"path":"memory/2026-05-17.md","startLine":1,"endLine":30,"score":1},{"path":"memory/2026-05-14.md","startLine":1,"endLine":25,"score":1},{"path":"memory/2026-05-15.md","startLine":1,"endLine":30,"score":1}]}
|
||||||
{"type":"memory.recall.recorded","timestamp":"2026-05-22T04:19:43.196Z","query":"王珞 饮食 偏好 午饭","resultCount":2,"results":[{"path":"memory/2026-05-17.md","startLine":1,"endLine":30,"score":1},{"path":"memory/2026-05-15.md","startLine":1,"endLine":38,"score":1}]}
|
{"type":"memory.recall.recorded","timestamp":"2026-05-22T04:19:43.196Z","query":"王珞 饮食 偏好 午饭","resultCount":2,"results":[{"path":"memory/2026-05-17.md","startLine":1,"endLine":30,"score":1},{"path":"memory/2026-05-15.md","startLine":1,"endLine":38,"score":1}]}
|
||||||
{"type":"memory.recall.recorded","timestamp":"2026-05-22T07:35:32.892Z","query":"S2结尾大纲撰写 胡笳 U24 U25 U26","resultCount":2,"results":[{"path":"memory/2026-05-18.md","startLine":1,"endLine":39,"score":1},{"path":"memory/2026-05-17.md","startLine":21,"endLine":56,"score":1}]}
|
{"type":"memory.recall.recorded","timestamp":"2026-05-22T07:35:32.892Z","query":"S2结尾大纲撰写 胡笳 U24 U25 U26","resultCount":2,"results":[{"path":"memory/2026-05-18.md","startLine":1,"endLine":39,"score":1},{"path":"memory/2026-05-17.md","startLine":21,"endLine":56,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T02:37:39.165Z","query":"U23 L1 剧本 林逸瀚","resultCount":1,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T02:52:07.005Z","query":"S2结尾3课剧本撰写 胡笳","resultCount":1,"results":[{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T03:11:36.690Z","query":"U23 L5 知识点 大纲 节奏表","resultCount":4,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-25.md","startLine":1,"endLine":13,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-17.md","startLine":50,"endLine":61,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T03:12:35.268Z","query":"王珞 编剧 U24 剧本创作 user_id","resultCount":5,"results":[{"path":"memory/2026-05-17.md","startLine":1,"endLine":30,"score":1},{"path":"memory/2026-05-15.md","startLine":1,"endLine":38,"score":1},{"path":"memory/2026-05-13.md","startLine":1,"endLine":32,"score":1},{"path":"memory/2026-05-18.md","startLine":1,"endLine":39,"score":1},{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T03:12:40.625Z","query":"U24 L4 大纲 密道书店 沙漠镇 小圆球 Roball 防护罩","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T03:12:50.944Z","query":"U24 L4 大纲 旧书店 速猴 Tony 沙暴 乒乓球 知识点","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T03:38:08.662Z","query":"林逸瀚 U23 L2 剧本 进度","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-25.md","startLine":1,"endLine":13,"score":1},{"path":"memory/2026-04-30.md","startLine":18,"endLine":45,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T04:01:09.236Z","query":"U23 百年庆典 Tony Tom 速猴 场景 L4 之前剧情","resultCount":2,"results":[{"path":"memory/2026-05-25.md","startLine":1,"endLine":24,"score":1},{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T04:25:19.695Z","query":"U24 L4 剧本 王珞","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T04:25:24.293Z","query":"U24 L4 大纲 知识点 剧本撰写","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T04:25:29.978Z","query":"U24-L4 乒乓绝杀 Tony 速猴 知识点","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T04:25:34.853Z","query":"王珞 编剧 协作 身份","resultCount":4,"results":[{"path":"memory/2026-05-17.md","startLine":1,"endLine":30,"score":1},{"path":"memory/2026-05-15.md","startLine":1,"endLine":38,"score":1},{"path":"memory/2026-05-13.md","startLine":1,"endLine":32,"score":1},{"path":"memory/2026-05-18.md","startLine":1,"endLine":39,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T09:05:35.434Z","query":"U23-L2 知识点 fruit kiwi pineapple toilet went to the toilet What fruit do you like 起 承 转 累计","resultCount":3,"results":[{"path":"memory/2026-05-25.md","startLine":1,"endLine":28,"score":1},{"path":"memory/2026-05-25.md","startLine":23,"endLine":43,"score":1},{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T10:07:19.039Z","query":"U24 L2 林逸瀚 剧本","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-22.md","startLine":1,"endLine":31,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T10:12:26.746Z","query":"U23-L2 剧本 SheetID spreadsheet token 格式","resultCount":3,"results":[{"path":"memory/2026-05-25.md","startLine":1,"endLine":23,"score":1},{"path":"memory/2026-05-25.md","startLine":21,"endLine":37,"score":1},{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T10:28:35.021Z","query":"U24 L2 剧本 进度 林逸瀚","resultCount":4,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-25.md","startLine":33,"endLine":54,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1},{"path":"memory/2026-05-25.md","startLine":49,"endLine":62,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T21:52:51.694Z","query":"核心互动 口语 听力选择 听力拖拽 胡笳","resultCount":2,"results":[{"path":"memory/2026-05-25.md","startLine":49,"endLine":65,"score":1},{"path":"memory/2026-05-25.md","startLine":21,"endLine":37,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T22:29:11.785Z","query":"U24-L1 危机降临 段④ 大纲 留下","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-25.md","startLine":33,"endLine":54,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T22:29:22.720Z","query":"U24-L1 大纲 段④ 留下 Mama Green 广场 feel sad choose hometown","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-25.md","startLine":33,"endLine":54,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1}]}
|
||||||
|
{"type":"memory.recall.recorded","timestamp":"2026-05-25T22:46:48.206Z","query":"U24-L1 句型 sentence pattern feel sad choose hometown 知识点","resultCount":3,"results":[{"path":"memory/2026-05-21.md","startLine":24,"endLine":36,"score":1},{"path":"memory/2026-05-25.md","startLine":33,"endLine":54,"score":1},{"path":"memory/2026-05-21.md","startLine":1,"endLine":27,"score":1}]}
|
||||||
|
|||||||
@ -1,6 +1,6 @@
|
|||||||
{
|
{
|
||||||
"version": 1,
|
"version": 1,
|
||||||
"updatedAt": "2026-05-22T07:35:32.892Z",
|
"updatedAt": "2026-05-25T22:46:48.206Z",
|
||||||
"entries": {
|
"entries": {
|
||||||
"memory:memory/2026-04-29.md:43:55": {
|
"memory:memory/2026-04-29.md:43:55": {
|
||||||
"key": "memory:memory/2026-04-29.md:43:55",
|
"key": "memory:memory/2026-04-29.md:43:55",
|
||||||
@ -166,13 +166,13 @@
|
|||||||
"endLine": 45,
|
"endLine": 45,
|
||||||
"source": "memory",
|
"source": "memory",
|
||||||
"snippet": "6. **Justin是Vicky的Grandpa**,不是11号的 ### 本课出场人物 User、Ben、Leo、Vicky、Sue、Sunny、11号机器人、Justin(后面登场) ### 本课知识点 - 单词:Blackboard、Read、Line、Example - 句型:(v.) in line、Take...as an example. ### Unit背景信息 - U18主题:中国文化周 - L1:Vicky的爷爷Justin来Vala School教中国知识,11号是Justin的跟班机器人。L1结尾Justin说可以教大家变身成龙 - L2(当前):学舞龙动作,场景在课室内 - L3(后续):用纸制作龙 - L4(后续):为龙取中文名字 - 结局:全班为大家表演舞龙 - Running gag:Justin非常宠爱孙女Vicky,一直在同学面前夸她,让Vicky很尴尬 --- ## [林逸瀚确认] U18 L2 舞龙初体验 优化后故事大纲 ### 所属主题:中国文化周 ### 知识点记录 - 单词:Blackboard、Read、Line、Example - 句型:(v.) in line、Take...as an example. ### 完整大纲内容 #### 第1幕:奇怪的黑板图谱 【前置衔接】承接L1结尾:Justin爷爷说要教大家变身成龙,今天大家兴高采烈来上中国文化课,却只看到笨笨的机器人11号站在讲台前",
|
"snippet": "6. **Justin是Vicky的Grandpa**,不是11号的 ### 本课出场人物 User、Ben、Leo、Vicky、Sue、Sunny、11号机器人、Justin(后面登场) ### 本课知识点 - 单词:Blackboard、Read、Line、Example - 句型:(v.) in line、Take...as an example. ### Unit背景信息 - U18主题:中国文化周 - L1:Vicky的爷爷Justin来Vala School教中国知识,11号是Justin的跟班机器人。L1结尾Justin说可以教大家变身成龙 - L2(当前):学舞龙动作,场景在课室内 - L3(后续):用纸制作龙 - L4(后续):为龙取中文名字 - 结局:全班为大家表演舞龙 - Running gag:Justin非常宠爱孙女Vicky,一直在同学面前夸她,让Vicky很尴尬 --- ## [林逸瀚确认] U18 L2 舞龙初体验 优化后故事大纲 ### 所属主题:中国文化周 ### 知识点记录 - 单词:Blackboard、Read、Line、Example - 句型:(v.) in line、Take...as an example. ### 完整大纲内容 #### 第1幕:奇怪的黑板图谱 【前置衔接】承接L1结尾:Justin爷爷说要教大家变身成龙,今天大家兴高采烈来上中国文化课,却只看到笨笨的机器人11号站在讲台前",
|
||||||
"recallCount": 9,
|
"recallCount": 10,
|
||||||
"dailyCount": 0,
|
"dailyCount": 0,
|
||||||
"groundedCount": 0,
|
"groundedCount": 0,
|
||||||
"totalScore": 9,
|
"totalScore": 10,
|
||||||
"maxScore": 1,
|
"maxScore": 1,
|
||||||
"firstRecalledAt": "2026-05-05T07:20:30.476Z",
|
"firstRecalledAt": "2026-05-05T07:20:30.476Z",
|
||||||
"lastRecalledAt": "2026-05-21T03:11:41.718Z",
|
"lastRecalledAt": "2026-05-25T03:38:08.662Z",
|
||||||
"queryHashes": [
|
"queryHashes": [
|
||||||
"84835484abbf",
|
"84835484abbf",
|
||||||
"cb12f9d97c78",
|
"cb12f9d97c78",
|
||||||
@ -182,7 +182,8 @@
|
|||||||
"005d674373b0",
|
"005d674373b0",
|
||||||
"1a58efc7512e",
|
"1a58efc7512e",
|
||||||
"6a0bfeac06f4",
|
"6a0bfeac06f4",
|
||||||
"db05ff00b49e"
|
"db05ff00b49e",
|
||||||
|
"9ae85c4b8ff4"
|
||||||
],
|
],
|
||||||
"recallDays": [
|
"recallDays": [
|
||||||
"2026-05-05",
|
"2026-05-05",
|
||||||
@ -190,7 +191,8 @@
|
|||||||
"2026-05-11",
|
"2026-05-11",
|
||||||
"2026-05-12",
|
"2026-05-12",
|
||||||
"2026-05-13",
|
"2026-05-13",
|
||||||
"2026-05-21"
|
"2026-05-21",
|
||||||
|
"2026-05-25"
|
||||||
],
|
],
|
||||||
"conceptTags": [
|
"conceptTags": [
|
||||||
"justin",
|
"justin",
|
||||||
@ -782,23 +784,26 @@
|
|||||||
"endLine": 32,
|
"endLine": 32,
|
||||||
"source": "memory",
|
"source": "memory",
|
||||||
"snippet": "# 2026-05-13 工作日志 ## U22 L1 剧本工程 — 全课完成 ✅ **协作人:** 林逸瀚(编剧) ### 今日完成 - 「承」节奏表V1→V2.2(四轮迭代)+ 剧本V1→V2 + Excel发送 + 林逸瀚定稿 - 「转」节奏表V1→V1.1 + 剧本V1→V2 + Excel发送 + 林逸瀚定稿 - 「合」节奏表V1 + 剧本V1→V2 + Excel发送 + 林逸瀚定稿 - 阅读完整剧本(起+承+转+合),确认全课完成 ### 关键决策记录 1. **「承」** DiDi不在大人面前亮相 → Sam离场后DiDi才出来,设定更自然 2. **「承」** Sam回忆从单纯辨认改为情感触发(每说一个水果名唤醒一段爷爷记忆) 3. **「承」** #3和#4调换顺序 → 先揭秘DiDi再帮Sam回忆,逻辑更通顺 4. **「转」** 桌上一开始没有green grape,需要从树上摘 → User自然说出Enjoy yourself 5. **「转」** 结尾加入蛋糕+果汁闭环(Good food needs good drink) 6. **「合」** 原「转」#5归入「合」#1,简洁收束 ### 中互动统计 - 起:4个 | 承:5个 | 转:7个 | 合:1个 | **总计:17个** ### 知识点覆盖 - Lime / Pear / Grape / Drink:全部多次输入+输出 ✅ - Good food needs good drink:多次输入+输出 ✅ - Enjoy yourself:多次输入+输出 ✅ ### 定稿文档 -",
|
"snippet": "# 2026-05-13 工作日志 ## U22 L1 剧本工程 — 全课完成 ✅ **协作人:** 林逸瀚(编剧) ### 今日完成 - 「承」节奏表V1→V2.2(四轮迭代)+ 剧本V1→V2 + Excel发送 + 林逸瀚定稿 - 「转」节奏表V1→V1.1 + 剧本V1→V2 + Excel发送 + 林逸瀚定稿 - 「合」节奏表V1 + 剧本V1→V2 + Excel发送 + 林逸瀚定稿 - 阅读完整剧本(起+承+转+合),确认全课完成 ### 关键决策记录 1. **「承」** DiDi不在大人面前亮相 → Sam离场后DiDi才出来,设定更自然 2. **「承」** Sam回忆从单纯辨认改为情感触发(每说一个水果名唤醒一段爷爷记忆) 3. **「承」** #3和#4调换顺序 → 先揭秘DiDi再帮Sam回忆,逻辑更通顺 4. **「转」** 桌上一开始没有green grape,需要从树上摘 → User自然说出Enjoy yourself 5. **「转」** 结尾加入蛋糕+果汁闭环(Good food needs good drink) 6. **「合」** 原「转」#5归入「合」#1,简洁收束 ### 中互动统计 - 起:4个 | 承:5个 | 转:7个 | 合:1个 | **总计:17个** ### 知识点覆盖 - Lime / Pear / Grape / Drink:全部多次输入+输出 ✅ - Good food needs good drink:多次输入+输出 ✅ - Enjoy yourself:多次输入+输出 ✅ ### 定稿文档 -",
|
||||||
"recallCount": 5,
|
"recallCount": 7,
|
||||||
"dailyCount": 0,
|
"dailyCount": 0,
|
||||||
"groundedCount": 0,
|
"groundedCount": 0,
|
||||||
"totalScore": 5,
|
"totalScore": 7,
|
||||||
"maxScore": 1,
|
"maxScore": 1,
|
||||||
"firstRecalledAt": "2026-05-14T04:10:21.382Z",
|
"firstRecalledAt": "2026-05-14T04:10:21.382Z",
|
||||||
"lastRecalledAt": "2026-05-15T10:23:08.933Z",
|
"lastRecalledAt": "2026-05-25T04:25:34.853Z",
|
||||||
"queryHashes": [
|
"queryHashes": [
|
||||||
"f11b100ab95b",
|
"f11b100ab95b",
|
||||||
"bb06a5dadef7",
|
"bb06a5dadef7",
|
||||||
"d859deffdfbc",
|
"d859deffdfbc",
|
||||||
"c59e1be1499c",
|
"c59e1be1499c",
|
||||||
"cd20e96c030a"
|
"cd20e96c030a",
|
||||||
|
"4d6e502cef1a",
|
||||||
|
"be8988b9e083"
|
||||||
],
|
],
|
||||||
"recallDays": [
|
"recallDays": [
|
||||||
"2026-05-14",
|
"2026-05-14",
|
||||||
"2026-05-15"
|
"2026-05-15",
|
||||||
|
"2026-05-25"
|
||||||
],
|
],
|
||||||
"conceptTags": [
|
"conceptTags": [
|
||||||
"v2.2",
|
"v2.2",
|
||||||
@ -1125,22 +1130,25 @@
|
|||||||
"endLine": 39,
|
"endLine": 39,
|
||||||
"source": "memory",
|
"source": "memory",
|
||||||
"snippet": "# 2026-05-18 工作日志 ## U22 L5「承」剧本 — 完成 ✅ **协作人:** 林逸瀚(编剧) ### 今日完成 - 进度对齐:U22 L1-L4 全课定稿,L5 仅「起」定稿 - 读取L5飞书文档,确认「起」结尾内容(Logi + User进Jay帐篷) - 对齐「承」衔接点:知识点重点(ask/tell/because/want提前铺)、中互动5个 - 节奏表V1 → 林逸瀚直接确认通过(未迭代) - 剧本V1 → 林逸瀚确认定稿 - 确认新流程:以后剧本用飞书电子表格直接审阅,不走文本消息排版 ### 「承」关键设计 - 场景:Jay帐篷 ↔ 庆典广场交替(User来回跑腿传话) - 酷刑1:数学题(Jay死忍)→ \"I don't want this...\" - 酷刑2:儿歌loop(重金属→儿歌→Jay崩溃招供) - Jay供词:\"Because Lin asked me to help!\" - 结尾:Tom说\"Ask Lin next\",User应答\"Got it!\" → 衔接到「转」 ### 中互动统计 - 5个中互动:①向Jay传话(ask...to...) ②向Logi传话(Tell sb sth) ③向Tom汇报(want+tell) ④向Logi传话(ask...to...) ⑤向Tom汇报(tell+because+ask) ### 知识点覆盖(承) | 知识点 | 输入 | 输出 | |--------|------|------| | ask | 4 | 3 | | tell | 5 | 2 | | because |",
|
"snippet": "# 2026-05-18 工作日志 ## U22 L5「承」剧本 — 完成 ✅ **协作人:** 林逸瀚(编剧) ### 今日完成 - 进度对齐:U22 L1-L4 全课定稿,L5 仅「起」定稿 - 读取L5飞书文档,确认「起」结尾内容(Logi + User进Jay帐篷) - 对齐「承」衔接点:知识点重点(ask/tell/because/want提前铺)、中互动5个 - 节奏表V1 → 林逸瀚直接确认通过(未迭代) - 剧本V1 → 林逸瀚确认定稿 - 确认新流程:以后剧本用飞书电子表格直接审阅,不走文本消息排版 ### 「承」关键设计 - 场景:Jay帐篷 ↔ 庆典广场交替(User来回跑腿传话) - 酷刑1:数学题(Jay死忍)→ \"I don't want this...\" - 酷刑2:儿歌loop(重金属→儿歌→Jay崩溃招供) - Jay供词:\"Because Lin asked me to help!\" - 结尾:Tom说\"Ask Lin next\",User应答\"Got it!\" → 衔接到「转」 ### 中互动统计 - 5个中互动:①向Jay传话(ask...to...) ②向Logi传话(Tell sb sth) ③向Tom汇报(want+tell) ④向Logi传话(ask...to...) ⑤向Tom汇报(tell+because+ask) ### 知识点覆盖(承) | 知识点 | 输入 | 输出 | |--------|------|------| | ask | 4 | 3 | | tell | 5 | 2 | | because |",
|
||||||
"recallCount": 4,
|
"recallCount": 6,
|
||||||
"dailyCount": 0,
|
"dailyCount": 0,
|
||||||
"groundedCount": 0,
|
"groundedCount": 0,
|
||||||
"totalScore": 4,
|
"totalScore": 6,
|
||||||
"maxScore": 1,
|
"maxScore": 1,
|
||||||
"firstRecalledAt": "2026-05-21T03:11:45.840Z",
|
"firstRecalledAt": "2026-05-21T03:11:45.840Z",
|
||||||
"lastRecalledAt": "2026-05-22T07:35:32.892Z",
|
"lastRecalledAt": "2026-05-25T04:25:34.853Z",
|
||||||
"queryHashes": [
|
"queryHashes": [
|
||||||
"daa1ab20dbeb",
|
"daa1ab20dbeb",
|
||||||
"6b7cfc099a33",
|
"6b7cfc099a33",
|
||||||
"c7001076bb77",
|
"c7001076bb77",
|
||||||
"9d91a88f6741"
|
"9d91a88f6741",
|
||||||
|
"4d6e502cef1a",
|
||||||
|
"be8988b9e083"
|
||||||
],
|
],
|
||||||
"recallDays": [
|
"recallDays": [
|
||||||
"2026-05-21",
|
"2026-05-21",
|
||||||
"2026-05-22"
|
"2026-05-22",
|
||||||
|
"2026-05-25"
|
||||||
],
|
],
|
||||||
"conceptTags": [
|
"conceptTags": [
|
||||||
"l1-l4",
|
"l1-l4",
|
||||||
@ -1160,19 +1168,21 @@
|
|||||||
"endLine": 61,
|
"endLine": 61,
|
||||||
"source": "memory",
|
"source": "memory",
|
||||||
"snippet": "- Jay酷刑:数学题(死忍)→儿歌loop(崩溃) - Lin酷刑:无聊卡通(硬撑)→Ben讲鬼故事(崩溃) - Jay和Lin都不认识老板长相,留悬念 - 「起」节奏表V2→3个中互动,剧本林逸瀚定稿 - 明天继续出「承」(审问Jay) ### 生图能力 - 首次使用FunCloud GPT Image 2生图(水流分拣示意图),约限1分钟,下次应先告知“在生成中” ### 下一步 - 明天继续U22 L5「承」节奏表+剧本",
|
"snippet": "- Jay酷刑:数学题(死忍)→儿歌loop(崩溃) - Lin酷刑:无聊卡通(硬撑)→Ben讲鬼故事(崩溃) - Jay和Lin都不认识老板长相,留悬念 - 「起」节奏表V2→3个中互动,剧本林逸瀚定稿 - 明天继续出「承」(审问Jay) ### 生图能力 - 首次使用FunCloud GPT Image 2生图(水流分拣示意图),约限1分钟,下次应先告知“在生成中” ### 下一步 - 明天继续U22 L5「承」节奏表+剧本",
|
||||||
"recallCount": 2,
|
"recallCount": 3,
|
||||||
"dailyCount": 0,
|
"dailyCount": 0,
|
||||||
"groundedCount": 0,
|
"groundedCount": 0,
|
||||||
"totalScore": 2,
|
"totalScore": 3,
|
||||||
"maxScore": 1,
|
"maxScore": 1,
|
||||||
"firstRecalledAt": "2026-05-21T03:19:32.078Z",
|
"firstRecalledAt": "2026-05-21T03:19:32.078Z",
|
||||||
"lastRecalledAt": "2026-05-21T07:27:42.661Z",
|
"lastRecalledAt": "2026-05-25T03:11:36.690Z",
|
||||||
"queryHashes": [
|
"queryHashes": [
|
||||||
"6b7cfc099a33",
|
"6b7cfc099a33",
|
||||||
"486481005f95"
|
"486481005f95",
|
||||||
|
"0f1442e2084c"
|
||||||
],
|
],
|
||||||
"recallDays": [
|
"recallDays": [
|
||||||
"2026-05-21"
|
"2026-05-21",
|
||||||
|
"2026-05-25"
|
||||||
],
|
],
|
||||||
"conceptTags": [
|
"conceptTags": [
|
||||||
"gpt",
|
"gpt",
|
||||||
@ -1223,19 +1233,22 @@
|
|||||||
"endLine": 30,
|
"endLine": 30,
|
||||||
"source": "memory",
|
"source": "memory",
|
||||||
"snippet": "# 2026-05-17 工作日志 ## U21 L5 剧本工程 — 全课完成 ✅ **协作人:** 王珞(编剧) ### 今日完成 - 大纲讨论:角色设定确认(Logi城府深、Rebecca果敢热血、Robot 11是速猴帮手、鳄鱼只有一只) - 剧情逻辑确认:Beat 1-7拆分、User叙事视角、速猴鬼鬼祟祟翻资料、速猴想跳窗→质问→狡辩泡澡、BoBo以小球形式藏在User身上 - 知识点确认:go to sleep, crocodile, bathroom, bath(名词/浴缸)+ It's time to..., right now - 节奏表V1 → 王珞确认 - 起承转合四段剧本V1分段确认 → 全部Pass - 飞书电子表格写入(同一个Sheet) ### 关键决策 1. 核心互动选「听力选择」— User与Sally电话,3道图片单选题 2. Beat 1简短带过留宿争议,不展开争论 3. BoBo全程以小球形态藏在User身上,合段才飞出来(不做铺垫) 4. 速猴推开User想从窗户逃 → BoBo护主飞出(儿童剧本安全尺度) 5. 鳄鱼只有一只,非多只 6. 互动数12中互动+1核心互动=13个(略低于14下限,王珞确认不需要补) ### 中互动统计 - 起:2 | 承:3 | 转:4 | 合:3(+1核心互动)| **总计:13个** ### 知识点覆盖(全课) | 知识点 | 输入 | 输出 | 状态 | |--------|------|------|------| | go to sleep | 3 | 3 | ✅ | | It's",
|
"snippet": "# 2026-05-17 工作日志 ## U21 L5 剧本工程 — 全课完成 ✅ **协作人:** 王珞(编剧) ### 今日完成 - 大纲讨论:角色设定确认(Logi城府深、Rebecca果敢热血、Robot 11是速猴帮手、鳄鱼只有一只) - 剧情逻辑确认:Beat 1-7拆分、User叙事视角、速猴鬼鬼祟祟翻资料、速猴想跳窗→质问→狡辩泡澡、BoBo以小球形式藏在User身上 - 知识点确认:go to sleep, crocodile, bathroom, bath(名词/浴缸)+ It's time to..., right now - 节奏表V1 → 王珞确认 - 起承转合四段剧本V1分段确认 → 全部Pass - 飞书电子表格写入(同一个Sheet) ### 关键决策 1. 核心互动选「听力选择」— User与Sally电话,3道图片单选题 2. Beat 1简短带过留宿争议,不展开争论 3. BoBo全程以小球形态藏在User身上,合段才飞出来(不做铺垫) 4. 速猴推开User想从窗户逃 → BoBo护主飞出(儿童剧本安全尺度) 5. 鳄鱼只有一只,非多只 6. 互动数12中互动+1核心互动=13个(略低于14下限,王珞确认不需要补) ### 中互动统计 - 起:2 | 承:3 | 转:4 | 合:3(+1核心互动)| **总计:13个** ### 知识点覆盖(全课) | 知识点 | 输入 | 输出 | 状态 | |--------|------|------|------| | go to sleep | 3 | 3 | ✅ | | It's",
|
||||||
"recallCount": 2,
|
"recallCount": 4,
|
||||||
"dailyCount": 0,
|
"dailyCount": 0,
|
||||||
"groundedCount": 0,
|
"groundedCount": 0,
|
||||||
"totalScore": 2,
|
"totalScore": 4,
|
||||||
"maxScore": 1,
|
"maxScore": 1,
|
||||||
"firstRecalledAt": "2026-05-22T04:12:37.577Z",
|
"firstRecalledAt": "2026-05-22T04:12:37.577Z",
|
||||||
"lastRecalledAt": "2026-05-22T04:19:43.196Z",
|
"lastRecalledAt": "2026-05-25T04:25:34.853Z",
|
||||||
"queryHashes": [
|
"queryHashes": [
|
||||||
"309af4daf776",
|
"309af4daf776",
|
||||||
"f68ccb9fe689"
|
"f68ccb9fe689",
|
||||||
|
"4d6e502cef1a",
|
||||||
|
"be8988b9e083"
|
||||||
],
|
],
|
||||||
"recallDays": [
|
"recallDays": [
|
||||||
"2026-05-22"
|
"2026-05-22",
|
||||||
|
"2026-05-25"
|
||||||
],
|
],
|
||||||
"conceptTags": [
|
"conceptTags": [
|
||||||
"1-7拆分",
|
"1-7拆分",
|
||||||
@ -1255,18 +1268,21 @@
|
|||||||
"endLine": 38,
|
"endLine": 38,
|
||||||
"source": "memory",
|
"source": "memory",
|
||||||
"snippet": "# 2026-05-15 工作日志 ## U21 L4 剧本工程 — V1初稿完成 **协作人:** 王珞(编剧) ### 今日完成 - 大纲讨论:角色确认(Rebecca新建)、起承转合拆分(转中被困+结盟上移到承) - 节奏表V1产出 → 王珞确认 - 起/承/转/合 剧本V1全部完成 - Excel写入飞书表格(4 Sheet) ### 角色库更新 - 新建 `Rebecca.md`:新晋Pioneer,勤奋努力 - Clare Hope 补充:Pioneer身份、U19离开地球执行任务、Vala学院入学日期9月1日 - Fred Hope 补充:同上 ### 关键决策 - Sally只在门厅,速猴出场后不在场(所以被困无人直接解围) - 速猴平面图是偷来的 - User看到速猴背影 → 追上去(非Jay主动打招呼) - 结盟过程:User怪速猴 → 互怼 → Jay打破僵局 → Lin撞墙 → 不得不合作 → 碰拳(非握手) - Jay说傻话、Lin更慎重总让Jay闭嘴 - User已知父母是开拓者(U19),情感是既骄傲又思念 - 密码日期:9月1日 - 转中Lin焦急催促 vs User沉浸研究的反差 - BoBo不出场 ### 中互动统计 - 起:4 | 承:5 | 转:7 | 合:1(核心互动-写作)| **总计:17个** ### 知识点输出 - flat:3 / hall:3 / room:3 / know:4 / Do you know...:2 / What is in...:4 ✅ ### 定稿文档 - https://makee-in",
|
"snippet": "# 2026-05-15 工作日志 ## U21 L4 剧本工程 — V1初稿完成 **协作人:** 王珞(编剧) ### 今日完成 - 大纲讨论:角色确认(Rebecca新建)、起承转合拆分(转中被困+结盟上移到承) - 节奏表V1产出 → 王珞确认 - 起/承/转/合 剧本V1全部完成 - Excel写入飞书表格(4 Sheet) ### 角色库更新 - 新建 `Rebecca.md`:新晋Pioneer,勤奋努力 - Clare Hope 补充:Pioneer身份、U19离开地球执行任务、Vala学院入学日期9月1日 - Fred Hope 补充:同上 ### 关键决策 - Sally只在门厅,速猴出场后不在场(所以被困无人直接解围) - 速猴平面图是偷来的 - User看到速猴背影 → 追上去(非Jay主动打招呼) - 结盟过程:User怪速猴 → 互怼 → Jay打破僵局 → Lin撞墙 → 不得不合作 → 碰拳(非握手) - Jay说傻话、Lin更慎重总让Jay闭嘴 - User已知父母是开拓者(U19),情感是既骄傲又思念 - 密码日期:9月1日 - 转中Lin焦急催促 vs User沉浸研究的反差 - BoBo不出场 ### 中互动统计 - 起:4 | 承:5 | 转:7 | 合:1(核心互动-写作)| **总计:17个** ### 知识点输出 - flat:3 / hall:3 / room:3 / know:4 / Do you know...:2 / What is in...:4 ✅ ### 定稿文档 - https://makee-in",
|
||||||
"recallCount": 1,
|
"recallCount": 3,
|
||||||
"dailyCount": 0,
|
"dailyCount": 0,
|
||||||
"groundedCount": 0,
|
"groundedCount": 0,
|
||||||
"totalScore": 1,
|
"totalScore": 3,
|
||||||
"maxScore": 1,
|
"maxScore": 1,
|
||||||
"firstRecalledAt": "2026-05-22T04:19:43.196Z",
|
"firstRecalledAt": "2026-05-22T04:19:43.196Z",
|
||||||
"lastRecalledAt": "2026-05-22T04:19:43.196Z",
|
"lastRecalledAt": "2026-05-25T04:25:34.853Z",
|
||||||
"queryHashes": [
|
"queryHashes": [
|
||||||
"f68ccb9fe689"
|
"f68ccb9fe689",
|
||||||
|
"4d6e502cef1a",
|
||||||
|
"be8988b9e083"
|
||||||
],
|
],
|
||||||
"recallDays": [
|
"recallDays": [
|
||||||
"2026-05-22"
|
"2026-05-22",
|
||||||
|
"2026-05-25"
|
||||||
],
|
],
|
||||||
"conceptTags": [
|
"conceptTags": [
|
||||||
"起/承/转/合",
|
"起/承/转/合",
|
||||||
@ -1278,6 +1294,419 @@
|
|||||||
"u21",
|
"u21",
|
||||||
"剧本"
|
"剧本"
|
||||||
]
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-21.md:24:36": {
|
||||||
|
"key": "memory:memory/2026-05-21.md:24:36",
|
||||||
|
"path": "memory/2026-05-21.md",
|
||||||
|
"startLine": 24,
|
||||||
|
"endLine": 36,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "### 世界观对照要点(重要,剧本阶段需遵守) - 防护罩 = 小圆球为节点 + 硅化沙尘为罩体 - 小圆球最终\"消失\"成防护罩一部分 → 通道敞开 → 0号生态球入口暴露 - 盖娅完成守护后主动休眠(《机械沉默法案》,不向儿童解释) - 盖娅是主动召唤者,声音像鲸鱼低鸣 - 灯塔 = 通往0号生态球的入口 + 光导管,内部是竖井 ### S2结尾3课大纲任务总进度 - U23-L3 知识问答 ✅(5/19定稿) - U24-L1 危机降临 ✅(5/20定稿) - U24-L5 终极守护 ✅(5/21定稿) - 3/3 全部完成,待胡笳确认是否进入下一阶段",
|
||||||
|
"recallCount": 17,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 17,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T02:37:39.165Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T22:46:48.206Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"003f212be2b7",
|
||||||
|
"0f1442e2084c",
|
||||||
|
"4d6e502cef1a",
|
||||||
|
"f6d0d0b7ad74",
|
||||||
|
"c6cbfa358ac4",
|
||||||
|
"9ae85c4b8ff4",
|
||||||
|
"6640bce08bea",
|
||||||
|
"10e014c59e11",
|
||||||
|
"4887c9eea8b3",
|
||||||
|
"ebd966e44367",
|
||||||
|
"e3da36f481f1",
|
||||||
|
"395a52267cd4",
|
||||||
|
"619ca46b0aaa",
|
||||||
|
"fd3058a5c589",
|
||||||
|
"708e6823d131",
|
||||||
|
"ed43958211b0",
|
||||||
|
"2c53af70c1ae"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25",
|
||||||
|
"2026-05-26"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u23-l3",
|
||||||
|
"5/19定稿",
|
||||||
|
"u24-l1",
|
||||||
|
"5/20定稿",
|
||||||
|
"u24-l5",
|
||||||
|
"5/21定稿",
|
||||||
|
"3/3",
|
||||||
|
"世界"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-22.md:1:31": {
|
||||||
|
"key": "memory:memory/2026-05-22.md:1:31",
|
||||||
|
"path": "memory/2026-05-22.md",
|
||||||
|
"startLine": 1,
|
||||||
|
"endLine": 31,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "# 2026-05-22 工作日志 ## S2结尾3课大纲撰写 — 完成归档 ✅ **协作人:** 胡笳(业务负责人) ### 今日完成 - U24-L5 终极守护 大纲:胡笳发来修订版,对照旧版完成差异分析 - 修订版核心变化:5段→4段(合并回家+一个都不能少)、Gaia提前登场、新增巨型球体矩阵视觉元素、BoBo/DiDi告别差异化 - 胡笳确认三课大纲全部定稿 - 任务完成沉淀:9条反馈写入 `script_feedback_log.md`(#12-20) - 任务归档:从 `active_tasks.md` 移除 ### 沉淀到知识库的9条 写入 `business_knowledge/script_feedback_log.md`: - #12 知识点词性大纲层级强制统一 - #13 知识点必须是故事驱动力 - #14 知识点均匀分布+每段独立情感主题 - #15 一个Lesson需2-4个独立事件序列 - #16 喜剧节奏是儿童动画呼吸口 - #17-20 设计哲学4条(极简目标/情感>科技/知识点驱动情感/多方信息来源) ## S2结尾3课剧本撰写 — 新任务启动 **创建者:** 胡笳 **来源任务:** S2结尾3课大纲撰写(已完成) - 新日志:`business_knowledge/task_log/S2结尾3课剧本撰写.md` - 从旧日志转移了:故事背景、三课定稿大纲、场景需求、13条剧本撰写注意 - 已注册到 `active_tasks.md`",
|
||||||
|
"recallCount": 7,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 7,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T02:52:07.005Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T10:07:19.039Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"76cfecfd3eb4",
|
||||||
|
"f6d0d0b7ad74",
|
||||||
|
"c6cbfa358ac4",
|
||||||
|
"10e014c59e11",
|
||||||
|
"4887c9eea8b3",
|
||||||
|
"ebd966e44367",
|
||||||
|
"395a52267cd4"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u24-l5",
|
||||||
|
"bobo/didi告别差异化",
|
||||||
|
"script-feedback-log.md",
|
||||||
|
"12-20",
|
||||||
|
"active-tasks.md",
|
||||||
|
"一个lesson需2-4个独立事件序列",
|
||||||
|
"17-20",
|
||||||
|
"极简目标/情感"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:1:13": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:1:13",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 1,
|
||||||
|
"endLine": 13,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "# 2026-05-25 ## U23-L1 剧本同步(林逸瀚) - 林逸瀚在飞书文档中完成了「转」段剧本,小编通过 Sheets V2 API 读取完整剧本并同步到任务日志 - **关键变化(与小编 V1 提案对比):** - SaSa 是伪装名,真狗叫 **BlackBean** - 转的结局改为 Tony 被撞破后**被迫放出圆球**,而非扛袋离开(更温和) - 「Don't be silly」在转段热狗摊已由 User 中互动输出,不再需要合段补 - 六个知识点在起+承+转已全部收齐,合段无剩余知识点缺口 - **合段任务:** 纯剧情收束(User 觉察不对劲)+ 悬钩(Tony 对通讯器说\"下次不会失手\") - 任务日志和 active_tasks 已同步更新",
|
||||||
|
"recallCount": 2,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 2,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T03:11:36.690Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T03:38:08.662Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"0f1442e2084c",
|
||||||
|
"9ae85c4b8ff4"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u23-l1",
|
||||||
|
"active-tasks",
|
||||||
|
"u23",
|
||||||
|
"剧本",
|
||||||
|
"同步",
|
||||||
|
"完成",
|
||||||
|
"通过",
|
||||||
|
"sheets"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-21.md:1:27": {
|
||||||
|
"key": "memory:memory/2026-05-21.md:1:27",
|
||||||
|
"path": "memory/2026-05-21.md",
|
||||||
|
"startLine": 1,
|
||||||
|
"endLine": 27,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "# 2026-05-21 工作日志 ## U24-L5 终极守护 大纲撰写(与胡笳协作) ### 完成情况 - ✅ U24-L5 大纲定稿 - ✅ 灯塔外场景绘制需求输出 - ✅ 任务日志更新 ### 迭代路径 1. 提出3个备选方案(垂直攀塔/沙下遗迹/双线作战)→ 胡笳选定方案B 2. V1科技感太重、逻辑复杂 → 胡笳反馈:一句话逻辑 + 情感>科技 3. V2简化为\"送小圆球回妈妈身边\" → 胡笳反馈:围绕end/smile写故事 4. V3围绕end/smile重构 → 胡笳提供完整故事流程框架 5. V4融入知识点 + 世界观冲突修复 → 定稿 6. 输出场景绘制需求 ### 关键决策 - 一句话目标:\"把所有小圆球送回妈妈身边\" - 地图设计:沙层塌陷露出018号穹顶都市遗迹,保留部分原地表元素 - 情感核心:告别——User送Roballs回家,BoBo/DiDi各自不同的告别方式 - 知识点嵌入:fantastic+day(回家的美好)→ end(终极守护指令)→ smile(Roballs拼笑脸告别) ### 世界观对照要点(重要,剧本阶段需遵守) - 防护罩 = 小圆球为节点 + 硅化沙尘为罩体 - 小圆球最终\"消失\"成防护罩一部分 → 通道敞开 → 0号生态球入口暴露 - 盖娅完成守护后主动休眠(《机械沉默法案》,不向儿童解释)",
|
||||||
|
"recallCount": 11,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 11,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T03:11:36.690Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T22:46:48.206Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"0f1442e2084c",
|
||||||
|
"f6d0d0b7ad74",
|
||||||
|
"c6cbfa358ac4",
|
||||||
|
"10e014c59e11",
|
||||||
|
"4887c9eea8b3",
|
||||||
|
"ebd966e44367",
|
||||||
|
"395a52267cd4",
|
||||||
|
"fd3058a5c589",
|
||||||
|
"708e6823d131",
|
||||||
|
"ed43958211b0",
|
||||||
|
"2c53af70c1ae"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25",
|
||||||
|
"2026-05-26"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u24-l5",
|
||||||
|
"垂直攀塔/沙下遗迹/双线作战",
|
||||||
|
"围绕end/smile写故事",
|
||||||
|
"v3围绕end/smile重构",
|
||||||
|
"bobo/didi各自不同的告别方式",
|
||||||
|
"工作",
|
||||||
|
"日志",
|
||||||
|
"u24"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:1:24": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:1:24",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 1,
|
||||||
|
"endLine": 24,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "# 2026-05-25 ## U23-L1 剧本同步(林逸瀚) - 林逸瀚在飞书文档中完成了「转」段剧本,小编通过 Sheets V2 API 读取完整剧本并同步到任务日志 - **关键变化(与小编 V1 提案对比):** - SaSa 是伪装名,真狗叫 **BlackBean** - 转的结局改为 Tony 被撞破后**被迫放出圆球**,而非扛袋离开(更温和) - 「Don't be silly」在转段热狗摊已由 User 中互动输出,不再需要合段补 - 六个知识点在起+承+转已全部收齐,合段无剩余知识点缺口 - **合段任务:** 纯剧情收束(User 觉察不对劲)+ 悬钩(Tony 对通讯器说\"下次不会失手\") - 任务日志和 active_tasks 已同步更新 ## U23 L1 补充 - 「合」段已于 2026-05-25 完成,全课起承转合全部完成。U23-L1 收工。 ## U23-L2 剧本启动(林逸瀚) - 创建时间:2026-05-25 11:42 - 任务目标:单个 Lesson 完整剧本(起承转合) - U23 主线:Tony 假扮 Tom,L1 孩子们已察觉可疑,L2 加深可疑印象 - 当前:等待林逸瀚提供故事大纲和知识点",
|
||||||
|
"recallCount": 1,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 1,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T04:01:09.236Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T04:01:09.236Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"6640bce08bea"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u23-l1",
|
||||||
|
"active-tasks",
|
||||||
|
"u23-l2",
|
||||||
|
"u23",
|
||||||
|
"剧本",
|
||||||
|
"同步",
|
||||||
|
"完成",
|
||||||
|
"通过"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:1:28": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:1:28",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 1,
|
||||||
|
"endLine": 28,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "# 2026-05-25 ## U23-L1 剧本同步(林逸瀚) - 林逸瀚在飞书文档中完成了「转」段剧本,小编通过 Sheets V2 API 读取完整剧本并同步到任务日志 - **关键变化(与小编 V1 提案对比):** - SaSa 是伪装名,真狗叫 **BlackBean** - 转的结局改为 Tony 被撞破后**被迫放出圆球**,而非扛袋离开(更温和) - 「Don't be silly」在转段热狗摊已由 User 中互动输出,不再需要合段补 - 六个知识点在起+承+转已全部收齐,合段无剩余知识点缺口 - **合段任务:** 纯剧情收束(User 觉察不对劲)+ 悬钩(Tony 对通讯器说\"下次不会失手\") - 任务日志和 active_tasks 已同步更新 ## U23 L1 补充 - 「合」段已于 2026-05-25 完成,全课起承转合全部完成。U23-L1 收工。 ## U23-L2 剧本启动(林逸瀚) - 创建时间:2026-05-25 11:42 - 任务目标:单个 Lesson 完整剧本(起承转合) - U23 主线:Tony 假扮 Tom,L1 孩子们已察觉可疑,L2 加深可疑印象 - 当前:等待林逸瀚提供故事大纲和知识点 ## U23-L3 剧本撰写完成(胡笳) - 完成了S2结尾3课剧本撰写任务的第一课:U23-L3 知识问答 - 交付物:飞书表格 https://makee-interactive.feishu.cn/sheets/Ns7jsJTsyhzYe3txVqqcawP0nQe (Sheet1为最终版)",
|
||||||
|
"recallCount": 1,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 1,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T09:05:35.434Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T09:05:35.434Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"e3da36f481f1"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u23-l1",
|
||||||
|
"active-tasks",
|
||||||
|
"u23-l2",
|
||||||
|
"u23-l3",
|
||||||
|
"u23",
|
||||||
|
"剧本",
|
||||||
|
"同步",
|
||||||
|
"完成"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:23:43": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:23:43",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 23,
|
||||||
|
"endLine": 43,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "- 当前:等待林逸瀚提供故事大纲和知识点 ## U23-L3 剧本撰写完成(胡笳) - 完成了S2结尾3课剧本撰写任务的第一课:U23-L3 知识问答 - 交付物:飞书表格 https://makee-interactive.feishu.cn/sheets/Ns7jsJTsyhzYe3txVqqcawP0nQe (Sheet1为最终版) - 核心互动类型:口语(Sally通过对讲机远程提问,3轮) - 知识点统计:基本达标,句型1输入偏少(1次)、句型2输出偏少(1次),胡笳知悉待微调 - 互动统计:中互动15 + 核心互动1 = 16个 ✅ - 沉淀了6条新规则到 script_feedback_log.md (#21-26) - 更新了SKILL.md中emoji使用规范 - 同步了世界观文档到本地 business_knowledge/resources/worldview_L1_desert_town.md - 更新了MEMORY.md中创作必备参考文档集(本地路径优先) - ⚠️ 教训:不要在Wiki知识库中新建文档,Bot只做读取 ### 关键协作经验 - 胡笳倾向于先给草稿框架,让小编润色+续写,再由她在表格上直接修订 - 她对台词的语言地道程度要求很高,会逐句审视 - 她会直接在飞书表格上修改,然后让小编读取修改后版本继续 - 知识点的有效输出(User在中互动中说)vs无效输出(NPC在中互动中说)是容易犯的错",
|
||||||
|
"recallCount": 1,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 1,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T09:05:35.434Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T09:05:35.434Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"e3da36f481f1"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u23-l3",
|
||||||
|
"script-feedback-log.md",
|
||||||
|
"21-26",
|
||||||
|
"更新了skill.md中emoji使用规范",
|
||||||
|
"更新了memory.md中创作必备参考文档集",
|
||||||
|
"当前",
|
||||||
|
"等待",
|
||||||
|
"提供"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:1:23": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:1:23",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 1,
|
||||||
|
"endLine": 23,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "# 2026-05-25 ## U23-L1 剧本同步(林逸瀚) - 林逸瀚在飞书文档中完成了「转」段剧本,小编通过 Sheets V2 API 读取完整剧本并同步到任务日志 - **关键变化(与小编 V1 提案对比):** - SaSa 是伪装名,真狗叫 **BlackBean** - 转的结局改为 Tony 被撞破后**被迫放出圆球**,而非扛袋离开(更温和) - 「Don't be silly」在转段热狗摊已由 User 中互动输出,不再需要合段补 - 六个知识点在起+承+转已全部收齐,合段无剩余知识点缺口 - **合段任务:** 纯剧情收束(User 觉察不对劲)+ 悬钩(Tony 对通讯器说\"下次不会失手\") - 任务日志和 active_tasks 已同步更新 ## U23 L1 补充 - 「合」段已于 2026-05-25 完成,全课起承转合全部完成。U23-L1 收工。 ## U23-L2 剧本(林逸瀚)✅ 已完成 - 创建时间:2026-05-25 11:42 → 完成时间:2026-05-25 17:49 - 定稿位置:Wiki 表格 BVASsXSrph6ZlAtmQmFcC7LYn5b, Sheet wMQVyV - 全课中互动 14 个 + 1 核心互动-听力选择,知识点全部达标 - **林逸瀚协作特点:** 先给我大纲+知识点 → 我出节奏表→审核→出剧本→Excel→他在Wiki表格上直接修改→我读回同步日志",
|
||||||
|
"recallCount": 1,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 1,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T10:12:26.746Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T10:12:26.746Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"619ca46b0aaa"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"u23-l1",
|
||||||
|
"active-tasks",
|
||||||
|
"u23-l2",
|
||||||
|
"核心互动-听力选择",
|
||||||
|
"u23",
|
||||||
|
"剧本",
|
||||||
|
"同步",
|
||||||
|
"完成"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:21:37": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:21:37",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 21,
|
||||||
|
"endLine": 37,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "- 定稿位置:Wiki 表格 BVASsXSrph6ZlAtmQmFcC7LYn5b, Sheet wMQVyV - 全课中互动 14 个 + 1 核心互动-听力选择,知识点全部达标 - **林逸瀚协作特点:** 先给我大纲+知识点 → 我出节奏表→审核→出剧本→Excel→他在Wiki表格上直接修改→我读回同步日志 - **关键桥段:** 水果鉴赏大会 / 驴子偷吃Pineapple / Tom脸红过敏自曝 / 评分制夺冠 / Fiona揭露过敏 / J+L组合决心查真相 - **本次教训:** 中互动必须和知识点绑定——\"I heard that!\" / \"Are you okay?\" 写成中互动为严重错误,被林逸瀚明确指出 ## U23-L3 剧本撰写完成(胡笳) - 完成了S2结尾3课剧本撰写任务的第一课:U23-L3 知识问答 - 交付物:飞书表格 https://makee-interactive.feishu.cn/sheets/Ns7jsJTsyhzYe3txVqqcawP0nQe (Sheet1为最终版) - 核心互动类型:口语(Sally通过对讲机远程提问,3轮) - 知识点统计:基本达标,句型1输入偏少(1次)、句型2输出偏少(1次),胡笳知悉待微调 - 互动统计:中互动15 + 核心互动1 = 16个 ✅ - 沉淀了6条新规则到 script_feedback_log.md (#21-26) - 更新了SKILL.md中emoji使用规范 - 同步了世界观文档到本地 business_knowledge/resources/worldview_L1_",
|
||||||
|
"recallCount": 2,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 2,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T10:12:26.746Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T21:52:51.694Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"619ca46b0aaa",
|
||||||
|
"23c45da4698f"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25",
|
||||||
|
"2026-05-26"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"核心互动-听力选择",
|
||||||
|
"u23-l3",
|
||||||
|
"script-feedback-log.md",
|
||||||
|
"21-26",
|
||||||
|
"更新了skill.md中emoji使用规范",
|
||||||
|
"定稿",
|
||||||
|
"位置",
|
||||||
|
"wiki"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:33:54": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:33:54",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 33,
|
||||||
|
"endLine": 54,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "- 互动统计:中互动15 + 核心互动1 = 16个 ✅ - 沉淀了6条新规则到 script_feedback_log.md (#21-26) - 更新了SKILL.md中emoji使用规范 - 同步了世界观文档到本地 business_knowledge/resources/worldview_L1_desert_town.md - 更新了MEMORY.md中创作必备参考文档集(本地路径优先) - ⚠️ 教训:不要在Wiki知识库中新建文档,Bot只做读取 ### 关键协作经验 - 胡笳倾向于先给草稿框架,让小编润色+续写,再由她在表格上直接修订 - 她对台词的语言地道程度要求很高,会逐句审视 - 她会直接在飞书表格上修改,然后让小编读取修改后版本继续 - 知识点的有效输出(User在中互动中说)vs无效输出(NPC在中互动中说)是容易犯的错 ## U24-L2 剧本启动(林逸瀚) - 创建时间:2026-05-25 18:06 - 日志:`business_knowledge/task_log/U24-L2剧本.md` - 飞书文档:https://makee-interactive.feishu.cn/wiki/AwaNwj9tOiimb6kIYjZczO9MnQf - 表格Token:TcHdsymvJhOeZWtdUdJcEvyFnEg ### 林逸瀚明确协作流程(起承转合分段推进) 每个Lesson按4段独立推进,每段流程:",
|
||||||
|
"recallCount": 4,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 4,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T10:28:35.021Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T22:46:48.206Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"fd3058a5c589",
|
||||||
|
"708e6823d131",
|
||||||
|
"ed43958211b0",
|
||||||
|
"2c53af70c1ae"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25",
|
||||||
|
"2026-05-26"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"script-feedback-log.md",
|
||||||
|
"21-26",
|
||||||
|
"更新了skill.md中emoji使用规范",
|
||||||
|
"更新了memory.md中创作必备参考文档集",
|
||||||
|
"u24-l2",
|
||||||
|
"互动",
|
||||||
|
"统计",
|
||||||
|
"核心"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:49:62": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:49:62",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 49,
|
||||||
|
"endLine": 62,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "- 日志:`business_knowledge/task_log/U24-L2剧本.md` - 飞书文档:https://makee-interactive.feishu.cn/wiki/AwaNwj9tOiimb6kIYjZczO9MnQf - 表格Token:TcHdsymvJhOeZWtdUdJcEvyFnEg ### 林逸瀚明确协作流程(起承转合分段推进) 每个Lesson按4段独立推进,每段流程: 1. 接收大纲+知识点 → 2. 出节奏表(桥段+知识输入+互动)→ 3. 林逸瀚反馈迭代 → 4. 定稿后出剧本 → 5. 语言难度调整 → 6. 转Excel发林逸瀚 → 7. 林逸瀚Wiki最终修改回传 → 8. 进入下一段 节奏表必须呈现三个维度:桥段(情节长度清晰)、知识点输入(首次引入/复现/强化方式)、中互动(位置/类型/数量) ### 本课知识点 - 单词:kitchen / kid(s)(需有单数教学)/ many / really - 句型:It's really... / There are many...",
|
||||||
|
"recallCount": 1,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 1,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T10:28:35.021Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T10:28:35.021Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"fd3058a5c589"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-25"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"首次引入/复现/强化方式",
|
||||||
|
"位置/类型/数量",
|
||||||
|
"日志",
|
||||||
|
"business-knowledge",
|
||||||
|
"task-log",
|
||||||
|
"u24",
|
||||||
|
"剧本",
|
||||||
|
"https"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"memory:memory/2026-05-25.md:49:65": {
|
||||||
|
"key": "memory:memory/2026-05-25.md:49:65",
|
||||||
|
"path": "memory/2026-05-25.md",
|
||||||
|
"startLine": 49,
|
||||||
|
"endLine": 65,
|
||||||
|
"source": "memory",
|
||||||
|
"snippet": "- 日志:`business_knowledge/task_log/U24-L2剧本.md` - 飞书文档:https://makee-interactive.feishu.cn/wiki/AwaNwj9tOiimb6kIYjZczO9MnQf - 表格Token:TcHdsymvJhOeZWtdUdJcEvyFnEg ### 林逸瀚明确协作流程(起承转合分段推进) 每个Lesson按4段独立推进,每段流程: 1. 接收大纲+知识点 → 2. 出节奏表(桥段+知识输入+互动)→ 3. 林逸瀚反馈迭代 → 4. 定稿后出剧本 → 5. 语言难度调整 → 6. 转Excel发林逸瀚 → 7. 林逸瀚修改回传 → 8. 进入下一段 ### 本课知识点 - 单词:kitchen / kid(s)(需有单数教学)/ many / really - 句型:It's really... / There are many... ### 进度 - 19:52 — 「起」段完成(林逸瀚亲自撰写定稿),Wiki 表格 wMQVyV,53行 - 小编 V1(16行)V2(35行)均被弃用 - **核心教训:** 小编把剧本当教学流程图写——给角色分配台词完成知识点覆盖。林逸瀚写的是动画——人物关系、群像、喜剧节奏、情感层次。知识点嵌在故事里,不是贴在台词上",
|
||||||
|
"recallCount": 1,
|
||||||
|
"dailyCount": 0,
|
||||||
|
"groundedCount": 0,
|
||||||
|
"totalScore": 1,
|
||||||
|
"maxScore": 1,
|
||||||
|
"firstRecalledAt": "2026-05-25T21:52:51.694Z",
|
||||||
|
"lastRecalledAt": "2026-05-25T21:52:51.694Z",
|
||||||
|
"queryHashes": [
|
||||||
|
"23c45da4698f"
|
||||||
|
],
|
||||||
|
"recallDays": [
|
||||||
|
"2026-05-26"
|
||||||
|
],
|
||||||
|
"conceptTags": [
|
||||||
|
"日志",
|
||||||
|
"business-knowledge",
|
||||||
|
"task-log",
|
||||||
|
"u24",
|
||||||
|
"剧本",
|
||||||
|
"https",
|
||||||
|
"makee",
|
||||||
|
"interactive.feishu.cn"
|
||||||
|
]
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
5
memory/2026-05-25-1024.md
Normal file
5
memory/2026-05-25-1024.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Session: 2026-05-25 10:24:30 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_871f718731cca0c18d3a1b9af45a8f5c
|
||||||
|
- **Session ID**: 6fd210ff-e907-4127-afab-c8f55c720235
|
||||||
|
- **Source**: feishu
|
||||||
5
memory/2026-05-25-1136-2.md
Normal file
5
memory/2026-05-25-1136-2.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Session: 2026-05-25 11:36:31 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_859415bde7f8e07cf5f24940f299a287
|
||||||
|
- **Session ID**: 0daf5c4e-a887-4163-a021-41f4fa7748fd
|
||||||
|
- **Source**: feishu
|
||||||
124
memory/2026-05-25-1136.md
Normal file
124
memory/2026-05-25-1136.md
Normal file
@ -0,0 +1,124 @@
|
|||||||
|
# Session: 2026-05-25 11:36:26 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_859415bde7f8e07cf5f24940f299a287
|
||||||
|
- **Session ID**: 97f04db8-4b2c-403a-9873-0c35197d2d41
|
||||||
|
- **Source**: feishu
|
||||||
|
|
||||||
|
## Conversation Summary
|
||||||
|
|
||||||
|
assistant: 更新 `active_tasks.md` 的进度:
|
||||||
|
assistant: 记录今日记忆:
|
||||||
|
assistant: ---
|
||||||
|
|
||||||
|
## ✅ 同步完成
|
||||||
|
|
||||||
|
飞书剧本已读全、对比完毕,以下是核心变化:
|
||||||
|
|
||||||
|
| 项目 | 旧日志(我的 V1) | 飞书实际剧本 |
|
||||||
|
|------|-------------------|-------------|
|
||||||
|
| 狗名 | SaSa | SaSa 是伪装 → 真狗 **BlackBean** |
|
||||||
|
| 转结局 | Tony 扛袋离开(暗黑悬疑) | Tony 被撞破 → 被迫放出圆球 → 悻悻走 |
|
||||||
|
| Don't be silly | 标记"合段待补" | **转段已输出**(热狗摊 User 中互动) |
|
||||||
|
| 知识点缺口 | 合段缺 Don't be silly | **六个全部收齐,合段无缺口** |
|
||||||
|
|
||||||
|
**日志和 active_tasks 已同步更新。**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
现在合段只剩纯剧情:User 隐隐觉得不对 → 别人都觉得 Tom 只是跟圆球玩 silly 游戏 → Tony 暗处对通讯器说"下次不会失手" → 悬钩。
|
||||||
|
|
||||||
|
要继续推进合吗?
|
||||||
|
user: 中互动估计达标了,但需要核心互动。
|
||||||
|
核心互动,大家讨论Tom 好奇怪。
|
||||||
|
|
||||||
|
Tony不要有「下次不会失手」 这剧情。字数现在超标了。讨论Tom 之后直接完吧。
|
||||||
|
assistant: 明白了。合段 = 核心互动(口语),讨论 Tom 的奇怪,讨论完就结束,不拖。
|
||||||
|
|
||||||
|
先写到飞书文档里,然后再给你看。让我直接追加到飞书 Sheet 上:
|
||||||
|
assistant: 最后一行为 251。现在直接往飞书 Sheet 写入合段。
|
||||||
|
assistant: 两个 Bot(xiaobian、xiaoyan)都无该 Sheet 的写入权限。合段剧本我直接给你,你贴进去——
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 「合」段剧本 — 核心互动(口语)
|
||||||
|
|
||||||
|
> 接在 Tony 离开之后(251 行之后)
|
||||||
|
|
||||||
|
| 类型 | 剧情描述 | 名字 | 台词 |
|
||||||
|
|------|---------|------|------|
|
||||||
|
| TL | 众人目送 Tony 离开。圆球机器人回到主人身边。 | | |
|
||||||
|
| | | May | That was... strange. |
|
||||||
|
| | | Ben | Tom was so funny today! |
|
||||||
|
| 核心互动-口语 | 【插入图】公园空地,孩子们和圆球聚在一起。背景可见 Shop 和刚才玩 Freeze 的区域。 | | |
|
||||||
|
| | 第1轮 — 图片高亮 Shop 门口的旧电视 | May | What did Tom do at the shop? |
|
||||||
|
| | | User | He fixed that silly TV. |
|
||||||
|
| | 第2轮 — 图片高亮公园空地 Freeze 游戏处 | May | And the Freeze game in the park... |
|
||||||
|
| | | User | He put the robots in a bag. So silly. |
|
||||||
|
| | 第3轮 — 图片高亮 May 困惑表情 | May | I can't find a reason. Tom is so different. |
|
||||||
|
| | | User | Don't be silly, May. This is not Tom. |
|
||||||
|
| TL | Ben 愣住,看看 User 又看看 May。May 抬头望向远处,若有所思。 | | |
|
||||||
|
| | 镜头拉远 — 公园全景,孩子们和圆球小小的身影。 | | |
|
||||||
|
| | **「合」完结 / L1 全剧本完** | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**知识点覆盖:** silly ×3 / shop ×1 / park ×1 / I can't find ×1 / Don't be silly ×1(5 个知识点,≥3 ✅)
|
||||||
|
|
||||||
|
**NPC 单句:** May 最长 9 词 ✅ | User 台词最长 8 词 ✅
|
||||||
|
|
||||||
|
**设计思路:** May 主导讨论、串联疑点,User 最后盖章"不是 Tom"→ 全课收在悬念而非解答,留 L2 展开。
|
||||||
|
user: 我没有收到剧本。我需要你贴上来我看看。合最好加一个 "Park" 的中互动,因为字数太多,前面的剧本我做增减。
|
||||||
|
assistant: 上一个消息里我已经贴了剧本表格,可能格式没渲染清楚。我直接发飞书消息给你,加上 Park 中互动。先把日志同步。
|
||||||
|
assistant: 好的,合段剧本来了——加了 Park 中互动:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 「合」 — 接 Tony 离开后
|
||||||
|
|
||||||
|
**类型 | 剧情描述 | 名字 | 台词**
|
||||||
|
|
||||||
|
TL | 众人目送 Tony 离开。圆球机器人回到主人身边。 | |
|
||||||
|
| | | May | That was... strange.
|
||||||
|
| | | Ben | Tom was so funny today!
|
||||||
|
中互动 | | User | This park is so strange today.
|
||||||
|
核心互动-口语 | 【插入图】公园空地,孩子们和圆球聚在一起。背景可见 Shop 和刚才玩 Freeze 的区域。 | |
|
||||||
|
| 第1轮 — 图片高亮 Shop 门口的旧电视 | May | What did Tom do at the shop?
|
||||||
|
| | User | He fixed that silly TV.
|
||||||
|
| 第2轮 — 图片高亮公园空地 Freeze 游戏处 | May | And the Freeze game in the park...
|
||||||
|
| | User | He put the robots in a bag. So silly.
|
||||||
|
| 第3轮 — 图片高亮 May 困惑表情 | May | I can't find a reason. Tom is so different.
|
||||||
|
| | User | Don't be silly, May. This is not Tom.
|
||||||
|
TL | Ben 愣住,看看 User 又看看 May。May 抬头望向远处,若有所思。 | |
|
||||||
|
| 镜头拉远 — 公园全景,孩子们和圆球小小的身影。 | |
|
||||||
|
| **「合」完结 / L1 全剧本完** | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**知识点:** silly ×3 / shop ×1 / park ×1(中互动)/ I can't find ×1 / Don't be silly ×1 → 5/6 知识点覆盖,≥3 ✅
|
||||||
|
|
||||||
|
**NPC单句:** May 最长 9 词 / User 最长 8 词 ✅
|
||||||
|
|
||||||
|
**设计:** May 主导讨论串联疑点 → User 最后盖章"不是 Tom" → 收在悬念不解答,留到 L2。13 行,紧凑。你看看?
|
||||||
|
user: This park is so strange today.<<<这句没有教学,只是台词。要让孩子知道Park = 公园
|
||||||
|
assistant: 你说得对,"This park is so strange today" 只是用了 park,没有帮助孩子理解"公园"的含义。
|
||||||
|
|
||||||
|
换几个方向,你挑:
|
||||||
|
|
||||||
|
**A)情绪对照型** — 承接 Ben 的"funny"做对比,孩子能直观感受 park 的含义:
|
||||||
|
> `This park is not fun today.`
|
||||||
|
|
||||||
|
**B)地点指认型** — 看一圈周围后,明确指认所在的地方:
|
||||||
|
> `This is the park. But it is not right.`
|
||||||
|
|
||||||
|
**C)直接否定型** — 简洁,跟日常公园体验做对比:
|
||||||
|
> `I don't like this park.`
|
||||||
|
|
||||||
|
我倾向 **A**,和 Ben 上一句 "Tom was so funny today!" 形成自然对照,孩子容易理解 park = 一个应该好玩的地方,但今天不对劲。
|
||||||
|
user: 算了。Park在前面加好过后面加。那不用Park这个中互动,直接出剧本吧,给我excel 剧本。
|
||||||
|
assistant: 去掉 Park 中互动,直接出 Excel。
|
||||||
|
assistant: Excel 已发送。总结:
|
||||||
|
|
||||||
|
- 12 行,紧凑,无 Park 中互动
|
||||||
|
- 核心互动-口语,May 主导 3 轮
|
||||||
|
- silly / shop / park / I can't find / Don't be silly 全覆盖
|
||||||
|
- 收在 `This is not Tom`,不留 Tony 尾巴
|
||||||
323
memory/2026-05-25-1224.md
Normal file
323
memory/2026-05-25-1224.md
Normal file
@ -0,0 +1,323 @@
|
|||||||
|
# Session: 2026-05-25 12:24:35 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_24c55250c8dd62a40535b183a9bc83e5
|
||||||
|
- **Session ID**: 206836e5-b379-4b04-a7c7-02001f033075
|
||||||
|
- **Source**: feishu
|
||||||
|
|
||||||
|
## Conversation Summary
|
||||||
|
|
||||||
|
assistant: 好,起段通过。开始写【承】:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## U24-L4【承】剧本 V1
|
||||||
|
|
||||||
|
| 类型 | 剧情描述 | 角色名 | 编剧台词 |
|
||||||
|
|------|---------|--------|---------|
|
||||||
|
| TL | 时间过去了一会儿。User走到Jay和Lin旁边坐下,BoBo在User肩膀上轻轻滚动 | | |
|
||||||
|
| | | User | Jay, Lin... can I ask you something? |
|
||||||
|
| | | Jay | Sure, kid. |
|
||||||
|
| | | User | Why did you help Tony? |
|
||||||
|
| | | User | He is a bad guy. |
|
||||||
|
| | Jay和Lin对视了一眼,Lin低下头 | | |
|
||||||
|
| | | Jay | Tony gave us a job. |
|
||||||
|
| | | Jay | We had nothing before. |
|
||||||
|
| | | Lin | No money. No food. No home. |
|
||||||
|
| | | Lin | Tony helped us first. |
|
||||||
|
| | | Lin | So we helped him back. |
|
||||||
|
| | 【emoji】Jay黑线 | | |
|
||||||
|
| | | Jay | But... this town is different. |
|
||||||
|
| | Lin看向书店里互相分享食物、照顾小孩的避难者们 | | |
|
||||||
|
| | | Lin | People here are not **like** Tony. |
|
||||||
|
| | | Lin | They help each other. |
|
||||||
|
| | | Jay | Tony only helps himself. |
|
||||||
|
| 中互动 | | User | They **are like** a big family. |
|
||||||
|
| | | Jay | Yeah... a family. |
|
||||||
|
| TL | 大家依然无事可做。Sue从仓库搬出一堆**table tennis** balls,Vicky拿了颜料罐 | | |
|
||||||
|
| | | Vicky | Let's paint them! |
|
||||||
|
| | | Ben | Paint what? |
|
||||||
|
| | | Vicky | Roballs! |
|
||||||
|
| | Vicky用颜料在table tennis ball上画出小圆球的眼睛和花纹,有模有样 | | |
|
||||||
|
| | | May | Ha! They look the same! |
|
||||||
|
| | 【emoji】Ben星星 | | |
|
||||||
|
| 中互动 | | User | These **table tennis** balls look **like** Roballs! |
|
||||||
|
| | | Ben | I can't tell which is real! |
|
||||||
|
| TL | BoBo滚到一堆画好的"假Roball"旁边,混在里面一动不动 | | |
|
||||||
|
| | | Sue | Where is BoBo? |
|
||||||
|
| | BoBo从球堆里弹出来 | | |
|
||||||
|
| | | BoBo | Boo! |
|
||||||
|
| | 【emoji】Sue惊讶 | | |
|
||||||
|
| | 大家哈哈大笑 | | |
|
||||||
|
| 中互动 | | User | We **are having** fun **now**! |
|
||||||
|
| | | Vicky | Best waiting ever! |
|
||||||
|
| TL | 突然——咚咚咚!书店大门传来沉重的敲门声 | | |
|
||||||
|
| | 【音效】咚!咚!咚! | | |
|
||||||
|
| | 所有人安静下来。Jay站起来,眯着眼看向门口 | | |
|
||||||
|
| | | Jay | Wait... |
|
||||||
|
| | 但Ben和Vicky已经认出了门外那个高大的身影轮廓——"镇长"的帽子和外套 | | |
|
||||||
|
| | | Ben | It's Tom! |
|
||||||
|
| | | Vicky | Tom is here! |
|
||||||
|
| | Ben冲过去开门,风沙灌进来,一个人影走进来。昏暗中那张脸确实像Tom | | |
|
||||||
|
| | 孩子们围上去。User把童话书递过去 | | |
|
||||||
|
| | | User | Tom! We found something! |
|
||||||
|
| | | User | Roballs can protect the town! |
|
||||||
|
| | "镇长"接过书,翻了翻,笑了 | | |
|
||||||
|
| | | Tony(伪装) | Good job, kids. Very good. |
|
||||||
|
| 中互动 | | User | We **are saving** the town **now**! |
|
||||||
|
| | 【emoji】Jay冒汗。Jay从仓库方向赶回来,脸色不对 | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 承段知识点统计
|
||||||
|
|
||||||
|
| 知识点 | 输入 | 输出 | 备注 |
|
||||||
|
|--------|------|------|------|
|
||||||
|
| now | 0 | 2(User: are having fun now + are saving the town now) | ✅ |
|
||||||
|
| table tennis | 1(场景描述/Sue搬出) | 1(User: table tennis balls look like) | ✅ |
|
||||||
|
| like | 1(Lin: not like Tony) | 2(User: are like a family + look like Roballs) | ✅ |
|
||||||
|
| ...be like... | 0 | 1(User: are like a big family) | ✅ |
|
||||||
|
| be doing...now | 0 | 2(User: are having fun now + are saving the town now) | ✅ |
|
||||||
|
| person | 0 | 0 | 转段集中输出 |
|
||||||
|
|
||||||
|
### 承段互动统计:4个中互动
|
||||||
|
|
||||||
|
### 累计互动:起3 + 承4 = 7
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
承段完成。几个设计说明:
|
||||||
|
- BoBo混进假Roball球堆那个喜剧小节 = 为"合"段乒乓球骗局做**双重伏笔**(视觉相似+角色已验证可以骗人)
|
||||||
|
- Jay从仓库赶回来"脸色不对"= 他认出Tony但来不及阻止的钩子,衔接转段
|
||||||
|
|
||||||
|
请确认,通过后写【转】。
|
||||||
|
user: pass
|
||||||
|
assistant: 承段通过。开始写【转】:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## U24-L4【转】剧本 V1
|
||||||
|
|
||||||
|
| 类型 | 剧情描述 | 角色名 | 编剧台词 |
|
||||||
|
|------|---------|--------|---------|
|
||||||
|
| TL | Jay冲到孩子们面前,张开双臂挡住 | | |
|
||||||
|
| | | Jay | Stop! That is NOT Tom! |
|
||||||
|
| | 所有人愣住。"镇长"慢慢笑了,摘下帽子 | | |
|
||||||
|
| | | Tony | Smart monkey. |
|
||||||
|
| | 灯光闪了一下,照亮Tony的脸——和Tom几乎一模一样,但笑容更冷 | | |
|
||||||
|
| | 【emoji】Ben惊讶 | | |
|
||||||
|
| | 【emoji】Vicky惊讶 | | |
|
||||||
|
| 中互动 | | User | This **person** is not Tom! |
|
||||||
|
| | | May | But... he looks the same! |
|
||||||
|
| | | Lin | Tony **is like** Tom. |
|
||||||
|
| | | Lin | Too dark to see. |
|
||||||
|
| TL | Tony已经把书翻完了,靠在书架上,手指敲着书页 | | |
|
||||||
|
| | | Tony | So... these little balls can stop the storm. |
|
||||||
|
| | | Tony | Very useful. |
|
||||||
|
| | | Tony | I want them. |
|
||||||
|
| | | Ben | No way! |
|
||||||
|
| | Tony举起一只手,不慌不忙 | | |
|
||||||
|
| | | Tony | Listen. I am a businessman. |
|
||||||
|
| | | Tony | I use Roballs to deliver faster. |
|
||||||
|
| | | Tony | I give you money. |
|
||||||
|
| | | Tony | Everyone moves to a new town. |
|
||||||
|
| | | Tony | Easy, right? |
|
||||||
|
| 中互动 | | User | No! Roballs **are like** our friends! |
|
||||||
|
| | | Sue | They are not tools! |
|
||||||
|
| TL | Tony摊手,脸上装出为难的样子 | | |
|
||||||
|
| | | Tony | But the storm is big. |
|
||||||
|
| | | Tony | No **person** knows what will happen. |
|
||||||
|
| | | Tony | Boom boom... gone. |
|
||||||
|
| | 【emoji】User……(省略号) | | |
|
||||||
|
| | 孩子们沉默了。Ben紧紧攥着拳头,May咬着嘴唇 | | |
|
||||||
|
| 中互动 | | User | No **person** wants them to get hurt... |
|
||||||
|
| | | Vicky | They are **like** family to us... |
|
||||||
|
| TL | BoBo从User肩上滚下来,稳稳落在地面。DiDi从May身后飘出来 | | |
|
||||||
|
| | BoBo亮起灯,发出坚定的嗡嗡声 | | |
|
||||||
|
| | | BoBo | BoBo fight! |
|
||||||
|
| | DiDi紧跟着亮灯 | | |
|
||||||
|
| | | DiDi | DiDi too! |
|
||||||
|
| | | May | They want to protect the town. |
|
||||||
|
| | | Ben | It's their choice. |
|
||||||
|
| 中互动 | | User | We **are standing** together **now**! |
|
||||||
|
| | | Jay | Right! |
|
||||||
|
| TL | Tony的脸色沉下来,他一步跨到门口,背靠大门,双臂交叉 | | |
|
||||||
|
| | | Tony | Nobody leaves. |
|
||||||
|
| | | Tony | Not until I get what I want. |
|
||||||
|
| | 【emoji】User生气 | | |
|
||||||
|
| 中互动 | | User | You can't stop us **now**! |
|
||||||
|
| | 但Tony纹丝不动,门外风沙呼啸 | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 转段知识点统计
|
||||||
|
|
||||||
|
| 知识点 | 输入 | 输出 | 备注 |
|
||||||
|
|--------|------|------|------|
|
||||||
|
| now | 0 | 2(User: are standing together now + can't stop us now) | ✅ |
|
||||||
|
| table tennis | 0 | 0 | 合段集中爆发 |
|
||||||
|
| person | 2(Tony×2: no person knows + no person) | 2(User: this person is not Tom + no person wants them to get hurt) | ✅ |
|
||||||
|
| like | 1(Lin: Tony is like Tom) | 2(User: are like our friends + Vicky: are like family) | ✅ |
|
||||||
|
| ...be like... | 1(Lin: is like Tom) | 2(User: are like our friends + Roballs are like friends) | ✅ |
|
||||||
|
| be doing...now | 0 | 1(User: are standing together now) | ✅ |
|
||||||
|
|
||||||
|
### 转段互动统计:5个中互动
|
||||||
|
|
||||||
|
### 累计互动:起3 + 承4 + 转5 = 12
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
转段完成。通过后写【合】(含核心互动)。
|
||||||
|
user: 承在哪
|
||||||
|
assistant: 抱歉让你困惑了!承段就是上一轮你已经确认pass的那段——以"速猴聊天→画乒乓球→敲门→假镇长"为内容的段落。
|
||||||
|
|
||||||
|
刚才你确认了:
|
||||||
|
- ✅ 起(速猴到达→安顿→仓库→等待)
|
||||||
|
- ✅ 承(聊速猴过往→画乒乓球→假镇长进门)
|
||||||
|
- ✅ 转(刚刚发的:揭露Tony→威胁→BoBo/DiDi表态→Tony挡门)
|
||||||
|
|
||||||
|
现在要写的是最后一段【合】(速猴反转+核心互动+AI动画逃跑)。
|
||||||
|
|
||||||
|
要我继续写合段吗?
|
||||||
|
user: 没有看到你写的承,你漏了一段
|
||||||
|
assistant: 你说得对,我回顾一下——确实可能是消息发送中你没收到承段内容。我现在重新发一次【承】段完整剧本:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## U24-L4【承】剧本 V1
|
||||||
|
|
||||||
|
| 类型 | 剧情描述 | 角色名 | 编剧台词 |
|
||||||
|
|------|---------|--------|---------|
|
||||||
|
| TL | 时间过去了一会儿。User走到Jay和Lin旁边坐下,BoBo在User肩膀上轻轻滚动 | | |
|
||||||
|
| | | User | Jay, Lin... can I ask you something? |
|
||||||
|
| | | Jay | Sure, kid. |
|
||||||
|
| | | User | Why did you help Tony? |
|
||||||
|
| | | User | He is a bad guy. |
|
||||||
|
| | Jay和Lin对视了一眼,Lin低下头 | | |
|
||||||
|
| | | Jay | Tony gave us a job. |
|
||||||
|
| | | Jay | We had nothing before. |
|
||||||
|
| | | Lin | No money. No food. No home. |
|
||||||
|
| | | Lin | Tony helped us first. |
|
||||||
|
| | | Lin | So we helped him back. |
|
||||||
|
| | 【emoji】Jay黑线 | | |
|
||||||
|
| | | Jay | But... this town is different. |
|
||||||
|
| | Lin看向书店里互相分享食物、照顾小孩的避难者们 | | |
|
||||||
|
| | | Lin | People here are not **like** Tony. |
|
||||||
|
| | | Lin | They help each other. |
|
||||||
|
| | | Jay | Tony only helps himself. |
|
||||||
|
| 中互动 | | User | They **are like** a big family. |
|
||||||
|
| | | Jay | Yeah... a family. |
|
||||||
|
| TL | 大家依然无事可做。Sue从仓库搬出一堆**table tennis** balls,Vicky拿了颜料罐 | | |
|
||||||
|
| | | Vicky | Let's paint them! |
|
||||||
|
| | | Ben | Paint what? |
|
||||||
|
| | | Vicky | Roballs! |
|
||||||
|
| | Vicky用颜料在table tennis ball上画出小圆球的眼睛和花纹,有模有样 | | |
|
||||||
|
| | | May | Ha! They look the same! |
|
||||||
|
| | 【emoji】Ben星星 | | |
|
||||||
|
| 中互动 | | User | These **table tennis** balls look **like** Roballs! |
|
||||||
|
| | | Ben | I can't tell which is real! |
|
||||||
|
| TL | BoBo滚到一堆画好的"假Roball"旁边,混在里面一动不动 | | |
|
||||||
|
| | | Sue | Where is BoBo? |
|
||||||
|
| | BoBo从球堆里弹出来 | | |
|
||||||
|
| | | BoBo | Boo! |
|
||||||
|
| | 【emoji】Sue惊讶 | | |
|
||||||
|
| | 大家哈哈大笑 | | |
|
||||||
|
| 中互动 | | User | We **are having** fun **now**! |
|
||||||
|
| | | Vicky | Best waiting ever! |
|
||||||
|
| TL | 突然——咚咚咚!书店大门传来沉重的敲门声 | | |
|
||||||
|
| | 【音效】咚!咚!咚! | | |
|
||||||
|
| | 所有人安静下来。Jay站起来,眯着眼看向门口 | | |
|
||||||
|
| | | Jay | Wait... |
|
||||||
|
| | 但Ben和Vicky已经认出了门外那个高大的身影轮廓——"镇长"的帽子和外套 | | |
|
||||||
|
| | | Ben | It's Tom! |
|
||||||
|
| | | Vicky | Tom is here! |
|
||||||
|
| | Ben冲过去开门,风沙灌进来,一个人影走进来。昏暗中那张脸确实像Tom | | |
|
||||||
|
| | 孩子们围上去。User把童话书递过去 | | |
|
||||||
|
| | | User | Tom! We found something! |
|
||||||
|
| | | User | Roballs can protect the town! |
|
||||||
|
| | "镇长"接过书,翻了翻,笑了 | | |
|
||||||
|
| | | Tony(伪装) | Good job, kids. Very good. |
|
||||||
|
| 中互动 | | User | We **are saving** the town **now**! |
|
||||||
|
| | 【emoji】Jay冒汗。Jay从仓库方向赶回来,脸色不对 | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 承段知识点统计
|
||||||
|
|
||||||
|
| 知识点 | 输入 | 输出 |
|
||||||
|
|--------|------|------|
|
||||||
|
| now | 0 | 2(are having fun now + are saving the town now) |
|
||||||
|
| table tennis | 1(Sue搬出) | 1(User: table tennis balls look like) |
|
||||||
|
| like | 1(Lin: not like Tony) | 2(are like a family + look like Roballs) |
|
||||||
|
| ...be like... | 0 | 1(are like a big family) |
|
||||||
|
| be doing...now | 0 | 2(are having fun now + are saving the town now) |
|
||||||
|
| person | 0 | 0 |
|
||||||
|
|
||||||
|
### 承段互动统计:4个中互动
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
确认这段没问题,我接着发【转】和【合】。
|
||||||
|
user: 把user说的They are like a big family.改成we are like a big family
|
||||||
|
Vicky说的let's paint them有点突兀
|
||||||
|
assistant: 好,两处修改:
|
||||||
|
|
||||||
|
**1.** "They are like a big family" → **"We are like a big family."**
|
||||||
|
|
||||||
|
**2.** Vicky的"Let's paint them"确实缺过渡。改为——在大家无聊时先有人注意到table tennis balls和Roball的相似性,再自然引出画画:
|
||||||
|
|
||||||
|
**原:**
|
||||||
|
> Sue从仓库搬出一堆table tennis balls,Vicky拿了颜料罐
|
||||||
|
> Vicky: Let's paint them!
|
||||||
|
|
||||||
|
**改为:**
|
||||||
|
> 大家无聊地摆弄仓库里的table tennis balls,Sue把一颗举到BoBo旁边比了比
|
||||||
|
> Sue: Hey! This ball looks like BoBo!
|
||||||
|
> Vicky发现旁边有旧颜料罐,眼睛一亮
|
||||||
|
> Vicky: I have an idea! Let's paint faces on them!
|
||||||
|
|
||||||
|
这样从"发现相似"到"动手画"更顺。你看行吗?
|
||||||
|
user: 这样吧,User和速猴对话,对话完以后,user发现有人在画画,问他们在干什么,然后Vicky说“we are painting them now”,然后User可以说“They are like Roballs”
|
||||||
|
assistant: 好,这样更自然。承段中间那块改为:
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**修改后:**
|
||||||
|
|
||||||
|
| 类型 | 剧情描述 | 角色名 | 编剧台词 |
|
||||||
|
|------|---------|--------|---------|
|
||||||
|
| 中互动 | | User | We **are like** a big family. |
|
||||||
|
| | | Jay | Yeah... a family. |
|
||||||
|
| TL | User和速猴聊完,站起来走回大家那边。发现Vicky、Sue和Ben围在一起,手上沾着颜料,正在table tennis balls上画东西 | | |
|
||||||
|
| | | User | What are you doing? |
|
||||||
|
| | 【emoji】Vicky星星 | | |
|
||||||
|
| | | Vicky | We **are painting** them **now**! |
|
||||||
|
| | Vicky举起一颗画好的table tennis ball——上面画着小圆球的眼睛和花纹 | | |
|
||||||
|
| 中互动 | | User | They **are like** Roballs! |
|
||||||
|
| | | Ben | I can't tell which is real! |
|
||||||
|
| TL | BoBo滚到一堆画好的"假Roball"旁边,混在里面一动不动 | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
这样:
|
||||||
|
- User发起对话("What are you doing?")→ 主动推进
|
||||||
|
- Vicky回答自然带出 be doing...now 的输入
|
||||||
|
- User感叹带出 like 的输出
|
||||||
|
|
||||||
|
其余部分不变。承段整体通过吗?
|
||||||
|
user: pass
|
||||||
|
assistant: 好,承段通过。转段你之前也看到了,我再发一次确认,然后继续写合段。
|
||||||
|
|
||||||
|
转段之前已发过并且你没有提出修改意见,确认通过的话我直接写【合】了?
|
||||||
|
user: 你先把现在的结果保存一下,存为历史文件;如果我待会儿要重启你并接着现在的任务进度,我该怎么唤醒你?
|
||||||
|
assistant: 已保存完毕。存了两个文件:
|
||||||
|
|
||||||
|
1. **任务日志** `business_knowledge/task_log/U24L4剧本.md` — 记录了进度和关键反馈
|
||||||
|
2. **剧本存档** `business_knowledge/task_log/U24L4_draft_progress.md` — 起承转完整剧本 + 知识点统计 + 核心互动设计备忘
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**下次唤醒方式:** 直接跟我说——
|
||||||
|
|
||||||
|
> **"继续U24L4剧本"**
|
||||||
|
|
||||||
|
我会自动读取任务日志和存档文件,恢复到当前进度(转段待确认→写合段)。
|
||||||
229
memory/2026-05-25-1805.md
Normal file
229
memory/2026-05-25-1805.md
Normal file
@ -0,0 +1,229 @@
|
|||||||
|
# Session: 2026-05-25 18:05:29 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_859415bde7f8e07cf5f24940f299a287
|
||||||
|
- **Session ID**: 75c7e785-b8f6-4116-9979-ce74b393d9a4
|
||||||
|
- **Source**: feishu
|
||||||
|
|
||||||
|
## Conversation Summary
|
||||||
|
|
||||||
|
assistant: NO_REPLY
|
||||||
|
user: https://makee-interactive.feishu.cn/wiki/GvXTwO7V5iulQtkDOqPcM8o4nBe
|
||||||
|
新剧本,看完我们进合,进合之前我有细节给你。你先看,等我指示。
|
||||||
|
assistant: 看完了。你的改动让转段生动了很多,几个亮点:
|
||||||
|
|
||||||
|
- **Tom 拒吃 Kiwi 的戏拉长了** — Smith "Try a bite" → Tom "It smells good..." → Fiona 再逼 "Come on. Take a bite." → Tom 勉强吃一口 "Not. Bad." → 立刻逃。评委会集体目睹他的反常,比原来 User/May 偷偷观察的视角更强
|
||||||
|
- **Jack 成了观察者** — "Do you see whats wrong with Tom?" 让成人角色也参与见证可疑,承段埋的"全镇以为Tom爱吃"在转段开始反噬
|
||||||
|
- **Sam 的情绪线** — "Kalab's kiwi is sweet. My pineapple doesn't have a chance!" → 绝望→厕所→被鼓励→"I will try",完整的微型弧线
|
||||||
|
- **Tom 的厕所梗** — Smith "He went to toilet again?" → Fiona 收尾 "Tom went to toilet again!" → went to the toilet 顺便多刷了一次输入
|
||||||
|
|
||||||
|
**中互动核查:**
|
||||||
|
|
||||||
|
| 中互动 | 知识点 | 状态 |
|
||||||
|
|--------|--------|------|
|
||||||
|
| Tom did not eat Kiwi! | kiwi | ✅ |
|
||||||
|
| The kiwi is so sweet! | kiwi | ✅ |
|
||||||
|
| Sam went to the toilet! | went to the toilet | ✅ |
|
||||||
|
| You have the best fruit! | fruit | ✅ |
|
||||||
|
|
||||||
|
4 中互动全部含知识点。全课中互动累计 12,知识点全部达标。
|
||||||
|
|
||||||
|
等你合段指示。
|
||||||
|
user: Sam被带到会场了,我们却发现另一个评委Tom却不见人。
|
||||||
|
我们在厕所里找到满脸通红的Tom,他说没事,我们跟Tom回去,Tom不愿意吃pineapple,只闻了一下,便宣布水果品鉴大会的冠军就是Sam的Pineapple!
|
||||||
|
|
||||||
|
我们虽然替Sam高兴,同时我们也很疑惑。这场比赛真的公平吗?
|
||||||
|
|
||||||
|
可是镇长Tom又不见了。我们跟在场的Lily说这件事,Lily说有些人吃Kiwi和pineapple,会让脸变得很红。但我们奇怪,Tom不是很喜欢吃Pineapple 和 Kiwi吗?
|
||||||
|
|
||||||
|
上面是「合」的大纲这
|
||||||
|
user: Sam被带到会场了,我们却发现另一个评委Tom却不见人。
|
||||||
|
我们在厕所里找到满脸通红的Tom,他说没事,我们跟Tom回去,Tom不愿意吃pineapple,只闻了一下,便宣布水果品鉴大会的冠军就是Sam的Pineapple!
|
||||||
|
|
||||||
|
我们虽然替Sam高兴,同时我们也很疑惑。这场比赛真的公平吗?
|
||||||
|
|
||||||
|
可是镇长Tom又不见了。我们跟在场的Lily说这件事,Lily说有些人吃Kiwi和pineapple,会让脸变得很红。但我们奇怪,Tom不是很喜欢吃Pineapple 和 Kiwi吗?
|
||||||
|
|
||||||
|
上面是「合」的大纲这
|
||||||
|
user: Sam被带到会场了,我们却发现另一个评委Tom却不见人。
|
||||||
|
我们在厕所里找到满脸通红的Tom,他说没事,我们跟Tom回去,Tom不愿意吃pineapple,只闻了一下,便宣布水果品鉴大会的冠军就是Sam的Pineapple!
|
||||||
|
|
||||||
|
我们虽然替Sam高兴,同时我们也很疑惑。这场比赛真的公平吗?
|
||||||
|
|
||||||
|
可是镇长Tom又不见了。我们跟在场的Lily说这件事,Lily说有些人吃Kiwi和pineapple,会让脸变得很红。但我们奇怪,Tom不是很喜欢吃Pineapple 和 Kiwi吗?
|
||||||
|
|
||||||
|
上面是「合」的大纲,我需要以下几点改动:
|
||||||
|
|
||||||
|
我们不用到厕所捞Tom了,走来走去节奏不好。改做Smith:我去厕所带他回来吧!然后镜头一转,满脸通红的Tom被带到评鉴会上。Tom说什么都不吃Pineapple,便直接说Sam 赢了。
|
||||||
|
|
||||||
|
最后我们和Lily谈这件事,剧情改为Fiona 主动走过来跟我们说话,要我们照顾镇长:她看得出Tom过敏。这里是一个核心听力。Lily 则删去。
|
||||||
|
|
||||||
|
出一个新的节奏表吧。包含两个中互动和最后的听力互动。
|
||||||
|
assistant: ## U23-L2「合」节奏表 V1
|
||||||
|
|
||||||
|
### 段落一:Tom 缺席 → Smith 去带人
|
||||||
|
|
||||||
|
| 类型 | 内容 |
|
||||||
|
|------|------|
|
||||||
|
| TL | 接转段结尾:众人回到评审桌前。Kalab、Sam 都就位。Tom 的座位空着 |
|
||||||
|
| TL | Smith 扫了一眼: "Where is Tom?" |
|
||||||
|
| TL | Fiona 往厕所方向张望: "He **went to the toilet**. He was not feeling well." |
|
||||||
|
| TL | Smith 站起来: "I will bring him back." |
|
||||||
|
| TL | 【转场】Smith 扶着满脸通红的 Tom 走回来。全场安静 |
|
||||||
|
| TL | Tom 揉着脸,挥手: "I am fine. Let's do this." |
|
||||||
|
|
||||||
|
### 段落二:Tom 拒吃 Pineapple → 宣布冠军
|
||||||
|
|
||||||
|
| 类型 | 内容 |
|
||||||
|
|------|------|
|
||||||
|
| TL | Sam 把 Pineapple 端上评审桌。果肉金黄,香气四溢 |
|
||||||
|
| TL | Fiona 尝了一片: "It is sweet!" Smith 尝了一片: "Very good." |
|
||||||
|
| TL | Kalab 在一边看着,手指敲着桌子 |
|
||||||
|
| TL | 轮到 Tom。所有人看着他。Tom 盯着面前的 Pineapple,手没动 |
|
||||||
|
| TL | Tom 拿起一片——放到鼻子前——闻了一下——放回去。没吃 |
|
||||||
|
| **中互动** | User → "Tom did not eat the **pineapple**!" |
|
||||||
|
| TL | Tom 突然站起来 |
|
||||||
|
| TL | Tom: "Sam wins." |
|
||||||
|
| TL | Tom: "His **pineapple** is the best **fruit**." |
|
||||||
|
| TL | 【emoji】Sam😲 May🎉 |
|
||||||
|
| **中互动** | User → "Sam's **pineapple** is the winner!" |
|
||||||
|
|
||||||
|
### 段落三:Tom 又不见了 + Fiona 揭露过敏(核心听力)
|
||||||
|
|
||||||
|
| 类型 | 内容 |
|
||||||
|
|------|------|
|
||||||
|
| TL | 欢呼声中,Tom 悄悄从椅子上起身——溜向厕所方向 |
|
||||||
|
| TL | May 转头,座位又空了: "Tom **went to the toilet** again?" |
|
||||||
|
| TL | Jack: "He did not even try the **pineapple**..." |
|
||||||
|
| TL | Jack: "Was the game fair?" |
|
||||||
|
| TL | 我们也在想同样的事。这时 Fiona 走过来,表情温和 |
|
||||||
|
| TL | Fiona: "Kids. I want to tell you something." |
|
||||||
|
| TL | Fiona: "When Tom eats **kiwi** or **pineapple**..." |
|
||||||
|
| TL | Fiona: "...his face turns red." |
|
||||||
|
| TL | Fiona: "His eyes get puffy. He cannot breathe well." |
|
||||||
|
| TL | Fiona: "That is a food allergy." |
|
||||||
|
| TL | Fiona: "Some **fruit** can make people sick." |
|
||||||
|
| **听力互动** | Fiona 刚才说了 Tom 为什么不吃 pineapple。请 User 回答:Why did Tom not eat the **pineapple**? |
|
||||||
|
| TL | Fiona 微笑,拍拍 User: "That is right. Look after Tom for me, okay?" |
|
||||||
|
| TL | ⚡收束钩子:Fiona 走了。May 看着 User — 眼神里全是问号。Jack 说过 Tom 喜欢吃 Pineapple。Jack 也说过 Tom 喜欢吃 Kiwi。一个对这两种水果过敏的人,为什么全镇都以为他喜欢? |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 全课最终盘点
|
||||||
|
|
||||||
|
| 知识点 | 起→合计 | 状态 |
|
||||||
|
|--------|----------|------|
|
||||||
|
| fruit | 4→1 + 转1 + 合1 | ✅ |
|
||||||
|
| kiwi | 4→1 + 转2 | ✅ |
|
||||||
|
| pineapple | 3→2 + 承1 + 转 + 合2 | ✅ |
|
||||||
|
| toilet | 承2→1 + 转2 | ✅ |
|
||||||
|
| went to the toilet | 承2→1 + 转1 + 合2 | ✅ |
|
||||||
|
| What fruit do you like? | 1→1 + 承2→1 | ✅ |
|
||||||
|
|
||||||
|
**全课中互动:** 14 个(起4 + 承4 + 转4 + 合2) + 1 听力互动
|
||||||
|
**Lily:** 已删,替换为 Fiona
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
请审阅。
|
||||||
|
user: 听力互动是核心互动,写剧本的时候请遵照核心互动的写法。现在出剧本我看看。
|
||||||
|
assistant: ## U23-L2「合」剧本 V1
|
||||||
|
|
||||||
|
| 类型 | 剧情描述 | 角色名 | 编剧台词 |
|
||||||
|
|------|---------|--------|---------|
|
||||||
|
| TL | 【场景】会场 — 评审桌前<br>【角色】User、May、Sam、Jack、Kalab、Tom、Smith、Fiona | | |
|
||||||
|
| | 接转段结尾:众人回到评审桌前。Kalab、Sam 都就位。Tom 的座位空着 | | |
|
||||||
|
| | | Smith | Where is Tom? |
|
||||||
|
| | Fiona 往厕所方向望了一眼 | | |
|
||||||
|
| | | Fiona | He **went to the toilet**. |
|
||||||
|
| | | Fiona | He was not feeling well. |
|
||||||
|
| | Smith 站起来 | | |
|
||||||
|
| | | Smith | I will bring him back. |
|
||||||
|
| | 【转场】Smith 扶着满脸通红的 Tom 走回来。全场安静 | | |
|
||||||
|
| | Tom 坐下,揉着脸 | | |
|
||||||
|
| | | Tom | I am fine. Let's do this. |
|
||||||
|
| TL | Sam 把 Pineapple 端上评审桌。果肉金黄,香气四溢 | | |
|
||||||
|
| | | Fiona | Wow. It looks good! |
|
||||||
|
| | | Smith | Let's try. |
|
||||||
|
| | Fiona 尝了一片,眼睛亮了。Smith 尝了一片,满意地点头 | | |
|
||||||
|
| | 轮到 Tom。所有人看着他。Tom 盯着面前的 Pineapple,手没动 | | |
|
||||||
|
| | Tom 拿起一片——送到鼻子前——闻了一下——放回盘子。没吃 | | |
|
||||||
|
| **中互动** | | User | Tom did not eat the **pineapple**! |
|
||||||
|
| | Tom 突然站起来 | | |
|
||||||
|
| | | Tom | Sam wins. |
|
||||||
|
| | | Tom | His **pineapple** is the best. |
|
||||||
|
| | 【emoji】Sam😲 May🎉 Kalab😐 | | |
|
||||||
|
| | | May | You won, Sam! |
|
||||||
|
| | | Sam | I... I did it? |
|
||||||
|
| **中互动** | | User | Sam's **pineapple** is the winner! |
|
||||||
|
| TL | 欢呼声中,Tom 悄悄从椅子上起身——往厕所方向溜去 | | |
|
||||||
|
| | May 转头,座位又空了 | | |
|
||||||
|
| | | May | Tom **went to the toilet** again? |
|
||||||
|
| | | Jack | He did not even try the **pineapple**. |
|
||||||
|
| | | Jack | Was the game fair? |
|
||||||
|
| | User 和 Sam 也在想同样的事 | | |
|
||||||
|
| | 【emoji】User🤨 May🤨 | | |
|
||||||
|
| | 这时 Fiona 走过来,表情温和 | | |
|
||||||
|
| | | Fiona | Kids. I want to tell you something. |
|
||||||
|
| | | Fiona | Tom is not okay. |
|
||||||
|
| 核心互动-听力选择 | 第1题 — 【插入图】三张选项图(图A: apple + banana / 图B: kiwi + pineapple / 图C: grapes + watermelon),正确答案:图B | | |
|
||||||
|
| | | May | Why is his face so red? |
|
||||||
|
| | | Fiona | He ate some **fruit**. |
|
||||||
|
| | | Fiona | His face turns red from **kiwi** or **pineapple**. |
|
||||||
|
| | 第2题 — 【插入图】三张选项图(图A: 笑脸Tom / 图B: 脸红眼肿的Tom / 图C: 睡觉的Tom),正确答案:图B | | |
|
||||||
|
| | | May | What happens to him? |
|
||||||
|
| | | Fiona | His face turns red. |
|
||||||
|
| | | Fiona | His eyes get puffy too. |
|
||||||
|
| | 第3题 — 【插入图】三张选项图(图A: Tom很饿 / 图B: Tom很累 / 图C: Tom对食物过敏),正确答案:图C | | |
|
||||||
|
| | | May | So he cannot eat **kiwi** or **pineapple**? |
|
||||||
|
| | | Fiona | Yes. That is a food allergy. |
|
||||||
|
| | | Fiona | Some **fruit** can make people sick. |
|
||||||
|
| TL | Fiona 微笑,拍拍 User 和 May | | |
|
||||||
|
| | | Fiona | You got it. |
|
||||||
|
| | | Fiona | Look after Tom for me, okay? |
|
||||||
|
| | Fiona 转身离开 | | |
|
||||||
|
| | May 看着 User,眼神里全是问号 | | |
|
||||||
|
| | | May | Wait. That is strange. |
|
||||||
|
| | | May | Jack said Tom loves **pineapple**... |
|
||||||
|
| | | May | Jack said Tom loves **kiwi** too. |
|
||||||
|
| | | May | Why does he love them? |
|
||||||
|
| | | May | They can make him sick! |
|
||||||
|
| | ⚡收束钩子:User 看着 Tom 消失的方向。一个对 Kiwi 和 Pineapple 过敏的人,为什么全镇都以为他爱吃这两种水果?这个 Tom……到底是谁? | | |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 自查
|
||||||
|
|
||||||
|
**中互动含知识点:** ✅
|
||||||
|
|
||||||
|
| 中互动 | 知识点 |
|
||||||
|
|--------|--------|
|
||||||
|
| Tom did not eat the pineapple! | pineapple ✅ |
|
||||||
|
| Sam's pineapple is the winner! | pineapple ✅ |
|
||||||
|
|
||||||
|
**核心互动覆盖:** kiwi、pineapple、fruit → 3 个知识点 ✅
|
||||||
|
**核心互动 3 轮听力:** Q1→Q2→Q3,每轮含 3 张选项图 + NPC 对话音频 ✅
|
||||||
|
**NPC 最长句:** "Look after Tom for me, okay?"(7 词)✅
|
||||||
|
**Fiona → 替换 Lily:** ✅
|
||||||
|
**Lily 全部删除:** ✅
|
||||||
|
**厕所捞人:** User 不去,Smith 代劳 ✅
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
全课至此起承转合全部写完。你看。
|
||||||
|
user: User Tom did not eat the pineapple! <<< 这个中互的桥段改一改。
|
||||||
|
Smith 终于把 Tom 请回来。 User 问:What fruit does he like? Pineapple or Kiwi?
|
||||||
|
|
||||||
|
|
||||||
|
这样句型 What fruit do/does you(he/she/it) like? 可以用多一次。
|
||||||
|
assistant: 改这一处,Excel 重出。
|
||||||
|
assistant: NO_REPLY
|
||||||
|
user: 剧本已写好,task 已完结。
|
||||||
|
assistant: 全课定稿已读回。现在关账。
|
||||||
|
assistant: 已全部关账:
|
||||||
|
|
||||||
|
- `task_log/U23-L2剧本.md` — 全课进度已记录
|
||||||
|
- `active_tasks.md` — #6 U23-L2 标记完成,重复编号已修复
|
||||||
|
- `memory/2026-05-25.md` — 完成记录已写入
|
||||||
|
- `script_feedback_log.md` — 规则 #4 "中互动必须含知识点" 升级为高频错误
|
||||||
|
|
||||||
|
U23-L2 起承转合全课收工。有 L3 大纲随时叫我。
|
||||||
5
memory/2026-05-25-1806.md
Normal file
5
memory/2026-05-25-1806.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Session: 2026-05-25 18:06:09 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_859415bde7f8e07cf5f24940f299a287
|
||||||
|
- **Session ID**: 8eeb7b5b-c22c-443e-acaa-a316abb454b7
|
||||||
|
- **Source**: feishu
|
||||||
5
memory/2026-05-25-1825-2.md
Normal file
5
memory/2026-05-25-1825-2.md
Normal file
@ -0,0 +1,5 @@
|
|||||||
|
# Session: 2026-05-25 18:25:41 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_859415bde7f8e07cf5f24940f299a287
|
||||||
|
- **Session ID**: bc945566-cad8-4776-9d0d-c87b03a4d5da
|
||||||
|
- **Source**: feishu
|
||||||
294
memory/2026-05-25-1825.md
Normal file
294
memory/2026-05-25-1825.md
Normal file
@ -0,0 +1,294 @@
|
|||||||
|
# Session: 2026-05-25 18:25:36 GMT+8
|
||||||
|
|
||||||
|
- **Session Key**: agent:xiaobian:feishu:direct:ou_859415bde7f8e07cf5f24940f299a287
|
||||||
|
- **Session ID**: e4f5a669-2195-4d14-b7d9-a19d88f0816f
|
||||||
|
- **Source**: feishu
|
||||||
|
|
||||||
|
## Conversation Summary
|
||||||
|
|
||||||
|
assistant: 找到表格了。现在读取知识点 Sheet 和剧本 Sheet。
|
||||||
|
assistant: 资料读取完成。让我再检查一下其他 Sheet 是否有额外信息,同时也参考一下之前的剧本格式。
|
||||||
|
assistant: 资料已全部读取。现在整理一下:
|
||||||
|
|
||||||
|
**知识点 (6个):**
|
||||||
|
- `kitchen`(单词)| `kid`(单词,需有单数教学)| `many`(单词)| `really`(单词)
|
||||||
|
- `It's really...`(句型)| `There are many...`(句型)
|
||||||
|
|
||||||
|
**剧情:** 4场 + 尾声,沙暴中避难 → 厨房做饭 → 油锅起火 → 停电密道
|
||||||
|
|
||||||
|
按林逸瀚协作惯例,先出节奏表。我查一下参考格式和规范再动手。
|
||||||
|
assistant: 格式和协作流程已确认。按惯例先出节奏表。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## U24-L2 节奏表 V1
|
||||||
|
|
||||||
|
**知识点:** kitchen / kid(s) / many / really / It's really... / There are many...
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 第一场:沙暴避难 — 开场
|
||||||
|
|
||||||
|
| # | 类型 | 内容 |
|
||||||
|
|---|------|------|
|
||||||
|
| 1 | TL | 【场景】Tom家客厅(窗外风沙漫天)【全屏插入图】沙尘暴中 Mama Green 站在 Tom 家门口,护着孩子们往里跑 |
|
||||||
|
| 2 | TL | Mama Green 扫视一圈:少了一个 |
|
||||||
|
| 3 | | **Mama Green:** "Where is Sue?" |
|
||||||
|
| 4 | | **Kid A:** "She didn't come back!" |
|
||||||
|
| 5 | TL | Mama Green 二话不说冲进风沙 |
|
||||||
|
| 6 | | **User:** "It's really windy!" → 🎯 `really` 输入 + `It's really...` 句型输入 |
|
||||||
|
| 7 | TL | 【全屏插入图2】Mama Green 提着 Sue 回来,风沙中身影坚毅 |
|
||||||
|
| 8 | | **User:** "Mama Green is amazing!" |
|
||||||
|
| 9 | | **Kid B:** "Thank you, Mama Green!" |
|
||||||
|
| 10 | TL | 【转场】切到 Tom 家客厅内。孩子们坐在地上,个个灰头土脸 |
|
||||||
|
| 11 | | **User:** "I'm really hungry." → 🎯 `really` 输入 |
|
||||||
|
| 12 | TL | Mama Green 看向 Jay 和 Lin |
|
||||||
|
| 13 | | **Mama Green:** "You two. Cook for the kids." |
|
||||||
|
| 14 | TL | 转身对 Tony |
|
||||||
|
| 15 | | **Mama Green:** "Look after the kids for me, okay?" |
|
||||||
|
| 16 | | **Tony:** "Okay!" |
|
||||||
|
| 17 | TL | User 和 Sue 站起来,主动请缨 |
|
||||||
|
| 18 | | **User:** "I want to help in the kitchen!" → 🎯 `kitchen` 输入 |
|
||||||
|
| 19 | | **Sue:** "Me too!" |
|
||||||
|
| 20 | | **Mama Green:** "Great! Two little helpers." → 🎯 `kid` 语境铺垫 |
|
||||||
|
| 21 | TL | Jay 斜眼看着 User 和 Sue,嗤之以鼻 |
|
||||||
|
| 22 | | **Jay:** "Kids in the kitchen?" |
|
||||||
|
| 23 | | **Lin:** "Kids can't do many things." → 🎯 `kids` 输入 / `many` 输入 |
|
||||||
|
| 24 | **中互动** | **User 对 Jay/Lin 的反驳** → 🎯 User 输出 |
|
||||||
|
| | | **User:** "Kids can do many things!" → 🎯 `kids` 输出 / `many` 输出 |
|
||||||
|
| 25 | TL | Jay 耸耸肩,转身走进厨房。Lin 冲 User 做了个鬼脸跟上 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 第二场:厨房大作战
|
||||||
|
|
||||||
|
| # | 类型 | 内容 |
|
||||||
|
|---|------|------|
|
||||||
|
| 26 | TL | 【场景】Tom家厨房(橱柜、灶台、食材堆满桌面)【角色】User、Sue、Mama Green、Jay、Lin |
|
||||||
|
| 27 | TL | User 环顾四周——食材堆得满满当当 |
|
||||||
|
| 28 | | **User:** "There are many ingredients!" → 🎯 `There are many...` 句型输入 |
|
||||||
|
| 29 | | **Mama Green:** "Yes! Let's make beef burgers." |
|
||||||
|
| 30 | TL | Green 分配任务:User 和 Sue 负责洗菜叠菜 |
|
||||||
|
| 31 | TL | User 和 Sue 配合默契——一个洗,一个叠,蔬菜塔越堆越高 |
|
||||||
|
| 32 | | **Mama Green:** "Good job, kids!" → 🎯 `kids` 输入 |
|
||||||
|
| 33 | TL | Jay 和 Lin 在另一边准备鸡蛋,Lin 想耍帅 |
|
||||||
|
| 34 | | **Lin:** "Watch this!" |
|
||||||
|
| 35 | TL | 【spine动画】Jay 学杂技单手打蛋——鸡蛋飞出去,"啪"砸在 Lin 头上,蛋液顺着脸流下 |
|
||||||
|
| 36 | TL | User 和 Sue 忍不住笑出声 |
|
||||||
|
| 37 | | **User:** "It's really funny!" → 🎯 `really` 输出 / `It's really...` 句型输出 |
|
||||||
|
| 38 | TL | Jay 和 Lin 开始幼稚地互相甩锅 |
|
||||||
|
| 39 | | **Jay:** "It's your head!" |
|
||||||
|
| 40 | | **Lin:** "It's your egg!" |
|
||||||
|
| 41 | TL | Mama Green 看着两人摇头叹气 |
|
||||||
|
| 42 | | **Mama Green:** "You two are really like kids." → 🎯 `really` 输入 |
|
||||||
|
| 43 | TL | Green 拿起面粉盆往外走 |
|
||||||
|
| 44 | | **Mama Green:** "I will knead the dough outside." |
|
||||||
|
| 45 | TL | Lin 擦掉头上的蛋液,不服气地看了一眼 User 这边 |
|
||||||
|
| 46 | | **Lin:** "Now, we fry the steak." |
|
||||||
|
| 47 | TL | Jay 开火,锅里倒油。Lin 故意走到 User 身边 |
|
||||||
|
| 48 | | **Lin:** "Fire and oil. Very dangerous." |
|
||||||
|
| 49 | | **Lin:** "Don't come close, kids." → 🎯 `kids` 输入 |
|
||||||
|
| 50 | TL | Lin 走回灶台。Jay 和 Lin 开始煎牛排,故意把火烧得很旺,嘚瑟地回头瞟 User |
|
||||||
|
| 51 | **中互动** | User 的内心吐槽(旁白) |
|
||||||
|
| | | **User:** "It's really dangerous!" → 🎯 `really` 输出 / `It's really...` 句型输出 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 第三场:油锅起火
|
||||||
|
|
||||||
|
| # | 类型 | 内容 |
|
||||||
|
|---|------|------|
|
||||||
|
| 52 | TL | 【音效】窗外风暴声骤然变大,狂风猛烈拍打窗户 |
|
||||||
|
| 53 | | **Jay:** "What was that?!" |
|
||||||
|
| 54 | | **Lin:** "Let's go check!" |
|
||||||
|
| 55 | TL | Jay 和 Lin 匆忙跑出厨房——忘了关火! |
|
||||||
|
| 56 | TL | 【音效】油锅滋滋作响 → 啪一声冒烟 → 轰一声火焰窜起 |
|
||||||
|
| 57 | TL | 厨房浓烟弥漫,所有人剧烈咳嗽 |
|
||||||
|
| 58 | | **User:** "The kitchen is on fire!" → 🎯 `kitchen` 输出 |
|
||||||
|
| 59 | TL | Jay 和 Lin 冲回来,对着火焰又吹又扇——火焰反而蹿得更高 |
|
||||||
|
| 60 | TL | 两人吓得跳开好几米远 |
|
||||||
|
| 61 | | **Jay & Lin:** "Ahhh!!" |
|
||||||
|
| 62 | | **Sue:** "It's really dangerous!" → 🎯 `really` 输入 / `It's really...` 句型输入 |
|
||||||
|
| 63 | TL | Sue 灵机一动 |
|
||||||
|
| 64 | | **Sue:** "Cover the pot! Cover the pot!" |
|
||||||
|
| 65 | TL | 用户帮忙大喊 |
|
||||||
|
| 66 | | **User:** "Cover the pot!" |
|
||||||
|
| 67 | TL | Mama Green 从外面冲进来,一把拿起锅盖,猛地盖上去——火焰瞬间熄灭 |
|
||||||
|
| 68 | TL | 厨房安静了几秒。Green 转身看着缩在角落的 Jay 和 Lin |
|
||||||
|
| 69 | | **Mama Green:** "The kids are calmer and braver than you!" → 🎯 `kids` 输入 |
|
||||||
|
| 70 | **核心互动**(口语 — User 向 Mama Green 复述刚才发生了什么) |
|
||||||
|
| | | **Mama Green:** "What happened?" |
|
||||||
|
| | | **User 复述3轮:** ① 风太大,他们跑去看 → ② 忘了关火,油锅着火 → ③ Sue 说盖锅盖,我们喊了你 |
|
||||||
|
| 71 | TL | 客厅里 Tony 带着 kids 也走进来帮忙收拾 |
|
||||||
|
| 72 | TL | Jay 和 Lin 低着头走过来 |
|
||||||
|
| 73 | | **Jay:** "We are sorry." |
|
||||||
|
| 74 | | **Lin:** "Kids can do many things." → 🎯 `kids` 输出 / `many` 输出 |
|
||||||
|
| 75 | TL | Mama Green 摸摸两人的头,笑了 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 第四场:停电密道
|
||||||
|
|
||||||
|
| # | 类型 | 内容 |
|
||||||
|
|---|------|------|
|
||||||
|
| 76 | TL | 【转场】餐桌上摆满了汉堡,大家围坐在一起 |
|
||||||
|
| 77 | | **User:** "The burgers are great!" |
|
||||||
|
| 78 | TL | 【音效】外面风声再次变强——轰隆隆 |
|
||||||
|
| 79 | TL | 啪嗒一声,灯全灭了。客厅陷入黑暗 |
|
||||||
|
| 80 | TL | 【音效】孩子们惊慌失措的叫声 |
|
||||||
|
| 81 | | **Mama Green:** "Stay calm! Stay together!" |
|
||||||
|
| 82 | TL | 黑暗中有人撞到了身后的矮柜 |
|
||||||
|
| 83 | TL | 【音效】柜子移开的沉闷声响 |
|
||||||
|
| 84 | TL | 矮柜后面,露出一条向下延伸的楼梯——密道! |
|
||||||
|
| 85 | TL | 所有人都愣住了,盯着那个入口 |
|
||||||
|
| 86 | TL | 【黑屏】TO BE CONTINUED... |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 知识点覆盖追踪
|
||||||
|
|
||||||
|
| 知识点 | 输入次数 | 输入位置 | 输出次数 | 输出位置 |
|
||||||
|
|--------|:---:|------|:---:|------|
|
||||||
|
| kitchen(单词) | 1 | #18 User | 1 | #58 User |
|
||||||
|
| kid(s)(单词) | 4 | #23 Lin, #32 Green, #49 Lin, #69 Green | 3 | #24 中互动, #74 Lin, #69语境 |
|
||||||
|
| many(单词) | 1 | #23 Lin | 2 | #24 中互动, #74 Lin |
|
||||||
|
| really(单词) | 3 | #6 User, #11 User, #42 Green | 2 | #37 中互动, #51 中互动 |
|
||||||
|
| It's really...(句型) | 1 | #6 User | 2 | #37 中互动, #51 中互动 |
|
||||||
|
| There are many...(句型) | 1 | #28 User | 0 | — |
|
||||||
|
|
||||||
|
**⚠️ 提醒:** `There are many...` 句型目前只有输入没有输出,需要在某处补一个输出点。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
你看看节奏表,有问题直接说。
|
||||||
|
user: 你先不急,更新一下
|
||||||
|
与我的协助流程。
|
||||||
|
|
||||||
|
每个 Lesson(如 U22 L1、L2…)按以下流程推进:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ 1. 接收故事大纲 + 知识点 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 2. 拆解「起」→ 产出节奏表(桥段 + 知识点输入 + 中互动) │
|
||||||
|
│ ↓ │
|
||||||
|
│ 3. 林逸瀚反馈 → 修改节奏表(可能多轮迭代) │
|
||||||
|
│ ↓ │
|
||||||
|
│ 4. 节奏表定稿 → 输出剧本 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 5. 林逸瀚审阅 → 可能要求语言难度/台词调整 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 6. 剧本定稿 → 转 Excel 发给林逸瀚 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 7. 林逸瀚最终修改 → 回传定稿版(飞书知识库链接) │
|
||||||
|
│ ↓ │
|
||||||
|
│ 8. 进入「承」→ 重复步骤 2-7 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 9. 进入「转」→ 重复步骤 2-7 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 10. 进入「合」→ 重复步骤 2-7 │
|
||||||
|
│ ↓ │
|
||||||
|
│ 11. ✅ 该 Lesson 剧本完成 │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
**补充说明(从实际协作中总结):**
|
||||||
|
- 步骤 2-3 节奏表可能经过多轮迭代(V1→V2→V3...),林逸瀚会给出具体方向调整
|
||||||
|
- 步骤 4-5 剧本输出后可能需要调整语言难度(台词须符合 Pre-A1 水平)
|
||||||
|
- 步骤 6 Excel 格式按 4 列标准(类型/剧情描述/名字/台词),知识点做富文本标注
|
||||||
|
- 步骤 7 林逸瀚的最终修改版会以飞书知识库文档形式回传
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 节奏表说明
|
||||||
|
|
||||||
|
节奏表必须清晰呈现三个维度:
|
||||||
|
1. **桥段:** 按时间线拆解的情节段落,能看出每个情节的长度
|
||||||
|
2. **知识点输入:** 词汇/句型在哪个桥段、以什么方式出现(首次引入/复现/强化)
|
||||||
|
3. **中互动:** 互动环节的位置、类型、数量
|
||||||
|
|
||||||
|
通过节奏表可以快速掌握:
|
||||||
|
- 某个情节的长度
|
||||||
|
- 中互动的总数量
|
||||||
|
- 输入的知识点是否足够
|
||||||
|
assistant: 收到,先更新协作流程到记忆和日志,然后重新按「起承转合」分步来。
|
||||||
|
assistant: 流程已更新到日志和记忆。现在按「起承转合」重出,先来 **「起」** 段。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## U24-L2「起」节奏表 V1
|
||||||
|
|
||||||
|
### 一、桥段拆解
|
||||||
|
|
||||||
|
```
|
||||||
|
┌────────────────────────────────────────────────────────┐
|
||||||
|
│ 段落1 沙暴中避难 │
|
||||||
|
│ · Mama Green 守住Tom家门口,孩子们跑进 → Sue不见了 │
|
||||||
|
│ · Mama Green 冲进风沙找回Sue → 大家惊叹 │
|
||||||
|
│ · 🎯 really + It's really...(首次引入,User看沙暴感叹) │
|
||||||
|
│ ~5个节点 │
|
||||||
|
├────────────────────────────────────────────────────────┤
|
||||||
|
│ 段落2 大家都饿了 │
|
||||||
|
│ · 切到Tom家客厅,孩子们灰头土脸 → User喊饿 │
|
||||||
|
│ · Mama Green 安排分工:速猴做饭 / Tony看kids │
|
||||||
|
│ · 🎯 really(复现,User"I'm really hungry") │
|
||||||
|
│ ~4个节点 │
|
||||||
|
├────────────────────────────────────────────────────────┤
|
||||||
|
│ 段落3 User主动请缨 │
|
||||||
|
│ · User+Sue 自告奋勇去kitchen帮忙 → Green欢迎 │
|
||||||
|
│ · Jay&Lin 嗤之以鼻:"Kids can't do many things" │
|
||||||
|
│ · 🎯 kitchen(首次引入)/ kids(首次引入)/ many(首次引入)│
|
||||||
|
│ · 💬 中互动①:User反驳 → "Kids can do many things!" │
|
||||||
|
│ ~6个节点 │
|
||||||
|
├────────────────────────────────────────────────────────┤
|
||||||
|
│ 进入厨房,切「承」 │
|
||||||
|
└────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### 二、知识点输入清单
|
||||||
|
|
||||||
|
| 知识点 | 引入方式 | 位置 | 台词 |
|
||||||
|
|--------|----------|------|------|
|
||||||
|
| `really` | 🔵 首次 | 段落1 | User: "It's really windy!" 🌪️ |
|
||||||
|
| `It's really...` | 🔵 首次 | 段落1 | 同上,句型首次亮相 |
|
||||||
|
| `really` | 🟢 复现 | 段落2 | User: "I'm really hungry." |
|
||||||
|
| `kitchen` | 🔵 首次 | 段落3 | User: "I want to help in the kitchen!" |
|
||||||
|
| `kid(s)` | 🔵 首次 | 段落3 | Lin: "Kids can't do many things." |
|
||||||
|
| `many` | 🔵 首次 | 段落3 | Lin: "Kids can't do many things." |
|
||||||
|
|
||||||
|
### 三、中互动
|
||||||
|
|
||||||
|
| # | 位置 | 类型 | 内容 | 覆盖知识点 |
|
||||||
|
|:---:|------|------|------|:---:|
|
||||||
|
| ① | 段落3末尾 | 口语 | User 反驳 Jay&Lin 的嘲笑 | `kids` `many` |
|
||||||
|
|
||||||
|
> **User输出:** "Kids can do many things!"
|
||||||
|
|
||||||
|
### 四、「起」段知识点小计
|
||||||
|
|
||||||
|
| | kitchen | kid(s) | many | really | It's really... | There are many... |
|
||||||
|
|---|:---:|:---:|:---:|:---:|:---:|:---:|
|
||||||
|
| 输入 | 1 | 1 | 1 | 2 | 1 | — |
|
||||||
|
| 输出 | — | 1 | 1 | — | — | — |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
「起」段节奏:三个段落,1个中互动,引入 kitchen/kids/many/really + It's really... 句型。看一下。
|
||||||
|
user: 节奏表能用表格格式发我吗?不要用md格式写。比较难看。
|
||||||
|
assistant: 已用飞书消息发出,查收一下。你看这个格式是否比 md 好一些?还是想要别的形式?
|
||||||
|
user: 好像不行,这个格式不太行。
|
||||||
|
assistant: 换个干净格式试试:
|
||||||
|
assistant: 已发。这种简洁列表式你看行吗?还是你心目中的格式是别样的——比如用飞书电子表格做一个节奏表 sheet?
|
||||||
|
user: [media attached: /root/.openclaw/media/inbound/0552295c-3f56-4373-ba4a-3e754491c43c.jpg (image/jpeg) | /root/.openclaw/media/inbound/0552295c-3f56-4373-ba4a-3e754491c43c.jpg]
|
||||||
|
[Image]
|
||||||
|
User text:
|
||||||
|
[Feishu ou_859415bde7f8e07cf5f24940f299a287 Mon 2026-05-25 18:21 GMT+8] 林逸瀚: /root/.openclaw/media/inbound/0552295c-3f56-4373-ba4a-3e754491c43c.jpg
|
||||||
|
这种格式是md吗?
|
||||||
|
Description:
|
||||||
|
This is a partial Chinese-English script snippet for *Scene 3: Oil Pot Catches Fire*, the 3rd page of 3 total pages for this scene, presented as a 3-column (line number
|
||||||
|
assistant: 明白了,格式是 Excel 表格不是 md。我把节奏表直接写进飞书表格里。
|
||||||
|
assistant: 节奏表已写入飞书表格,林逸瀚权限也已添加。
|
||||||
|
|
||||||
|
📊 https://makee-interactive.feishu.cn/sheets/Qh6es6AZWhTtmRtjTREcZBvBnRA
|
||||||
|
|
||||||
|
3列格式:`#` / `类型` / `内容`,跟你发的截图同款。你看看这个表格格式对了吗?内容上有什么要调的?
|
||||||
64
memory/2026-05-25.md
Normal file
64
memory/2026-05-25.md
Normal file
@ -0,0 +1,64 @@
|
|||||||
|
# 2026-05-25
|
||||||
|
|
||||||
|
## U23-L1 剧本同步(林逸瀚)
|
||||||
|
|
||||||
|
- 林逸瀚在飞书文档中完成了「转」段剧本,小编通过 Sheets V2 API 读取完整剧本并同步到任务日志
|
||||||
|
- **关键变化(与小编 V1 提案对比):**
|
||||||
|
- SaSa 是伪装名,真狗叫 **BlackBean**
|
||||||
|
- 转的结局改为 Tony 被撞破后**被迫放出圆球**,而非扛袋离开(更温和)
|
||||||
|
- 「Don't be silly」在转段热狗摊已由 User 中互动输出,不再需要合段补
|
||||||
|
- 六个知识点在起+承+转已全部收齐,合段无剩余知识点缺口
|
||||||
|
- **合段任务:** 纯剧情收束(User 觉察不对劲)+ 悬钩(Tony 对通讯器说"下次不会失手")
|
||||||
|
- 任务日志和 active_tasks 已同步更新
|
||||||
|
|
||||||
|
## U23 L1 补充
|
||||||
|
|
||||||
|
- 「合」段已于 2026-05-25 完成,全课起承转合全部完成。U23-L1 收工。
|
||||||
|
|
||||||
|
## U23-L2 剧本(林逸瀚)✅ 已完成
|
||||||
|
|
||||||
|
- 创建时间:2026-05-25 11:42 → 完成时间:2026-05-25 17:49
|
||||||
|
- 定稿位置:Wiki 表格 BVASsXSrph6ZlAtmQmFcC7LYn5b, Sheet wMQVyV
|
||||||
|
- 全课中互动 14 个 + 1 核心互动-听力选择,知识点全部达标
|
||||||
|
- **林逸瀚协作特点:** 先给我大纲+知识点 → 我出节奏表→审核→出剧本→Excel→他在Wiki表格上直接修改→我读回同步日志
|
||||||
|
- **关键桥段:** 水果鉴赏大会 / 驴子偷吃Pineapple / Tom脸红过敏自曝 / 评分制夺冠 / Fiona揭露过敏 / J+L组合决心查真相
|
||||||
|
- **本次教训:** 中互动必须和知识点绑定——"I heard that!" / "Are you okay?" 写成中互动为严重错误,被林逸瀚明确指出
|
||||||
|
|
||||||
|
## U23-L3 剧本撰写完成(胡笳)
|
||||||
|
|
||||||
|
- 完成了S2结尾3课剧本撰写任务的第一课:U23-L3 知识问答
|
||||||
|
- 交付物:飞书表格 https://makee-interactive.feishu.cn/sheets/Ns7jsJTsyhzYe3txVqqcawP0nQe (Sheet1为最终版)
|
||||||
|
- 核心互动类型:口语(Sally通过对讲机远程提问,3轮)
|
||||||
|
- 知识点统计:基本达标,句型1输入偏少(1次)、句型2输出偏少(1次),胡笳知悉待微调
|
||||||
|
- 互动统计:中互动15 + 核心互动1 = 16个 ✅
|
||||||
|
- 沉淀了6条新规则到 script_feedback_log.md (#21-26)
|
||||||
|
- 更新了SKILL.md中emoji使用规范
|
||||||
|
- 同步了世界观文档到本地 business_knowledge/resources/worldview_L1_desert_town.md
|
||||||
|
- 更新了MEMORY.md中创作必备参考文档集(本地路径优先)
|
||||||
|
- ⚠️ 教训:不要在Wiki知识库中新建文档,Bot只做读取
|
||||||
|
|
||||||
|
### 关键协作经验
|
||||||
|
- 胡笳倾向于先给草稿框架,让小编润色+续写,再由她在表格上直接修订
|
||||||
|
- 她对台词的语言地道程度要求很高,会逐句审视
|
||||||
|
- 她会直接在飞书表格上修改,然后让小编读取修改后版本继续
|
||||||
|
- 知识点的有效输出(User在中互动中说)vs无效输出(NPC在中互动中说)是容易犯的错
|
||||||
|
|
||||||
|
## U24-L2 剧本启动(林逸瀚)
|
||||||
|
|
||||||
|
- 创建时间:2026-05-25 18:06
|
||||||
|
- 日志:`business_knowledge/task_log/U24-L2剧本.md`
|
||||||
|
- 飞书文档:https://makee-interactive.feishu.cn/wiki/AwaNwj9tOiimb6kIYjZczO9MnQf
|
||||||
|
- 表格Token:TcHdsymvJhOeZWtdUdJcEvyFnEg
|
||||||
|
|
||||||
|
### 林逸瀚明确协作流程(起承转合分段推进)
|
||||||
|
每个Lesson按4段独立推进,每段流程:
|
||||||
|
1. 接收大纲+知识点 → 2. 出节奏表(桥段+知识输入+互动)→ 3. 林逸瀚反馈迭代 → 4. 定稿后出剧本 → 5. 语言难度调整 → 6. 转Excel发林逸瀚 → 7. 林逸瀚修改回传 → 8. 进入下一段
|
||||||
|
|
||||||
|
### 本课知识点
|
||||||
|
- 单词:kitchen / kid(s)(需有单数教学)/ many / really
|
||||||
|
- 句型:It's really... / There are many...
|
||||||
|
|
||||||
|
### 进度
|
||||||
|
- 19:52 — 「起」段完成(林逸瀚亲自撰写定稿),Wiki 表格 wMQVyV,53行
|
||||||
|
- 小编 V1(16行)V2(35行)均被弃用
|
||||||
|
- **核心教训:** 小编把剧本当教学流程图写——给角色分配台词完成知识点覆盖。林逸瀚写的是动画——人物关系、群像、喜剧节奏、情感层次。知识点嵌在故事里,不是贴在台词上
|
||||||
BIN
output/U23-L1_合.xlsx
Normal file
BIN
output/U23-L1_合.xlsx
Normal file
Binary file not shown.
BIN
output/U23L5_剧本.xlsx
Normal file
BIN
output/U23L5_剧本.xlsx
Normal file
Binary file not shown.
BIN
output/U23_L2_合_剧本V1.xlsx
Normal file
BIN
output/U23_L2_合_剧本V1.xlsx
Normal file
Binary file not shown.
BIN
output/U23_L2_承_剧本V1.xlsx
Normal file
BIN
output/U23_L2_承_剧本V1.xlsx
Normal file
Binary file not shown.
BIN
output/U23_L2_起_剧本V1.xlsx
Normal file
BIN
output/U23_L2_起_剧本V1.xlsx
Normal file
Binary file not shown.
BIN
output/U23_L2_转_剧本V1.xlsx
Normal file
BIN
output/U23_L2_转_剧本V1.xlsx
Normal file
Binary file not shown.
BIN
output/U24-L2_起_剧本.xlsx
Normal file
BIN
output/U24-L2_起_剧本.xlsx
Normal file
Binary file not shown.
BIN
output/U24-L2_起_剧本_v2.xlsx
Normal file
BIN
output/U24-L2_起_剧本_v2.xlsx
Normal file
Binary file not shown.
@ -1928,3 +1928,13 @@ To https://git.valavala.com/ai_member_only/ai_member_xiaobian.git
|
|||||||
From https://git.valavala.com/ai_member_only/ai_member_xiaobian
|
From https://git.valavala.com/ai_member_only/ai_member_xiaobian
|
||||||
* branch master -> FETCH_HEAD
|
* branch master -> FETCH_HEAD
|
||||||
Already up to date.
|
Already up to date.
|
||||||
|
[master 7e2bef0] auto backup 2026-05-25 08:10:01
|
||||||
|
2 files changed, 11 insertions(+), 1 deletion(-)
|
||||||
|
remote: . Processing 1 references
|
||||||
|
remote: Processed 1 references in total
|
||||||
|
To https://git.valavala.com/ai_member_only/ai_member_xiaobian.git
|
||||||
|
66ffe6d..7e2bef0 master -> master
|
||||||
|
✅ 备份完成:Mon May 25 08:10:02 AM CST 2026
|
||||||
|
From https://git.valavala.com/ai_member_only/ai_member_xiaobian
|
||||||
|
* branch master -> FETCH_HEAD
|
||||||
|
Already up to date.
|
||||||
|
|||||||
@ -69,6 +69,7 @@ User是玩家代入的角色,必须是剧情的**主导者和推动者**:
|
|||||||
1. 写清动作细节、相对位置变化、环境反应,让制作团队能据此实现画面
|
1. 写清动作细节、相对位置变化、环境反应,让制作团队能据此实现画面
|
||||||
2. 积极使用制作标注:【场景】【角色】【emoji】【插入图】【全屏插入图】【spine动画】【镜头】【画外音】【音效】【音乐/儿歌】【转场】
|
2. 积极使用制作标注:【场景】【角色】【emoji】【插入图】【全屏插入图】【spine动画】【镜头】【画外音】【音效】【音乐/儿歌】【转场】
|
||||||
3. 角色情绪变化时用描述或emoji标注
|
3. 角色情绪变化时用描述或emoji标注
|
||||||
|
4. **【emoji】必须严格使用 `business_knowledge/resources/emoji_catalog.md` 中收录的名称**,禁止使用 Unicode emoji 符号(如😊😂🎉等)。正确写法示例:`【emoji】User感叹号`、`【emoji】Ben冒汗`、`【emoji】Tom流汗`。有子类型时用括号补充:`【emoji】User发抖(大)`、`【emoji】May问号x2`
|
||||||
|
|
||||||
### 场景与节奏规范
|
### 场景与节奏规范
|
||||||
1. 优先用视角/镜头切换代替场景切换(手环通信、电话、对话)
|
1. 优先用视角/镜头切换代替场景切换(手环通信、电话、对话)
|
||||||
|
|||||||
6
skills/office-document-specialist-suite/.venv/bin/normalizer
Executable file
6
skills/office-document-specialist-suite/.venv/bin/normalizer
Executable file
@ -0,0 +1,6 @@
|
|||||||
|
#!/root/.openclaw/workspace-xiaobian/skills/office-document-specialist-suite/.venv/bin/python
|
||||||
|
import sys
|
||||||
|
from charset_normalizer.cli import cli_detect
|
||||||
|
if __name__ == '__main__':
|
||||||
|
sys.argv[0] = sys.argv[0].removesuffix('.exe')
|
||||||
|
sys.exit(cli_detect())
|
||||||
Binary file not shown.
@ -0,0 +1 @@
|
|||||||
|
pip
|
||||||
@ -0,0 +1,78 @@
|
|||||||
|
Metadata-Version: 2.4
|
||||||
|
Name: certifi
|
||||||
|
Version: 2026.4.22
|
||||||
|
Summary: Python package for providing Mozilla's CA Bundle.
|
||||||
|
Home-page: https://github.com/certifi/python-certifi
|
||||||
|
Author: Kenneth Reitz
|
||||||
|
Author-email: me@kennethreitz.com
|
||||||
|
License: MPL-2.0
|
||||||
|
Project-URL: Source, https://github.com/certifi/python-certifi
|
||||||
|
Classifier: Development Status :: 5 - Production/Stable
|
||||||
|
Classifier: Intended Audience :: Developers
|
||||||
|
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
|
||||||
|
Classifier: Natural Language :: English
|
||||||
|
Classifier: Programming Language :: Python
|
||||||
|
Classifier: Programming Language :: Python :: 3
|
||||||
|
Classifier: Programming Language :: Python :: 3 :: Only
|
||||||
|
Classifier: Programming Language :: Python :: 3.7
|
||||||
|
Classifier: Programming Language :: Python :: 3.8
|
||||||
|
Classifier: Programming Language :: Python :: 3.9
|
||||||
|
Classifier: Programming Language :: Python :: 3.10
|
||||||
|
Classifier: Programming Language :: Python :: 3.11
|
||||||
|
Classifier: Programming Language :: Python :: 3.12
|
||||||
|
Classifier: Programming Language :: Python :: 3.13
|
||||||
|
Classifier: Programming Language :: Python :: 3.14
|
||||||
|
Requires-Python: >=3.7
|
||||||
|
License-File: LICENSE
|
||||||
|
Dynamic: author
|
||||||
|
Dynamic: author-email
|
||||||
|
Dynamic: classifier
|
||||||
|
Dynamic: description
|
||||||
|
Dynamic: home-page
|
||||||
|
Dynamic: license
|
||||||
|
Dynamic: license-file
|
||||||
|
Dynamic: project-url
|
||||||
|
Dynamic: requires-python
|
||||||
|
Dynamic: summary
|
||||||
|
|
||||||
|
Certifi: Python SSL Certificates
|
||||||
|
================================
|
||||||
|
|
||||||
|
Certifi provides Mozilla's carefully curated collection of Root Certificates for
|
||||||
|
validating the trustworthiness of SSL certificates while verifying the identity
|
||||||
|
of TLS hosts. It has been extracted from the `Requests`_ project.
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
``certifi`` is available on PyPI. Simply install it with ``pip``::
|
||||||
|
|
||||||
|
$ pip install certifi
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
To reference the installed certificate authority (CA) bundle, you can use the
|
||||||
|
built-in function::
|
||||||
|
|
||||||
|
>>> import certifi
|
||||||
|
|
||||||
|
>>> certifi.where()
|
||||||
|
'/usr/local/lib/python3.7/site-packages/certifi/cacert.pem'
|
||||||
|
|
||||||
|
Or from the command line::
|
||||||
|
|
||||||
|
$ python -m certifi
|
||||||
|
/usr/local/lib/python3.7/site-packages/certifi/cacert.pem
|
||||||
|
|
||||||
|
Enjoy!
|
||||||
|
|
||||||
|
.. _`Requests`: https://requests.readthedocs.io/en/master/
|
||||||
|
|
||||||
|
Addition/Removal of Certificates
|
||||||
|
--------------------------------
|
||||||
|
|
||||||
|
Certifi does not support any addition/removal or other modification of the
|
||||||
|
CA trust store content. This project is intended to provide a reliable and
|
||||||
|
highly portable root of trust to python deployments. Look to upstream projects
|
||||||
|
for methods to use alternate trust.
|
||||||
@ -0,0 +1,14 @@
|
|||||||
|
certifi-2026.4.22.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||||
|
certifi-2026.4.22.dist-info/METADATA,sha256=Ttb_6cy-0FBGq3v_ICTNPKcXKVs2i-0f4gpz42W8QVo,2474
|
||||||
|
certifi-2026.4.22.dist-info/RECORD,,
|
||||||
|
certifi-2026.4.22.dist-info/WHEEL,sha256=aeYiig01lYGDzBgS8HxWXOg3uV61G9ijOsup-k9o1sk,91
|
||||||
|
certifi-2026.4.22.dist-info/licenses/LICENSE,sha256=6TcW2mucDVpKHfYP5pWzcPBpVgPSH2-D8FPkLPwQyvc,989
|
||||||
|
certifi-2026.4.22.dist-info/top_level.txt,sha256=KMu4vUCfsjLrkPbSNdgdekS-pVJzBAJFO__nI8NF6-U,8
|
||||||
|
certifi/__init__.py,sha256=iBJYWtxRGHMaO1S73pWYpFn962WQ4ldof2GkfMSREJM,94
|
||||||
|
certifi/__main__.py,sha256=xBBoj905TUWBLRGANOcf7oi6e-3dMP4cEoG9OyMs11g,243
|
||||||
|
certifi/__pycache__/__init__.cpython-312.pyc,,
|
||||||
|
certifi/__pycache__/__main__.cpython-312.pyc,,
|
||||||
|
certifi/__pycache__/core.cpython-312.pyc,,
|
||||||
|
certifi/cacert.pem,sha256=Fr4_b-sVQIGV3P46oade-dty9ka5brvv3Gj1YlX3mfg,238641
|
||||||
|
certifi/core.py,sha256=XFXycndG5pf37ayeF8N32HUuDafsyhkVMbO4BAPWHa0,3394
|
||||||
|
certifi/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||||
@ -0,0 +1,5 @@
|
|||||||
|
Wheel-Version: 1.0
|
||||||
|
Generator: setuptools (82.0.1)
|
||||||
|
Root-Is-Purelib: true
|
||||||
|
Tag: py3-none-any
|
||||||
|
|
||||||
@ -0,0 +1,20 @@
|
|||||||
|
This package contains a modified version of ca-bundle.crt:
|
||||||
|
|
||||||
|
ca-bundle.crt -- Bundle of CA Root Certificates
|
||||||
|
|
||||||
|
This is a bundle of X.509 certificates of public Certificate Authorities
|
||||||
|
(CA). These were automatically extracted from Mozilla's root certificates
|
||||||
|
file (certdata.txt). This file can be found in the mozilla source tree:
|
||||||
|
https://hg.mozilla.org/mozilla-central/file/tip/security/nss/lib/ckfw/builtins/certdata.txt
|
||||||
|
It contains the certificates in PEM format and therefore
|
||||||
|
can be directly used with curl / libcurl / php_curl, or with
|
||||||
|
an Apache+mod_ssl webserver for SSL client authentication.
|
||||||
|
Just configure this file as the SSLCACertificateFile.#
|
||||||
|
|
||||||
|
***** BEGIN LICENSE BLOCK *****
|
||||||
|
This Source Code Form is subject to the terms of the Mozilla Public License,
|
||||||
|
v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain
|
||||||
|
one at http://mozilla.org/MPL/2.0/.
|
||||||
|
|
||||||
|
***** END LICENSE BLOCK *****
|
||||||
|
@(#) $RCSfile: certdata.txt,v $ $Revision: 1.80 $ $Date: 2011/11/03 15:11:58 $
|
||||||
@ -0,0 +1 @@
|
|||||||
|
certifi
|
||||||
@ -0,0 +1,4 @@
|
|||||||
|
from .core import contents, where
|
||||||
|
|
||||||
|
__all__ = ["contents", "where"]
|
||||||
|
__version__ = "2026.04.22"
|
||||||
@ -0,0 +1,12 @@
|
|||||||
|
import argparse
|
||||||
|
|
||||||
|
from certifi import contents, where
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("-c", "--contents", action="store_true")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.contents:
|
||||||
|
print(contents())
|
||||||
|
else:
|
||||||
|
print(where())
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,83 @@
|
|||||||
|
"""
|
||||||
|
certifi.py
|
||||||
|
~~~~~~~~~~
|
||||||
|
|
||||||
|
This module returns the installation location of cacert.pem or its contents.
|
||||||
|
"""
|
||||||
|
import sys
|
||||||
|
import atexit
|
||||||
|
|
||||||
|
def exit_cacert_ctx() -> None:
|
||||||
|
_CACERT_CTX.__exit__(None, None, None) # type: ignore[union-attr]
|
||||||
|
|
||||||
|
|
||||||
|
if sys.version_info >= (3, 11):
|
||||||
|
|
||||||
|
from importlib.resources import as_file, files
|
||||||
|
|
||||||
|
_CACERT_CTX = None
|
||||||
|
_CACERT_PATH = None
|
||||||
|
|
||||||
|
def where() -> str:
|
||||||
|
# This is slightly terrible, but we want to delay extracting the file
|
||||||
|
# in cases where we're inside of a zipimport situation until someone
|
||||||
|
# actually calls where(), but we don't want to re-extract the file
|
||||||
|
# on every call of where(), so we'll do it once then store it in a
|
||||||
|
# global variable.
|
||||||
|
global _CACERT_CTX
|
||||||
|
global _CACERT_PATH
|
||||||
|
if _CACERT_PATH is None:
|
||||||
|
# This is slightly janky, the importlib.resources API wants you to
|
||||||
|
# manage the cleanup of this file, so it doesn't actually return a
|
||||||
|
# path, it returns a context manager that will give you the path
|
||||||
|
# when you enter it and will do any cleanup when you leave it. In
|
||||||
|
# the common case of not needing a temporary file, it will just
|
||||||
|
# return the file system location and the __exit__() is a no-op.
|
||||||
|
#
|
||||||
|
# We also have to hold onto the actual context manager, because
|
||||||
|
# it will do the cleanup whenever it gets garbage collected, so
|
||||||
|
# we will also store that at the global level as well.
|
||||||
|
_CACERT_CTX = as_file(files("certifi").joinpath("cacert.pem"))
|
||||||
|
_CACERT_PATH = str(_CACERT_CTX.__enter__())
|
||||||
|
atexit.register(exit_cacert_ctx)
|
||||||
|
|
||||||
|
return _CACERT_PATH
|
||||||
|
|
||||||
|
def contents() -> str:
|
||||||
|
return files("certifi").joinpath("cacert.pem").read_text(encoding="ascii")
|
||||||
|
|
||||||
|
else:
|
||||||
|
|
||||||
|
from importlib.resources import path as get_path, read_text
|
||||||
|
|
||||||
|
_CACERT_CTX = None
|
||||||
|
_CACERT_PATH = None
|
||||||
|
|
||||||
|
def where() -> str:
|
||||||
|
# This is slightly terrible, but we want to delay extracting the
|
||||||
|
# file in cases where we're inside of a zipimport situation until
|
||||||
|
# someone actually calls where(), but we don't want to re-extract
|
||||||
|
# the file on every call of where(), so we'll do it once then store
|
||||||
|
# it in a global variable.
|
||||||
|
global _CACERT_CTX
|
||||||
|
global _CACERT_PATH
|
||||||
|
if _CACERT_PATH is None:
|
||||||
|
# This is slightly janky, the importlib.resources API wants you
|
||||||
|
# to manage the cleanup of this file, so it doesn't actually
|
||||||
|
# return a path, it returns a context manager that will give
|
||||||
|
# you the path when you enter it and will do any cleanup when
|
||||||
|
# you leave it. In the common case of not needing a temporary
|
||||||
|
# file, it will just return the file system location and the
|
||||||
|
# __exit__() is a no-op.
|
||||||
|
#
|
||||||
|
# We also have to hold onto the actual context manager, because
|
||||||
|
# it will do the cleanup whenever it gets garbage collected, so
|
||||||
|
# we will also store that at the global level as well.
|
||||||
|
_CACERT_CTX = get_path("certifi", "cacert.pem")
|
||||||
|
_CACERT_PATH = str(_CACERT_CTX.__enter__())
|
||||||
|
atexit.register(exit_cacert_ctx)
|
||||||
|
|
||||||
|
return _CACERT_PATH
|
||||||
|
|
||||||
|
def contents() -> str:
|
||||||
|
return read_text("certifi", "cacert.pem", encoding="ascii")
|
||||||
@ -0,0 +1 @@
|
|||||||
|
pip
|
||||||
@ -0,0 +1,808 @@
|
|||||||
|
Metadata-Version: 2.4
|
||||||
|
Name: charset-normalizer
|
||||||
|
Version: 3.4.7
|
||||||
|
Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
|
||||||
|
Author-email: "Ahmed R. TAHRI" <tahri.ahmed@proton.me>
|
||||||
|
Maintainer-email: "Ahmed R. TAHRI" <tahri.ahmed@proton.me>
|
||||||
|
License: MIT
|
||||||
|
Project-URL: Changelog, https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md
|
||||||
|
Project-URL: Documentation, https://charset-normalizer.readthedocs.io/
|
||||||
|
Project-URL: Code, https://github.com/jawah/charset_normalizer
|
||||||
|
Project-URL: Issue tracker, https://github.com/jawah/charset_normalizer/issues
|
||||||
|
Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect
|
||||||
|
Classifier: Development Status :: 5 - Production/Stable
|
||||||
|
Classifier: Intended Audience :: Developers
|
||||||
|
Classifier: Operating System :: OS Independent
|
||||||
|
Classifier: Programming Language :: Python
|
||||||
|
Classifier: Programming Language :: Python :: 3
|
||||||
|
Classifier: Programming Language :: Python :: 3.7
|
||||||
|
Classifier: Programming Language :: Python :: 3.8
|
||||||
|
Classifier: Programming Language :: Python :: 3.9
|
||||||
|
Classifier: Programming Language :: Python :: 3.10
|
||||||
|
Classifier: Programming Language :: Python :: 3.11
|
||||||
|
Classifier: Programming Language :: Python :: 3.12
|
||||||
|
Classifier: Programming Language :: Python :: 3.13
|
||||||
|
Classifier: Programming Language :: Python :: 3.14
|
||||||
|
Classifier: Programming Language :: Python :: 3 :: Only
|
||||||
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||||||
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||||||
|
Classifier: Programming Language :: Python :: Free Threading :: 4 - Resilient
|
||||||
|
Classifier: Topic :: Text Processing :: Linguistic
|
||||||
|
Classifier: Topic :: Utilities
|
||||||
|
Classifier: Typing :: Typed
|
||||||
|
Requires-Python: >=3.7
|
||||||
|
Description-Content-Type: text/markdown
|
||||||
|
License-File: LICENSE
|
||||||
|
Provides-Extra: unicode-backport
|
||||||
|
Dynamic: license-file
|
||||||
|
|
||||||
|
<h1 align="center">Charset Detection, for Everyone 👋</h1>
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<sup>The Real First Universal Charset Detector</sup><br>
|
||||||
|
<a href="https://pypi.org/project/charset-normalizer">
|
||||||
|
<img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
|
||||||
|
</a>
|
||||||
|
<a href="https://pepy.tech/project/charset-normalizer/">
|
||||||
|
<img alt="Download Count Total" src="https://static.pepy.tech/badge/charset-normalizer/month" />
|
||||||
|
</a>
|
||||||
|
<a href="https://bestpractices.coreinfrastructure.org/projects/7297">
|
||||||
|
<img src="https://bestpractices.coreinfrastructure.org/projects/7297/badge">
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
<p align="center">
|
||||||
|
<sup><i>Featured Packages</i></sup><br>
|
||||||
|
<a href="https://github.com/jawah/niquests">
|
||||||
|
<img alt="Static Badge" src="https://img.shields.io/badge/Niquests-Most_Advanced_HTTP_Client-cyan">
|
||||||
|
</a>
|
||||||
|
<a href="https://github.com/jawah/wassima">
|
||||||
|
<img alt="Static Badge" src="https://img.shields.io/badge/Wassima-Certifi_Replacement-cyan">
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
<p align="center">
|
||||||
|
<sup><i>In other language (unofficial port - by the community)</i></sup><br>
|
||||||
|
<a href="https://github.com/nickspring/charset-normalizer-rs">
|
||||||
|
<img alt="Static Badge" src="https://img.shields.io/badge/Rust-red">
|
||||||
|
</a>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
|
||||||
|
> I'm trying to resolve the issue by taking a new approach.
|
||||||
|
> All IANA character set names for which the Python core library provides codecs are supported.
|
||||||
|
> You can also register your own set of codecs, and yes, it would work as-is.
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
>>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
|
||||||
|
</p>
|
||||||
|
|
||||||
|
This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
|
||||||
|
|
||||||
|
| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
|
||||||
|
|--------------------------------------------------|:---------------------------------------------:|:-----------------------------------------------------------------------------------------------:|:-----------------------------------------------:|
|
||||||
|
| `Fast` | ✅ | ✅ | ✅ |
|
||||||
|
| `Universal`[^1] | ❌ | ✅ | ❌ |
|
||||||
|
| `Reliable` **without** distinguishable standards | ✅ | ✅ | ✅ |
|
||||||
|
| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
|
||||||
|
| `License` | _Disputed_[^2]<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ |
|
||||||
|
| `Native Python` | ✅ | ✅ | ❌ |
|
||||||
|
| `Detect spoken language` | ✅ | ✅ | N/A |
|
||||||
|
| `UnicodeDecodeError Safety` | ✅ | ✅ | ❌ |
|
||||||
|
| `Whl Size (min)` | 500 kB | 150 kB | ~200 kB |
|
||||||
|
| `Supported Encoding` | 99 | [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 |
|
||||||
|
| `Can register custom encoding` | ❌ | ✅ | ❌ |
|
||||||
|
|
||||||
|
<p align="center">
|
||||||
|
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
|
||||||
|
</p>
|
||||||
|
|
||||||
|
[^1]: They are clearly using specific code for a specific encoding even if covering most of used one.
|
||||||
|
[^2]: Chardet 7.0+ was relicensed from LGPL-2.1 to MIT following an AI-assisted rewrite. This relicensing is disputed on two independent grounds: **(a)** the original author [contests](https://github.com/chardet/chardet/issues/327) that the maintainer had the right to relicense, arguing the rewrite is a derivative work of the LGPL-licensed codebase since it was not a clean room implementation; **(b)** the copyright claim itself is [questionable](https://github.com/chardet/chardet/issues/334) given the code was primarily generated by an LLM, and AI-generated output may not be copyrightable under most jurisdictions. Either issue alone could undermine the MIT license. Beyond licensing, the rewrite raises questions about responsible use of AI in open source: key architectural ideas pioneered by charset-normalizer - notably decode-first validity filtering (our foundational approach since v1) and encoding pairwise similarity with the same algorithm and threshold — surfaced in chardet 7 without acknowledgment. The project also imported test files from charset-normalizer to train and benchmark against it, then claimed superior accuracy on those very files. Charset-normalizer has always been MIT-licensed, encoding-agnostic by design, and built on a verifiable human-authored history.
|
||||||
|
|
||||||
|
## ⚡ Performance
|
||||||
|
|
||||||
|
This package offer better performances (99th, and 95th) against Chardet. Here are some numbers.
|
||||||
|
|
||||||
|
| Package | Accuracy | Mean per file (ms) | File per sec (est) |
|
||||||
|
|---------------------------------------------------|:--------:|:------------------:|:------------------:|
|
||||||
|
| [chardet 7.1](https://github.com/chardet/chardet) | 89 % | 3 ms | 333 file/sec |
|
||||||
|
| charset-normalizer | **97 %** | 3 ms | 333 file/sec |
|
||||||
|
|
||||||
|
| Package | 99th percentile | 95th percentile | 50th percentile |
|
||||||
|
|---------------------------------------------------|:---------------:|:---------------:|:---------------:|
|
||||||
|
| [chardet 7.1](https://github.com/chardet/chardet) | 32 ms | 17 ms | < 1 ms |
|
||||||
|
| charset-normalizer | 16 ms | 10 ms | 1 ms |
|
||||||
|
|
||||||
|
_updated as of March 2026 using CPython 3.12, Charset-Normalizer 3.4.6, and Chardet 7.1.0_
|
||||||
|
|
||||||
|
~Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.~ No longer the case since Chardet 7.0+
|
||||||
|
|
||||||
|
> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
|
||||||
|
> And yes, these results might change at any time. The dataset can be updated to include more files.
|
||||||
|
> The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
|
||||||
|
> Chardet claims on his documentation to have a greater accuracy than us based on the dataset they trained Chardet on(...)
|
||||||
|
> Well, it's normal, the opposite would have been worrying. Whereas charset-normalizer don't train on anything, our solution
|
||||||
|
> is based on a completely different algorithm, still heuristic through, it does not need weights across every encoding tables.
|
||||||
|
|
||||||
|
## ✨ Installation
|
||||||
|
|
||||||
|
Using pip:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
pip install charset-normalizer -U
|
||||||
|
```
|
||||||
|
|
||||||
|
## 🚀 Basic Usage
|
||||||
|
|
||||||
|
### CLI
|
||||||
|
This package comes with a CLI.
|
||||||
|
|
||||||
|
```
|
||||||
|
usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
|
||||||
|
file [file ...]
|
||||||
|
|
||||||
|
The Real First Universal Charset Detector. Discover originating encoding used
|
||||||
|
on text file. Normalize text to unicode.
|
||||||
|
|
||||||
|
positional arguments:
|
||||||
|
files File(s) to be analysed
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
-v, --verbose Display complementary information about file if any.
|
||||||
|
Stdout will contain logs about the detection process.
|
||||||
|
-a, --with-alternative
|
||||||
|
Output complementary possibilities if any. Top-level
|
||||||
|
JSON WILL be a list.
|
||||||
|
-n, --normalize Permit to normalize input file. If not set, program
|
||||||
|
does not write anything.
|
||||||
|
-m, --minimal Only output the charset detected to STDOUT. Disabling
|
||||||
|
JSON output.
|
||||||
|
-r, --replace Replace file when trying to normalize it instead of
|
||||||
|
creating a new one.
|
||||||
|
-f, --force Replace file without asking if you are sure, use this
|
||||||
|
flag with caution.
|
||||||
|
-t THRESHOLD, --threshold THRESHOLD
|
||||||
|
Define a custom maximum amount of chaos allowed in
|
||||||
|
decoded content. 0. <= chaos <= 1.
|
||||||
|
--version Show version information and exit.
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
normalizer ./data/sample.1.fr.srt
|
||||||
|
```
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m charset_normalizer ./data/sample.1.fr.srt
|
||||||
|
```
|
||||||
|
|
||||||
|
🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
|
||||||
|
"encoding": "cp1252",
|
||||||
|
"encoding_aliases": [
|
||||||
|
"1252",
|
||||||
|
"windows_1252"
|
||||||
|
],
|
||||||
|
"alternative_encodings": [
|
||||||
|
"cp1254",
|
||||||
|
"cp1256",
|
||||||
|
"cp1258",
|
||||||
|
"iso8859_14",
|
||||||
|
"iso8859_15",
|
||||||
|
"iso8859_16",
|
||||||
|
"iso8859_3",
|
||||||
|
"iso8859_9",
|
||||||
|
"latin_1",
|
||||||
|
"mbcs"
|
||||||
|
],
|
||||||
|
"language": "French",
|
||||||
|
"alphabets": [
|
||||||
|
"Basic Latin",
|
||||||
|
"Latin-1 Supplement"
|
||||||
|
],
|
||||||
|
"has_sig_or_bom": false,
|
||||||
|
"chaos": 0.149,
|
||||||
|
"coherence": 97.152,
|
||||||
|
"unicode_path": null,
|
||||||
|
"is_preferred": true
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Python
|
||||||
|
*Just print out normalized text*
|
||||||
|
```python
|
||||||
|
from charset_normalizer import from_path
|
||||||
|
|
||||||
|
results = from_path('./my_subtitle.srt')
|
||||||
|
|
||||||
|
print(str(results.best()))
|
||||||
|
```
|
||||||
|
|
||||||
|
*Upgrade your code without effort*
|
||||||
|
```python
|
||||||
|
from charset_normalizer import detect
|
||||||
|
```
|
||||||
|
|
||||||
|
The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
|
||||||
|
|
||||||
|
See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
|
||||||
|
|
||||||
|
## 😇 Why
|
||||||
|
|
||||||
|
When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
|
||||||
|
reliable alternative using a completely different method. Also! I never back down on a good challenge!
|
||||||
|
|
||||||
|
I **don't care** about the **originating charset** encoding, because **two different tables** can
|
||||||
|
produce **two identical rendered string.**
|
||||||
|
What I want is to get readable text, the best I can.
|
||||||
|
|
||||||
|
In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
|
||||||
|
|
||||||
|
Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair Unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
|
||||||
|
|
||||||
|
## 🍰 How
|
||||||
|
|
||||||
|
- Discard all charset encoding table that could not fit the binary content.
|
||||||
|
- Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
|
||||||
|
- Extract matches with the lowest mess detected.
|
||||||
|
- Additionally, we measure coherence / probe for a language.
|
||||||
|
|
||||||
|
**Wait a minute**, what is noise/mess and coherence according to **YOU ?**
|
||||||
|
|
||||||
|
*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
|
||||||
|
**I established** some ground rules about **what is obvious** when **it seems like** a mess (aka. defining noise in rendered text).
|
||||||
|
I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to
|
||||||
|
improve or rewrite it.
|
||||||
|
|
||||||
|
*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
|
||||||
|
that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
|
||||||
|
|
||||||
|
## ⚡ Known limitations
|
||||||
|
|
||||||
|
- Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
|
||||||
|
- Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
|
||||||
|
|
||||||
|
## ⚠️ About Python EOLs
|
||||||
|
|
||||||
|
**If you are running:**
|
||||||
|
|
||||||
|
- Python >=2.7,<3.5: Unsupported
|
||||||
|
- Python 3.5: charset-normalizer < 2.1
|
||||||
|
- Python 3.6: charset-normalizer < 3.1
|
||||||
|
|
||||||
|
Upgrade your Python interpreter as soon as possible.
|
||||||
|
|
||||||
|
## 👤 Contributing
|
||||||
|
|
||||||
|
Contributions, issues and feature requests are very much welcome.<br />
|
||||||
|
Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
|
||||||
|
|
||||||
|
## 📝 License
|
||||||
|
|
||||||
|
Copyright © [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
|
||||||
|
This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
|
||||||
|
|
||||||
|
Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
|
||||||
|
|
||||||
|
## 💼 For Enterprise
|
||||||
|
|
||||||
|
Professional support for charset-normalizer is available as part of the [Tidelift
|
||||||
|
Subscription][1]. Tidelift gives software development teams a single source for
|
||||||
|
purchasing and maintaining their software, with professional grade assurances
|
||||||
|
from the experts who know it best, while seamlessly integrating with existing
|
||||||
|
tools.
|
||||||
|
|
||||||
|
[1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme
|
||||||
|
|
||||||
|
[](https://www.bestpractices.dev/projects/7297)
|
||||||
|
|
||||||
|
# Changelog
|
||||||
|
All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||||
|
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
|
||||||
|
|
||||||
|
## [3.4.7](https://github.com/Ousret/charset_normalizer/compare/3.4.6...3.4.7) (2026-04-02)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Pre-built optimized version using mypy[c] v1.20.
|
||||||
|
- Relax `setuptools` constraint to `setuptools>=68,<82.1`.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Correctly remove SIG remnant in utf-7 decoded string. (#718) (#716)
|
||||||
|
|
||||||
|
## [3.4.6](https://github.com/Ousret/charset_normalizer/compare/3.4.5...3.4.6) (2026-03-15)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Flattened the logic in `charset_normalizer.md` for higher performance. Removed `eligible(..)` and `feed(...)`
|
||||||
|
in favor of `feed_info(...)`.
|
||||||
|
- Raised upper bound for mypy[c] to 1.20, for our optimized version.
|
||||||
|
- Updated `UNICODE_RANGES_COMBINED` using Unicode blocks v17.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Edge case where noise difference between two candidates can be almost insignificant. (#672)
|
||||||
|
- CLI `--normalize` writing to wrong path when passing multiple files in. (#702)
|
||||||
|
|
||||||
|
### Misc
|
||||||
|
- Freethreaded pre-built wheels now shipped in PyPI starting with 3.14t. (#616)
|
||||||
|
|
||||||
|
## [3.4.5](https://github.com/Ousret/charset_normalizer/compare/3.4.4...3.4.5) (2026-03-06)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Update `setuptools` constraint to `setuptools>=68,<=82`.
|
||||||
|
- Raised upper bound of mypyc for the optional pre-built extension to v1.19.1
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Add explicit link to lib math in our optimized build. (#692)
|
||||||
|
- Logger level not restored correctly for empty byte sequences. (#701)
|
||||||
|
- TypeError when passing bytearray to from_bytes. (#703)
|
||||||
|
|
||||||
|
### Misc
|
||||||
|
- Applied safe micro-optimizations in both our noise detector and language detector.
|
||||||
|
- Rewrote the `query_yes_no` function (inside CLI) to avoid using ambiguous licensed code.
|
||||||
|
- Added `cd.py` submodule into mypyc optional compilation to reduce further the performance impact.
|
||||||
|
|
||||||
|
## [3.4.4](https://github.com/Ousret/charset_normalizer/compare/3.4.2...3.4.4) (2025-10-13)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Bound `setuptools` to a specific constraint `setuptools>=68,<=81`.
|
||||||
|
- Raised upper bound of mypyc for the optional pre-built extension to v1.18.2
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- `setuptools-scm` as a build dependency.
|
||||||
|
|
||||||
|
### Misc
|
||||||
|
- Enforced hashes in `dev-requirements.txt` and created `ci-requirements.txt` for security purposes.
|
||||||
|
- Additional pre-built wheels for riscv64, s390x, and armv7l architectures.
|
||||||
|
- Restore ` multiple.intoto.jsonl` in GitHub releases in addition to individual attestation file per wheel.
|
||||||
|
|
||||||
|
## [3.4.3](https://github.com/Ousret/charset_normalizer/compare/3.4.2...3.4.3) (2025-08-09)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- mypy(c) is no longer a required dependency at build time if `CHARSET_NORMALIZER_USE_MYPYC` isn't set to `1`. (#595) (#583)
|
||||||
|
- automatically lower confidence on small bytes samples that are not Unicode in `detect` output legacy function. (#391)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Custom build backend to overcome inability to mark mypy as an optional dependency in the build phase.
|
||||||
|
- Support for Python 3.14
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- sdist archive contained useless directories.
|
||||||
|
- automatically fallback on valid UTF-16 or UTF-32 even if the md says it's noisy. (#633)
|
||||||
|
|
||||||
|
### Misc
|
||||||
|
- SBOM are automatically published to the relevant GitHub release to comply with regulatory changes.
|
||||||
|
Each published wheel comes with its SBOM. We choose CycloneDX as the format.
|
||||||
|
- Prebuilt optimized wheel are no longer distributed by default for CPython 3.7 due to a change in cibuildwheel.
|
||||||
|
|
||||||
|
## [3.4.2](https://github.com/Ousret/charset_normalizer/compare/3.4.1...3.4.2) (2025-05-02)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Addressed the DeprecationWarning in our CLI regarding `argparse.FileType` by backporting the target class into the package. (#591)
|
||||||
|
- Improved the overall reliability of the detector with CJK Ideographs. (#605) (#587)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Optional mypyc compilation upgraded to version 1.15 for Python >= 3.8
|
||||||
|
|
||||||
|
## [3.4.1](https://github.com/Ousret/charset_normalizer/compare/3.4.0...3.4.1) (2024-12-24)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Project metadata are now stored using `pyproject.toml` instead of `setup.cfg` using setuptools as the build backend.
|
||||||
|
- Enforce annotation delayed loading for a simpler and consistent types in the project.
|
||||||
|
- Optional mypyc compilation upgraded to version 1.14 for Python >= 3.8
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- pre-commit configuration.
|
||||||
|
- noxfile.
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- `build-requirements.txt` as per using `pyproject.toml` native build configuration.
|
||||||
|
- `bin/integration.py` and `bin/serve.py` in favor of downstream integration test (see noxfile).
|
||||||
|
- `setup.cfg` in favor of `pyproject.toml` metadata configuration.
|
||||||
|
- Unused `utils.range_scan` function.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Converting content to Unicode bytes may insert `utf_8` instead of preferred `utf-8`. (#572)
|
||||||
|
- Deprecation warning "'count' is passed as positional argument" when converting to Unicode bytes on Python 3.13+
|
||||||
|
|
||||||
|
## [3.4.0](https://github.com/Ousret/charset_normalizer/compare/3.3.2...3.4.0) (2024-10-08)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
|
||||||
|
- Support for Python 3.13 (#512)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
|
||||||
|
- Improved the general reliability of the detector based on user feedbacks. (#520) (#509) (#498) (#407) (#537)
|
||||||
|
- Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. (#381)
|
||||||
|
|
||||||
|
## [3.3.2](https://github.com/Ousret/charset_normalizer/compare/3.3.1...3.3.2) (2023-10-31)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Unintentional memory usage regression when using large payload that match several encoding (#376)
|
||||||
|
- Regression on some detection case showcased in the documentation (#371)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife)
|
||||||
|
|
||||||
|
## [3.3.1](https://github.com/Ousret/charset_normalizer/compare/3.3.0...3.3.1) (2023-10-22)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
|
||||||
|
- Improved the general detection reliability based on reports from the community
|
||||||
|
|
||||||
|
## [3.3.0](https://github.com/Ousret/charset_normalizer/compare/3.2.0...3.3.0) (2023-09-30)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer`
|
||||||
|
- Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias (#323)
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- (internal) Redundant utils.is_ascii function and unused function is_private_use_only
|
||||||
|
- (internal) charset_normalizer.assets is moved inside charset_normalizer.constant
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
|
||||||
|
- Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ (#350)
|
||||||
|
|
||||||
|
## [3.2.0](https://github.com/Ousret/charset_normalizer/compare/3.1.0...3.2.0) (2023-06-07)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Typehint for function `from_path` no longer enforce `PathLike` as its first argument
|
||||||
|
- Minor improvement over the global detection reliability
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries
|
||||||
|
- Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True)
|
||||||
|
- Explicit support for Python 3.12
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Edge case detection failure where a file would contain 'very-long' camel cased word (Issue #289)
|
||||||
|
|
||||||
|
## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262)
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Support for Python 3.6 (PR #260)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Optional speedup provided by mypy/c 1.0.1
|
||||||
|
|
||||||
|
## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Speedup provided by mypy/c 0.990 on Python >= 3.7
|
||||||
|
|
||||||
|
## [3.0.0](https://github.com/Ousret/charset_normalizer/compare/2.1.1...3.0.0) (2022-10-20)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
|
||||||
|
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
|
||||||
|
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
|
||||||
|
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Build with static metadata using 'build' frontend
|
||||||
|
- Make the language detection stricter
|
||||||
|
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- CLI with opt --normalize fail when using full path for files
|
||||||
|
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
|
||||||
|
- Sphinx warnings when generating the documentation
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Coherence detector no longer return 'Simple English' instead return 'English'
|
||||||
|
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
|
||||||
|
- Breaking: Method `first()` and `best()` from CharsetMatch
|
||||||
|
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
|
||||||
|
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
|
||||||
|
- Breaking: Top-level function `normalize`
|
||||||
|
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
|
||||||
|
- Support for the backport `unicodedata2`
|
||||||
|
|
||||||
|
## [3.0.0rc1](https://github.com/Ousret/charset_normalizer/compare/3.0.0b2...3.0.0rc1) (2022-10-18)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
|
||||||
|
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
|
||||||
|
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Build with static metadata using 'build' frontend
|
||||||
|
- Make the language detection stricter
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- CLI with opt --normalize fail when using full path for files
|
||||||
|
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Coherence detector no longer return 'Simple English' instead return 'English'
|
||||||
|
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
|
||||||
|
|
||||||
|
## [3.0.0b2](https://github.com/Ousret/charset_normalizer/compare/3.0.0b1...3.0.0b2) (2022-08-21)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Breaking: Method `first()` and `best()` from CharsetMatch
|
||||||
|
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Sphinx warnings when generating the documentation
|
||||||
|
|
||||||
|
## [3.0.0b1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...3.0.0b1) (2022-08-15)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
|
||||||
|
- Breaking: Top-level function `normalize`
|
||||||
|
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
|
||||||
|
- Support for the backport `unicodedata2`
|
||||||
|
|
||||||
|
## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19)
|
||||||
|
|
||||||
|
### Deprecated
|
||||||
|
- Function `normalize` scheduled for removal in 3.0
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Removed useless call to decode in fn is_unprintable (#206)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204)
|
||||||
|
|
||||||
|
## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Output the Unicode table version when running the CLI with `--version` (PR #194)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR #175)
|
||||||
|
- Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR #183)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
|
||||||
|
- CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR #181)
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Support for Python 3.5 (PR #192)
|
||||||
|
|
||||||
|
### Deprecated
|
||||||
|
- Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)
|
||||||
|
|
||||||
|
## [2.0.12](https://github.com/Ousret/charset_normalizer/compare/2.0.11...2.0.12) (2022-02-12)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- ASCII miss-detection on rare cases (PR #170)
|
||||||
|
|
||||||
|
## [2.0.11](https://github.com/Ousret/charset_normalizer/compare/2.0.10...2.0.11) (2022-01-30)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Explicit support for Python 3.11 (PR #164)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165)
|
||||||
|
|
||||||
|
## [2.0.10](https://github.com/Ousret/charset_normalizer/compare/2.0.9...2.0.10) (2022-01-04)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Skipping the language-detection (CD) on ASCII (PR #155)
|
||||||
|
|
||||||
|
## [2.0.9](https://github.com/Ousret/charset_normalizer/compare/2.0.8...2.0.9) (2021-12-03)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Moderating the logging impact (since 2.0.8) for specific environments (PR #147)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Wrong logging level applied when setting kwarg `explain` to True (PR #146)
|
||||||
|
|
||||||
|
## [2.0.8](https://github.com/Ousret/charset_normalizer/compare/2.0.7...2.0.8) (2021-11-24)
|
||||||
|
### Changed
|
||||||
|
- Improvement over Vietnamese detection (PR #126)
|
||||||
|
- MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
|
||||||
|
- Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR #122)
|
||||||
|
- call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR #129)
|
||||||
|
- Code style as refactored by Sourcery-AI (PR #131)
|
||||||
|
- Minor adjustment on the MD around european words (PR #133)
|
||||||
|
- Remove and replace SRTs from assets / tests (PR #139)
|
||||||
|
- Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR #135)
|
||||||
|
- Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
|
||||||
|
- Avoid using too insignificant chunk (PR #137)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR #135)
|
||||||
|
- Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR #141)
|
||||||
|
|
||||||
|
## [2.0.7](https://github.com/Ousret/charset_normalizer/compare/2.0.6...2.0.7) (2021-10-11)
|
||||||
|
### Added
|
||||||
|
- Add support for Kazakh (Cyrillic) language detection (PR #109)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Further, improve inferring the language from a given single-byte code page (PR #112)
|
||||||
|
- Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116)
|
||||||
|
- Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR #113)
|
||||||
|
- Various detection improvement (MD+CD) (PR #117)
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- Remove redundant logging entry about detected language(s) (PR #115)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102)
|
||||||
|
|
||||||
|
## [2.0.6](https://github.com/Ousret/charset_normalizer/compare/2.0.5...2.0.6) (2021-09-18)
|
||||||
|
### Fixed
|
||||||
|
- Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100)
|
||||||
|
- Fix CLI crash when using --minimal output in certain cases (PR #103)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Minor improvement to the detection efficiency (less than 1%) (PR #106 #101)
|
||||||
|
|
||||||
|
## [2.0.5](https://github.com/Ousret/charset_normalizer/compare/2.0.4...2.0.5) (2021-09-14)
|
||||||
|
### Changed
|
||||||
|
- The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81)
|
||||||
|
- The BC-support with v1.x was improved, the old staticmethods are restored (PR #82)
|
||||||
|
- The Unicode detection is slightly improved (PR #93)
|
||||||
|
- Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR #91)
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95)
|
||||||
|
- Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96)
|
||||||
|
- The MANIFEST.in was not exhaustive (PR #78)
|
||||||
|
|
||||||
|
## [2.0.4](https://github.com/Ousret/charset_normalizer/compare/2.0.3...2.0.4) (2021-07-30)
|
||||||
|
### Fixed
|
||||||
|
- The CLI no longer raise an unexpected exception when no encoding has been found (PR #70)
|
||||||
|
- Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68)
|
||||||
|
- The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72)
|
||||||
|
- Submatch factoring could be wrong in rare edge cases (PR #72)
|
||||||
|
- Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72)
|
||||||
|
- Fix line endings from CRLF to LF for certain project files (PR #67)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76)
|
||||||
|
- Allow fallback on specified encoding if any (PR #71)
|
||||||
|
|
||||||
|
## [2.0.3](https://github.com/Ousret/charset_normalizer/compare/2.0.2...2.0.3) (2021-07-16)
|
||||||
|
### Changed
|
||||||
|
- Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63)
|
||||||
|
- According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64)
|
||||||
|
|
||||||
|
## [2.0.2](https://github.com/Ousret/charset_normalizer/compare/2.0.1...2.0.2) (2021-07-15)
|
||||||
|
### Fixed
|
||||||
|
- Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR #59)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR #57)
|
||||||
|
|
||||||
|
## [2.0.1](https://github.com/Ousret/charset_normalizer/compare/2.0.0...2.0.1) (2021-07-13)
|
||||||
|
### Fixed
|
||||||
|
- Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR #55)
|
||||||
|
- Using explain=False permanently disable the verbose output in the current runtime (PR #47)
|
||||||
|
- One log entry (language target preemptive) was not show in logs when using explain=True (PR #47)
|
||||||
|
- Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Public function normalize default args values were not aligned with from_bytes (PR #53)
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47)
|
||||||
|
|
||||||
|
## [2.0.0](https://github.com/Ousret/charset_normalizer/compare/1.4.1...2.0.0) (2021-07-02)
|
||||||
|
### Changed
|
||||||
|
- 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet.
|
||||||
|
- Accent has been made on UTF-8 detection, should perform rather instantaneous.
|
||||||
|
- The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible.
|
||||||
|
- The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)
|
||||||
|
- The program has been rewritten to ease the readability and maintainability. (+Using static typing)+
|
||||||
|
- utf_7 detection has been reinstated.
|
||||||
|
|
||||||
|
### Removed
|
||||||
|
- This package no longer require anything when used with Python 3.5 (Dropped cached_property)
|
||||||
|
- Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.
|
||||||
|
- The exception hook on UnicodeDecodeError has been removed.
|
||||||
|
|
||||||
|
### Deprecated
|
||||||
|
- Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- The CLI output used the relative path of the file(s). Should be absolute.
|
||||||
|
|
||||||
|
## [1.4.1](https://github.com/Ousret/charset_normalizer/compare/1.4.0...1.4.1) (2021-05-28)
|
||||||
|
### Fixed
|
||||||
|
- Logger configuration/usage no longer conflict with others (PR #44)
|
||||||
|
|
||||||
|
## [1.4.0](https://github.com/Ousret/charset_normalizer/compare/1.3.9...1.4.0) (2021-05-21)
|
||||||
|
### Removed
|
||||||
|
- Using standard logging instead of using the package loguru.
|
||||||
|
- Dropping nose test framework in favor of the maintained pytest.
|
||||||
|
- Choose to not use dragonmapper package to help with gibberish Chinese/CJK text.
|
||||||
|
- Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version.
|
||||||
|
- Stop support for UTF-7 that does not contain a SIG.
|
||||||
|
- Dropping PrettyTable, replaced with pure JSON output in CLI.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process.
|
||||||
|
- Not searching properly for the BOM when trying utf32/16 parent codec.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Improving the package final size by compressing frequencies.json.
|
||||||
|
- Huge improvement over the larges payload.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- CLI now produces JSON consumable output.
|
||||||
|
- Return ASCII if given sequences fit. Given reasonable confidence.
|
||||||
|
|
||||||
|
## [1.3.9](https://github.com/Ousret/charset_normalizer/compare/1.3.8...1.3.9) (2021-05-13)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40)
|
||||||
|
|
||||||
|
## [1.3.8](https://github.com/Ousret/charset_normalizer/compare/1.3.7...1.3.8) (2021-05-12)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR #39)
|
||||||
|
|
||||||
|
## [1.3.7](https://github.com/Ousret/charset_normalizer/compare/1.3.6...1.3.7) (2021-05-12)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38)
|
||||||
|
|
||||||
|
## [1.3.6](https://github.com/Ousret/charset_normalizer/compare/1.3.5...1.3.6) (2021-02-09)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Amend the previous release to allow prettytable 2.0 (PR #35)
|
||||||
|
|
||||||
|
## [1.3.5](https://github.com/Ousret/charset_normalizer/compare/1.3.4...1.3.5) (2021-02-08)
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Fix error while using the package with a python pre-release interpreter (PR #33)
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Dependencies refactoring, constraints revised.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Add python 3.9 and 3.10 to the supported interpreters
|
||||||
|
|
||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2025 TAHRI Ahmed R.
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
||||||
@ -0,0 +1,36 @@
|
|||||||
|
../../../bin/normalizer,sha256=Vl5b9ip1NWayflTDPQ3qTFq2TwHnjq-qPimg4xHPduc,256
|
||||||
|
81d243bd2c585b0f4821__mypyc.cpython-312-x86_64-linux-gnu.so,sha256=xPTGB-9iuOqJ5RfI3qaB1WzuFAm1oYWbgN1Jz9U1wn0,433312
|
||||||
|
charset_normalizer-3.4.7.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||||
|
charset_normalizer-3.4.7.dist-info/METADATA,sha256=K8lK8L8LaZ1YmKvWLt3zEkpIxiCOC58xNhzFQrfQJxQ,40931
|
||||||
|
charset_normalizer-3.4.7.dist-info/RECORD,,
|
||||||
|
charset_normalizer-3.4.7.dist-info/WHEEL,sha256=Tc3fF66yn9Kh-hkUUsdKQyuB9Lw0CDoeANnEbSVc3f4,190
|
||||||
|
charset_normalizer-3.4.7.dist-info/entry_points.txt,sha256=ADSTKrkXZ3hhdOVFi6DcUEHQRS0xfxDIE_pEz4wLIXA,65
|
||||||
|
charset_normalizer-3.4.7.dist-info/licenses/LICENSE,sha256=bQ1Bv-FwrGx9wkjJpj4lTQ-0WmDVCoJX0K-SxuJJuIc,1071
|
||||||
|
charset_normalizer-3.4.7.dist-info/top_level.txt,sha256=c_vZbitqecT2GfK3zdxSTLCn8C-6pGnHQY5o_5Y32M0,47
|
||||||
|
charset_normalizer/__init__.py,sha256=OKRxRv2Zhnqk00tqkN0c1BtJjm165fWXLydE52IKuHc,1590
|
||||||
|
charset_normalizer/__main__.py,sha256=yzYxMR-IhKRHYwcSlavEv8oGdwxsR89mr2X09qXGdps,109
|
||||||
|
charset_normalizer/__pycache__/__init__.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/__main__.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/api.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/cd.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/constant.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/legacy.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/md.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/models.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/utils.cpython-312.pyc,,
|
||||||
|
charset_normalizer/__pycache__/version.cpython-312.pyc,,
|
||||||
|
charset_normalizer/api.py,sha256=387F3n23MlMu-xfSbFULW2DLGsBmVrZVGhnkiGXeKBo,38844
|
||||||
|
charset_normalizer/cd.cpython-312-x86_64-linux-gnu.so,sha256=gOe65H__3O8_4a-aSVMB8gxHsRxVyQDUqqaIurPmIhE,15912
|
||||||
|
charset_normalizer/cd.py,sha256=v0iPJweGsRegXywrM1LzUgqW9bJ1KFvIblQHP1jm5FQ,15174
|
||||||
|
charset_normalizer/cli/__init__.py,sha256=D8I86lFk2-py45JvqxniTirSj_sFyE6sjaY_0-G1shc,136
|
||||||
|
charset_normalizer/cli/__main__.py,sha256=E9FFSV1E2iOE_B2B1tJHQT9ExJqc60Ks_c-08sNawh8,11940
|
||||||
|
charset_normalizer/cli/__pycache__/__init__.cpython-312.pyc,,
|
||||||
|
charset_normalizer/cli/__pycache__/__main__.cpython-312.pyc,,
|
||||||
|
charset_normalizer/constant.py,sha256=yvLAWDrdSC743Cu4amhwHLIO-FGuRTOTZouCzZKGikc,44431
|
||||||
|
charset_normalizer/legacy.py,sha256=yBIFMNABNPE5JkdKOWyVo36fZtV9nm8bf37LrDWulz8,2661
|
||||||
|
charset_normalizer/md.cpython-312-x86_64-linux-gnu.so,sha256=iYaQbya7NVRR7xg5FtK1yAKS5shmTFwmtkqqQbbvEWs,15912
|
||||||
|
charset_normalizer/md.py,sha256=AYCdfDX79FrgoId3zXqmbCuDcbGr1NRuGqgJN94Rx9Q,30441
|
||||||
|
charset_normalizer/models.py,sha256=FbaQnI6ECmVmyHRSvVM5fHNeMAQ3KSGdwLjGcQqWDws,12821
|
||||||
|
charset_normalizer/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||||
|
charset_normalizer/utils.py,sha256=9cpi-_0-vC9pGDfuoarhC6VlF_Jxwx5Jsa_8I4w2D8k,12282
|
||||||
|
charset_normalizer/version.py,sha256=2LxFuGp3BBuIwt95cp64y7v8bCNHcMAi08IfXt_47Co,115
|
||||||
@ -0,0 +1,7 @@
|
|||||||
|
Wheel-Version: 1.0
|
||||||
|
Generator: setuptools (82.0.1)
|
||||||
|
Root-Is-Purelib: false
|
||||||
|
Tag: cp312-cp312-manylinux_2_17_x86_64
|
||||||
|
Tag: cp312-cp312-manylinux2014_x86_64
|
||||||
|
Tag: cp312-cp312-manylinux_2_28_x86_64
|
||||||
|
|
||||||
@ -0,0 +1,2 @@
|
|||||||
|
[console_scripts]
|
||||||
|
normalizer = charset_normalizer.cli:cli_detect
|
||||||
@ -0,0 +1,21 @@
|
|||||||
|
MIT License
|
||||||
|
|
||||||
|
Copyright (c) 2025 TAHRI Ahmed R.
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||||
|
SOFTWARE.
|
||||||
@ -0,0 +1,2 @@
|
|||||||
|
81d243bd2c585b0f4821__mypyc
|
||||||
|
charset_normalizer
|
||||||
@ -0,0 +1,48 @@
|
|||||||
|
"""
|
||||||
|
Charset-Normalizer
|
||||||
|
~~~~~~~~~~~~~~
|
||||||
|
The Real First Universal Charset Detector.
|
||||||
|
A library that helps you read text from an unknown charset encoding.
|
||||||
|
Motivated by chardet, This package is trying to resolve the issue by taking a new approach.
|
||||||
|
All IANA character set names for which the Python core library provides codecs are supported.
|
||||||
|
|
||||||
|
Basic usage:
|
||||||
|
>>> from charset_normalizer import from_bytes
|
||||||
|
>>> results = from_bytes('Bсеки човек има право на образование. Oбразованието!'.encode('utf_8'))
|
||||||
|
>>> best_guess = results.best()
|
||||||
|
>>> str(best_guess)
|
||||||
|
'Bсеки човек има право на образование. Oбразованието!'
|
||||||
|
|
||||||
|
Others methods and usages are available - see the full documentation
|
||||||
|
at <https://github.com/Ousret/charset_normalizer>.
|
||||||
|
:copyright: (c) 2021 by Ahmed TAHRI
|
||||||
|
:license: MIT, see LICENSE for more details.
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
|
||||||
|
from .api import from_bytes, from_fp, from_path, is_binary
|
||||||
|
from .legacy import detect
|
||||||
|
from .models import CharsetMatch, CharsetMatches
|
||||||
|
from .utils import set_logging_handler
|
||||||
|
from .version import VERSION, __version__
|
||||||
|
|
||||||
|
__all__ = (
|
||||||
|
"from_fp",
|
||||||
|
"from_path",
|
||||||
|
"from_bytes",
|
||||||
|
"is_binary",
|
||||||
|
"detect",
|
||||||
|
"CharsetMatch",
|
||||||
|
"CharsetMatches",
|
||||||
|
"__version__",
|
||||||
|
"VERSION",
|
||||||
|
"set_logging_handler",
|
||||||
|
)
|
||||||
|
|
||||||
|
# Attach a NullHandler to the top level logger by default
|
||||||
|
# https://docs.python.org/3.3/howto/logging.html#configuring-logging-for-a-library
|
||||||
|
|
||||||
|
logging.getLogger("charset_normalizer").addHandler(logging.NullHandler())
|
||||||
@ -0,0 +1,6 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from .cli import cli_detect
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
cli_detect()
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -0,0 +1,988 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import logging
|
||||||
|
from os import PathLike
|
||||||
|
from typing import BinaryIO
|
||||||
|
|
||||||
|
from .cd import (
|
||||||
|
coherence_ratio,
|
||||||
|
encoding_languages,
|
||||||
|
mb_encoding_languages,
|
||||||
|
merge_coherence_ratios,
|
||||||
|
)
|
||||||
|
from .constant import (
|
||||||
|
IANA_SUPPORTED,
|
||||||
|
IANA_SUPPORTED_SIMILAR,
|
||||||
|
TOO_BIG_SEQUENCE,
|
||||||
|
TOO_SMALL_SEQUENCE,
|
||||||
|
TRACE,
|
||||||
|
)
|
||||||
|
from .md import mess_ratio
|
||||||
|
from .models import CharsetMatch, CharsetMatches
|
||||||
|
from .utils import (
|
||||||
|
any_specified_encoding,
|
||||||
|
cut_sequence_chunks,
|
||||||
|
iana_name,
|
||||||
|
identify_sig_or_bom,
|
||||||
|
is_multi_byte_encoding,
|
||||||
|
should_strip_sig_or_bom,
|
||||||
|
)
|
||||||
|
|
||||||
|
logger = logging.getLogger("charset_normalizer")
|
||||||
|
explain_handler = logging.StreamHandler()
|
||||||
|
explain_handler.setFormatter(
|
||||||
|
logging.Formatter("%(asctime)s | %(levelname)s | %(message)s")
|
||||||
|
)
|
||||||
|
|
||||||
|
# Pre-compute a reordered encoding list: multibyte first, then single-byte.
|
||||||
|
# This allows the mb_definitive_match optimization to fire earlier, skipping
|
||||||
|
# all single-byte encodings for genuine CJK content. Multibyte codecs
|
||||||
|
# hard-fail (UnicodeDecodeError) on single-byte data almost instantly, so
|
||||||
|
# testing them first costs negligible time for non-CJK files.
|
||||||
|
_mb_supported: list[str] = []
|
||||||
|
_sb_supported: list[str] = []
|
||||||
|
|
||||||
|
for _supported_enc in IANA_SUPPORTED:
|
||||||
|
try:
|
||||||
|
if is_multi_byte_encoding(_supported_enc):
|
||||||
|
_mb_supported.append(_supported_enc)
|
||||||
|
else:
|
||||||
|
_sb_supported.append(_supported_enc)
|
||||||
|
except ImportError:
|
||||||
|
_sb_supported.append(_supported_enc)
|
||||||
|
|
||||||
|
IANA_SUPPORTED_MB_FIRST: list[str] = _mb_supported + _sb_supported
|
||||||
|
|
||||||
|
|
||||||
|
def from_bytes(
|
||||||
|
sequences: bytes | bytearray,
|
||||||
|
steps: int = 5,
|
||||||
|
chunk_size: int = 512,
|
||||||
|
threshold: float = 0.2,
|
||||||
|
cp_isolation: list[str] | None = None,
|
||||||
|
cp_exclusion: list[str] | None = None,
|
||||||
|
preemptive_behaviour: bool = True,
|
||||||
|
explain: bool = False,
|
||||||
|
language_threshold: float = 0.1,
|
||||||
|
enable_fallback: bool = True,
|
||||||
|
) -> CharsetMatches:
|
||||||
|
"""
|
||||||
|
Given a raw bytes sequence, return the best possibles charset usable to render str objects.
|
||||||
|
If there is no results, it is a strong indicator that the source is binary/not text.
|
||||||
|
By default, the process will extract 5 blocks of 512o each to assess the mess and coherence of a given sequence.
|
||||||
|
And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
|
||||||
|
|
||||||
|
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page
|
||||||
|
but never take it for granted. Can improve the performance.
|
||||||
|
|
||||||
|
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that
|
||||||
|
purpose.
|
||||||
|
|
||||||
|
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
|
||||||
|
By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain'
|
||||||
|
toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging.
|
||||||
|
Custom logging format and handler can be set manually.
|
||||||
|
"""
|
||||||
|
|
||||||
|
if not isinstance(sequences, (bytearray, bytes)):
|
||||||
|
raise TypeError(
|
||||||
|
"Expected object of type bytes or bytearray, got: {}".format(
|
||||||
|
type(sequences)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if explain:
|
||||||
|
previous_logger_level: int = logger.level
|
||||||
|
logger.addHandler(explain_handler)
|
||||||
|
logger.setLevel(TRACE)
|
||||||
|
|
||||||
|
length: int = len(sequences)
|
||||||
|
|
||||||
|
if length == 0:
|
||||||
|
logger.debug("Encoding detection on empty bytes, assuming utf_8 intention.")
|
||||||
|
if explain: # Defensive: ensure exit path clean handler
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
return CharsetMatches([CharsetMatch(sequences, "utf_8", 0.0, False, [], "")])
|
||||||
|
|
||||||
|
if cp_isolation is not None:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"cp_isolation is set. use this flag for debugging purpose. "
|
||||||
|
"limited list of encoding allowed : %s.",
|
||||||
|
", ".join(cp_isolation),
|
||||||
|
)
|
||||||
|
cp_isolation = [iana_name(cp, False) for cp in cp_isolation]
|
||||||
|
else:
|
||||||
|
cp_isolation = []
|
||||||
|
|
||||||
|
if cp_exclusion is not None:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"cp_exclusion is set. use this flag for debugging purpose. "
|
||||||
|
"limited list of encoding excluded : %s.",
|
||||||
|
", ".join(cp_exclusion),
|
||||||
|
)
|
||||||
|
cp_exclusion = [iana_name(cp, False) for cp in cp_exclusion]
|
||||||
|
else:
|
||||||
|
cp_exclusion = []
|
||||||
|
|
||||||
|
if length <= (chunk_size * steps):
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.",
|
||||||
|
steps,
|
||||||
|
chunk_size,
|
||||||
|
length,
|
||||||
|
)
|
||||||
|
steps = 1
|
||||||
|
chunk_size = length
|
||||||
|
|
||||||
|
if steps > 1 and length / steps < chunk_size:
|
||||||
|
chunk_size = int(length / steps)
|
||||||
|
|
||||||
|
is_too_small_sequence: bool = len(sequences) < TOO_SMALL_SEQUENCE
|
||||||
|
is_too_large_sequence: bool = len(sequences) >= TOO_BIG_SEQUENCE
|
||||||
|
|
||||||
|
if is_too_small_sequence:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Trying to detect encoding from a tiny portion of ({}) byte(s).".format(
|
||||||
|
length
|
||||||
|
),
|
||||||
|
)
|
||||||
|
elif is_too_large_sequence:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Using lazy str decoding because the payload is quite large, ({}) byte(s).".format(
|
||||||
|
length
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
prioritized_encodings: list[str] = []
|
||||||
|
|
||||||
|
specified_encoding: str | None = (
|
||||||
|
any_specified_encoding(sequences) if preemptive_behaviour else None
|
||||||
|
)
|
||||||
|
|
||||||
|
if specified_encoding is not None:
|
||||||
|
prioritized_encodings.append(specified_encoding)
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Detected declarative mark in sequence. Priority +1 given for %s.",
|
||||||
|
specified_encoding,
|
||||||
|
)
|
||||||
|
|
||||||
|
tested: set[str] = set()
|
||||||
|
tested_but_hard_failure: list[str] = []
|
||||||
|
tested_but_soft_failure: list[str] = []
|
||||||
|
soft_failure_skip: set[str] = set()
|
||||||
|
success_fast_tracked: set[str] = set()
|
||||||
|
|
||||||
|
# Cache for decoded payload deduplication: hash(decoded_payload) -> (mean_mess_ratio, cd_ratios_merged, passed)
|
||||||
|
# When multiple encodings decode to the exact same string, we can skip the expensive
|
||||||
|
# mess_ratio and coherence_ratio analysis and reuse the results from the first encoding.
|
||||||
|
payload_result_cache: dict[int, tuple[float, list[tuple[str, float]], bool]] = {}
|
||||||
|
|
||||||
|
# When a definitive result (chaos=0.0 and good coherence) is found after testing
|
||||||
|
# the prioritized encodings (ascii, utf_8), we can significantly reduce the remaining
|
||||||
|
# work. Encodings that target completely different language families (e.g., Cyrillic
|
||||||
|
# when the definitive match is Latin) are skipped entirely.
|
||||||
|
# Additionally, for same-family encodings that pass chaos probing, we reuse the
|
||||||
|
# definitive match's coherence ratios instead of recomputing them — a major savings
|
||||||
|
# since coherence_ratio accounts for ~30% of total time on slow Latin files.
|
||||||
|
definitive_match_found: bool = False
|
||||||
|
definitive_target_languages: set[str] = set()
|
||||||
|
# After the definitive match fires, we cap the number of additional same-family
|
||||||
|
# single-byte encodings that pass chaos probing. Once we've accumulated enough
|
||||||
|
# good candidates (N), further same-family SB encodings are unlikely to produce
|
||||||
|
# a better best() result and just waste mess_ratio + coherence_ratio time.
|
||||||
|
# The first encoding to trigger the definitive match is NOT counted (it's already in).
|
||||||
|
post_definitive_sb_success_count: int = 0
|
||||||
|
POST_DEFINITIVE_SB_CAP: int = 7
|
||||||
|
|
||||||
|
# When a non-UTF multibyte encoding passes chaos probing with significant multibyte
|
||||||
|
# content (decoded length < 98% of raw length), skip all remaining single-byte encodings.
|
||||||
|
# Rationale: multi-byte decoders (CJK) have strict byte-sequence validation — if they
|
||||||
|
# decode without error AND pass chaos probing with substantial multibyte content, the
|
||||||
|
# data is genuinely multibyte encoded. Single-byte encodings will always decode (every
|
||||||
|
# byte maps to something) but waste time on mess_ratio before failing.
|
||||||
|
# The 98% threshold prevents false triggers on files that happen to have a few valid
|
||||||
|
# multibyte pairs (e.g., cp424/_ude_1.txt where big5 decodes with 99% ratio).
|
||||||
|
mb_definitive_match_found: bool = False
|
||||||
|
|
||||||
|
fallback_ascii: CharsetMatch | None = None
|
||||||
|
fallback_u8: CharsetMatch | None = None
|
||||||
|
fallback_specified: CharsetMatch | None = None
|
||||||
|
|
||||||
|
results: CharsetMatches = CharsetMatches()
|
||||||
|
|
||||||
|
early_stop_results: CharsetMatches = CharsetMatches()
|
||||||
|
|
||||||
|
sig_encoding, sig_payload = identify_sig_or_bom(sequences)
|
||||||
|
|
||||||
|
if sig_encoding is not None:
|
||||||
|
prioritized_encodings.append(sig_encoding)
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Detected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.",
|
||||||
|
len(sig_payload),
|
||||||
|
sig_encoding,
|
||||||
|
)
|
||||||
|
|
||||||
|
prioritized_encodings.append("ascii")
|
||||||
|
|
||||||
|
if "utf_8" not in prioritized_encodings:
|
||||||
|
prioritized_encodings.append("utf_8")
|
||||||
|
|
||||||
|
for encoding_iana in prioritized_encodings + IANA_SUPPORTED_MB_FIRST:
|
||||||
|
if cp_isolation and encoding_iana not in cp_isolation:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if cp_exclusion and encoding_iana in cp_exclusion:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if encoding_iana in tested:
|
||||||
|
continue
|
||||||
|
|
||||||
|
tested.add(encoding_iana)
|
||||||
|
|
||||||
|
decoded_payload: str | None = None
|
||||||
|
bom_or_sig_available: bool = sig_encoding == encoding_iana
|
||||||
|
strip_sig_or_bom: bool = bom_or_sig_available and should_strip_sig_or_bom(
|
||||||
|
encoding_iana
|
||||||
|
)
|
||||||
|
|
||||||
|
if encoding_iana in {"utf_16", "utf_32"} and not bom_or_sig_available:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Encoding %s won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
if encoding_iana in {"utf_7"} and not bom_or_sig_available:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Encoding %s won't be tested as-is because detection is unreliable without BOM/SIG.",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Skip encodings similar to ones that already soft-failed (high mess ratio).
|
||||||
|
# Checked BEFORE the expensive decode attempt.
|
||||||
|
if encoding_iana in soft_failure_skip:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"%s is deemed too similar to a code page that was already considered unsuited. Continuing!",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Skip encodings that were already fast-tracked from a similar successful encoding.
|
||||||
|
if encoding_iana in success_fast_tracked:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Skipping %s: already fast-tracked from a similar successful encoding.",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
is_multi_byte_decoder: bool = is_multi_byte_encoding(encoding_iana)
|
||||||
|
except (ModuleNotFoundError, ImportError): # Defensive:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Encoding %s does not provide an IncrementalDecoder",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# When we've already found a definitive match (chaos=0.0 with good coherence)
|
||||||
|
# after testing the prioritized encodings, skip encodings that target
|
||||||
|
# completely different language families. This avoids running expensive
|
||||||
|
# mess_ratio + coherence_ratio on clearly unrelated candidates (e.g., Cyrillic
|
||||||
|
# when the definitive match is Latin-based).
|
||||||
|
if definitive_match_found:
|
||||||
|
if not is_multi_byte_decoder:
|
||||||
|
enc_languages = set(encoding_languages(encoding_iana))
|
||||||
|
else:
|
||||||
|
enc_languages = set(mb_encoding_languages(encoding_iana))
|
||||||
|
if not enc_languages.intersection(definitive_target_languages):
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Skipping %s: definitive match already found, this encoding targets different languages (%s vs %s).",
|
||||||
|
encoding_iana,
|
||||||
|
enc_languages,
|
||||||
|
definitive_target_languages,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# After the definitive match, cap the number of additional same-family
|
||||||
|
# single-byte encodings that pass chaos probing. This avoids testing the
|
||||||
|
# tail of rare, low-value same-family encodings (mac_iceland, cp860, etc.)
|
||||||
|
# that almost never change best() but each cost ~1-2ms of mess_ratio + coherence.
|
||||||
|
if (
|
||||||
|
definitive_match_found
|
||||||
|
and not is_multi_byte_decoder
|
||||||
|
and post_definitive_sb_success_count >= POST_DEFINITIVE_SB_CAP
|
||||||
|
):
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Skipping %s: already accumulated %d same-family results after definitive match (cap=%d).",
|
||||||
|
encoding_iana,
|
||||||
|
post_definitive_sb_success_count,
|
||||||
|
POST_DEFINITIVE_SB_CAP,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# When a multibyte encoding with significant multibyte content has already
|
||||||
|
# passed chaos probing, skip all single-byte encodings. They will either fail
|
||||||
|
# chaos probing (wasting mess_ratio time) or produce inferior results.
|
||||||
|
if mb_definitive_match_found and not is_multi_byte_decoder:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Skipping single-byte %s: multi-byte definitive match already found.",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
if is_too_large_sequence and is_multi_byte_decoder is False:
|
||||||
|
str(
|
||||||
|
(
|
||||||
|
sequences[: int(50e4)]
|
||||||
|
if strip_sig_or_bom is False
|
||||||
|
else sequences[len(sig_payload) : int(50e4)]
|
||||||
|
),
|
||||||
|
encoding=encoding_iana,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
# UTF-7 BOM is encoded in modified Base64 whose byte boundary
|
||||||
|
# can overlap with the next character. Stripping raw SIG bytes
|
||||||
|
# before decoding may leave stray bytes that decode as garbage.
|
||||||
|
# Decode the full sequence and remove the leading BOM char instead.
|
||||||
|
# see https://github.com/jawah/charset_normalizer/issues/718
|
||||||
|
# and https://github.com/jawah/charset_normalizer/issues/716
|
||||||
|
if encoding_iana == "utf_7" and bom_or_sig_available:
|
||||||
|
decoded_payload = str(
|
||||||
|
sequences,
|
||||||
|
encoding=encoding_iana,
|
||||||
|
)
|
||||||
|
if decoded_payload and decoded_payload[0] == "\ufeff":
|
||||||
|
decoded_payload = decoded_payload[1:]
|
||||||
|
else:
|
||||||
|
decoded_payload = str(
|
||||||
|
(
|
||||||
|
sequences
|
||||||
|
if strip_sig_or_bom is False
|
||||||
|
else sequences[len(sig_payload) :]
|
||||||
|
),
|
||||||
|
encoding=encoding_iana,
|
||||||
|
)
|
||||||
|
except (UnicodeDecodeError, LookupError) as e:
|
||||||
|
if not isinstance(e, LookupError):
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Code page %s does not fit given bytes sequence at ALL. %s",
|
||||||
|
encoding_iana,
|
||||||
|
str(e),
|
||||||
|
)
|
||||||
|
tested_but_hard_failure.append(encoding_iana)
|
||||||
|
continue
|
||||||
|
|
||||||
|
r_ = range(
|
||||||
|
0 if not bom_or_sig_available else len(sig_payload),
|
||||||
|
length,
|
||||||
|
int(length / steps),
|
||||||
|
)
|
||||||
|
|
||||||
|
multi_byte_bonus: bool = (
|
||||||
|
is_multi_byte_decoder
|
||||||
|
and decoded_payload is not None
|
||||||
|
and len(decoded_payload) < length
|
||||||
|
)
|
||||||
|
|
||||||
|
if multi_byte_bonus:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Code page %s is a multi byte encoding table and it appear that at least one character "
|
||||||
|
"was encoded using n-bytes.",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Payload-hash deduplication: if another encoding already decoded to the
|
||||||
|
# exact same string, reuse its mess_ratio and coherence results entirely.
|
||||||
|
# This is strictly more general than the old IANA_SUPPORTED_SIMILAR approach
|
||||||
|
# because it catches ALL identical decoding, not just pre-mapped ones.
|
||||||
|
if decoded_payload is not None and not is_multi_byte_decoder:
|
||||||
|
payload_hash: int = hash(decoded_payload)
|
||||||
|
cached = payload_result_cache.get(payload_hash)
|
||||||
|
if cached is not None:
|
||||||
|
cached_mess, cached_cd, cached_passed = cached
|
||||||
|
if cached_passed:
|
||||||
|
# The previous encoding with identical output passed chaos probing.
|
||||||
|
fast_match = CharsetMatch(
|
||||||
|
sequences,
|
||||||
|
encoding_iana,
|
||||||
|
cached_mess,
|
||||||
|
bom_or_sig_available,
|
||||||
|
cached_cd,
|
||||||
|
(
|
||||||
|
decoded_payload
|
||||||
|
if (
|
||||||
|
is_too_large_sequence is False
|
||||||
|
or encoding_iana
|
||||||
|
in [specified_encoding, "ascii", "utf_8"]
|
||||||
|
)
|
||||||
|
else None
|
||||||
|
),
|
||||||
|
preemptive_declaration=specified_encoding,
|
||||||
|
)
|
||||||
|
results.append(fast_match)
|
||||||
|
success_fast_tracked.add(encoding_iana)
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"%s fast-tracked (identical decoded payload to a prior encoding, chaos=%f %%).",
|
||||||
|
encoding_iana,
|
||||||
|
round(cached_mess * 100, ndigits=3),
|
||||||
|
)
|
||||||
|
|
||||||
|
if (
|
||||||
|
encoding_iana in [specified_encoding, "ascii", "utf_8"]
|
||||||
|
and cached_mess < 0.1
|
||||||
|
):
|
||||||
|
if cached_mess == 0.0:
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: %s is most likely the one.",
|
||||||
|
fast_match.encoding,
|
||||||
|
)
|
||||||
|
if explain:
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
return CharsetMatches([fast_match])
|
||||||
|
early_stop_results.append(fast_match)
|
||||||
|
|
||||||
|
if (
|
||||||
|
len(early_stop_results)
|
||||||
|
and (specified_encoding is None or specified_encoding in tested)
|
||||||
|
and "ascii" in tested
|
||||||
|
and "utf_8" in tested
|
||||||
|
):
|
||||||
|
probable_result: CharsetMatch = early_stop_results.best() # type: ignore[assignment]
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: %s is most likely the one.",
|
||||||
|
probable_result.encoding,
|
||||||
|
)
|
||||||
|
if explain:
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
return CharsetMatches([probable_result])
|
||||||
|
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
# The previous encoding with identical output failed chaos probing.
|
||||||
|
tested_but_soft_failure.append(encoding_iana)
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"%s fast-skipped (identical decoded payload to a prior encoding that failed chaos probing).",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
# Prepare fallbacks for special encodings even when skipped.
|
||||||
|
if enable_fallback and encoding_iana in [
|
||||||
|
"ascii",
|
||||||
|
"utf_8",
|
||||||
|
specified_encoding,
|
||||||
|
"utf_16",
|
||||||
|
"utf_32",
|
||||||
|
]:
|
||||||
|
fallback_entry = CharsetMatch(
|
||||||
|
sequences,
|
||||||
|
encoding_iana,
|
||||||
|
threshold,
|
||||||
|
bom_or_sig_available,
|
||||||
|
[],
|
||||||
|
decoded_payload,
|
||||||
|
preemptive_declaration=specified_encoding,
|
||||||
|
)
|
||||||
|
if encoding_iana == specified_encoding:
|
||||||
|
fallback_specified = fallback_entry
|
||||||
|
elif encoding_iana == "ascii":
|
||||||
|
fallback_ascii = fallback_entry
|
||||||
|
else:
|
||||||
|
fallback_u8 = fallback_entry
|
||||||
|
continue
|
||||||
|
|
||||||
|
max_chunk_gave_up: int = int(len(r_) / 4)
|
||||||
|
|
||||||
|
max_chunk_gave_up = max(max_chunk_gave_up, 2)
|
||||||
|
early_stop_count: int = 0
|
||||||
|
lazy_str_hard_failure = False
|
||||||
|
|
||||||
|
md_chunks: list[str] = []
|
||||||
|
md_ratios = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
for chunk in cut_sequence_chunks(
|
||||||
|
sequences,
|
||||||
|
encoding_iana,
|
||||||
|
r_,
|
||||||
|
chunk_size,
|
||||||
|
bom_or_sig_available,
|
||||||
|
strip_sig_or_bom,
|
||||||
|
sig_payload,
|
||||||
|
is_multi_byte_decoder,
|
||||||
|
decoded_payload,
|
||||||
|
):
|
||||||
|
md_chunks.append(chunk)
|
||||||
|
|
||||||
|
md_ratios.append(
|
||||||
|
mess_ratio(
|
||||||
|
chunk,
|
||||||
|
threshold,
|
||||||
|
explain is True and 1 <= len(cp_isolation) <= 2,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if md_ratios[-1] >= threshold:
|
||||||
|
early_stop_count += 1
|
||||||
|
|
||||||
|
if (early_stop_count >= max_chunk_gave_up) or (
|
||||||
|
bom_or_sig_available and strip_sig_or_bom is False
|
||||||
|
):
|
||||||
|
break
|
||||||
|
except (
|
||||||
|
UnicodeDecodeError
|
||||||
|
) as e: # Lazy str loading may have missed something there
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"LazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %s",
|
||||||
|
encoding_iana,
|
||||||
|
str(e),
|
||||||
|
)
|
||||||
|
early_stop_count = max_chunk_gave_up
|
||||||
|
lazy_str_hard_failure = True
|
||||||
|
|
||||||
|
# We might want to check the sequence again with the whole content
|
||||||
|
# Only if initial MD tests passes
|
||||||
|
if (
|
||||||
|
not lazy_str_hard_failure
|
||||||
|
and is_too_large_sequence
|
||||||
|
and not is_multi_byte_decoder
|
||||||
|
):
|
||||||
|
try:
|
||||||
|
sequences[int(50e3) :].decode(encoding_iana, errors="strict")
|
||||||
|
except UnicodeDecodeError as e:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %s",
|
||||||
|
encoding_iana,
|
||||||
|
str(e),
|
||||||
|
)
|
||||||
|
tested_but_hard_failure.append(encoding_iana)
|
||||||
|
continue
|
||||||
|
|
||||||
|
mean_mess_ratio: float = sum(md_ratios) / len(md_ratios) if md_ratios else 0.0
|
||||||
|
if mean_mess_ratio >= threshold or early_stop_count >= max_chunk_gave_up:
|
||||||
|
tested_but_soft_failure.append(encoding_iana)
|
||||||
|
if encoding_iana in IANA_SUPPORTED_SIMILAR:
|
||||||
|
soft_failure_skip.update(IANA_SUPPORTED_SIMILAR[encoding_iana])
|
||||||
|
# Cache this soft-failure so identical decoding from other encodings
|
||||||
|
# can be skipped immediately.
|
||||||
|
if decoded_payload is not None and not is_multi_byte_decoder:
|
||||||
|
payload_result_cache.setdefault(
|
||||||
|
hash(decoded_payload), (mean_mess_ratio, [], False)
|
||||||
|
)
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"%s was excluded because of initial chaos probing. Gave up %i time(s). "
|
||||||
|
"Computed mean chaos is %f %%.",
|
||||||
|
encoding_iana,
|
||||||
|
early_stop_count,
|
||||||
|
round(mean_mess_ratio * 100, ndigits=3),
|
||||||
|
)
|
||||||
|
# Preparing those fallbacks in case we got nothing.
|
||||||
|
if (
|
||||||
|
enable_fallback
|
||||||
|
and encoding_iana
|
||||||
|
in ["ascii", "utf_8", specified_encoding, "utf_16", "utf_32"]
|
||||||
|
and not lazy_str_hard_failure
|
||||||
|
):
|
||||||
|
fallback_entry = CharsetMatch(
|
||||||
|
sequences,
|
||||||
|
encoding_iana,
|
||||||
|
threshold,
|
||||||
|
bom_or_sig_available,
|
||||||
|
[],
|
||||||
|
decoded_payload,
|
||||||
|
preemptive_declaration=specified_encoding,
|
||||||
|
)
|
||||||
|
if encoding_iana == specified_encoding:
|
||||||
|
fallback_specified = fallback_entry
|
||||||
|
elif encoding_iana == "ascii":
|
||||||
|
fallback_ascii = fallback_entry
|
||||||
|
else:
|
||||||
|
fallback_u8 = fallback_entry
|
||||||
|
continue
|
||||||
|
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"%s passed initial chaos probing. Mean measured chaos is %f %%",
|
||||||
|
encoding_iana,
|
||||||
|
round(mean_mess_ratio * 100, ndigits=3),
|
||||||
|
)
|
||||||
|
|
||||||
|
if not is_multi_byte_decoder:
|
||||||
|
target_languages: list[str] = encoding_languages(encoding_iana)
|
||||||
|
else:
|
||||||
|
target_languages = mb_encoding_languages(encoding_iana)
|
||||||
|
|
||||||
|
if target_languages:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"{} should target any language(s) of {}".format(
|
||||||
|
encoding_iana, str(target_languages)
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
cd_ratios = []
|
||||||
|
|
||||||
|
# Run coherence detection on all chunks. We previously tried limiting to
|
||||||
|
# 1-2 chunks for post-definitive encodings to save time, but this caused
|
||||||
|
# coverage regressions by producing unrepresentative coherence scores.
|
||||||
|
# The SB cap and language-family skip optimizations provide sufficient
|
||||||
|
# speedup without sacrificing coherence accuracy.
|
||||||
|
if encoding_iana != "ascii":
|
||||||
|
# We shall skip the CD when its about ASCII
|
||||||
|
# Most of the time its not relevant to run "language-detection" on it.
|
||||||
|
for chunk in md_chunks:
|
||||||
|
chunk_languages = coherence_ratio(
|
||||||
|
chunk,
|
||||||
|
language_threshold,
|
||||||
|
",".join(target_languages) if target_languages else None,
|
||||||
|
)
|
||||||
|
|
||||||
|
cd_ratios.append(chunk_languages)
|
||||||
|
cd_ratios_merged = merge_coherence_ratios(cd_ratios)
|
||||||
|
else:
|
||||||
|
cd_ratios_merged = merge_coherence_ratios(cd_ratios)
|
||||||
|
|
||||||
|
if cd_ratios_merged:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"We detected language {} using {}".format(
|
||||||
|
cd_ratios_merged, encoding_iana
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
current_match = CharsetMatch(
|
||||||
|
sequences,
|
||||||
|
encoding_iana,
|
||||||
|
mean_mess_ratio,
|
||||||
|
bom_or_sig_available,
|
||||||
|
cd_ratios_merged,
|
||||||
|
(
|
||||||
|
decoded_payload
|
||||||
|
if (
|
||||||
|
is_too_large_sequence is False
|
||||||
|
or encoding_iana in [specified_encoding, "ascii", "utf_8"]
|
||||||
|
)
|
||||||
|
else None
|
||||||
|
),
|
||||||
|
preemptive_declaration=specified_encoding,
|
||||||
|
)
|
||||||
|
|
||||||
|
results.append(current_match)
|
||||||
|
|
||||||
|
# Cache the successful result for payload-hash deduplication.
|
||||||
|
if decoded_payload is not None and not is_multi_byte_decoder:
|
||||||
|
payload_result_cache.setdefault(
|
||||||
|
hash(decoded_payload),
|
||||||
|
(mean_mess_ratio, cd_ratios_merged, True),
|
||||||
|
)
|
||||||
|
|
||||||
|
# Count post-definitive same-family SB successes for the early termination cap.
|
||||||
|
# Only count low-mess encodings (< 2%) toward the cap. High-mess encodings are
|
||||||
|
# marginal results that shouldn't prevent better-quality candidates from being
|
||||||
|
# tested. For example, iso8859_4 (mess=0%) should not be skipped just because
|
||||||
|
# 7 high-mess Latin encodings (cp1252 at 8%, etc.) were tried first.
|
||||||
|
if (
|
||||||
|
definitive_match_found
|
||||||
|
and not is_multi_byte_decoder
|
||||||
|
and mean_mess_ratio < 0.02
|
||||||
|
):
|
||||||
|
post_definitive_sb_success_count += 1
|
||||||
|
|
||||||
|
if (
|
||||||
|
encoding_iana in [specified_encoding, "ascii", "utf_8"]
|
||||||
|
and mean_mess_ratio < 0.1
|
||||||
|
):
|
||||||
|
# If md says nothing to worry about, then... stop immediately!
|
||||||
|
if mean_mess_ratio == 0.0:
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: %s is most likely the one.",
|
||||||
|
current_match.encoding,
|
||||||
|
)
|
||||||
|
if explain: # Defensive: ensure exit path clean handler
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
return CharsetMatches([current_match])
|
||||||
|
|
||||||
|
early_stop_results.append(current_match)
|
||||||
|
|
||||||
|
if (
|
||||||
|
len(early_stop_results)
|
||||||
|
and (specified_encoding is None or specified_encoding in tested)
|
||||||
|
and "ascii" in tested
|
||||||
|
and "utf_8" in tested
|
||||||
|
):
|
||||||
|
probable_result = early_stop_results.best() # type: ignore[assignment]
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: %s is most likely the one.",
|
||||||
|
probable_result.encoding, # type: ignore[union-attr]
|
||||||
|
)
|
||||||
|
if explain: # Defensive: ensure exit path clean handler
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
|
||||||
|
return CharsetMatches([probable_result])
|
||||||
|
|
||||||
|
# Once we find a result with good coherence (>= 0.5) after testing the
|
||||||
|
# prioritized encodings (ascii, utf_8), activate "definitive mode": skip
|
||||||
|
# encodings that target completely different language families. This avoids
|
||||||
|
# running expensive mess_ratio + coherence_ratio on clearly unrelated
|
||||||
|
# candidates (e.g., Cyrillic encodings when the match is Latin-based).
|
||||||
|
# We require coherence >= 0.5 to avoid false positives (e.g., cp1251 decoding
|
||||||
|
# Hebrew text with 0.0 chaos but wrong language detection at coherence 0.33).
|
||||||
|
if not definitive_match_found and not is_multi_byte_decoder:
|
||||||
|
best_coherence = (
|
||||||
|
max((v for _, v in cd_ratios_merged), default=0.0)
|
||||||
|
if cd_ratios_merged
|
||||||
|
else 0.0
|
||||||
|
)
|
||||||
|
if best_coherence >= 0.5 and "ascii" in tested and "utf_8" in tested:
|
||||||
|
definitive_match_found = True
|
||||||
|
definitive_target_languages.update(target_languages)
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Definitive match found: %s (chaos=%.3f, coherence=%.2f). Encodings targeting different language families will be skipped.",
|
||||||
|
encoding_iana,
|
||||||
|
mean_mess_ratio,
|
||||||
|
best_coherence,
|
||||||
|
)
|
||||||
|
|
||||||
|
# When a non-UTF multibyte encoding passes chaos probing with significant
|
||||||
|
# multibyte content (decoded < 98% of raw), activate mb_definitive_match.
|
||||||
|
# This skips all remaining single-byte encodings which would either soft-fail
|
||||||
|
# (running expensive mess_ratio for nothing) or produce inferior results.
|
||||||
|
if (
|
||||||
|
not mb_definitive_match_found
|
||||||
|
and is_multi_byte_decoder
|
||||||
|
and multi_byte_bonus
|
||||||
|
and decoded_payload is not None
|
||||||
|
and len(decoded_payload) < length * 0.98
|
||||||
|
and encoding_iana
|
||||||
|
not in {
|
||||||
|
"utf_8",
|
||||||
|
"utf_8_sig",
|
||||||
|
"utf_16",
|
||||||
|
"utf_16_be",
|
||||||
|
"utf_16_le",
|
||||||
|
"utf_32",
|
||||||
|
"utf_32_be",
|
||||||
|
"utf_32_le",
|
||||||
|
"utf_7",
|
||||||
|
}
|
||||||
|
and "ascii" in tested
|
||||||
|
and "utf_8" in tested
|
||||||
|
):
|
||||||
|
mb_definitive_match_found = True
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Multi-byte definitive match: %s (chaos=%.3f, decoded=%d/%d=%.1f%%). Single-byte encodings will be skipped.",
|
||||||
|
encoding_iana,
|
||||||
|
mean_mess_ratio,
|
||||||
|
len(decoded_payload),
|
||||||
|
length,
|
||||||
|
len(decoded_payload) / length * 100,
|
||||||
|
)
|
||||||
|
|
||||||
|
if encoding_iana == sig_encoding:
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: %s is most likely the one as we detected a BOM or SIG within "
|
||||||
|
"the beginning of the sequence.",
|
||||||
|
encoding_iana,
|
||||||
|
)
|
||||||
|
if explain: # Defensive: ensure exit path clean handler
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
return CharsetMatches([results[encoding_iana]])
|
||||||
|
|
||||||
|
if len(results) == 0:
|
||||||
|
if fallback_u8 or fallback_ascii or fallback_specified:
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Nothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.",
|
||||||
|
)
|
||||||
|
|
||||||
|
if fallback_specified:
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: %s will be used as a fallback match",
|
||||||
|
fallback_specified.encoding,
|
||||||
|
)
|
||||||
|
results.append(fallback_specified)
|
||||||
|
elif (
|
||||||
|
(fallback_u8 and fallback_ascii is None)
|
||||||
|
or (
|
||||||
|
fallback_u8
|
||||||
|
and fallback_ascii
|
||||||
|
and fallback_u8.fingerprint != fallback_ascii.fingerprint
|
||||||
|
)
|
||||||
|
or (fallback_u8 is not None)
|
||||||
|
):
|
||||||
|
logger.debug("Encoding detection: utf_8 will be used as a fallback match")
|
||||||
|
results.append(fallback_u8)
|
||||||
|
elif fallback_ascii:
|
||||||
|
logger.debug("Encoding detection: ascii will be used as a fallback match")
|
||||||
|
results.append(fallback_ascii)
|
||||||
|
|
||||||
|
if results:
|
||||||
|
logger.debug(
|
||||||
|
"Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.",
|
||||||
|
results.best().encoding, # type: ignore
|
||||||
|
len(results) - 1,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
logger.debug("Encoding detection: Unable to determine any suitable charset.")
|
||||||
|
|
||||||
|
if explain:
|
||||||
|
logger.removeHandler(explain_handler)
|
||||||
|
logger.setLevel(previous_logger_level)
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
def from_fp(
|
||||||
|
fp: BinaryIO,
|
||||||
|
steps: int = 5,
|
||||||
|
chunk_size: int = 512,
|
||||||
|
threshold: float = 0.20,
|
||||||
|
cp_isolation: list[str] | None = None,
|
||||||
|
cp_exclusion: list[str] | None = None,
|
||||||
|
preemptive_behaviour: bool = True,
|
||||||
|
explain: bool = False,
|
||||||
|
language_threshold: float = 0.1,
|
||||||
|
enable_fallback: bool = True,
|
||||||
|
) -> CharsetMatches:
|
||||||
|
"""
|
||||||
|
Same thing than the function from_bytes but using a file pointer that is already ready.
|
||||||
|
Will not close the file pointer.
|
||||||
|
"""
|
||||||
|
return from_bytes(
|
||||||
|
fp.read(),
|
||||||
|
steps,
|
||||||
|
chunk_size,
|
||||||
|
threshold,
|
||||||
|
cp_isolation,
|
||||||
|
cp_exclusion,
|
||||||
|
preemptive_behaviour,
|
||||||
|
explain,
|
||||||
|
language_threshold,
|
||||||
|
enable_fallback,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def from_path(
|
||||||
|
path: str | bytes | PathLike, # type: ignore[type-arg]
|
||||||
|
steps: int = 5,
|
||||||
|
chunk_size: int = 512,
|
||||||
|
threshold: float = 0.20,
|
||||||
|
cp_isolation: list[str] | None = None,
|
||||||
|
cp_exclusion: list[str] | None = None,
|
||||||
|
preemptive_behaviour: bool = True,
|
||||||
|
explain: bool = False,
|
||||||
|
language_threshold: float = 0.1,
|
||||||
|
enable_fallback: bool = True,
|
||||||
|
) -> CharsetMatches:
|
||||||
|
"""
|
||||||
|
Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode.
|
||||||
|
Can raise IOError.
|
||||||
|
"""
|
||||||
|
with open(path, "rb") as fp:
|
||||||
|
return from_fp(
|
||||||
|
fp,
|
||||||
|
steps,
|
||||||
|
chunk_size,
|
||||||
|
threshold,
|
||||||
|
cp_isolation,
|
||||||
|
cp_exclusion,
|
||||||
|
preemptive_behaviour,
|
||||||
|
explain,
|
||||||
|
language_threshold,
|
||||||
|
enable_fallback,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def is_binary(
|
||||||
|
fp_or_path_or_payload: PathLike | str | BinaryIO | bytes, # type: ignore[type-arg]
|
||||||
|
steps: int = 5,
|
||||||
|
chunk_size: int = 512,
|
||||||
|
threshold: float = 0.20,
|
||||||
|
cp_isolation: list[str] | None = None,
|
||||||
|
cp_exclusion: list[str] | None = None,
|
||||||
|
preemptive_behaviour: bool = True,
|
||||||
|
explain: bool = False,
|
||||||
|
language_threshold: float = 0.1,
|
||||||
|
enable_fallback: bool = False,
|
||||||
|
) -> bool:
|
||||||
|
"""
|
||||||
|
Detect if the given input (file, bytes, or path) points to a binary file. aka. not a string.
|
||||||
|
Based on the same main heuristic algorithms and default kwargs at the sole exception that fallbacks match
|
||||||
|
are disabled to be stricter around ASCII-compatible but unlikely to be a string.
|
||||||
|
"""
|
||||||
|
if isinstance(fp_or_path_or_payload, (str, PathLike)):
|
||||||
|
guesses = from_path(
|
||||||
|
fp_or_path_or_payload,
|
||||||
|
steps=steps,
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
threshold=threshold,
|
||||||
|
cp_isolation=cp_isolation,
|
||||||
|
cp_exclusion=cp_exclusion,
|
||||||
|
preemptive_behaviour=preemptive_behaviour,
|
||||||
|
explain=explain,
|
||||||
|
language_threshold=language_threshold,
|
||||||
|
enable_fallback=enable_fallback,
|
||||||
|
)
|
||||||
|
elif isinstance(
|
||||||
|
fp_or_path_or_payload,
|
||||||
|
(
|
||||||
|
bytes,
|
||||||
|
bytearray,
|
||||||
|
),
|
||||||
|
):
|
||||||
|
guesses = from_bytes(
|
||||||
|
fp_or_path_or_payload,
|
||||||
|
steps=steps,
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
threshold=threshold,
|
||||||
|
cp_isolation=cp_isolation,
|
||||||
|
cp_exclusion=cp_exclusion,
|
||||||
|
preemptive_behaviour=preemptive_behaviour,
|
||||||
|
explain=explain,
|
||||||
|
language_threshold=language_threshold,
|
||||||
|
enable_fallback=enable_fallback,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
guesses = from_fp(
|
||||||
|
fp_or_path_or_payload,
|
||||||
|
steps=steps,
|
||||||
|
chunk_size=chunk_size,
|
||||||
|
threshold=threshold,
|
||||||
|
cp_isolation=cp_isolation,
|
||||||
|
cp_exclusion=cp_exclusion,
|
||||||
|
preemptive_behaviour=preemptive_behaviour,
|
||||||
|
explain=explain,
|
||||||
|
language_threshold=language_threshold,
|
||||||
|
enable_fallback=enable_fallback,
|
||||||
|
)
|
||||||
|
|
||||||
|
return not guesses
|
||||||
Binary file not shown.
@ -0,0 +1,454 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
from codecs import IncrementalDecoder
|
||||||
|
from collections import Counter
|
||||||
|
from functools import lru_cache
|
||||||
|
from typing import Counter as TypeCounter
|
||||||
|
|
||||||
|
from .constant import (
|
||||||
|
FREQUENCIES,
|
||||||
|
KO_NAMES,
|
||||||
|
LANGUAGE_SUPPORTED_COUNT,
|
||||||
|
TOO_SMALL_SEQUENCE,
|
||||||
|
ZH_NAMES,
|
||||||
|
_FREQUENCIES_SET,
|
||||||
|
_FREQUENCIES_RANK,
|
||||||
|
)
|
||||||
|
from .md import is_suspiciously_successive_range
|
||||||
|
from .models import CoherenceMatches
|
||||||
|
from .utils import (
|
||||||
|
is_accentuated,
|
||||||
|
is_latin,
|
||||||
|
is_multi_byte_encoding,
|
||||||
|
is_unicode_range_secondary,
|
||||||
|
unicode_range,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def encoding_unicode_range(iana_name: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Return associated unicode ranges in a single byte code page.
|
||||||
|
"""
|
||||||
|
if is_multi_byte_encoding(iana_name):
|
||||||
|
raise OSError( # Defensive:
|
||||||
|
"Function not supported on multi-byte code page"
|
||||||
|
)
|
||||||
|
|
||||||
|
decoder = importlib.import_module(f"encodings.{iana_name}").IncrementalDecoder
|
||||||
|
|
||||||
|
p: IncrementalDecoder = decoder(errors="ignore")
|
||||||
|
seen_ranges: dict[str, int] = {}
|
||||||
|
character_count: int = 0
|
||||||
|
|
||||||
|
for i in range(0x40, 0xFF):
|
||||||
|
chunk: str = p.decode(bytes([i]))
|
||||||
|
|
||||||
|
if chunk:
|
||||||
|
character_range: str | None = unicode_range(chunk)
|
||||||
|
|
||||||
|
if character_range is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if is_unicode_range_secondary(character_range) is False:
|
||||||
|
if character_range not in seen_ranges:
|
||||||
|
seen_ranges[character_range] = 0
|
||||||
|
seen_ranges[character_range] += 1
|
||||||
|
character_count += 1
|
||||||
|
|
||||||
|
return sorted(
|
||||||
|
[
|
||||||
|
character_range
|
||||||
|
for character_range in seen_ranges
|
||||||
|
if seen_ranges[character_range] / character_count >= 0.15
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def unicode_range_languages(primary_range: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Return inferred languages used with a unicode range.
|
||||||
|
"""
|
||||||
|
languages: list[str] = []
|
||||||
|
|
||||||
|
for language, characters in FREQUENCIES.items():
|
||||||
|
for character in characters:
|
||||||
|
if unicode_range(character) == primary_range:
|
||||||
|
languages.append(language)
|
||||||
|
break
|
||||||
|
|
||||||
|
return languages
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache()
|
||||||
|
def encoding_languages(iana_name: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Single-byte encoding language association. Some code page are heavily linked to particular language(s).
|
||||||
|
This function does the correspondence.
|
||||||
|
"""
|
||||||
|
unicode_ranges: list[str] = encoding_unicode_range(iana_name)
|
||||||
|
primary_range: str | None = None
|
||||||
|
|
||||||
|
for specified_range in unicode_ranges:
|
||||||
|
if "Latin" not in specified_range:
|
||||||
|
primary_range = specified_range
|
||||||
|
break
|
||||||
|
|
||||||
|
if primary_range is None:
|
||||||
|
return ["Latin Based"]
|
||||||
|
|
||||||
|
return unicode_range_languages(primary_range)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache()
|
||||||
|
def mb_encoding_languages(iana_name: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Multi-byte encoding language association. Some code page are heavily linked to particular language(s).
|
||||||
|
This function does the correspondence.
|
||||||
|
"""
|
||||||
|
if (
|
||||||
|
iana_name.startswith("shift_")
|
||||||
|
or iana_name.startswith("iso2022_jp")
|
||||||
|
or iana_name.startswith("euc_j")
|
||||||
|
or iana_name == "cp932"
|
||||||
|
):
|
||||||
|
return ["Japanese"]
|
||||||
|
if iana_name.startswith("gb") or iana_name in ZH_NAMES:
|
||||||
|
return ["Chinese"]
|
||||||
|
if iana_name.startswith("iso2022_kr") or iana_name in KO_NAMES:
|
||||||
|
return ["Korean"]
|
||||||
|
|
||||||
|
return []
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=LANGUAGE_SUPPORTED_COUNT)
|
||||||
|
def get_target_features(language: str) -> tuple[bool, bool]:
|
||||||
|
"""
|
||||||
|
Determine main aspects from a supported language if it contains accents and if is pure Latin.
|
||||||
|
"""
|
||||||
|
target_have_accents: bool = False
|
||||||
|
target_pure_latin: bool = True
|
||||||
|
|
||||||
|
for character in FREQUENCIES[language]:
|
||||||
|
if not target_have_accents and is_accentuated(character):
|
||||||
|
target_have_accents = True
|
||||||
|
if target_pure_latin and is_latin(character) is False:
|
||||||
|
target_pure_latin = False
|
||||||
|
|
||||||
|
return target_have_accents, target_pure_latin
|
||||||
|
|
||||||
|
|
||||||
|
def alphabet_languages(
|
||||||
|
characters: list[str], ignore_non_latin: bool = False
|
||||||
|
) -> list[str]:
|
||||||
|
"""
|
||||||
|
Return associated languages associated to given characters.
|
||||||
|
"""
|
||||||
|
languages: list[tuple[str, float]] = []
|
||||||
|
|
||||||
|
characters_set: frozenset[str] = frozenset(characters)
|
||||||
|
source_have_accents = any(is_accentuated(character) for character in characters)
|
||||||
|
|
||||||
|
for language, language_characters in FREQUENCIES.items():
|
||||||
|
target_have_accents, target_pure_latin = get_target_features(language)
|
||||||
|
|
||||||
|
if ignore_non_latin and target_pure_latin is False:
|
||||||
|
continue
|
||||||
|
|
||||||
|
if target_have_accents is False and source_have_accents:
|
||||||
|
continue
|
||||||
|
|
||||||
|
character_count: int = len(language_characters)
|
||||||
|
|
||||||
|
character_match_count: int = len(_FREQUENCIES_SET[language] & characters_set)
|
||||||
|
|
||||||
|
ratio: float = character_match_count / character_count
|
||||||
|
|
||||||
|
if ratio >= 0.2:
|
||||||
|
languages.append((language, ratio))
|
||||||
|
|
||||||
|
languages = sorted(languages, key=lambda x: x[1], reverse=True)
|
||||||
|
|
||||||
|
return [compatible_language[0] for compatible_language in languages]
|
||||||
|
|
||||||
|
|
||||||
|
def characters_popularity_compare(
|
||||||
|
language: str, ordered_characters: list[str]
|
||||||
|
) -> float:
|
||||||
|
"""
|
||||||
|
Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language.
|
||||||
|
The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
|
||||||
|
Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
|
||||||
|
"""
|
||||||
|
if language not in FREQUENCIES:
|
||||||
|
raise ValueError(f"{language} not available") # Defensive:
|
||||||
|
|
||||||
|
character_approved_count: int = 0
|
||||||
|
frequencies_language_set: frozenset[str] = _FREQUENCIES_SET[language]
|
||||||
|
lang_rank: dict[str, int] = _FREQUENCIES_RANK[language]
|
||||||
|
|
||||||
|
ordered_characters_count: int = len(ordered_characters)
|
||||||
|
target_language_characters_count: int = len(FREQUENCIES[language])
|
||||||
|
|
||||||
|
large_alphabet: bool = target_language_characters_count > 26
|
||||||
|
|
||||||
|
expected_projection_ratio: float = (
|
||||||
|
target_language_characters_count / ordered_characters_count
|
||||||
|
)
|
||||||
|
|
||||||
|
# Pre-built rank dict for ordered_characters (avoids repeated list slicing).
|
||||||
|
ordered_rank: dict[str, int] = {
|
||||||
|
char: rank for rank, char in enumerate(ordered_characters)
|
||||||
|
}
|
||||||
|
|
||||||
|
# Pre-compute characters common to both orderings.
|
||||||
|
# Avoids repeated `c in ordered_rank` dict lookups in the inner counts.
|
||||||
|
common_chars: list[tuple[int, int]] = [
|
||||||
|
(lr, ordered_rank[c]) for c, lr in lang_rank.items() if c in ordered_rank
|
||||||
|
]
|
||||||
|
|
||||||
|
# Pre-extract lr and orr arrays for faster iteration in the inner loop.
|
||||||
|
# Plain integer loops with local arrays are much faster under mypyc than
|
||||||
|
# generator expression sums over a list of tuples.
|
||||||
|
common_count: int = len(common_chars)
|
||||||
|
common_lr: list[int] = [p[0] for p in common_chars]
|
||||||
|
common_orr: list[int] = [p[1] for p in common_chars]
|
||||||
|
|
||||||
|
for character, character_rank in zip(
|
||||||
|
ordered_characters, range(0, ordered_characters_count)
|
||||||
|
):
|
||||||
|
if character not in frequencies_language_set:
|
||||||
|
continue
|
||||||
|
|
||||||
|
character_rank_in_language: int = lang_rank[character]
|
||||||
|
character_rank_projection: int = int(character_rank * expected_projection_ratio)
|
||||||
|
|
||||||
|
if (
|
||||||
|
large_alphabet is False
|
||||||
|
and abs(character_rank_projection - character_rank_in_language) > 4
|
||||||
|
):
|
||||||
|
continue
|
||||||
|
|
||||||
|
if (
|
||||||
|
large_alphabet is True
|
||||||
|
and abs(character_rank_projection - character_rank_in_language)
|
||||||
|
< target_language_characters_count / 3
|
||||||
|
):
|
||||||
|
character_approved_count += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Count how many characters appear "before" in both orderings,
|
||||||
|
# and how many appear "at or after" in both orderings.
|
||||||
|
# Single pass over pre-extracted arrays — much faster under mypyc
|
||||||
|
# than two generator expression sums.
|
||||||
|
before_match_count: int = 0
|
||||||
|
after_match_count: int = 0
|
||||||
|
for i in range(common_count):
|
||||||
|
lr_i: int = common_lr[i]
|
||||||
|
orr_i: int = common_orr[i]
|
||||||
|
if lr_i < character_rank_in_language:
|
||||||
|
if orr_i < character_rank:
|
||||||
|
before_match_count += 1
|
||||||
|
else:
|
||||||
|
if orr_i >= character_rank:
|
||||||
|
after_match_count += 1
|
||||||
|
|
||||||
|
after_len: int = target_language_characters_count - character_rank_in_language
|
||||||
|
|
||||||
|
if character_rank_in_language == 0 and before_match_count <= 4:
|
||||||
|
character_approved_count += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
if after_len == 0 and after_match_count <= 4:
|
||||||
|
character_approved_count += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
if (
|
||||||
|
character_rank_in_language > 0
|
||||||
|
and before_match_count / character_rank_in_language >= 0.4
|
||||||
|
) or (after_len > 0 and after_match_count / after_len >= 0.4):
|
||||||
|
character_approved_count += 1
|
||||||
|
continue
|
||||||
|
|
||||||
|
return character_approved_count / len(ordered_characters)
|
||||||
|
|
||||||
|
|
||||||
|
def alpha_unicode_split(decoded_sequence: str) -> list[str]:
|
||||||
|
"""
|
||||||
|
Given a decoded text sequence, return a list of str. Unicode range / alphabet separation.
|
||||||
|
Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list;
|
||||||
|
One containing the latin letters and the other hebrew.
|
||||||
|
"""
|
||||||
|
layers: dict[str, list[str]] = {}
|
||||||
|
|
||||||
|
# Fast path: track single-layer key to skip dict iteration for single-script text.
|
||||||
|
single_layer_key: str | None = None
|
||||||
|
multi_layer: bool = False
|
||||||
|
|
||||||
|
# Cache the last character_range and its resolved layer to avoid repeated
|
||||||
|
# is_suspiciously_successive_range calls for consecutive same-range chars.
|
||||||
|
prev_character_range: str | None = None
|
||||||
|
prev_layer_target: str | None = None
|
||||||
|
|
||||||
|
for character in decoded_sequence:
|
||||||
|
if character.isalpha() is False:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# ASCII fast-path: a-z and A-Z are always "Basic Latin".
|
||||||
|
# Avoids unicode_range() function call overhead for the most common case.
|
||||||
|
character_ord: int = ord(character)
|
||||||
|
if character_ord < 128:
|
||||||
|
character_range: str | None = "Basic Latin"
|
||||||
|
else:
|
||||||
|
character_range = unicode_range(character)
|
||||||
|
|
||||||
|
if character_range is None:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Fast path: same range as previous character → reuse cached layer target.
|
||||||
|
if character_range == prev_character_range:
|
||||||
|
if prev_layer_target is not None:
|
||||||
|
layers[prev_layer_target].append(character)
|
||||||
|
continue
|
||||||
|
|
||||||
|
layer_target_range: str | None = None
|
||||||
|
|
||||||
|
if multi_layer:
|
||||||
|
for discovered_range in layers:
|
||||||
|
if (
|
||||||
|
is_suspiciously_successive_range(discovered_range, character_range)
|
||||||
|
is False
|
||||||
|
):
|
||||||
|
layer_target_range = discovered_range
|
||||||
|
break
|
||||||
|
elif single_layer_key is not None:
|
||||||
|
if (
|
||||||
|
is_suspiciously_successive_range(single_layer_key, character_range)
|
||||||
|
is False
|
||||||
|
):
|
||||||
|
layer_target_range = single_layer_key
|
||||||
|
|
||||||
|
if layer_target_range is None:
|
||||||
|
layer_target_range = character_range
|
||||||
|
|
||||||
|
if layer_target_range not in layers:
|
||||||
|
layers[layer_target_range] = []
|
||||||
|
if single_layer_key is None:
|
||||||
|
single_layer_key = layer_target_range
|
||||||
|
else:
|
||||||
|
multi_layer = True
|
||||||
|
|
||||||
|
layers[layer_target_range].append(character)
|
||||||
|
|
||||||
|
# Cache for next iteration
|
||||||
|
prev_character_range = character_range
|
||||||
|
prev_layer_target = layer_target_range
|
||||||
|
|
||||||
|
return ["".join(chars).lower() for chars in layers.values()]
|
||||||
|
|
||||||
|
|
||||||
|
def merge_coherence_ratios(results: list[CoherenceMatches]) -> CoherenceMatches:
|
||||||
|
"""
|
||||||
|
This function merge results previously given by the function coherence_ratio.
|
||||||
|
The return type is the same as coherence_ratio.
|
||||||
|
"""
|
||||||
|
per_language_ratios: dict[str, list[float]] = {}
|
||||||
|
for result in results:
|
||||||
|
for sub_result in result:
|
||||||
|
language, ratio = sub_result
|
||||||
|
if language not in per_language_ratios:
|
||||||
|
per_language_ratios[language] = [ratio]
|
||||||
|
continue
|
||||||
|
per_language_ratios[language].append(ratio)
|
||||||
|
|
||||||
|
merge = [
|
||||||
|
(
|
||||||
|
language,
|
||||||
|
round(
|
||||||
|
sum(per_language_ratios[language]) / len(per_language_ratios[language]),
|
||||||
|
4,
|
||||||
|
),
|
||||||
|
)
|
||||||
|
for language in per_language_ratios
|
||||||
|
]
|
||||||
|
|
||||||
|
return sorted(merge, key=lambda x: x[1], reverse=True)
|
||||||
|
|
||||||
|
|
||||||
|
def filter_alt_coherence_matches(results: CoherenceMatches) -> CoherenceMatches:
|
||||||
|
"""
|
||||||
|
We shall NOT return "English—" in CoherenceMatches because it is an alternative
|
||||||
|
of "English". This function only keeps the best match and remove the em-dash in it.
|
||||||
|
"""
|
||||||
|
index_results: dict[str, list[float]] = dict()
|
||||||
|
|
||||||
|
for result in results:
|
||||||
|
language, ratio = result
|
||||||
|
no_em_name: str = language.replace("—", "")
|
||||||
|
|
||||||
|
if no_em_name not in index_results:
|
||||||
|
index_results[no_em_name] = []
|
||||||
|
|
||||||
|
index_results[no_em_name].append(ratio)
|
||||||
|
|
||||||
|
if any(len(index_results[e]) > 1 for e in index_results):
|
||||||
|
filtered_results: CoherenceMatches = []
|
||||||
|
|
||||||
|
for language in index_results:
|
||||||
|
filtered_results.append((language, max(index_results[language])))
|
||||||
|
|
||||||
|
return filtered_results
|
||||||
|
|
||||||
|
return results
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=2048)
|
||||||
|
def coherence_ratio(
|
||||||
|
decoded_sequence: str, threshold: float = 0.1, lg_inclusion: str | None = None
|
||||||
|
) -> CoherenceMatches:
|
||||||
|
"""
|
||||||
|
Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers.
|
||||||
|
A layer = Character extraction by alphabets/ranges.
|
||||||
|
"""
|
||||||
|
|
||||||
|
results: list[tuple[str, float]] = []
|
||||||
|
ignore_non_latin: bool = False
|
||||||
|
|
||||||
|
sufficient_match_count: int = 0
|
||||||
|
|
||||||
|
lg_inclusion_list = lg_inclusion.split(",") if lg_inclusion is not None else []
|
||||||
|
if "Latin Based" in lg_inclusion_list:
|
||||||
|
ignore_non_latin = True
|
||||||
|
lg_inclusion_list.remove("Latin Based")
|
||||||
|
|
||||||
|
for layer in alpha_unicode_split(decoded_sequence):
|
||||||
|
sequence_frequencies: TypeCounter[str] = Counter(layer)
|
||||||
|
most_common = sequence_frequencies.most_common()
|
||||||
|
|
||||||
|
character_count: int = len(layer)
|
||||||
|
|
||||||
|
if character_count <= TOO_SMALL_SEQUENCE:
|
||||||
|
continue
|
||||||
|
|
||||||
|
popular_character_ordered: list[str] = [c for c, o in most_common]
|
||||||
|
|
||||||
|
for language in lg_inclusion_list or alphabet_languages(
|
||||||
|
popular_character_ordered, ignore_non_latin
|
||||||
|
):
|
||||||
|
ratio: float = characters_popularity_compare(
|
||||||
|
language, popular_character_ordered
|
||||||
|
)
|
||||||
|
|
||||||
|
if ratio < threshold:
|
||||||
|
continue
|
||||||
|
elif ratio >= 0.8:
|
||||||
|
sufficient_match_count += 1
|
||||||
|
|
||||||
|
results.append((language, round(ratio, 4)))
|
||||||
|
|
||||||
|
if sufficient_match_count >= 3:
|
||||||
|
break
|
||||||
|
|
||||||
|
return sorted(
|
||||||
|
filter_alt_coherence_matches(results), key=lambda x: x[1], reverse=True
|
||||||
|
)
|
||||||
@ -0,0 +1,8 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from .__main__ import cli_detect, query_yes_no
|
||||||
|
|
||||||
|
__all__ = (
|
||||||
|
"cli_detect",
|
||||||
|
"query_yes_no",
|
||||||
|
)
|
||||||
@ -0,0 +1,362 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import sys
|
||||||
|
import typing
|
||||||
|
from json import dumps
|
||||||
|
from os.path import abspath, basename, dirname, join, realpath
|
||||||
|
from platform import python_version
|
||||||
|
from unicodedata import unidata_version
|
||||||
|
|
||||||
|
import charset_normalizer.md as md_module
|
||||||
|
from charset_normalizer import from_fp
|
||||||
|
from charset_normalizer.models import CliDetectionResult
|
||||||
|
from charset_normalizer.version import __version__
|
||||||
|
|
||||||
|
|
||||||
|
def query_yes_no(question: str, default: str = "yes") -> bool: # Defensive:
|
||||||
|
"""Ask a yes/no question via input() and return the answer as a bool."""
|
||||||
|
prompt = " [Y/n] " if default == "yes" else " [y/N] "
|
||||||
|
|
||||||
|
while True:
|
||||||
|
choice = input(question + prompt).strip().lower()
|
||||||
|
if not choice:
|
||||||
|
return default == "yes"
|
||||||
|
if choice in ("y", "yes"):
|
||||||
|
return True
|
||||||
|
if choice in ("n", "no"):
|
||||||
|
return False
|
||||||
|
print("Please respond with 'y' or 'n'.")
|
||||||
|
|
||||||
|
|
||||||
|
class FileType:
|
||||||
|
"""Factory for creating file object types
|
||||||
|
|
||||||
|
Instances of FileType are typically passed as type= arguments to the
|
||||||
|
ArgumentParser add_argument() method.
|
||||||
|
|
||||||
|
Keyword Arguments:
|
||||||
|
- mode -- A string indicating how the file is to be opened. Accepts the
|
||||||
|
same values as the builtin open() function.
|
||||||
|
- bufsize -- The file's desired buffer size. Accepts the same values as
|
||||||
|
the builtin open() function.
|
||||||
|
- encoding -- The file's encoding. Accepts the same values as the
|
||||||
|
builtin open() function.
|
||||||
|
- errors -- A string indicating how encoding and decoding errors are to
|
||||||
|
be handled. Accepts the same value as the builtin open() function.
|
||||||
|
|
||||||
|
Backported from CPython 3.12
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
mode: str = "r",
|
||||||
|
bufsize: int = -1,
|
||||||
|
encoding: str | None = None,
|
||||||
|
errors: str | None = None,
|
||||||
|
):
|
||||||
|
self._mode = mode
|
||||||
|
self._bufsize = bufsize
|
||||||
|
self._encoding = encoding
|
||||||
|
self._errors = errors
|
||||||
|
|
||||||
|
def __call__(self, string: str) -> typing.IO: # type: ignore[type-arg]
|
||||||
|
# the special argument "-" means sys.std{in,out}
|
||||||
|
if string == "-":
|
||||||
|
if "r" in self._mode:
|
||||||
|
return sys.stdin.buffer if "b" in self._mode else sys.stdin
|
||||||
|
elif any(c in self._mode for c in "wax"):
|
||||||
|
return sys.stdout.buffer if "b" in self._mode else sys.stdout
|
||||||
|
else:
|
||||||
|
msg = f'argument "-" with mode {self._mode}'
|
||||||
|
raise ValueError(msg)
|
||||||
|
|
||||||
|
# all other arguments are used as file names
|
||||||
|
try:
|
||||||
|
return open(string, self._mode, self._bufsize, self._encoding, self._errors)
|
||||||
|
except OSError as e:
|
||||||
|
message = f"can't open '{string}': {e}"
|
||||||
|
raise argparse.ArgumentTypeError(message)
|
||||||
|
|
||||||
|
def __repr__(self) -> str:
|
||||||
|
args = self._mode, self._bufsize
|
||||||
|
kwargs = [("encoding", self._encoding), ("errors", self._errors)]
|
||||||
|
args_str = ", ".join(
|
||||||
|
[repr(arg) for arg in args if arg != -1]
|
||||||
|
+ [f"{kw}={arg!r}" for kw, arg in kwargs if arg is not None]
|
||||||
|
)
|
||||||
|
return f"{type(self).__name__}({args_str})"
|
||||||
|
|
||||||
|
|
||||||
|
def cli_detect(argv: list[str] | None = None) -> int:
|
||||||
|
"""
|
||||||
|
CLI assistant using ARGV and ArgumentParser
|
||||||
|
:param argv:
|
||||||
|
:return: 0 if everything is fine, anything else equal trouble
|
||||||
|
"""
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="The Real First Universal Charset Detector. "
|
||||||
|
"Discover originating encoding used on text file. "
|
||||||
|
"Normalize text to unicode."
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"files", type=FileType("rb"), nargs="+", help="File(s) to be analysed"
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-v",
|
||||||
|
"--verbose",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="verbose",
|
||||||
|
help="Display complementary information about file if any. "
|
||||||
|
"Stdout will contain logs about the detection process.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-a",
|
||||||
|
"--with-alternative",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="alternatives",
|
||||||
|
help="Output complementary possibilities if any. Top-level JSON WILL be a list.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-n",
|
||||||
|
"--normalize",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="normalize",
|
||||||
|
help="Permit to normalize input file. If not set, program does not write anything.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-m",
|
||||||
|
"--minimal",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="minimal",
|
||||||
|
help="Only output the charset detected to STDOUT. Disabling JSON output.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-r",
|
||||||
|
"--replace",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="replace",
|
||||||
|
help="Replace file when trying to normalize it instead of creating a new one.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-f",
|
||||||
|
"--force",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="force",
|
||||||
|
help="Replace file without asking if you are sure, use this flag with caution.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-i",
|
||||||
|
"--no-preemptive",
|
||||||
|
action="store_true",
|
||||||
|
default=False,
|
||||||
|
dest="no_preemptive",
|
||||||
|
help="Disable looking at a charset declaration to hint the detector.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"-t",
|
||||||
|
"--threshold",
|
||||||
|
action="store",
|
||||||
|
default=0.2,
|
||||||
|
type=float,
|
||||||
|
dest="threshold",
|
||||||
|
help="Define a custom maximum amount of noise allowed in decoded content. 0. <= noise <= 1.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--version",
|
||||||
|
action="version",
|
||||||
|
version="Charset-Normalizer {} - Python {} - Unicode {} - SpeedUp {}".format(
|
||||||
|
__version__,
|
||||||
|
python_version(),
|
||||||
|
unidata_version,
|
||||||
|
"OFF" if md_module.__file__.lower().endswith(".py") else "ON",
|
||||||
|
),
|
||||||
|
help="Show version information and exit.",
|
||||||
|
)
|
||||||
|
|
||||||
|
args = parser.parse_args(argv)
|
||||||
|
|
||||||
|
if args.replace is True and args.normalize is False:
|
||||||
|
if args.files:
|
||||||
|
for my_file in args.files:
|
||||||
|
my_file.close()
|
||||||
|
print("Use --replace in addition of --normalize only.", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
if args.force is True and args.replace is False:
|
||||||
|
if args.files:
|
||||||
|
for my_file in args.files:
|
||||||
|
my_file.close()
|
||||||
|
print("Use --force in addition of --replace only.", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
if args.threshold < 0.0 or args.threshold > 1.0:
|
||||||
|
if args.files:
|
||||||
|
for my_file in args.files:
|
||||||
|
my_file.close()
|
||||||
|
print("--threshold VALUE should be between 0. AND 1.", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
x_ = []
|
||||||
|
|
||||||
|
for my_file in args.files:
|
||||||
|
matches = from_fp(
|
||||||
|
my_file,
|
||||||
|
threshold=args.threshold,
|
||||||
|
explain=args.verbose,
|
||||||
|
preemptive_behaviour=args.no_preemptive is False,
|
||||||
|
)
|
||||||
|
|
||||||
|
best_guess = matches.best()
|
||||||
|
|
||||||
|
if best_guess is None:
|
||||||
|
print(
|
||||||
|
'Unable to identify originating encoding for "{}". {}'.format(
|
||||||
|
my_file.name,
|
||||||
|
(
|
||||||
|
"Maybe try increasing maximum amount of chaos."
|
||||||
|
if args.threshold < 1.0
|
||||||
|
else ""
|
||||||
|
),
|
||||||
|
),
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
x_.append(
|
||||||
|
CliDetectionResult(
|
||||||
|
abspath(my_file.name),
|
||||||
|
None,
|
||||||
|
[],
|
||||||
|
[],
|
||||||
|
"Unknown",
|
||||||
|
[],
|
||||||
|
False,
|
||||||
|
1.0,
|
||||||
|
0.0,
|
||||||
|
None,
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
cli_result = CliDetectionResult(
|
||||||
|
abspath(my_file.name),
|
||||||
|
best_guess.encoding,
|
||||||
|
best_guess.encoding_aliases,
|
||||||
|
[
|
||||||
|
cp
|
||||||
|
for cp in best_guess.could_be_from_charset
|
||||||
|
if cp != best_guess.encoding
|
||||||
|
],
|
||||||
|
best_guess.language,
|
||||||
|
best_guess.alphabets,
|
||||||
|
best_guess.bom,
|
||||||
|
best_guess.percent_chaos,
|
||||||
|
best_guess.percent_coherence,
|
||||||
|
None,
|
||||||
|
True,
|
||||||
|
)
|
||||||
|
x_.append(cli_result)
|
||||||
|
|
||||||
|
if len(matches) > 1 and args.alternatives:
|
||||||
|
for el in matches:
|
||||||
|
if el != best_guess:
|
||||||
|
x_.append(
|
||||||
|
CliDetectionResult(
|
||||||
|
abspath(my_file.name),
|
||||||
|
el.encoding,
|
||||||
|
el.encoding_aliases,
|
||||||
|
[
|
||||||
|
cp
|
||||||
|
for cp in el.could_be_from_charset
|
||||||
|
if cp != el.encoding
|
||||||
|
],
|
||||||
|
el.language,
|
||||||
|
el.alphabets,
|
||||||
|
el.bom,
|
||||||
|
el.percent_chaos,
|
||||||
|
el.percent_coherence,
|
||||||
|
None,
|
||||||
|
False,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
if args.normalize is True:
|
||||||
|
if best_guess.encoding.startswith("utf") is True:
|
||||||
|
print(
|
||||||
|
'"{}" file does not need to be normalized, as it already came from unicode.'.format(
|
||||||
|
my_file.name
|
||||||
|
),
|
||||||
|
file=sys.stderr,
|
||||||
|
)
|
||||||
|
if my_file.closed is False:
|
||||||
|
my_file.close()
|
||||||
|
continue
|
||||||
|
|
||||||
|
dir_path = dirname(realpath(my_file.name))
|
||||||
|
file_name = basename(realpath(my_file.name))
|
||||||
|
|
||||||
|
o_: list[str] = file_name.split(".")
|
||||||
|
|
||||||
|
if args.replace is False:
|
||||||
|
o_.insert(-1, best_guess.encoding)
|
||||||
|
if my_file.closed is False:
|
||||||
|
my_file.close()
|
||||||
|
elif (
|
||||||
|
args.force is False
|
||||||
|
and query_yes_no(
|
||||||
|
'Are you sure to normalize "{}" by replacing it ?'.format(
|
||||||
|
my_file.name
|
||||||
|
),
|
||||||
|
"no",
|
||||||
|
)
|
||||||
|
is False
|
||||||
|
):
|
||||||
|
if my_file.closed is False:
|
||||||
|
my_file.close()
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
cli_result.unicode_path = join(dir_path, ".".join(o_))
|
||||||
|
|
||||||
|
with open(cli_result.unicode_path, "wb") as fp:
|
||||||
|
fp.write(best_guess.output())
|
||||||
|
except OSError as e: # Defensive:
|
||||||
|
print(str(e), file=sys.stderr)
|
||||||
|
if my_file.closed is False:
|
||||||
|
my_file.close()
|
||||||
|
return 2
|
||||||
|
|
||||||
|
if my_file.closed is False:
|
||||||
|
my_file.close()
|
||||||
|
|
||||||
|
if args.minimal is False:
|
||||||
|
print(
|
||||||
|
dumps(
|
||||||
|
[el.__dict__ for el in x_] if len(x_) > 1 else x_[0].__dict__,
|
||||||
|
ensure_ascii=True,
|
||||||
|
indent=4,
|
||||||
|
)
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
for my_file in args.files:
|
||||||
|
print(
|
||||||
|
", ".join(
|
||||||
|
[
|
||||||
|
el.encoding or "undefined"
|
||||||
|
for el in x_
|
||||||
|
if el.path == abspath(my_file.name)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__": # Defensive:
|
||||||
|
cli_detect()
|
||||||
Binary file not shown.
Binary file not shown.
File diff suppressed because it is too large
Load Diff
@ -0,0 +1,79 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from typing import TYPE_CHECKING, Any
|
||||||
|
from warnings import warn
|
||||||
|
|
||||||
|
from .api import from_bytes
|
||||||
|
from .constant import CHARDET_CORRESPONDENCE, TOO_SMALL_SEQUENCE
|
||||||
|
|
||||||
|
if TYPE_CHECKING:
|
||||||
|
from typing import TypedDict
|
||||||
|
|
||||||
|
class ResultDict(TypedDict):
|
||||||
|
encoding: str | None
|
||||||
|
language: str
|
||||||
|
confidence: float | None
|
||||||
|
|
||||||
|
|
||||||
|
def detect(
|
||||||
|
byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any
|
||||||
|
) -> ResultDict:
|
||||||
|
"""
|
||||||
|
chardet legacy method
|
||||||
|
Detect the encoding of the given byte string. It should be mostly backward-compatible.
|
||||||
|
Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
|
||||||
|
This function is deprecated and should be used to migrate your project easily, consult the documentation for
|
||||||
|
further information. Not planned for removal.
|
||||||
|
|
||||||
|
:param byte_str: The byte sequence to examine.
|
||||||
|
:param should_rename_legacy: Should we rename legacy encodings
|
||||||
|
to their more modern equivalents?
|
||||||
|
"""
|
||||||
|
if len(kwargs):
|
||||||
|
warn(
|
||||||
|
f"charset-normalizer disregard arguments '{','.join(list(kwargs.keys()))}' in legacy function detect()"
|
||||||
|
)
|
||||||
|
|
||||||
|
if not isinstance(byte_str, (bytearray, bytes)):
|
||||||
|
raise TypeError( # pragma: nocover
|
||||||
|
f"Expected object of type bytes or bytearray, got: {type(byte_str)}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if isinstance(byte_str, bytearray):
|
||||||
|
byte_str = bytes(byte_str)
|
||||||
|
|
||||||
|
r = from_bytes(byte_str).best()
|
||||||
|
|
||||||
|
encoding = r.encoding if r is not None else None
|
||||||
|
language = r.language if r is not None and r.language != "Unknown" else ""
|
||||||
|
confidence = 1.0 - r.chaos if r is not None else None
|
||||||
|
|
||||||
|
# automatically lower confidence
|
||||||
|
# on small bytes samples.
|
||||||
|
# https://github.com/jawah/charset_normalizer/issues/391
|
||||||
|
if (
|
||||||
|
confidence is not None
|
||||||
|
and confidence >= 0.9
|
||||||
|
and encoding
|
||||||
|
not in {
|
||||||
|
"utf_8",
|
||||||
|
"ascii",
|
||||||
|
}
|
||||||
|
and r.bom is False # type: ignore[union-attr]
|
||||||
|
and len(byte_str) < TOO_SMALL_SEQUENCE
|
||||||
|
):
|
||||||
|
confidence -= 0.2
|
||||||
|
|
||||||
|
# Note: CharsetNormalizer does not return 'UTF-8-SIG' as the sig get stripped in the detection/normalization process
|
||||||
|
# but chardet does return 'utf-8-sig' and it is a valid codec name.
|
||||||
|
if r is not None and encoding == "utf_8" and r.bom:
|
||||||
|
encoding += "_sig"
|
||||||
|
|
||||||
|
if should_rename_legacy is False and encoding in CHARDET_CORRESPONDENCE:
|
||||||
|
encoding = CHARDET_CORRESPONDENCE[encoding]
|
||||||
|
|
||||||
|
return {
|
||||||
|
"encoding": encoding,
|
||||||
|
"language": language,
|
||||||
|
"confidence": confidence,
|
||||||
|
}
|
||||||
Binary file not shown.
@ -0,0 +1,936 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import sys
|
||||||
|
from functools import lru_cache
|
||||||
|
from logging import getLogger
|
||||||
|
|
||||||
|
if sys.version_info >= (3, 8):
|
||||||
|
from typing import final
|
||||||
|
else:
|
||||||
|
try:
|
||||||
|
from typing_extensions import final
|
||||||
|
except ImportError:
|
||||||
|
|
||||||
|
def final(cls): # type: ignore[misc,no-untyped-def]
|
||||||
|
return cls
|
||||||
|
|
||||||
|
|
||||||
|
from .constant import (
|
||||||
|
COMMON_CJK_CHARACTERS,
|
||||||
|
COMMON_SAFE_ASCII_CHARACTERS,
|
||||||
|
TRACE,
|
||||||
|
UNICODE_SECONDARY_RANGE_KEYWORD,
|
||||||
|
_ACCENTUATED,
|
||||||
|
_ARABIC,
|
||||||
|
_ARABIC_ISOLATED_FORM,
|
||||||
|
_CJK,
|
||||||
|
_HANGUL,
|
||||||
|
_HIRAGANA,
|
||||||
|
_KATAKANA,
|
||||||
|
_LATIN,
|
||||||
|
_THAI,
|
||||||
|
)
|
||||||
|
from .utils import (
|
||||||
|
_character_flags,
|
||||||
|
is_emoticon,
|
||||||
|
is_punctuation,
|
||||||
|
is_separator,
|
||||||
|
is_symbol,
|
||||||
|
remove_accent,
|
||||||
|
unicode_range,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Combined bitmask for CJK/Hangul/Katakana/Hiragana/Thai glyph detection.
|
||||||
|
_GLYPH_MASK: int = _CJK | _HANGUL | _KATAKANA | _HIRAGANA | _THAI
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class CharInfo:
|
||||||
|
"""Pre-computed character properties shared across all detectors.
|
||||||
|
|
||||||
|
Instantiated once and reused via :meth:`update` on every character
|
||||||
|
in the hot loop so that redundant calls to str methods
|
||||||
|
(``isalpha``, ``isupper``, …) and cached utility functions
|
||||||
|
(``_character_flags``, ``is_punctuation``, …) are avoided when
|
||||||
|
several plugins need the same information.
|
||||||
|
"""
|
||||||
|
|
||||||
|
__slots__ = (
|
||||||
|
"character",
|
||||||
|
"printable",
|
||||||
|
"alpha",
|
||||||
|
"upper",
|
||||||
|
"lower",
|
||||||
|
"space",
|
||||||
|
"digit",
|
||||||
|
"is_ascii",
|
||||||
|
"case_variable",
|
||||||
|
"flags",
|
||||||
|
"accentuated",
|
||||||
|
"latin",
|
||||||
|
"is_cjk",
|
||||||
|
"is_arabic",
|
||||||
|
"is_glyph",
|
||||||
|
"punct",
|
||||||
|
"sym",
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self.character: str = ""
|
||||||
|
self.printable: bool = False
|
||||||
|
self.alpha: bool = False
|
||||||
|
self.upper: bool = False
|
||||||
|
self.lower: bool = False
|
||||||
|
self.space: bool = False
|
||||||
|
self.digit: bool = False
|
||||||
|
self.is_ascii: bool = False
|
||||||
|
self.case_variable: bool = False
|
||||||
|
self.flags: int = 0
|
||||||
|
self.accentuated: bool = False
|
||||||
|
self.latin: bool = False
|
||||||
|
self.is_cjk: bool = False
|
||||||
|
self.is_arabic: bool = False
|
||||||
|
self.is_glyph: bool = False
|
||||||
|
self.punct: bool = False
|
||||||
|
self.sym: bool = False
|
||||||
|
|
||||||
|
def update(self, character: str) -> None:
|
||||||
|
"""Update all properties for *character* (called once per character)."""
|
||||||
|
self.character = character
|
||||||
|
|
||||||
|
# ASCII fast-path: for characters with ord < 128, we can skip
|
||||||
|
# _character_flags() entirely and derive most properties from ord.
|
||||||
|
o: int = ord(character)
|
||||||
|
if o < 128:
|
||||||
|
self.is_ascii = True
|
||||||
|
self.accentuated = False
|
||||||
|
self.is_cjk = False
|
||||||
|
self.is_arabic = False
|
||||||
|
self.is_glyph = False
|
||||||
|
# ASCII alpha: a-z (97-122) or A-Z (65-90)
|
||||||
|
if 65 <= o <= 90:
|
||||||
|
# Uppercase ASCII letter
|
||||||
|
self.alpha = True
|
||||||
|
self.upper = True
|
||||||
|
self.lower = False
|
||||||
|
self.space = False
|
||||||
|
self.digit = False
|
||||||
|
self.printable = True
|
||||||
|
self.case_variable = True
|
||||||
|
self.flags = _LATIN
|
||||||
|
self.latin = True
|
||||||
|
self.punct = False
|
||||||
|
self.sym = False
|
||||||
|
elif 97 <= o <= 122:
|
||||||
|
# Lowercase ASCII letter
|
||||||
|
self.alpha = True
|
||||||
|
self.upper = False
|
||||||
|
self.lower = True
|
||||||
|
self.space = False
|
||||||
|
self.digit = False
|
||||||
|
self.printable = True
|
||||||
|
self.case_variable = True
|
||||||
|
self.flags = _LATIN
|
||||||
|
self.latin = True
|
||||||
|
self.punct = False
|
||||||
|
self.sym = False
|
||||||
|
elif 48 <= o <= 57:
|
||||||
|
# ASCII digit 0-9
|
||||||
|
self.alpha = False
|
||||||
|
self.upper = False
|
||||||
|
self.lower = False
|
||||||
|
self.space = False
|
||||||
|
self.digit = True
|
||||||
|
self.printable = True
|
||||||
|
self.case_variable = False
|
||||||
|
self.flags = 0
|
||||||
|
self.latin = False
|
||||||
|
self.punct = False
|
||||||
|
self.sym = False
|
||||||
|
elif o == 32 or (9 <= o <= 13):
|
||||||
|
# Space, tab, newline, etc.
|
||||||
|
self.alpha = False
|
||||||
|
self.upper = False
|
||||||
|
self.lower = False
|
||||||
|
self.space = True
|
||||||
|
self.digit = False
|
||||||
|
self.printable = o == 32
|
||||||
|
self.case_variable = False
|
||||||
|
self.flags = 0
|
||||||
|
self.latin = False
|
||||||
|
self.punct = False
|
||||||
|
self.sym = False
|
||||||
|
else:
|
||||||
|
# Other ASCII (punctuation, symbols, control chars)
|
||||||
|
self.printable = character.isprintable()
|
||||||
|
self.alpha = False
|
||||||
|
self.upper = False
|
||||||
|
self.lower = False
|
||||||
|
self.space = False
|
||||||
|
self.digit = False
|
||||||
|
self.case_variable = False
|
||||||
|
self.flags = 0
|
||||||
|
self.latin = False
|
||||||
|
self.punct = is_punctuation(character) if self.printable else False
|
||||||
|
self.sym = is_symbol(character) if self.printable else False
|
||||||
|
else:
|
||||||
|
# Non-ASCII path
|
||||||
|
self.is_ascii = False
|
||||||
|
self.printable = character.isprintable()
|
||||||
|
self.alpha = character.isalpha()
|
||||||
|
self.upper = character.isupper()
|
||||||
|
self.lower = character.islower()
|
||||||
|
self.space = character.isspace()
|
||||||
|
self.digit = character.isdigit()
|
||||||
|
self.case_variable = self.lower != self.upper
|
||||||
|
|
||||||
|
# Flag-based classification (single unicodedata.name() call, lru-cached)
|
||||||
|
flags: int
|
||||||
|
if self.alpha:
|
||||||
|
flags = _character_flags(character)
|
||||||
|
else:
|
||||||
|
flags = 0
|
||||||
|
self.flags = flags
|
||||||
|
self.accentuated = bool(flags & _ACCENTUATED)
|
||||||
|
self.latin = bool(flags & _LATIN)
|
||||||
|
self.is_cjk = bool(flags & _CJK)
|
||||||
|
self.is_arabic = bool(flags & _ARABIC)
|
||||||
|
self.is_glyph = bool(flags & _GLYPH_MASK)
|
||||||
|
|
||||||
|
# Eagerly compute punct and sym (avoids property dispatch overhead
|
||||||
|
# on 300K+ accesses in the hot loop).
|
||||||
|
self.punct = is_punctuation(character) if self.printable else False
|
||||||
|
self.sym = is_symbol(character) if self.printable else False
|
||||||
|
|
||||||
|
|
||||||
|
class MessDetectorPlugin:
|
||||||
|
"""
|
||||||
|
Base abstract class used for mess detection plugins.
|
||||||
|
All detectors MUST extend and implement given methods.
|
||||||
|
"""
|
||||||
|
|
||||||
|
__slots__ = ()
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""
|
||||||
|
The main routine to be executed upon character.
|
||||||
|
Insert the logic in witch the text would be considered chaotic.
|
||||||
|
"""
|
||||||
|
raise NotImplementedError # Defensive:
|
||||||
|
|
||||||
|
def reset(self) -> None: # Defensive:
|
||||||
|
"""
|
||||||
|
Permit to reset the plugin to the initial state.
|
||||||
|
"""
|
||||||
|
raise NotImplementedError
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
"""
|
||||||
|
Compute the chaos ratio based on what your feed() has seen.
|
||||||
|
Must NOT be lower than 0.; No restriction gt 0.
|
||||||
|
"""
|
||||||
|
raise NotImplementedError # Defensive:
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class TooManySymbolOrPunctuationPlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = (
|
||||||
|
"_punctuation_count",
|
||||||
|
"_symbol_count",
|
||||||
|
"_character_count",
|
||||||
|
"_last_printable_char",
|
||||||
|
"_frenzy_symbol_in_word",
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._punctuation_count: int = 0
|
||||||
|
self._symbol_count: int = 0
|
||||||
|
self._character_count: int = 0
|
||||||
|
|
||||||
|
self._last_printable_char: str | None = None
|
||||||
|
self._frenzy_symbol_in_word: bool = False
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
self._character_count += 1
|
||||||
|
|
||||||
|
if (
|
||||||
|
character != self._last_printable_char
|
||||||
|
and character not in COMMON_SAFE_ASCII_CHARACTERS
|
||||||
|
):
|
||||||
|
if info.punct:
|
||||||
|
self._punctuation_count += 1
|
||||||
|
elif not info.digit and info.sym and not is_emoticon(character):
|
||||||
|
self._symbol_count += 2
|
||||||
|
|
||||||
|
self._last_printable_char = character
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._punctuation_count = 0
|
||||||
|
self._character_count = 0
|
||||||
|
self._symbol_count = 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count == 0:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
ratio_of_punctuation: float = (
|
||||||
|
self._punctuation_count + self._symbol_count
|
||||||
|
) / self._character_count
|
||||||
|
|
||||||
|
return ratio_of_punctuation if ratio_of_punctuation >= 0.3 else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class TooManyAccentuatedPlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = ("_character_count", "_accentuated_count")
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._character_count: int = 0
|
||||||
|
self._accentuated_count: int = 0
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
self._character_count += 1
|
||||||
|
|
||||||
|
if info.accentuated:
|
||||||
|
self._accentuated_count += 1
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._character_count = 0
|
||||||
|
self._accentuated_count = 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count < 8:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
ratio_of_accentuation: float = self._accentuated_count / self._character_count
|
||||||
|
return ratio_of_accentuation if ratio_of_accentuation >= 0.35 else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class UnprintablePlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = ("_unprintable_count", "_character_count")
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._unprintable_count: int = 0
|
||||||
|
self._character_count: int = 0
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
if (
|
||||||
|
not info.space
|
||||||
|
and not info.printable
|
||||||
|
and character != "\x1a"
|
||||||
|
and character != "\ufeff"
|
||||||
|
):
|
||||||
|
self._unprintable_count += 1
|
||||||
|
self._character_count += 1
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._unprintable_count = 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count == 0: # Defensive:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
return (self._unprintable_count * 8) / self._character_count
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class SuspiciousDuplicateAccentPlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = (
|
||||||
|
"_successive_count",
|
||||||
|
"_character_count",
|
||||||
|
"_last_latin_character",
|
||||||
|
"_last_was_accentuated",
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._successive_count: int = 0
|
||||||
|
self._character_count: int = 0
|
||||||
|
|
||||||
|
self._last_latin_character: str | None = None
|
||||||
|
self._last_was_accentuated: bool = False
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
self._character_count += 1
|
||||||
|
if (
|
||||||
|
self._last_latin_character is not None
|
||||||
|
and info.accentuated
|
||||||
|
and self._last_was_accentuated
|
||||||
|
):
|
||||||
|
if info.upper and self._last_latin_character.isupper():
|
||||||
|
self._successive_count += 1
|
||||||
|
if remove_accent(character) == remove_accent(self._last_latin_character):
|
||||||
|
self._successive_count += 1
|
||||||
|
self._last_latin_character = character
|
||||||
|
self._last_was_accentuated = info.accentuated
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._successive_count = 0
|
||||||
|
self._character_count = 0
|
||||||
|
self._last_latin_character = None
|
||||||
|
self._last_was_accentuated = False
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count == 0:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
return (self._successive_count * 2) / self._character_count
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class SuspiciousRange(MessDetectorPlugin):
|
||||||
|
__slots__ = (
|
||||||
|
"_suspicious_successive_range_count",
|
||||||
|
"_character_count",
|
||||||
|
"_last_printable_seen",
|
||||||
|
"_last_printable_range",
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._suspicious_successive_range_count: int = 0
|
||||||
|
self._character_count: int = 0
|
||||||
|
self._last_printable_seen: str | None = None
|
||||||
|
self._last_printable_range: str | None = None
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
self._character_count += 1
|
||||||
|
|
||||||
|
if info.space or info.punct or character in COMMON_SAFE_ASCII_CHARACTERS:
|
||||||
|
self._last_printable_seen = None
|
||||||
|
self._last_printable_range = None
|
||||||
|
return
|
||||||
|
|
||||||
|
if self._last_printable_seen is None:
|
||||||
|
self._last_printable_seen = character
|
||||||
|
self._last_printable_range = unicode_range(character)
|
||||||
|
return
|
||||||
|
|
||||||
|
unicode_range_a: str | None = self._last_printable_range
|
||||||
|
unicode_range_b: str | None = unicode_range(character)
|
||||||
|
|
||||||
|
if is_suspiciously_successive_range(unicode_range_a, unicode_range_b):
|
||||||
|
self._suspicious_successive_range_count += 1
|
||||||
|
|
||||||
|
self._last_printable_seen = character
|
||||||
|
self._last_printable_range = unicode_range_b
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._character_count = 0
|
||||||
|
self._suspicious_successive_range_count = 0
|
||||||
|
self._last_printable_seen = None
|
||||||
|
self._last_printable_range = None
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count <= 13:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
ratio_of_suspicious_range_usage: float = (
|
||||||
|
self._suspicious_successive_range_count * 2
|
||||||
|
) / self._character_count
|
||||||
|
|
||||||
|
return ratio_of_suspicious_range_usage
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class SuperWeirdWordPlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = (
|
||||||
|
"_word_count",
|
||||||
|
"_bad_word_count",
|
||||||
|
"_foreign_long_count",
|
||||||
|
"_is_current_word_bad",
|
||||||
|
"_foreign_long_watch",
|
||||||
|
"_character_count",
|
||||||
|
"_bad_character_count",
|
||||||
|
"_buffer_length",
|
||||||
|
"_buffer_last_char",
|
||||||
|
"_buffer_last_char_accentuated",
|
||||||
|
"_buffer_accent_count",
|
||||||
|
"_buffer_glyph_count",
|
||||||
|
"_buffer_upper_count",
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._word_count: int = 0
|
||||||
|
self._bad_word_count: int = 0
|
||||||
|
self._foreign_long_count: int = 0
|
||||||
|
|
||||||
|
self._is_current_word_bad: bool = False
|
||||||
|
self._foreign_long_watch: bool = False
|
||||||
|
|
||||||
|
self._character_count: int = 0
|
||||||
|
self._bad_character_count: int = 0
|
||||||
|
|
||||||
|
self._buffer_length: int = 0
|
||||||
|
self._buffer_last_char: str | None = None
|
||||||
|
self._buffer_last_char_accentuated: bool = False
|
||||||
|
self._buffer_accent_count: int = 0
|
||||||
|
self._buffer_glyph_count: int = 0
|
||||||
|
self._buffer_upper_count: int = 0
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
if info.alpha:
|
||||||
|
self._buffer_length += 1
|
||||||
|
self._buffer_last_char = character
|
||||||
|
|
||||||
|
if info.upper:
|
||||||
|
self._buffer_upper_count += 1
|
||||||
|
|
||||||
|
self._buffer_last_char_accentuated = info.accentuated
|
||||||
|
|
||||||
|
if info.accentuated:
|
||||||
|
self._buffer_accent_count += 1
|
||||||
|
if (
|
||||||
|
not self._foreign_long_watch
|
||||||
|
and (not info.latin or info.accentuated)
|
||||||
|
and not info.is_glyph
|
||||||
|
):
|
||||||
|
self._foreign_long_watch = True
|
||||||
|
if info.is_glyph:
|
||||||
|
self._buffer_glyph_count += 1
|
||||||
|
return
|
||||||
|
if not self._buffer_length:
|
||||||
|
return
|
||||||
|
if info.space or info.punct or is_separator(character):
|
||||||
|
self._word_count += 1
|
||||||
|
buffer_length: int = self._buffer_length
|
||||||
|
|
||||||
|
self._character_count += buffer_length
|
||||||
|
|
||||||
|
if buffer_length >= 4:
|
||||||
|
if self._buffer_accent_count / buffer_length >= 0.5:
|
||||||
|
self._is_current_word_bad = True
|
||||||
|
elif (
|
||||||
|
self._buffer_last_char_accentuated
|
||||||
|
and self._buffer_last_char.isupper() # type: ignore[union-attr]
|
||||||
|
and self._buffer_upper_count != buffer_length
|
||||||
|
):
|
||||||
|
self._foreign_long_count += 1
|
||||||
|
self._is_current_word_bad = True
|
||||||
|
elif self._buffer_glyph_count == 1:
|
||||||
|
self._is_current_word_bad = True
|
||||||
|
self._foreign_long_count += 1
|
||||||
|
if buffer_length >= 24 and self._foreign_long_watch:
|
||||||
|
probable_camel_cased: bool = (
|
||||||
|
self._buffer_upper_count > 0
|
||||||
|
and self._buffer_upper_count / buffer_length <= 0.3
|
||||||
|
)
|
||||||
|
|
||||||
|
if not probable_camel_cased:
|
||||||
|
self._foreign_long_count += 1
|
||||||
|
self._is_current_word_bad = True
|
||||||
|
|
||||||
|
if self._is_current_word_bad:
|
||||||
|
self._bad_word_count += 1
|
||||||
|
self._bad_character_count += buffer_length
|
||||||
|
self._is_current_word_bad = False
|
||||||
|
|
||||||
|
self._foreign_long_watch = False
|
||||||
|
self._buffer_length = 0
|
||||||
|
self._buffer_last_char = None
|
||||||
|
self._buffer_last_char_accentuated = False
|
||||||
|
self._buffer_accent_count = 0
|
||||||
|
self._buffer_glyph_count = 0
|
||||||
|
self._buffer_upper_count = 0
|
||||||
|
elif (
|
||||||
|
character not in {"<", ">", "-", "=", "~", "|", "_"}
|
||||||
|
and not info.digit
|
||||||
|
and info.sym
|
||||||
|
):
|
||||||
|
self._is_current_word_bad = True
|
||||||
|
self._buffer_length += 1
|
||||||
|
self._buffer_last_char = character
|
||||||
|
self._buffer_last_char_accentuated = False
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._buffer_length = 0
|
||||||
|
self._buffer_last_char = None
|
||||||
|
self._buffer_last_char_accentuated = False
|
||||||
|
self._is_current_word_bad = False
|
||||||
|
self._foreign_long_watch = False
|
||||||
|
self._bad_word_count = 0
|
||||||
|
self._word_count = 0
|
||||||
|
self._character_count = 0
|
||||||
|
self._bad_character_count = 0
|
||||||
|
self._foreign_long_count = 0
|
||||||
|
self._buffer_accent_count = 0
|
||||||
|
self._buffer_glyph_count = 0
|
||||||
|
self._buffer_upper_count = 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._word_count <= 10 and self._foreign_long_count == 0:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
return self._bad_character_count / self._character_count
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class CjkUncommonPlugin(MessDetectorPlugin):
|
||||||
|
"""
|
||||||
|
Detect messy CJK text that probably means nothing.
|
||||||
|
"""
|
||||||
|
|
||||||
|
__slots__ = ("_character_count", "_uncommon_count")
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._character_count: int = 0
|
||||||
|
self._uncommon_count: int = 0
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
self._character_count += 1
|
||||||
|
|
||||||
|
if character not in COMMON_CJK_CHARACTERS:
|
||||||
|
self._uncommon_count += 1
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._character_count = 0
|
||||||
|
self._uncommon_count = 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count < 8:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
uncommon_form_usage: float = self._uncommon_count / self._character_count
|
||||||
|
|
||||||
|
# we can be pretty sure it's garbage when uncommon characters are widely
|
||||||
|
# used. otherwise it could just be traditional chinese for example.
|
||||||
|
return uncommon_form_usage / 10 if uncommon_form_usage > 0.5 else 0.0
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class ArchaicUpperLowerPlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = (
|
||||||
|
"_buf",
|
||||||
|
"_character_count_since_last_sep",
|
||||||
|
"_successive_upper_lower_count",
|
||||||
|
"_successive_upper_lower_count_final",
|
||||||
|
"_character_count",
|
||||||
|
"_last_alpha_seen",
|
||||||
|
"_last_alpha_seen_upper",
|
||||||
|
"_last_alpha_seen_lower",
|
||||||
|
"_current_ascii_only",
|
||||||
|
)
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._buf: bool = False
|
||||||
|
|
||||||
|
self._character_count_since_last_sep: int = 0
|
||||||
|
|
||||||
|
self._successive_upper_lower_count: int = 0
|
||||||
|
self._successive_upper_lower_count_final: int = 0
|
||||||
|
|
||||||
|
self._character_count: int = 0
|
||||||
|
|
||||||
|
self._last_alpha_seen: str | None = None
|
||||||
|
self._last_alpha_seen_upper: bool = False
|
||||||
|
self._last_alpha_seen_lower: bool = False
|
||||||
|
self._current_ascii_only: bool = True
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
is_concerned: bool = info.alpha and info.case_variable
|
||||||
|
chunk_sep: bool = not is_concerned
|
||||||
|
|
||||||
|
if chunk_sep and self._character_count_since_last_sep > 0:
|
||||||
|
if (
|
||||||
|
self._character_count_since_last_sep <= 64
|
||||||
|
and not info.digit
|
||||||
|
and not self._current_ascii_only
|
||||||
|
):
|
||||||
|
self._successive_upper_lower_count_final += (
|
||||||
|
self._successive_upper_lower_count
|
||||||
|
)
|
||||||
|
|
||||||
|
self._successive_upper_lower_count = 0
|
||||||
|
self._character_count_since_last_sep = 0
|
||||||
|
self._last_alpha_seen = None
|
||||||
|
self._buf = False
|
||||||
|
self._character_count += 1
|
||||||
|
self._current_ascii_only = True
|
||||||
|
|
||||||
|
return
|
||||||
|
|
||||||
|
if self._current_ascii_only and not info.is_ascii:
|
||||||
|
self._current_ascii_only = False
|
||||||
|
|
||||||
|
if self._last_alpha_seen is not None:
|
||||||
|
if (info.upper and self._last_alpha_seen_lower) or (
|
||||||
|
info.lower and self._last_alpha_seen_upper
|
||||||
|
):
|
||||||
|
if self._buf:
|
||||||
|
self._successive_upper_lower_count += 2
|
||||||
|
self._buf = False
|
||||||
|
else:
|
||||||
|
self._buf = True
|
||||||
|
else:
|
||||||
|
self._buf = False
|
||||||
|
|
||||||
|
self._character_count += 1
|
||||||
|
self._character_count_since_last_sep += 1
|
||||||
|
self._last_alpha_seen = character
|
||||||
|
self._last_alpha_seen_upper = info.upper
|
||||||
|
self._last_alpha_seen_lower = info.lower
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._character_count = 0
|
||||||
|
self._character_count_since_last_sep = 0
|
||||||
|
self._successive_upper_lower_count = 0
|
||||||
|
self._successive_upper_lower_count_final = 0
|
||||||
|
self._last_alpha_seen = None
|
||||||
|
self._last_alpha_seen_upper = False
|
||||||
|
self._last_alpha_seen_lower = False
|
||||||
|
self._buf = False
|
||||||
|
self._current_ascii_only = True
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count == 0: # Defensive:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
return self._successive_upper_lower_count_final / self._character_count
|
||||||
|
|
||||||
|
|
||||||
|
@final
|
||||||
|
class ArabicIsolatedFormPlugin(MessDetectorPlugin):
|
||||||
|
__slots__ = ("_character_count", "_isolated_form_count")
|
||||||
|
|
||||||
|
def __init__(self) -> None:
|
||||||
|
self._character_count: int = 0
|
||||||
|
self._isolated_form_count: int = 0
|
||||||
|
|
||||||
|
def reset(self) -> None: # Abstract
|
||||||
|
self._character_count = 0
|
||||||
|
self._isolated_form_count = 0
|
||||||
|
|
||||||
|
def feed_info(self, character: str, info: CharInfo) -> None:
|
||||||
|
"""Optimized feed using pre-computed character info."""
|
||||||
|
self._character_count += 1
|
||||||
|
|
||||||
|
if info.flags & _ARABIC_ISOLATED_FORM:
|
||||||
|
self._isolated_form_count += 1
|
||||||
|
|
||||||
|
@property
|
||||||
|
def ratio(self) -> float:
|
||||||
|
if self._character_count < 8:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
isolated_form_usage: float = self._isolated_form_count / self._character_count
|
||||||
|
|
||||||
|
return isolated_form_usage
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=1024)
|
||||||
|
def is_suspiciously_successive_range(
|
||||||
|
unicode_range_a: str | None, unicode_range_b: str | None
|
||||||
|
) -> bool:
|
||||||
|
"""
|
||||||
|
Determine if two Unicode range seen next to each other can be considered as suspicious.
|
||||||
|
"""
|
||||||
|
if unicode_range_a is None or unicode_range_b is None:
|
||||||
|
return True
|
||||||
|
|
||||||
|
if unicode_range_a == unicode_range_b:
|
||||||
|
return False
|
||||||
|
|
||||||
|
if "Latin" in unicode_range_a and "Latin" in unicode_range_b:
|
||||||
|
return False
|
||||||
|
|
||||||
|
if "Emoticons" in unicode_range_a or "Emoticons" in unicode_range_b:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Latin characters can be accompanied with a combining diacritical mark
|
||||||
|
# eg. Vietnamese.
|
||||||
|
if ("Latin" in unicode_range_a or "Latin" in unicode_range_b) and (
|
||||||
|
"Combining" in unicode_range_a or "Combining" in unicode_range_b
|
||||||
|
):
|
||||||
|
return False
|
||||||
|
|
||||||
|
keywords_range_a, keywords_range_b = (
|
||||||
|
unicode_range_a.split(" "),
|
||||||
|
unicode_range_b.split(" "),
|
||||||
|
)
|
||||||
|
|
||||||
|
for el in keywords_range_a:
|
||||||
|
if el in UNICODE_SECONDARY_RANGE_KEYWORD:
|
||||||
|
continue
|
||||||
|
if el in keywords_range_b:
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Japanese Exception
|
||||||
|
range_a_jp_chars, range_b_jp_chars = (
|
||||||
|
unicode_range_a
|
||||||
|
in (
|
||||||
|
"Hiragana",
|
||||||
|
"Katakana",
|
||||||
|
),
|
||||||
|
unicode_range_b in ("Hiragana", "Katakana"),
|
||||||
|
)
|
||||||
|
if (range_a_jp_chars or range_b_jp_chars) and (
|
||||||
|
"CJK" in unicode_range_a or "CJK" in unicode_range_b
|
||||||
|
):
|
||||||
|
return False
|
||||||
|
if range_a_jp_chars and range_b_jp_chars:
|
||||||
|
return False
|
||||||
|
|
||||||
|
if "Hangul" in unicode_range_a or "Hangul" in unicode_range_b:
|
||||||
|
if "CJK" in unicode_range_a or "CJK" in unicode_range_b:
|
||||||
|
return False
|
||||||
|
if unicode_range_a == "Basic Latin" or unicode_range_b == "Basic Latin":
|
||||||
|
return False
|
||||||
|
|
||||||
|
# Chinese/Japanese use dedicated range for punctuation and/or separators.
|
||||||
|
if ("CJK" in unicode_range_a or "CJK" in unicode_range_b) or (
|
||||||
|
unicode_range_a in ["Katakana", "Hiragana"]
|
||||||
|
and unicode_range_b in ["Katakana", "Hiragana"]
|
||||||
|
):
|
||||||
|
if "Punctuation" in unicode_range_a or "Punctuation" in unicode_range_b:
|
||||||
|
return False
|
||||||
|
if "Forms" in unicode_range_a or "Forms" in unicode_range_b:
|
||||||
|
return False
|
||||||
|
if unicode_range_a == "Basic Latin" or unicode_range_b == "Basic Latin":
|
||||||
|
return False
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=2048)
|
||||||
|
def mess_ratio(
|
||||||
|
decoded_sequence: str, maximum_threshold: float = 0.2, debug: bool = False
|
||||||
|
) -> float:
|
||||||
|
"""
|
||||||
|
Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier.
|
||||||
|
"""
|
||||||
|
|
||||||
|
seq_len: int = len(decoded_sequence)
|
||||||
|
|
||||||
|
if seq_len < 511:
|
||||||
|
step: int = 32
|
||||||
|
elif seq_len < 1024:
|
||||||
|
step = 64
|
||||||
|
else:
|
||||||
|
step = 128
|
||||||
|
|
||||||
|
# Create each detector as a named local variable (unrolled from the generic loop).
|
||||||
|
# This eliminates per-character iteration over the detector list and
|
||||||
|
# per-character eligible() virtual dispatch, while keeping every plugin class
|
||||||
|
# intact and fully readable.
|
||||||
|
d_sp: TooManySymbolOrPunctuationPlugin = TooManySymbolOrPunctuationPlugin()
|
||||||
|
d_ta: TooManyAccentuatedPlugin = TooManyAccentuatedPlugin()
|
||||||
|
d_up: UnprintablePlugin = UnprintablePlugin()
|
||||||
|
d_sda: SuspiciousDuplicateAccentPlugin = SuspiciousDuplicateAccentPlugin()
|
||||||
|
d_sr: SuspiciousRange = SuspiciousRange()
|
||||||
|
d_sw: SuperWeirdWordPlugin = SuperWeirdWordPlugin()
|
||||||
|
d_cu: CjkUncommonPlugin = CjkUncommonPlugin()
|
||||||
|
d_au: ArchaicUpperLowerPlugin = ArchaicUpperLowerPlugin()
|
||||||
|
d_ai: ArabicIsolatedFormPlugin = ArabicIsolatedFormPlugin()
|
||||||
|
|
||||||
|
# Local references for feed_info methods called in the hot loop.
|
||||||
|
d_sp_feed = d_sp.feed_info
|
||||||
|
d_ta_feed = d_ta.feed_info
|
||||||
|
d_up_feed = d_up.feed_info
|
||||||
|
d_sda_feed = d_sda.feed_info
|
||||||
|
d_sr_feed = d_sr.feed_info
|
||||||
|
d_sw_feed = d_sw.feed_info
|
||||||
|
d_cu_feed = d_cu.feed_info
|
||||||
|
d_au_feed = d_au.feed_info
|
||||||
|
d_ai_feed = d_ai.feed_info
|
||||||
|
|
||||||
|
# Single reusable CharInfo object (avoids per-character allocation).
|
||||||
|
info: CharInfo = CharInfo()
|
||||||
|
info_update = info.update
|
||||||
|
|
||||||
|
mean_mess_ratio: float
|
||||||
|
|
||||||
|
for block_start in range(0, seq_len, step):
|
||||||
|
for character in decoded_sequence[block_start : block_start + step]:
|
||||||
|
# Pre-compute all character properties once (shared across all plugins).
|
||||||
|
info_update(character)
|
||||||
|
|
||||||
|
# Detectors with eligible() == always True
|
||||||
|
d_up_feed(character, info)
|
||||||
|
d_sw_feed(character, info)
|
||||||
|
d_au_feed(character, info)
|
||||||
|
|
||||||
|
# Detectors with eligible() == isprintable
|
||||||
|
if info.printable:
|
||||||
|
d_sp_feed(character, info)
|
||||||
|
d_sr_feed(character, info)
|
||||||
|
|
||||||
|
# Detectors with eligible() == isalpha
|
||||||
|
if info.alpha:
|
||||||
|
d_ta_feed(character, info)
|
||||||
|
# SuspiciousDuplicateAccent: isalpha() and is_latin()
|
||||||
|
if info.latin:
|
||||||
|
d_sda_feed(character, info)
|
||||||
|
# CjkUncommon: is_cjk()
|
||||||
|
if info.is_cjk:
|
||||||
|
d_cu_feed(character, info)
|
||||||
|
# ArabicIsolatedForm: is_arabic()
|
||||||
|
if info.is_arabic:
|
||||||
|
d_ai_feed(character, info)
|
||||||
|
|
||||||
|
mean_mess_ratio = (
|
||||||
|
d_sp.ratio
|
||||||
|
+ d_ta.ratio
|
||||||
|
+ d_up.ratio
|
||||||
|
+ d_sda.ratio
|
||||||
|
+ d_sr.ratio
|
||||||
|
+ d_sw.ratio
|
||||||
|
+ d_cu.ratio
|
||||||
|
+ d_au.ratio
|
||||||
|
+ d_ai.ratio
|
||||||
|
)
|
||||||
|
|
||||||
|
if mean_mess_ratio >= maximum_threshold:
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
# Flush last word buffer in SuperWeirdWordPlugin via trailing newline.
|
||||||
|
info_update("\n")
|
||||||
|
d_sw_feed("\n", info)
|
||||||
|
d_au_feed("\n", info)
|
||||||
|
d_up_feed("\n", info)
|
||||||
|
|
||||||
|
mean_mess_ratio = (
|
||||||
|
d_sp.ratio
|
||||||
|
+ d_ta.ratio
|
||||||
|
+ d_up.ratio
|
||||||
|
+ d_sda.ratio
|
||||||
|
+ d_sr.ratio
|
||||||
|
+ d_sw.ratio
|
||||||
|
+ d_cu.ratio
|
||||||
|
+ d_au.ratio
|
||||||
|
+ d_ai.ratio
|
||||||
|
)
|
||||||
|
|
||||||
|
if debug: # Defensive:
|
||||||
|
logger = getLogger("charset_normalizer")
|
||||||
|
|
||||||
|
logger.log(
|
||||||
|
TRACE,
|
||||||
|
"Mess-detector extended-analysis start. "
|
||||||
|
f"intermediary_mean_mess_ratio_calc={step} mean_mess_ratio={mean_mess_ratio} "
|
||||||
|
f"maximum_threshold={maximum_threshold}",
|
||||||
|
)
|
||||||
|
|
||||||
|
if seq_len > 16:
|
||||||
|
logger.log(TRACE, f"Starting with: {decoded_sequence[:16]}")
|
||||||
|
logger.log(TRACE, f"Ending with: {decoded_sequence[-16::]}")
|
||||||
|
|
||||||
|
for dt in [d_sp, d_ta, d_up, d_sda, d_sr, d_sw, d_cu, d_au, d_ai]:
|
||||||
|
logger.log(TRACE, f"{dt.__class__}: {dt.ratio}")
|
||||||
|
|
||||||
|
return round(mean_mess_ratio, 3)
|
||||||
@ -0,0 +1,369 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
from encodings.aliases import aliases
|
||||||
|
from json import dumps
|
||||||
|
from re import sub
|
||||||
|
from typing import Any, Iterator, List, Tuple
|
||||||
|
|
||||||
|
from .constant import RE_POSSIBLE_ENCODING_INDICATION, TOO_BIG_SEQUENCE
|
||||||
|
from .utils import iana_name, is_multi_byte_encoding, unicode_range
|
||||||
|
|
||||||
|
|
||||||
|
class CharsetMatch:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
payload: bytes | bytearray,
|
||||||
|
guessed_encoding: str,
|
||||||
|
mean_mess_ratio: float,
|
||||||
|
has_sig_or_bom: bool,
|
||||||
|
languages: CoherenceMatches,
|
||||||
|
decoded_payload: str | None = None,
|
||||||
|
preemptive_declaration: str | None = None,
|
||||||
|
):
|
||||||
|
self._payload: bytes | bytearray = payload
|
||||||
|
|
||||||
|
self._encoding: str = guessed_encoding
|
||||||
|
self._mean_mess_ratio: float = mean_mess_ratio
|
||||||
|
self._languages: CoherenceMatches = languages
|
||||||
|
self._has_sig_or_bom: bool = has_sig_or_bom
|
||||||
|
self._unicode_ranges: list[str] | None = None
|
||||||
|
|
||||||
|
self._leaves: list[CharsetMatch] = []
|
||||||
|
self._mean_coherence_ratio: float = 0.0
|
||||||
|
|
||||||
|
self._output_payload: bytes | None = None
|
||||||
|
self._output_encoding: str | None = None
|
||||||
|
|
||||||
|
self._string: str | None = decoded_payload
|
||||||
|
|
||||||
|
self._preemptive_declaration: str | None = preemptive_declaration
|
||||||
|
|
||||||
|
def __eq__(self, other: object) -> bool:
|
||||||
|
if not isinstance(other, CharsetMatch):
|
||||||
|
if isinstance(other, str):
|
||||||
|
return iana_name(other) == self.encoding
|
||||||
|
return False
|
||||||
|
return self.encoding == other.encoding and self.fingerprint == other.fingerprint
|
||||||
|
|
||||||
|
def __lt__(self, other: object) -> bool:
|
||||||
|
"""
|
||||||
|
Implemented to make sorted available upon CharsetMatches items.
|
||||||
|
"""
|
||||||
|
if not isinstance(other, CharsetMatch):
|
||||||
|
raise ValueError
|
||||||
|
|
||||||
|
chaos_difference: float = abs(self.chaos - other.chaos)
|
||||||
|
coherence_difference: float = abs(self.coherence - other.coherence)
|
||||||
|
|
||||||
|
# Below 0.5% difference --> Use Coherence
|
||||||
|
if chaos_difference < 0.005 and coherence_difference > 0.02:
|
||||||
|
return self.coherence > other.coherence
|
||||||
|
elif chaos_difference < 0.005 and coherence_difference <= 0.02:
|
||||||
|
# When having a difficult decision, use the result that decoded as many multi-byte as possible.
|
||||||
|
# preserve RAM usage!
|
||||||
|
if len(self._payload) >= TOO_BIG_SEQUENCE:
|
||||||
|
return self.chaos < other.chaos
|
||||||
|
return self.multi_byte_usage > other.multi_byte_usage
|
||||||
|
|
||||||
|
return self.chaos < other.chaos
|
||||||
|
|
||||||
|
@property
|
||||||
|
def multi_byte_usage(self) -> float:
|
||||||
|
return 1.0 - (len(str(self)) / len(self.raw))
|
||||||
|
|
||||||
|
def __str__(self) -> str:
|
||||||
|
# Lazy Str Loading
|
||||||
|
if self._string is None:
|
||||||
|
self._string = str(self._payload, self._encoding, "strict")
|
||||||
|
# UTF-7 BOM is encoded in modified Base64 whose byte boundary
|
||||||
|
# can overlap with the next character, so raw-byte stripping
|
||||||
|
# is unreliable. Strip the decoded BOM character instead.
|
||||||
|
if (
|
||||||
|
self._has_sig_or_bom
|
||||||
|
and self._encoding == "utf_7"
|
||||||
|
and self._string
|
||||||
|
and self._string[0] == "\ufeff"
|
||||||
|
):
|
||||||
|
self._string = self._string[1:]
|
||||||
|
return self._string
|
||||||
|
|
||||||
|
def __repr__(self) -> str:
|
||||||
|
return f"<CharsetMatch '{self.encoding}' fp({self.fingerprint})>"
|
||||||
|
|
||||||
|
def add_submatch(self, other: CharsetMatch) -> None:
|
||||||
|
if not isinstance(other, CharsetMatch) or other == self:
|
||||||
|
raise ValueError(
|
||||||
|
"Unable to add instance <{}> as a submatch of a CharsetMatch".format(
|
||||||
|
other.__class__
|
||||||
|
)
|
||||||
|
)
|
||||||
|
|
||||||
|
other._string = None # Unload RAM usage; dirty trick.
|
||||||
|
self._leaves.append(other)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def encoding(self) -> str:
|
||||||
|
return self._encoding
|
||||||
|
|
||||||
|
@property
|
||||||
|
def encoding_aliases(self) -> list[str]:
|
||||||
|
"""
|
||||||
|
Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855.
|
||||||
|
"""
|
||||||
|
also_known_as: list[str] = []
|
||||||
|
for u, p in aliases.items():
|
||||||
|
if self.encoding == u:
|
||||||
|
also_known_as.append(p)
|
||||||
|
elif self.encoding == p:
|
||||||
|
also_known_as.append(u)
|
||||||
|
return also_known_as
|
||||||
|
|
||||||
|
@property
|
||||||
|
def bom(self) -> bool:
|
||||||
|
return self._has_sig_or_bom
|
||||||
|
|
||||||
|
@property
|
||||||
|
def byte_order_mark(self) -> bool:
|
||||||
|
return self._has_sig_or_bom
|
||||||
|
|
||||||
|
@property
|
||||||
|
def languages(self) -> list[str]:
|
||||||
|
"""
|
||||||
|
Return the complete list of possible languages found in decoded sequence.
|
||||||
|
Usually not really useful. Returned list may be empty even if 'language' property return something != 'Unknown'.
|
||||||
|
"""
|
||||||
|
return [e[0] for e in self._languages]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def language(self) -> str:
|
||||||
|
"""
|
||||||
|
Most probable language found in decoded sequence. If none were detected or inferred, the property will return
|
||||||
|
"Unknown".
|
||||||
|
"""
|
||||||
|
if not self._languages:
|
||||||
|
# Trying to infer the language based on the given encoding
|
||||||
|
# Its either English or we should not pronounce ourselves in certain cases.
|
||||||
|
if "ascii" in self.could_be_from_charset:
|
||||||
|
return "English"
|
||||||
|
|
||||||
|
# doing it there to avoid circular import
|
||||||
|
from charset_normalizer.cd import encoding_languages, mb_encoding_languages
|
||||||
|
|
||||||
|
languages = (
|
||||||
|
mb_encoding_languages(self.encoding)
|
||||||
|
if is_multi_byte_encoding(self.encoding)
|
||||||
|
else encoding_languages(self.encoding)
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(languages) == 0 or "Latin Based" in languages:
|
||||||
|
return "Unknown"
|
||||||
|
|
||||||
|
return languages[0]
|
||||||
|
|
||||||
|
return self._languages[0][0]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def chaos(self) -> float:
|
||||||
|
return self._mean_mess_ratio
|
||||||
|
|
||||||
|
@property
|
||||||
|
def coherence(self) -> float:
|
||||||
|
if not self._languages:
|
||||||
|
return 0.0
|
||||||
|
return self._languages[0][1]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def percent_chaos(self) -> float:
|
||||||
|
return round(self.chaos * 100, ndigits=3)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def percent_coherence(self) -> float:
|
||||||
|
return round(self.coherence * 100, ndigits=3)
|
||||||
|
|
||||||
|
@property
|
||||||
|
def raw(self) -> bytes | bytearray:
|
||||||
|
"""
|
||||||
|
Original untouched bytes.
|
||||||
|
"""
|
||||||
|
return self._payload
|
||||||
|
|
||||||
|
@property
|
||||||
|
def submatch(self) -> list[CharsetMatch]:
|
||||||
|
return self._leaves
|
||||||
|
|
||||||
|
@property
|
||||||
|
def has_submatch(self) -> bool:
|
||||||
|
return len(self._leaves) > 0
|
||||||
|
|
||||||
|
@property
|
||||||
|
def alphabets(self) -> list[str]:
|
||||||
|
if self._unicode_ranges is not None:
|
||||||
|
return self._unicode_ranges
|
||||||
|
# list detected ranges
|
||||||
|
detected_ranges: list[str | None] = [unicode_range(char) for char in str(self)]
|
||||||
|
# filter and sort
|
||||||
|
self._unicode_ranges = sorted(list({r for r in detected_ranges if r}))
|
||||||
|
return self._unicode_ranges
|
||||||
|
|
||||||
|
@property
|
||||||
|
def could_be_from_charset(self) -> list[str]:
|
||||||
|
"""
|
||||||
|
The complete list of encoding that output the exact SAME str result and therefore could be the originating
|
||||||
|
encoding.
|
||||||
|
This list does include the encoding available in property 'encoding'.
|
||||||
|
"""
|
||||||
|
return [self._encoding] + [m.encoding for m in self._leaves]
|
||||||
|
|
||||||
|
def output(self, encoding: str = "utf_8") -> bytes:
|
||||||
|
"""
|
||||||
|
Method to get re-encoded bytes payload using given target encoding. Default to UTF-8.
|
||||||
|
Any errors will be simply ignored by the encoder NOT replaced.
|
||||||
|
"""
|
||||||
|
if self._output_encoding is None or self._output_encoding != encoding:
|
||||||
|
self._output_encoding = encoding
|
||||||
|
decoded_string = str(self)
|
||||||
|
if (
|
||||||
|
self._preemptive_declaration is not None
|
||||||
|
and self._preemptive_declaration.lower()
|
||||||
|
not in ["utf-8", "utf8", "utf_8"]
|
||||||
|
):
|
||||||
|
patched_header = sub(
|
||||||
|
RE_POSSIBLE_ENCODING_INDICATION,
|
||||||
|
lambda m: m.string[m.span()[0] : m.span()[1]].replace(
|
||||||
|
m.groups()[0],
|
||||||
|
iana_name(self._output_encoding).replace("_", "-"), # type: ignore[arg-type]
|
||||||
|
),
|
||||||
|
decoded_string[:8192],
|
||||||
|
count=1,
|
||||||
|
)
|
||||||
|
|
||||||
|
decoded_string = patched_header + decoded_string[8192:]
|
||||||
|
|
||||||
|
self._output_payload = decoded_string.encode(encoding, "replace")
|
||||||
|
|
||||||
|
return self._output_payload # type: ignore
|
||||||
|
|
||||||
|
@property
|
||||||
|
def fingerprint(self) -> int:
|
||||||
|
"""
|
||||||
|
Retrieve a hash fingerprint of the decoded payload, used for deduplication.
|
||||||
|
"""
|
||||||
|
return hash(str(self))
|
||||||
|
|
||||||
|
|
||||||
|
class CharsetMatches:
|
||||||
|
"""
|
||||||
|
Container with every CharsetMatch items ordered by default from most probable to the less one.
|
||||||
|
Act like a list(iterable) but does not implements all related methods.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self, results: list[CharsetMatch] | None = None):
|
||||||
|
self._results: list[CharsetMatch] = sorted(results) if results else []
|
||||||
|
|
||||||
|
def __iter__(self) -> Iterator[CharsetMatch]:
|
||||||
|
yield from self._results
|
||||||
|
|
||||||
|
def __getitem__(self, item: int | str) -> CharsetMatch:
|
||||||
|
"""
|
||||||
|
Retrieve a single item either by its position or encoding name (alias may be used here).
|
||||||
|
Raise KeyError upon invalid index or encoding not present in results.
|
||||||
|
"""
|
||||||
|
if isinstance(item, int):
|
||||||
|
return self._results[item]
|
||||||
|
if isinstance(item, str):
|
||||||
|
item = iana_name(item, False)
|
||||||
|
for result in self._results:
|
||||||
|
if item in result.could_be_from_charset:
|
||||||
|
return result
|
||||||
|
raise KeyError
|
||||||
|
|
||||||
|
def __len__(self) -> int:
|
||||||
|
return len(self._results)
|
||||||
|
|
||||||
|
def __bool__(self) -> bool:
|
||||||
|
return len(self._results) > 0
|
||||||
|
|
||||||
|
def append(self, item: CharsetMatch) -> None:
|
||||||
|
"""
|
||||||
|
Insert a single match. Will be inserted accordingly to preserve sort.
|
||||||
|
Can be inserted as a submatch.
|
||||||
|
"""
|
||||||
|
if not isinstance(item, CharsetMatch):
|
||||||
|
raise ValueError(
|
||||||
|
"Cannot append instance '{}' to CharsetMatches".format(
|
||||||
|
str(item.__class__)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
# We should disable the submatch factoring when the input file is too heavy (conserve RAM usage)
|
||||||
|
if len(item.raw) < TOO_BIG_SEQUENCE:
|
||||||
|
for match in self._results:
|
||||||
|
if match.fingerprint == item.fingerprint and match.chaos == item.chaos:
|
||||||
|
match.add_submatch(item)
|
||||||
|
return
|
||||||
|
self._results.append(item)
|
||||||
|
self._results = sorted(self._results)
|
||||||
|
|
||||||
|
def best(self) -> CharsetMatch | None:
|
||||||
|
"""
|
||||||
|
Simply return the first match. Strict equivalent to matches[0].
|
||||||
|
"""
|
||||||
|
if not self._results:
|
||||||
|
return None
|
||||||
|
return self._results[0]
|
||||||
|
|
||||||
|
def first(self) -> CharsetMatch | None:
|
||||||
|
"""
|
||||||
|
Redundant method, call the method best(). Kept for BC reasons.
|
||||||
|
"""
|
||||||
|
return self.best()
|
||||||
|
|
||||||
|
|
||||||
|
CoherenceMatch = Tuple[str, float]
|
||||||
|
CoherenceMatches = List[CoherenceMatch]
|
||||||
|
|
||||||
|
|
||||||
|
class CliDetectionResult:
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
path: str,
|
||||||
|
encoding: str | None,
|
||||||
|
encoding_aliases: list[str],
|
||||||
|
alternative_encodings: list[str],
|
||||||
|
language: str,
|
||||||
|
alphabets: list[str],
|
||||||
|
has_sig_or_bom: bool,
|
||||||
|
chaos: float,
|
||||||
|
coherence: float,
|
||||||
|
unicode_path: str | None,
|
||||||
|
is_preferred: bool,
|
||||||
|
):
|
||||||
|
self.path: str = path
|
||||||
|
self.unicode_path: str | None = unicode_path
|
||||||
|
self.encoding: str | None = encoding
|
||||||
|
self.encoding_aliases: list[str] = encoding_aliases
|
||||||
|
self.alternative_encodings: list[str] = alternative_encodings
|
||||||
|
self.language: str = language
|
||||||
|
self.alphabets: list[str] = alphabets
|
||||||
|
self.has_sig_or_bom: bool = has_sig_or_bom
|
||||||
|
self.chaos: float = chaos
|
||||||
|
self.coherence: float = coherence
|
||||||
|
self.is_preferred: bool = is_preferred
|
||||||
|
|
||||||
|
@property
|
||||||
|
def __dict__(self) -> dict[str, Any]: # type: ignore
|
||||||
|
return {
|
||||||
|
"path": self.path,
|
||||||
|
"encoding": self.encoding,
|
||||||
|
"encoding_aliases": self.encoding_aliases,
|
||||||
|
"alternative_encodings": self.alternative_encodings,
|
||||||
|
"language": self.language,
|
||||||
|
"alphabets": self.alphabets,
|
||||||
|
"has_sig_or_bom": self.has_sig_or_bom,
|
||||||
|
"chaos": self.chaos,
|
||||||
|
"coherence": self.coherence,
|
||||||
|
"unicode_path": self.unicode_path,
|
||||||
|
"is_preferred": self.is_preferred,
|
||||||
|
}
|
||||||
|
|
||||||
|
def to_json(self) -> str:
|
||||||
|
return dumps(self.__dict__, ensure_ascii=True, indent=4)
|
||||||
@ -0,0 +1,422 @@
|
|||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import importlib
|
||||||
|
import logging
|
||||||
|
import unicodedata
|
||||||
|
from bisect import bisect_right
|
||||||
|
from codecs import IncrementalDecoder
|
||||||
|
from encodings.aliases import aliases
|
||||||
|
from functools import lru_cache
|
||||||
|
from re import findall
|
||||||
|
from typing import Generator
|
||||||
|
|
||||||
|
from _multibytecodec import ( # type: ignore[import-not-found,import]
|
||||||
|
MultibyteIncrementalDecoder,
|
||||||
|
)
|
||||||
|
|
||||||
|
from .constant import (
|
||||||
|
ENCODING_MARKS,
|
||||||
|
IANA_SUPPORTED_SIMILAR,
|
||||||
|
RE_POSSIBLE_ENCODING_INDICATION,
|
||||||
|
UNICODE_RANGES_COMBINED,
|
||||||
|
UNICODE_SECONDARY_RANGE_KEYWORD,
|
||||||
|
UTF8_MAXIMAL_ALLOCATION,
|
||||||
|
COMMON_CJK_CHARACTERS,
|
||||||
|
_LATIN,
|
||||||
|
_CJK,
|
||||||
|
_HANGUL,
|
||||||
|
_KATAKANA,
|
||||||
|
_HIRAGANA,
|
||||||
|
_THAI,
|
||||||
|
_ARABIC,
|
||||||
|
_ARABIC_ISOLATED_FORM,
|
||||||
|
_ACCENT_KEYWORDS,
|
||||||
|
_ACCENTUATED,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def _character_flags(character: str) -> int:
|
||||||
|
"""Compute all name-based classification flags with a single unicodedata.name() call."""
|
||||||
|
try:
|
||||||
|
desc: str = unicodedata.name(character)
|
||||||
|
except ValueError:
|
||||||
|
return 0
|
||||||
|
|
||||||
|
flags: int = 0
|
||||||
|
|
||||||
|
if "LATIN" in desc:
|
||||||
|
flags |= _LATIN
|
||||||
|
if "CJK" in desc:
|
||||||
|
flags |= _CJK
|
||||||
|
if "HANGUL" in desc:
|
||||||
|
flags |= _HANGUL
|
||||||
|
if "KATAKANA" in desc:
|
||||||
|
flags |= _KATAKANA
|
||||||
|
if "HIRAGANA" in desc:
|
||||||
|
flags |= _HIRAGANA
|
||||||
|
if "THAI" in desc:
|
||||||
|
flags |= _THAI
|
||||||
|
if "ARABIC" in desc:
|
||||||
|
flags |= _ARABIC
|
||||||
|
if "ISOLATED FORM" in desc:
|
||||||
|
flags |= _ARABIC_ISOLATED_FORM
|
||||||
|
|
||||||
|
for kw in _ACCENT_KEYWORDS:
|
||||||
|
if kw in desc:
|
||||||
|
flags |= _ACCENTUATED
|
||||||
|
break
|
||||||
|
|
||||||
|
return flags
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_accentuated(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _ACCENTUATED)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def remove_accent(character: str) -> str:
|
||||||
|
decomposed: str = unicodedata.decomposition(character)
|
||||||
|
if not decomposed:
|
||||||
|
return character
|
||||||
|
|
||||||
|
codes: list[str] = decomposed.split(" ")
|
||||||
|
|
||||||
|
return chr(int(codes[0], 16))
|
||||||
|
|
||||||
|
|
||||||
|
# Pre-built sorted lookup table for O(log n) binary search in unicode_range().
|
||||||
|
# Each entry is (range_start, range_end_exclusive, range_name).
|
||||||
|
_UNICODE_RANGES_SORTED: list[tuple[int, int, str]] = sorted(
|
||||||
|
(ord_range.start, ord_range.stop, name)
|
||||||
|
for name, ord_range in UNICODE_RANGES_COMBINED.items()
|
||||||
|
)
|
||||||
|
_UNICODE_RANGE_STARTS: list[int] = [e[0] for e in _UNICODE_RANGES_SORTED]
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def unicode_range(character: str) -> str | None:
|
||||||
|
"""
|
||||||
|
Retrieve the Unicode range official name from a single character.
|
||||||
|
"""
|
||||||
|
character_ord: int = ord(character)
|
||||||
|
|
||||||
|
# Binary search: find the rightmost range whose start <= character_ord
|
||||||
|
idx = bisect_right(_UNICODE_RANGE_STARTS, character_ord) - 1
|
||||||
|
if idx >= 0:
|
||||||
|
start, stop, name = _UNICODE_RANGES_SORTED[idx]
|
||||||
|
if character_ord < stop:
|
||||||
|
return name
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_latin(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _LATIN)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_punctuation(character: str) -> bool:
|
||||||
|
character_category: str = unicodedata.category(character)
|
||||||
|
|
||||||
|
if "P" in character_category:
|
||||||
|
return True
|
||||||
|
|
||||||
|
character_range: str | None = unicode_range(character)
|
||||||
|
|
||||||
|
if character_range is None:
|
||||||
|
return False
|
||||||
|
|
||||||
|
return "Punctuation" in character_range
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_symbol(character: str) -> bool:
|
||||||
|
character_category: str = unicodedata.category(character)
|
||||||
|
|
||||||
|
if "S" in character_category or "N" in character_category:
|
||||||
|
return True
|
||||||
|
|
||||||
|
character_range: str | None = unicode_range(character)
|
||||||
|
|
||||||
|
if character_range is None:
|
||||||
|
return False
|
||||||
|
|
||||||
|
return "Forms" in character_range and character_category != "Lo"
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_emoticon(character: str) -> bool:
|
||||||
|
character_range: str | None = unicode_range(character)
|
||||||
|
|
||||||
|
if character_range is None:
|
||||||
|
return False
|
||||||
|
|
||||||
|
return "Emoticons" in character_range or "Pictographs" in character_range
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_separator(character: str) -> bool:
|
||||||
|
if character.isspace() or character in {"|", "+", "<", ">"}:
|
||||||
|
return True
|
||||||
|
|
||||||
|
character_category: str = unicodedata.category(character)
|
||||||
|
|
||||||
|
return "Z" in character_category or character_category in {"Po", "Pd", "Pc"}
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_case_variable(character: str) -> bool:
|
||||||
|
return character.islower() != character.isupper()
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_cjk(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _CJK)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_hiragana(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _HIRAGANA)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_katakana(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _KATAKANA)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_hangul(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _HANGUL)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_thai(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _THAI)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_arabic(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _ARABIC)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_arabic_isolated_form(character: str) -> bool:
|
||||||
|
return bool(_character_flags(character) & _ARABIC_ISOLATED_FORM)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_cjk_uncommon(character: str) -> bool:
|
||||||
|
return character not in COMMON_CJK_CHARACTERS
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=len(UNICODE_RANGES_COMBINED))
|
||||||
|
def is_unicode_range_secondary(range_name: str) -> bool:
|
||||||
|
return any(keyword in range_name for keyword in UNICODE_SECONDARY_RANGE_KEYWORD)
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
|
||||||
|
def is_unprintable(character: str) -> bool:
|
||||||
|
return (
|
||||||
|
character.isspace() is False # includes \n \t \r \v
|
||||||
|
and character.isprintable() is False
|
||||||
|
and character != "\x1a" # Why? Its the ASCII substitute character.
|
||||||
|
and character != "\ufeff" # bug discovered in Python,
|
||||||
|
# Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space.
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def any_specified_encoding(
|
||||||
|
sequence: bytes | bytearray, search_zone: int = 8192
|
||||||
|
) -> str | None:
|
||||||
|
"""
|
||||||
|
Extract using ASCII-only decoder any specified encoding in the first n-bytes.
|
||||||
|
"""
|
||||||
|
if not isinstance(sequence, (bytes, bytearray)):
|
||||||
|
raise TypeError
|
||||||
|
|
||||||
|
seq_len: int = len(sequence)
|
||||||
|
|
||||||
|
results: list[str] = findall(
|
||||||
|
RE_POSSIBLE_ENCODING_INDICATION,
|
||||||
|
sequence[: min(seq_len, search_zone)].decode("ascii", errors="ignore"),
|
||||||
|
)
|
||||||
|
|
||||||
|
if len(results) == 0:
|
||||||
|
return None
|
||||||
|
|
||||||
|
for specified_encoding in results:
|
||||||
|
specified_encoding = specified_encoding.lower().replace("-", "_")
|
||||||
|
|
||||||
|
encoding_alias: str
|
||||||
|
encoding_iana: str
|
||||||
|
|
||||||
|
for encoding_alias, encoding_iana in aliases.items():
|
||||||
|
if encoding_alias == specified_encoding:
|
||||||
|
return encoding_iana
|
||||||
|
if encoding_iana == specified_encoding:
|
||||||
|
return encoding_iana
|
||||||
|
|
||||||
|
return None
|
||||||
|
|
||||||
|
|
||||||
|
@lru_cache(maxsize=128)
|
||||||
|
def is_multi_byte_encoding(name: str) -> bool:
|
||||||
|
"""
|
||||||
|
Verify is a specific encoding is a multi byte one based on it IANA name
|
||||||
|
"""
|
||||||
|
return name in {
|
||||||
|
"utf_8",
|
||||||
|
"utf_8_sig",
|
||||||
|
"utf_16",
|
||||||
|
"utf_16_be",
|
||||||
|
"utf_16_le",
|
||||||
|
"utf_32",
|
||||||
|
"utf_32_le",
|
||||||
|
"utf_32_be",
|
||||||
|
"utf_7",
|
||||||
|
} or issubclass(
|
||||||
|
importlib.import_module(f"encodings.{name}").IncrementalDecoder,
|
||||||
|
MultibyteIncrementalDecoder,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def identify_sig_or_bom(sequence: bytes | bytearray) -> tuple[str | None, bytes]:
|
||||||
|
"""
|
||||||
|
Identify and extract SIG/BOM in given sequence.
|
||||||
|
"""
|
||||||
|
|
||||||
|
for iana_encoding in ENCODING_MARKS:
|
||||||
|
marks: bytes | list[bytes] = ENCODING_MARKS[iana_encoding]
|
||||||
|
|
||||||
|
if isinstance(marks, bytes):
|
||||||
|
marks = [marks]
|
||||||
|
|
||||||
|
for mark in marks:
|
||||||
|
if sequence.startswith(mark):
|
||||||
|
return iana_encoding, mark
|
||||||
|
|
||||||
|
return None, b""
|
||||||
|
|
||||||
|
|
||||||
|
def should_strip_sig_or_bom(iana_encoding: str) -> bool:
|
||||||
|
return iana_encoding not in {"utf_16", "utf_32"}
|
||||||
|
|
||||||
|
|
||||||
|
def iana_name(cp_name: str, strict: bool = True) -> str:
|
||||||
|
"""Returns the Python normalized encoding name (Not the IANA official name)."""
|
||||||
|
cp_name = cp_name.lower().replace("-", "_")
|
||||||
|
|
||||||
|
encoding_alias: str
|
||||||
|
encoding_iana: str
|
||||||
|
|
||||||
|
for encoding_alias, encoding_iana in aliases.items():
|
||||||
|
if cp_name in [encoding_alias, encoding_iana]:
|
||||||
|
return encoding_iana
|
||||||
|
|
||||||
|
if strict:
|
||||||
|
raise ValueError(f"Unable to retrieve IANA for '{cp_name}'")
|
||||||
|
|
||||||
|
return cp_name
|
||||||
|
|
||||||
|
|
||||||
|
def cp_similarity(iana_name_a: str, iana_name_b: str) -> float:
|
||||||
|
if is_multi_byte_encoding(iana_name_a) or is_multi_byte_encoding(iana_name_b):
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
decoder_a = importlib.import_module(f"encodings.{iana_name_a}").IncrementalDecoder
|
||||||
|
decoder_b = importlib.import_module(f"encodings.{iana_name_b}").IncrementalDecoder
|
||||||
|
|
||||||
|
id_a: IncrementalDecoder = decoder_a(errors="ignore")
|
||||||
|
id_b: IncrementalDecoder = decoder_b(errors="ignore")
|
||||||
|
|
||||||
|
character_match_count: int = 0
|
||||||
|
|
||||||
|
for i in range(256):
|
||||||
|
to_be_decoded: bytes = bytes([i])
|
||||||
|
if id_a.decode(to_be_decoded) == id_b.decode(to_be_decoded):
|
||||||
|
character_match_count += 1
|
||||||
|
|
||||||
|
return character_match_count / 256
|
||||||
|
|
||||||
|
|
||||||
|
def is_cp_similar(iana_name_a: str, iana_name_b: str) -> bool:
|
||||||
|
"""
|
||||||
|
Determine if two code page are at least 80% similar. IANA_SUPPORTED_SIMILAR dict was generated using
|
||||||
|
the function cp_similarity.
|
||||||
|
"""
|
||||||
|
return (
|
||||||
|
iana_name_a in IANA_SUPPORTED_SIMILAR
|
||||||
|
and iana_name_b in IANA_SUPPORTED_SIMILAR[iana_name_a]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def set_logging_handler(
|
||||||
|
name: str = "charset_normalizer",
|
||||||
|
level: int = logging.INFO,
|
||||||
|
format_string: str = "%(asctime)s | %(levelname)s | %(message)s",
|
||||||
|
) -> None:
|
||||||
|
logger = logging.getLogger(name)
|
||||||
|
logger.setLevel(level)
|
||||||
|
|
||||||
|
handler = logging.StreamHandler()
|
||||||
|
handler.setFormatter(logging.Formatter(format_string))
|
||||||
|
logger.addHandler(handler)
|
||||||
|
|
||||||
|
|
||||||
|
def cut_sequence_chunks(
|
||||||
|
sequences: bytes | bytearray,
|
||||||
|
encoding_iana: str,
|
||||||
|
offsets: range,
|
||||||
|
chunk_size: int,
|
||||||
|
bom_or_sig_available: bool,
|
||||||
|
strip_sig_or_bom: bool,
|
||||||
|
sig_payload: bytes,
|
||||||
|
is_multi_byte_decoder: bool,
|
||||||
|
decoded_payload: str | None = None,
|
||||||
|
) -> Generator[str, None, None]:
|
||||||
|
if decoded_payload and is_multi_byte_decoder is False:
|
||||||
|
for i in offsets:
|
||||||
|
chunk = decoded_payload[i : i + chunk_size]
|
||||||
|
if not chunk:
|
||||||
|
break
|
||||||
|
yield chunk
|
||||||
|
else:
|
||||||
|
for i in offsets:
|
||||||
|
chunk_end = i + chunk_size
|
||||||
|
if chunk_end > len(sequences) + 8:
|
||||||
|
continue
|
||||||
|
|
||||||
|
cut_sequence = sequences[i : i + chunk_size]
|
||||||
|
|
||||||
|
if bom_or_sig_available and strip_sig_or_bom is False:
|
||||||
|
cut_sequence = sig_payload + cut_sequence
|
||||||
|
|
||||||
|
chunk = cut_sequence.decode(
|
||||||
|
encoding_iana,
|
||||||
|
errors="ignore" if is_multi_byte_decoder else "strict",
|
||||||
|
)
|
||||||
|
|
||||||
|
# multi-byte bad cutting detector and adjustment
|
||||||
|
# not the cleanest way to perform that fix but clever enough for now.
|
||||||
|
if is_multi_byte_decoder and i > 0:
|
||||||
|
chunk_partial_size_chk: int = min(chunk_size, 16)
|
||||||
|
|
||||||
|
if (
|
||||||
|
decoded_payload
|
||||||
|
and chunk[:chunk_partial_size_chk] not in decoded_payload
|
||||||
|
):
|
||||||
|
for j in range(i, i - 4, -1):
|
||||||
|
cut_sequence = sequences[j:chunk_end]
|
||||||
|
|
||||||
|
if bom_or_sig_available and strip_sig_or_bom is False:
|
||||||
|
cut_sequence = sig_payload + cut_sequence
|
||||||
|
|
||||||
|
chunk = cut_sequence.decode(encoding_iana, errors="ignore")
|
||||||
|
|
||||||
|
if chunk[:chunk_partial_size_chk] in decoded_payload:
|
||||||
|
break
|
||||||
|
|
||||||
|
yield chunk
|
||||||
@ -0,0 +1,8 @@
|
|||||||
|
"""
|
||||||
|
Expose version
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
__version__ = "3.4.7"
|
||||||
|
VERSION = __version__.split(".")
|
||||||
@ -0,0 +1 @@
|
|||||||
|
pip
|
||||||
@ -0,0 +1,194 @@
|
|||||||
|
Metadata-Version: 2.4
|
||||||
|
Name: idna
|
||||||
|
Version: 3.15
|
||||||
|
Summary: Internationalized Domain Names in Applications (IDNA)
|
||||||
|
Author-email: Kim Davies <kim+pypi@gumleaf.org>
|
||||||
|
Requires-Python: >=3.8
|
||||||
|
Description-Content-Type: text/markdown
|
||||||
|
License-Expression: BSD-3-Clause
|
||||||
|
Classifier: Development Status :: 5 - Production/Stable
|
||||||
|
Classifier: Intended Audience :: Developers
|
||||||
|
Classifier: Intended Audience :: System Administrators
|
||||||
|
Classifier: Operating System :: OS Independent
|
||||||
|
Classifier: Programming Language :: Python
|
||||||
|
Classifier: Programming Language :: Python :: 3
|
||||||
|
Classifier: Programming Language :: Python :: 3 :: Only
|
||||||
|
Classifier: Programming Language :: Python :: 3.8
|
||||||
|
Classifier: Programming Language :: Python :: 3.9
|
||||||
|
Classifier: Programming Language :: Python :: 3.10
|
||||||
|
Classifier: Programming Language :: Python :: 3.11
|
||||||
|
Classifier: Programming Language :: Python :: 3.12
|
||||||
|
Classifier: Programming Language :: Python :: 3.13
|
||||||
|
Classifier: Programming Language :: Python :: 3.14
|
||||||
|
Classifier: Programming Language :: Python :: Implementation :: CPython
|
||||||
|
Classifier: Programming Language :: Python :: Implementation :: PyPy
|
||||||
|
Classifier: Topic :: Internet :: Name Service (DNS)
|
||||||
|
Classifier: Topic :: Software Development :: Libraries :: Python Modules
|
||||||
|
Classifier: Topic :: Utilities
|
||||||
|
License-File: LICENSE.md
|
||||||
|
Requires-Dist: ruff >= 0.6.2 ; extra == "all"
|
||||||
|
Requires-Dist: mypy >= 1.11.2 ; extra == "all"
|
||||||
|
Requires-Dist: pytest >= 8.3.2 ; extra == "all"
|
||||||
|
Project-URL: Changelog, https://github.com/kjd/idna/blob/master/HISTORY.md
|
||||||
|
Project-URL: Issue tracker, https://github.com/kjd/idna/issues
|
||||||
|
Project-URL: Source, https://github.com/kjd/idna
|
||||||
|
Provides-Extra: all
|
||||||
|
|
||||||
|
# Internationalized Domain Names in Applications (IDNA)
|
||||||
|
|
||||||
|
Support for [Internationalized Domain Names in
|
||||||
|
Applications (IDNA)](https://tools.ietf.org/html/rfc5891)
|
||||||
|
and [Unicode IDNA Compatibility Processing](https://unicode.org/reports/tr46/).
|
||||||
|
|
||||||
|
The latest versions of these standards supplied here provide
|
||||||
|
more comprehensive language coverage and reduce the potential of
|
||||||
|
allowing domains with known security vulnerabilities. This library
|
||||||
|
is a suitable replacement for the "encodings.idna"
|
||||||
|
module that comes with the Python standard library, but which
|
||||||
|
only supports an older superseded IDNA specification from 2003.
|
||||||
|
|
||||||
|
Basic functions are simply executed:
|
||||||
|
|
||||||
|
```pycon
|
||||||
|
>>> import idna
|
||||||
|
>>> idna.encode('ドメイン.テスト')
|
||||||
|
b'xn--eckwd4c7c.xn--zckzah'
|
||||||
|
>>> print(idna.decode('xn--eckwd4c7c.xn--zckzah'))
|
||||||
|
ドメイン.テスト
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
This package is available for installation from PyPI via the
|
||||||
|
typical mechanisms, such as:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
$ python3 -m pip install idna
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
For typical usage, the `encode` and `decode` functions will take a
|
||||||
|
domain name argument and perform a conversion to ASCII-compatible encoding
|
||||||
|
(known as A-labels), or to Unicode strings (known as U-labels)
|
||||||
|
respectively.
|
||||||
|
|
||||||
|
```pycon
|
||||||
|
>>> import idna
|
||||||
|
>>> idna.encode('ドメイン.テスト')
|
||||||
|
b'xn--eckwd4c7c.xn--zckzah'
|
||||||
|
>>> print(idna.decode('xn--eckwd4c7c.xn--zckzah'))
|
||||||
|
ドメイン.テスト
|
||||||
|
```
|
||||||
|
|
||||||
|
Conversions can be applied at a per-label basis using the `ulabel` or
|
||||||
|
`alabel` functions if necessary:
|
||||||
|
|
||||||
|
```pycon
|
||||||
|
>>> idna.alabel('测试')
|
||||||
|
b'xn--0zwm56d'
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
### Compatibility Mapping (UTS #46)
|
||||||
|
|
||||||
|
This library provides support for [Unicode IDNA Compatibility
|
||||||
|
Processing](https://unicode.org/reports/tr46/) which normalizes input from
|
||||||
|
different potential ways a user may input a domain prior to performing the IDNA
|
||||||
|
conversion operations. This functionality, known as a
|
||||||
|
[mapping](https://tools.ietf.org/html/rfc5895), is considered by the
|
||||||
|
specification to be a local user-interface issue distinct from IDNA
|
||||||
|
conversion functionality.
|
||||||
|
|
||||||
|
For example, "Königsgäßchen" is not a permissible label as *LATIN
|
||||||
|
CAPITAL LETTER K* is not allowed (nor are capital letters in general).
|
||||||
|
UTS 46 will convert this into lower case prior to applying the IDNA
|
||||||
|
conversion.
|
||||||
|
|
||||||
|
```pycon
|
||||||
|
>>> import idna
|
||||||
|
>>> idna.encode('Königsgäßchen')
|
||||||
|
...
|
||||||
|
idna.core.InvalidCodepoint: Codepoint U+004B at position 1 of 'Königsgäßchen' not allowed
|
||||||
|
>>> idna.encode('Königsgäßchen', uts46=True)
|
||||||
|
b'xn--knigsgchen-b4a3dun'
|
||||||
|
>>> print(idna.decode('xn--knigsgchen-b4a3dun'))
|
||||||
|
königsgäßchen
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
## Exceptions
|
||||||
|
|
||||||
|
All errors raised during the conversion following the specification
|
||||||
|
should raise an exception derived from the `idna.IDNAError` base
|
||||||
|
class.
|
||||||
|
|
||||||
|
More specific exceptions that may be generated as `idna.IDNABidiError`
|
||||||
|
when the error reflects an illegal combination of left-to-right and
|
||||||
|
right-to-left characters in a label; `idna.InvalidCodepoint` when
|
||||||
|
a specific codepoint is an illegal character in an IDN label (i.e.
|
||||||
|
INVALID); and `idna.InvalidCodepointContext` when the codepoint is
|
||||||
|
illegal based on its position in the string (i.e. it is CONTEXTO or CONTEXTJ
|
||||||
|
but the contextual requirements are not satisfied.)
|
||||||
|
|
||||||
|
## Building and Diagnostics
|
||||||
|
|
||||||
|
The IDNA and UTS 46 functionality relies upon pre-calculated lookup
|
||||||
|
tables for performance. These tables are derived from computing against
|
||||||
|
eligibility criteria in the respective standards using the command-line
|
||||||
|
script `tools/idna-data`.
|
||||||
|
|
||||||
|
This tool will fetch relevant codepoint data from the Unicode repository
|
||||||
|
and perform the required calculations to identify eligibility. There are
|
||||||
|
three main modes:
|
||||||
|
|
||||||
|
* `idna-data make-libdata`. Generates `idnadata.py` and
|
||||||
|
`uts46data.py`, the pre-calculated lookup tables used for IDNA and
|
||||||
|
UTS 46 conversions. Implementers who wish to track this library against
|
||||||
|
a different Unicode version may use this tool to manually generate a
|
||||||
|
different version of the `idnadata.py` and `uts46data.py` files.
|
||||||
|
|
||||||
|
* `idna-data make-table`. Generate a table of the IDNA disposition
|
||||||
|
(e.g. PVALID, CONTEXTJ, CONTEXTO) in the format found in Appendix
|
||||||
|
B.1 of RFC 5892 and the pre-computed tables published by [IANA](https://www.iana.org/).
|
||||||
|
|
||||||
|
* `idna-data U+0061`. Prints debugging output on the various
|
||||||
|
properties associated with an individual Unicode codepoint (in this
|
||||||
|
case, U+0061), that are used to assess the IDNA and UTS 46 status of a
|
||||||
|
codepoint. This is helpful in debugging or analysis.
|
||||||
|
|
||||||
|
The tool accepts a number of arguments, described using `idna-data -h`.
|
||||||
|
Most notably, the `--version` argument allows the specification
|
||||||
|
of the version of Unicode to be used in computing the table data. For
|
||||||
|
example, `idna-data --version 9.0.0 make-libdata` will generate
|
||||||
|
library data against Unicode 9.0.0.
|
||||||
|
|
||||||
|
|
||||||
|
## Additional Notes
|
||||||
|
|
||||||
|
* **Packages**. The latest tagged release version is published in the
|
||||||
|
[Python Package Index](https://pypi.org/project/idna/).
|
||||||
|
|
||||||
|
* **Version support**. This library supports Python 3.8 and higher.
|
||||||
|
As this library serves as a low-level toolkit for a variety of
|
||||||
|
applications, many of which strive for broad compatibility with older
|
||||||
|
Python versions, there is no rush to remove older interpreter support.
|
||||||
|
Support for older versions are likely to be removed from new releases
|
||||||
|
as automated tests can no longer easily be run, i.e. once the Python
|
||||||
|
version is officially end-of-life.
|
||||||
|
|
||||||
|
* **Testing**. The library has a test suite based on each rule of the
|
||||||
|
IDNA specification, as well as tests that are provided as part of the
|
||||||
|
Unicode Technical Standard 46, [Unicode IDNA Compatibility Processing](https://unicode.org/reports/tr46/).
|
||||||
|
|
||||||
|
* **Emoji**. It is an occasional request to support emoji domains in
|
||||||
|
this library. Encoding of symbols like emoji is expressly prohibited by
|
||||||
|
the IDNA technical standard, and emoji domains are broadly phased
|
||||||
|
out across the domain industry due to associated security risks.
|
||||||
|
|
||||||
|
* **Transitional processing**. Unicode 16.0.0 removed transitional
|
||||||
|
processing so the `transitional` argument for the encode() method
|
||||||
|
no longer has any effect and will be removed at a later date.
|
||||||
|
|
||||||
@ -0,0 +1,22 @@
|
|||||||
|
idna-3.15.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
|
||||||
|
idna-3.15.dist-info/METADATA,sha256=y4rCRb8knJ4jRkJ7xUu_eZcclre_a0aGLkM1-nYAGCc,7699
|
||||||
|
idna-3.15.dist-info/RECORD,,
|
||||||
|
idna-3.15.dist-info/WHEEL,sha256=G2gURzTEtmeR8nrdXUJfNiB3VYVxigPQ-bEQujpNiNs,82
|
||||||
|
idna-3.15.dist-info/licenses/LICENSE.md,sha256=GppPDj1HmickDd1ZqRN6ZqtKD539yMphiMwL_YUYfwQ,1541
|
||||||
|
idna/__init__.py,sha256=MPqNDLZbXqGaNdXxAFhiqFPKEQXju2jNQhCey6-5eJM,868
|
||||||
|
idna/__pycache__/__init__.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/codec.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/compat.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/core.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/idnadata.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/intranges.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/package_data.cpython-312.pyc,,
|
||||||
|
idna/__pycache__/uts46data.cpython-312.pyc,,
|
||||||
|
idna/codec.py,sha256=wtyXiDNWK2PtIs8LAhTHR7FFCNC4EcMwwASksHJqBxs,5047
|
||||||
|
idna/compat.py,sha256=AepA39ceRHxkfHP41-FvKW5Ki-f4PfUZ90RUMlCNdmo,1353
|
||||||
|
idna/core.py,sha256=7rMnUtssWtnkkXrR7lRAXvEZe2xKtLcc49dZDN2ATxU,23119
|
||||||
|
idna/idnadata.py,sha256=EYkhqhOze0Whw1pIMum_-qEHM-8utUwc_IBoIUWW8Go,92613
|
||||||
|
idna/intranges.py,sha256=egpSEXYDJ6BE0Ejo8djRdhsb4JJUtyoAQ_e49GP-7VE,1882
|
||||||
|
idna/package_data.py,sha256=adrXxBOTH4qXst7KXV_z8WJwP5OKbQfG8SOHBfFK42s,21
|
||||||
|
idna/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
|
||||||
|
idna/uts46data.py,sha256=8Vqa3jakE4ZWYYMkt-cTkp7IzpwJEmorYDyAaAZOPI4,202681
|
||||||
@ -0,0 +1,4 @@
|
|||||||
|
Wheel-Version: 1.0
|
||||||
|
Generator: flit 3.12.0
|
||||||
|
Root-Is-Purelib: true
|
||||||
|
Tag: py3-none-any
|
||||||
@ -0,0 +1,31 @@
|
|||||||
|
BSD 3-Clause License
|
||||||
|
|
||||||
|
Copyright (c) 2013-2026, Kim Davies and contributors.
|
||||||
|
All rights reserved.
|
||||||
|
|
||||||
|
Redistribution and use in source and binary forms, with or without
|
||||||
|
modification, are permitted provided that the following conditions are
|
||||||
|
met:
|
||||||
|
|
||||||
|
1. Redistributions of source code must retain the above copyright
|
||||||
|
notice, this list of conditions and the following disclaimer.
|
||||||
|
|
||||||
|
2. Redistributions in binary form must reproduce the above copyright
|
||||||
|
notice, this list of conditions and the following disclaimer in the
|
||||||
|
documentation and/or other materials provided with the distribution.
|
||||||
|
|
||||||
|
3. Neither the name of the copyright holder nor the names of its
|
||||||
|
contributors may be used to endorse or promote products derived from
|
||||||
|
this software without specific prior written permission.
|
||||||
|
|
||||||
|
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
|
||||||
|
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
|
||||||
|
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
|
||||||
|
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
|
||||||
|
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
|
||||||
|
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
|
||||||
|
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
|
||||||
|
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
|
||||||
|
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
|
||||||
|
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
|
||||||
|
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
|
||||||
@ -0,0 +1,45 @@
|
|||||||
|
from .core import (
|
||||||
|
IDNABidiError,
|
||||||
|
IDNAError,
|
||||||
|
InvalidCodepoint,
|
||||||
|
InvalidCodepointContext,
|
||||||
|
alabel,
|
||||||
|
check_bidi,
|
||||||
|
check_hyphen_ok,
|
||||||
|
check_initial_combiner,
|
||||||
|
check_label,
|
||||||
|
check_nfc,
|
||||||
|
decode,
|
||||||
|
encode,
|
||||||
|
ulabel,
|
||||||
|
uts46_remap,
|
||||||
|
valid_contextj,
|
||||||
|
valid_contexto,
|
||||||
|
valid_label_length,
|
||||||
|
valid_string_length,
|
||||||
|
)
|
||||||
|
from .intranges import intranges_contain
|
||||||
|
from .package_data import __version__
|
||||||
|
|
||||||
|
__all__ = [
|
||||||
|
"__version__",
|
||||||
|
"IDNABidiError",
|
||||||
|
"IDNAError",
|
||||||
|
"InvalidCodepoint",
|
||||||
|
"InvalidCodepointContext",
|
||||||
|
"alabel",
|
||||||
|
"check_bidi",
|
||||||
|
"check_hyphen_ok",
|
||||||
|
"check_initial_combiner",
|
||||||
|
"check_label",
|
||||||
|
"check_nfc",
|
||||||
|
"decode",
|
||||||
|
"encode",
|
||||||
|
"intranges_contain",
|
||||||
|
"ulabel",
|
||||||
|
"uts46_remap",
|
||||||
|
"valid_contextj",
|
||||||
|
"valid_contexto",
|
||||||
|
"valid_label_length",
|
||||||
|
"valid_string_length",
|
||||||
|
]
|
||||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@ -0,0 +1,159 @@
|
|||||||
|
import codecs
|
||||||
|
from typing import Any, Optional, Tuple
|
||||||
|
|
||||||
|
from .core import IDNAError, _unicode_dots_re, alabel, decode, encode, ulabel
|
||||||
|
|
||||||
|
|
||||||
|
class Codec(codecs.Codec):
|
||||||
|
"""Stateless IDNA 2008 codec.
|
||||||
|
|
||||||
|
Implements the :class:`codecs.Codec` protocol so that the whole-domain
|
||||||
|
encoder (:func:`idna.encode`) and decoder (:func:`idna.decode`) are
|
||||||
|
accessible through the standard codec machinery as ``"idna2008"``.
|
||||||
|
|
||||||
|
Only the ``"strict"`` error handler is supported; any other handler
|
||||||
|
raises :exc:`~idna.IDNAError`.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def encode(self, data: str, errors: str = "strict") -> Tuple[bytes, int]: # ty: ignore[invalid-method-override]
|
||||||
|
if errors != "strict":
|
||||||
|
raise IDNAError(f'Unsupported error handling "{errors}"')
|
||||||
|
|
||||||
|
if not data:
|
||||||
|
return b"", 0
|
||||||
|
|
||||||
|
return encode(data), len(data)
|
||||||
|
|
||||||
|
def decode(self, data: bytes, errors: str = "strict") -> Tuple[str, int]: # ty: ignore[invalid-method-override]
|
||||||
|
if errors != "strict":
|
||||||
|
raise IDNAError(f'Unsupported error handling "{errors}"')
|
||||||
|
|
||||||
|
if not data:
|
||||||
|
return "", 0
|
||||||
|
|
||||||
|
return decode(data), len(data)
|
||||||
|
|
||||||
|
|
||||||
|
class IncrementalEncoder(codecs.BufferedIncrementalEncoder):
|
||||||
|
"""Incremental IDNA 2008 encoder.
|
||||||
|
|
||||||
|
Buffers a partial trailing label across calls until either the next
|
||||||
|
label separator is seen or ``final=True``, so that streamed input is
|
||||||
|
encoded one whole label at a time. Any of the four Unicode label
|
||||||
|
separators (``U+002E``, ``U+3002``, ``U+FF0E``, ``U+FF61``) ends a
|
||||||
|
label; the result always uses ``U+002E`` as the separator.
|
||||||
|
|
||||||
|
Only the ``"strict"`` error handler is supported.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _buffer_encode(self, data: str, errors: str, final: bool) -> Tuple[bytes, int]: # ty: ignore[invalid-method-override]
|
||||||
|
if errors != "strict":
|
||||||
|
raise IDNAError(f'Unsupported error handling "{errors}"')
|
||||||
|
|
||||||
|
if not data:
|
||||||
|
return b"", 0
|
||||||
|
|
||||||
|
labels = _unicode_dots_re.split(data)
|
||||||
|
trailing_dot = b""
|
||||||
|
if labels:
|
||||||
|
if not labels[-1]:
|
||||||
|
trailing_dot = b"."
|
||||||
|
del labels[-1]
|
||||||
|
elif not final:
|
||||||
|
# Keep potentially unfinished label until the next call
|
||||||
|
del labels[-1]
|
||||||
|
if labels:
|
||||||
|
trailing_dot = b"."
|
||||||
|
|
||||||
|
result = []
|
||||||
|
size = 0
|
||||||
|
for label in labels:
|
||||||
|
result.append(alabel(label))
|
||||||
|
if size:
|
||||||
|
size += 1
|
||||||
|
size += len(label)
|
||||||
|
|
||||||
|
# Join with U+002E
|
||||||
|
result_bytes = b".".join(result) + trailing_dot
|
||||||
|
size += len(trailing_dot)
|
||||||
|
return result_bytes, size
|
||||||
|
|
||||||
|
|
||||||
|
class IncrementalDecoder(codecs.BufferedIncrementalDecoder):
|
||||||
|
"""Incremental IDNA 2008 decoder.
|
||||||
|
|
||||||
|
Buffers a partial trailing label across calls until either the next
|
||||||
|
label separator is seen or ``final=True``, so that streamed input is
|
||||||
|
decoded one whole label at a time.
|
||||||
|
|
||||||
|
Only the ``"strict"`` error handler is supported.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _buffer_decode(self, data: Any, errors: str, final: bool) -> Tuple[str, int]: # ty: ignore[invalid-method-override]
|
||||||
|
if errors != "strict":
|
||||||
|
raise IDNAError(f'Unsupported error handling "{errors}"')
|
||||||
|
|
||||||
|
if not data:
|
||||||
|
return ("", 0)
|
||||||
|
|
||||||
|
if not isinstance(data, str):
|
||||||
|
data = str(data, "ascii")
|
||||||
|
|
||||||
|
labels = _unicode_dots_re.split(data)
|
||||||
|
trailing_dot = ""
|
||||||
|
if labels:
|
||||||
|
if not labels[-1]:
|
||||||
|
trailing_dot = "."
|
||||||
|
del labels[-1]
|
||||||
|
elif not final:
|
||||||
|
# Keep potentially unfinished label until the next call
|
||||||
|
del labels[-1]
|
||||||
|
if labels:
|
||||||
|
trailing_dot = "."
|
||||||
|
|
||||||
|
result = []
|
||||||
|
size = 0
|
||||||
|
for label in labels:
|
||||||
|
result.append(ulabel(label))
|
||||||
|
if size:
|
||||||
|
size += 1
|
||||||
|
size += len(label)
|
||||||
|
|
||||||
|
result_str = ".".join(result) + trailing_dot
|
||||||
|
size += len(trailing_dot)
|
||||||
|
return (result_str, size)
|
||||||
|
|
||||||
|
|
||||||
|
class StreamWriter(Codec, codecs.StreamWriter):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class StreamReader(Codec, codecs.StreamReader):
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def search_function(name: str) -> Optional[codecs.CodecInfo]:
|
||||||
|
"""Codec search function registered with :mod:`codecs`.
|
||||||
|
|
||||||
|
Returns a :class:`codecs.CodecInfo` for the ``"idna2008"`` codec name
|
||||||
|
so that ``str.encode("idna2008")`` and ``bytes.decode("idna2008")``
|
||||||
|
invoke the IDNA 2008 codec defined in this module.
|
||||||
|
|
||||||
|
:param name: The codec name being looked up.
|
||||||
|
:returns: A :class:`codecs.CodecInfo` instance if ``name`` is
|
||||||
|
``"idna2008"``, otherwise ``None``.
|
||||||
|
"""
|
||||||
|
if name != "idna2008":
|
||||||
|
return None
|
||||||
|
return codecs.CodecInfo(
|
||||||
|
name=name,
|
||||||
|
encode=Codec().encode,
|
||||||
|
decode=Codec().decode, # type: ignore
|
||||||
|
incrementalencoder=IncrementalEncoder,
|
||||||
|
incrementaldecoder=IncrementalDecoder,
|
||||||
|
streamwriter=StreamWriter,
|
||||||
|
streamreader=StreamReader,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
codecs.register(search_function)
|
||||||
@ -0,0 +1,41 @@
|
|||||||
|
from typing import Any, Union
|
||||||
|
|
||||||
|
from .core import decode, encode
|
||||||
|
|
||||||
|
|
||||||
|
def ToASCII(label: str) -> bytes:
|
||||||
|
"""Compatibility shim for :rfc:`3490` ``ToASCII``.
|
||||||
|
|
||||||
|
Delegates to :func:`idna.encode` (IDNA 2008). Provided to ease porting
|
||||||
|
of code written against the legacy :mod:`encodings.idna` API; new code
|
||||||
|
should call :func:`idna.encode` directly.
|
||||||
|
|
||||||
|
:param label: The label or domain to encode.
|
||||||
|
:returns: The encoded form as ASCII :class:`bytes`.
|
||||||
|
"""
|
||||||
|
return encode(label)
|
||||||
|
|
||||||
|
|
||||||
|
def ToUnicode(label: Union[bytes, bytearray]) -> str:
|
||||||
|
"""Compatibility shim for :rfc:`3490` ``ToUnicode``.
|
||||||
|
|
||||||
|
Delegates to :func:`idna.decode` (IDNA 2008). Provided to ease porting
|
||||||
|
of code written against the legacy :mod:`encodings.idna` API; new code
|
||||||
|
should call :func:`idna.decode` directly.
|
||||||
|
|
||||||
|
:param label: The label or domain to decode.
|
||||||
|
:returns: The decoded Unicode form.
|
||||||
|
"""
|
||||||
|
return decode(label)
|
||||||
|
|
||||||
|
|
||||||
|
def nameprep(s: Any) -> None:
|
||||||
|
"""Stub for :rfc:`3491` Nameprep, which is not used by IDNA 2008.
|
||||||
|
|
||||||
|
IDNA 2008 (:rfc:`5891`) replaces Nameprep with the per-codepoint
|
||||||
|
validity classes from :rfc:`5892`; this function exists only to
|
||||||
|
return a clear error if legacy code attempts to call it.
|
||||||
|
|
||||||
|
:raises NotImplementedError: Always.
|
||||||
|
"""
|
||||||
|
raise NotImplementedError("IDNA 2008 does not utilise nameprep protocol")
|
||||||
@ -0,0 +1,620 @@
|
|||||||
|
import bisect
|
||||||
|
import re
|
||||||
|
import unicodedata
|
||||||
|
import warnings
|
||||||
|
from typing import Optional, Union
|
||||||
|
|
||||||
|
from . import idnadata
|
||||||
|
from .intranges import intranges_contain
|
||||||
|
|
||||||
|
_virama_combining_class = 9
|
||||||
|
_alabel_prefix = b"xn--"
|
||||||
|
_unicode_dots_re = re.compile("[\u002e\u3002\uff0e\uff61]")
|
||||||
|
|
||||||
|
|
||||||
|
# Bidi category sets from RFC 5893, hoisted out of the per-codepoint loop
|
||||||
|
_bidi_rtl_first = frozenset({"R", "AL"})
|
||||||
|
_bidi_rtl_categories = frozenset({"R", "AL", "AN"})
|
||||||
|
_bidi_rtl_allowed = frozenset({"R", "AL", "AN", "EN", "ES", "CS", "ET", "ON", "BN", "NSM"})
|
||||||
|
_bidi_rtl_valid_ending = frozenset({"R", "AL", "EN", "AN"})
|
||||||
|
_bidi_rtl_numeric = frozenset({"AN", "EN"})
|
||||||
|
_bidi_ltr_allowed = frozenset({"L", "EN", "ES", "CS", "ET", "ON", "BN", "NSM"})
|
||||||
|
_bidi_ltr_valid_ending = frozenset({"L", "EN"})
|
||||||
|
_bidi_joiner_l_or_d = frozenset({ord("L"), ord("D")})
|
||||||
|
_bidi_joiner_r_or_d = frozenset({ord("R"), ord("D")})
|
||||||
|
|
||||||
|
|
||||||
|
class IDNAError(UnicodeError):
|
||||||
|
"""Base exception for all IDNA-encoding related problems"""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class IDNABidiError(IDNAError):
|
||||||
|
"""Exception when bidirectional requirements are not satisfied"""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class InvalidCodepoint(IDNAError):
|
||||||
|
"""Exception when a disallowed or unallocated codepoint is used"""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
class InvalidCodepointContext(IDNAError):
|
||||||
|
"""Exception when the codepoint is not valid in the context it is used"""
|
||||||
|
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def _combining_class(cp: int) -> int:
|
||||||
|
v = unicodedata.combining(chr(cp))
|
||||||
|
if v == 0 and not unicodedata.name(chr(cp)):
|
||||||
|
raise ValueError("Unknown character in unicodedata")
|
||||||
|
return v
|
||||||
|
|
||||||
|
|
||||||
|
def _is_script(cp: str, script: str) -> bool:
|
||||||
|
return intranges_contain(ord(cp), idnadata.scripts[script])
|
||||||
|
|
||||||
|
|
||||||
|
def _punycode(s: str) -> bytes:
|
||||||
|
return s.encode("punycode")
|
||||||
|
|
||||||
|
|
||||||
|
def _unot(s: int) -> str:
|
||||||
|
return f"U+{s:04X}"
|
||||||
|
|
||||||
|
|
||||||
|
def valid_label_length(label: Union[bytes, str]) -> bool:
|
||||||
|
"""Check that a label does not exceed the maximum permitted length.
|
||||||
|
|
||||||
|
Per :rfc:`1035` (and :rfc:`5891` §4.2.4) a DNS label must not exceed
|
||||||
|
63 octets. The argument may be either a :class:`str` (a U-label, where
|
||||||
|
length is measured in characters) or :class:`bytes` (an A-label, where
|
||||||
|
length is measured in octets).
|
||||||
|
|
||||||
|
:param label: The label to check.
|
||||||
|
:returns: ``True`` if the label is within the length limit, otherwise
|
||||||
|
``False``.
|
||||||
|
"""
|
||||||
|
return len(label) <= 63
|
||||||
|
|
||||||
|
|
||||||
|
def valid_string_length(domain: Union[bytes, str], trailing_dot: bool) -> bool:
|
||||||
|
"""Check that a full domain name does not exceed the maximum length.
|
||||||
|
|
||||||
|
Per :rfc:`1035`, a domain name is limited to 253 octets when no trailing
|
||||||
|
dot is present, or 254 octets when one is included.
|
||||||
|
|
||||||
|
:param domain: The full (possibly multi-label) domain name.
|
||||||
|
:param trailing_dot: ``True`` if ``domain`` includes a trailing ``.``.
|
||||||
|
:returns: ``True`` if the domain is within the length limit, otherwise
|
||||||
|
``False``.
|
||||||
|
"""
|
||||||
|
return len(domain) <= (254 if trailing_dot else 253)
|
||||||
|
|
||||||
|
|
||||||
|
def check_bidi(label: str, check_ltr: bool = False) -> bool:
|
||||||
|
"""Validate the Bidi Rule from :rfc:`5893` for a single label.
|
||||||
|
|
||||||
|
The Bidi Rule constrains how bidirectional characters (Hebrew, Arabic,
|
||||||
|
etc.) may appear within a label. By default the check is only applied
|
||||||
|
when the label contains at least one right-to-left character (Unicode
|
||||||
|
bidirectional categories ``R``, ``AL``, or ``AN``); set ``check_ltr``
|
||||||
|
to ``True`` to apply it to LTR-only labels as well.
|
||||||
|
|
||||||
|
:param label: The label to validate, as a Unicode string.
|
||||||
|
:param check_ltr: If ``True``, apply the rules even when the label
|
||||||
|
contains no RTL characters.
|
||||||
|
:returns: ``True`` if the label satisfies the Bidi Rule.
|
||||||
|
:raises IDNABidiError: If any of Bidi Rule conditions 1-6 are violated,
|
||||||
|
or if the directional category of a codepoint cannot be determined.
|
||||||
|
"""
|
||||||
|
# Bidi rules should only be applied if string contains RTL characters
|
||||||
|
bidi_label = False
|
||||||
|
for idx, cp in enumerate(label, 1):
|
||||||
|
direction = unicodedata.bidirectional(cp)
|
||||||
|
if direction == "":
|
||||||
|
# String likely comes from a newer version of Unicode
|
||||||
|
raise IDNABidiError(f"Unknown directionality in label {repr(label)} at position {idx}")
|
||||||
|
if direction in _bidi_rtl_categories:
|
||||||
|
bidi_label = True
|
||||||
|
if not bidi_label and not check_ltr:
|
||||||
|
return True
|
||||||
|
|
||||||
|
# Bidi rule 1
|
||||||
|
direction = unicodedata.bidirectional(label[0])
|
||||||
|
if direction in _bidi_rtl_first:
|
||||||
|
rtl = True
|
||||||
|
elif direction == "L":
|
||||||
|
rtl = False
|
||||||
|
else:
|
||||||
|
raise IDNABidiError(f"First codepoint in label {repr(label)} must be directionality L, R or AL")
|
||||||
|
|
||||||
|
valid_ending = False
|
||||||
|
number_type: Optional[str] = None
|
||||||
|
for idx, cp in enumerate(label, 1):
|
||||||
|
direction = unicodedata.bidirectional(cp)
|
||||||
|
|
||||||
|
if rtl:
|
||||||
|
# Bidi rule 2
|
||||||
|
if direction not in _bidi_rtl_allowed:
|
||||||
|
raise IDNABidiError(f"Invalid direction for codepoint at position {idx} in a right-to-left label")
|
||||||
|
# Bidi rule 3
|
||||||
|
if direction in _bidi_rtl_valid_ending:
|
||||||
|
valid_ending = True
|
||||||
|
elif direction != "NSM":
|
||||||
|
valid_ending = False
|
||||||
|
# Bidi rule 4
|
||||||
|
if direction in _bidi_rtl_numeric:
|
||||||
|
if not number_type:
|
||||||
|
number_type = direction
|
||||||
|
else:
|
||||||
|
if number_type != direction:
|
||||||
|
raise IDNABidiError("Can not mix numeral types in a right-to-left label")
|
||||||
|
else:
|
||||||
|
# Bidi rule 5
|
||||||
|
if direction not in _bidi_ltr_allowed:
|
||||||
|
raise IDNABidiError(f"Invalid direction for codepoint at position {idx} in a left-to-right label")
|
||||||
|
# Bidi rule 6
|
||||||
|
if direction in _bidi_ltr_valid_ending:
|
||||||
|
valid_ending = True
|
||||||
|
elif direction != "NSM":
|
||||||
|
valid_ending = False
|
||||||
|
|
||||||
|
if not valid_ending:
|
||||||
|
raise IDNABidiError("Label ends with illegal codepoint directionality")
|
||||||
|
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def check_initial_combiner(label: str) -> bool:
|
||||||
|
"""Reject labels that begin with a combining mark.
|
||||||
|
|
||||||
|
Per :rfc:`5891` §4.2.3.2 a label must not start with a character of
|
||||||
|
Unicode general category ``M`` (Mark).
|
||||||
|
|
||||||
|
:param label: The label to check.
|
||||||
|
:returns: ``True`` if the first character is not a combining mark.
|
||||||
|
:raises IDNAError: If the label begins with a combining character.
|
||||||
|
"""
|
||||||
|
if unicodedata.category(label[0])[0] == "M":
|
||||||
|
raise IDNAError("Label begins with an illegal combining character")
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def check_hyphen_ok(label: str) -> bool:
|
||||||
|
"""Validate the hyphen restrictions for a label.
|
||||||
|
|
||||||
|
Per :rfc:`5891` §4.2.3.1 a label must not start or end with a hyphen
|
||||||
|
(``U+002D``), and must not have hyphens in both the third and fourth
|
||||||
|
positions (the prefix reserved for A-labels).
|
||||||
|
|
||||||
|
:param label: The label to check.
|
||||||
|
:returns: ``True`` if the hyphen restrictions are satisfied.
|
||||||
|
:raises IDNAError: If any of the hyphen restrictions are violated.
|
||||||
|
"""
|
||||||
|
if label[2:4] == "--":
|
||||||
|
raise IDNAError("Label has disallowed hyphens in 3rd and 4th position")
|
||||||
|
if label[0] == "-" or label[-1] == "-":
|
||||||
|
raise IDNAError("Label must not start or end with a hyphen")
|
||||||
|
return True
|
||||||
|
|
||||||
|
|
||||||
|
def check_nfc(label: str) -> None:
|
||||||
|
"""Require that a label is in Unicode Normalization Form C.
|
||||||
|
|
||||||
|
:param label: The label to check.
|
||||||
|
:raises IDNAError: If ``label`` differs from its NFC normalisation.
|
||||||
|
"""
|
||||||
|
if unicodedata.normalize("NFC", label) != label:
|
||||||
|
raise IDNAError("Label must be in Normalization Form C")
|
||||||
|
|
||||||
|
|
||||||
|
def valid_contextj(label: str, pos: int) -> bool:
|
||||||
|
"""Validate the CONTEXTJ rules from :rfc:`5892` Appendix A.
|
||||||
|
|
||||||
|
These rules govern the contextual use of the joiner codepoints
|
||||||
|
``U+200C`` (ZERO WIDTH NON-JOINER, Appendix A.1) and ``U+200D``
|
||||||
|
(ZERO WIDTH JOINER, Appendix A.2) within a label.
|
||||||
|
|
||||||
|
:param label: The label containing the codepoint.
|
||||||
|
:param pos: Index of the joiner codepoint within ``label``.
|
||||||
|
:returns: ``True`` if the codepoint at ``pos`` satisfies its CONTEXTJ
|
||||||
|
rule, ``False`` otherwise (including when the codepoint at
|
||||||
|
``pos`` is not a recognised joiner).
|
||||||
|
:raises ValueError: If an adjacent codepoint has no Unicode name when
|
||||||
|
determining its combining class.
|
||||||
|
"""
|
||||||
|
cp_value = ord(label[pos])
|
||||||
|
|
||||||
|
if cp_value == 0x200C:
|
||||||
|
if pos > 0 and _combining_class(ord(label[pos - 1])) == _virama_combining_class:
|
||||||
|
return True
|
||||||
|
|
||||||
|
ok = False
|
||||||
|
for i in range(pos - 1, -1, -1):
|
||||||
|
joining_type = idnadata.joining_types().get(ord(label[i]))
|
||||||
|
if joining_type == ord("T"):
|
||||||
|
continue
|
||||||
|
elif joining_type in _bidi_joiner_l_or_d:
|
||||||
|
ok = True
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
|
||||||
|
if not ok:
|
||||||
|
return False
|
||||||
|
|
||||||
|
ok = False
|
||||||
|
for i in range(pos + 1, len(label)):
|
||||||
|
joining_type = idnadata.joining_types().get(ord(label[i]))
|
||||||
|
if joining_type == ord("T"):
|
||||||
|
continue
|
||||||
|
elif joining_type in _bidi_joiner_r_or_d:
|
||||||
|
ok = True
|
||||||
|
break
|
||||||
|
else:
|
||||||
|
break
|
||||||
|
return ok
|
||||||
|
|
||||||
|
if cp_value == 0x200D:
|
||||||
|
return pos > 0 and _combining_class(ord(label[pos - 1])) == _virama_combining_class
|
||||||
|
|
||||||
|
else:
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def valid_contexto(label: str, pos: int, exception: bool = False) -> bool:
|
||||||
|
"""Validate the CONTEXTO rules from :rfc:`5892` Appendix A.
|
||||||
|
|
||||||
|
Covers the contextual rules for codepoints such as MIDDLE DOT
|
||||||
|
(``U+00B7``), Greek lower numeral sign, Hebrew punctuation, Katakana
|
||||||
|
middle dot, and the Arabic-Indic / Extended Arabic-Indic digit ranges.
|
||||||
|
|
||||||
|
:param label: The label containing the codepoint.
|
||||||
|
:param pos: Index of the codepoint within ``label``.
|
||||||
|
:param exception: Reserved for forward compatibility; currently unused.
|
||||||
|
:returns: ``True`` if the codepoint at ``pos`` satisfies its CONTEXTO
|
||||||
|
rule, ``False`` otherwise (including when the codepoint is not a
|
||||||
|
recognised CONTEXTO codepoint).
|
||||||
|
"""
|
||||||
|
cp_value = ord(label[pos])
|
||||||
|
|
||||||
|
if cp_value == 0x00B7:
|
||||||
|
return 0 < pos < len(label) - 1 and ord(label[pos - 1]) == 0x006C and ord(label[pos + 1]) == 0x006C
|
||||||
|
|
||||||
|
elif cp_value == 0x0375:
|
||||||
|
if pos < len(label) - 1 and len(label) > 1:
|
||||||
|
return _is_script(label[pos + 1], "Greek")
|
||||||
|
return False
|
||||||
|
|
||||||
|
elif cp_value == 0x05F3 or cp_value == 0x05F4:
|
||||||
|
if pos > 0:
|
||||||
|
return _is_script(label[pos - 1], "Hebrew")
|
||||||
|
return False
|
||||||
|
|
||||||
|
elif cp_value == 0x30FB:
|
||||||
|
for cp in label:
|
||||||
|
if cp == "\u30fb":
|
||||||
|
continue
|
||||||
|
if _is_script(cp, "Hiragana") or _is_script(cp, "Katakana") or _is_script(cp, "Han"):
|
||||||
|
return True
|
||||||
|
return False
|
||||||
|
|
||||||
|
elif 0x660 <= cp_value <= 0x669:
|
||||||
|
return not any(0x6F0 <= ord(cp) <= 0x06F9 for cp in label)
|
||||||
|
|
||||||
|
elif 0x6F0 <= cp_value <= 0x6F9:
|
||||||
|
return not any(0x660 <= ord(cp) <= 0x0669 for cp in label)
|
||||||
|
|
||||||
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
def check_label(label: Union[str, bytes, bytearray]) -> None:
|
||||||
|
"""Run the full set of IDNA 2008 validity checks on a single label.
|
||||||
|
|
||||||
|
Applies, in order: NFC normalisation (:func:`check_nfc`), hyphen
|
||||||
|
restrictions (:func:`check_hyphen_ok`), the no-leading-combiner rule
|
||||||
|
(:func:`check_initial_combiner`), per-codepoint validity (PVALID,
|
||||||
|
CONTEXTJ, CONTEXTO classes from :rfc:`5892`), and the Bidi Rule
|
||||||
|
(:func:`check_bidi`).
|
||||||
|
|
||||||
|
:param label: The label to validate. ``bytes`` or ``bytearray`` input
|
||||||
|
is decoded as UTF-8 first.
|
||||||
|
:raises IDNAError: If the label is empty or fails a structural rule.
|
||||||
|
:raises InvalidCodepoint: If the label contains a DISALLOWED or
|
||||||
|
UNASSIGNED codepoint.
|
||||||
|
:raises InvalidCodepointContext: If a CONTEXTJ or CONTEXTO codepoint
|
||||||
|
is not valid in its context.
|
||||||
|
:raises IDNABidiError: If the Bidi Rule is violated.
|
||||||
|
"""
|
||||||
|
if isinstance(label, (bytes, bytearray)):
|
||||||
|
label = label.decode("utf-8")
|
||||||
|
if len(label) == 0:
|
||||||
|
raise IDNAError("Empty Label")
|
||||||
|
|
||||||
|
# Reject on domain length rather than label length so support some UTS 46
|
||||||
|
# use cases, still reducing processing of label contextual rules
|
||||||
|
if not valid_string_length(label, trailing_dot=True):
|
||||||
|
raise IDNAError("Label too long")
|
||||||
|
|
||||||
|
check_nfc(label)
|
||||||
|
check_hyphen_ok(label)
|
||||||
|
check_initial_combiner(label)
|
||||||
|
|
||||||
|
for pos, cp in enumerate(label):
|
||||||
|
cp_value = ord(cp)
|
||||||
|
if intranges_contain(cp_value, idnadata.codepoint_classes["PVALID"]):
|
||||||
|
continue
|
||||||
|
elif intranges_contain(cp_value, idnadata.codepoint_classes["CONTEXTJ"]):
|
||||||
|
try:
|
||||||
|
if not valid_contextj(label, pos):
|
||||||
|
raise InvalidCodepointContext(
|
||||||
|
f"Joiner {_unot(cp_value)} not allowed at position {pos + 1} in {repr(label)}"
|
||||||
|
)
|
||||||
|
except ValueError as err:
|
||||||
|
raise IDNAError(
|
||||||
|
f"Unknown codepoint adjacent to joiner {_unot(cp_value)} at position {pos + 1} in {repr(label)}"
|
||||||
|
) from err
|
||||||
|
elif intranges_contain(cp_value, idnadata.codepoint_classes["CONTEXTO"]):
|
||||||
|
if not valid_contexto(label, pos):
|
||||||
|
raise InvalidCodepointContext(
|
||||||
|
f"Codepoint {_unot(cp_value)} not allowed at position {pos + 1} in {repr(label)}"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
raise InvalidCodepoint(f"Codepoint {_unot(cp_value)} at position {pos + 1} of {repr(label)} not allowed")
|
||||||
|
|
||||||
|
check_bidi(label)
|
||||||
|
|
||||||
|
|
||||||
|
def alabel(label: str) -> bytes:
|
||||||
|
"""Convert a single U-label into its A-label form.
|
||||||
|
|
||||||
|
The result is the ASCII-Compatible Encoding (ACE) form per :rfc:`5891`
|
||||||
|
§4: the label is validated, Punycode-encoded, and prefixed with
|
||||||
|
``xn--``. Pure ASCII labels that are already valid IDNA labels are
|
||||||
|
returned unchanged (as :class:`bytes`).
|
||||||
|
|
||||||
|
:param label: The label to convert, as a Unicode string.
|
||||||
|
:returns: The A-label as ASCII-encoded :class:`bytes`.
|
||||||
|
:raises IDNAError: If the label is invalid or the resulting A-label
|
||||||
|
exceeds 63 octets.
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
label_bytes = label.encode("ascii")
|
||||||
|
ulabel(label_bytes)
|
||||||
|
if not valid_label_length(label_bytes):
|
||||||
|
raise IDNAError("Label too long")
|
||||||
|
return label_bytes
|
||||||
|
except UnicodeEncodeError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
check_label(label)
|
||||||
|
label_bytes = _alabel_prefix + _punycode(label)
|
||||||
|
|
||||||
|
if not valid_label_length(label_bytes):
|
||||||
|
raise IDNAError("Label too long")
|
||||||
|
|
||||||
|
return label_bytes
|
||||||
|
|
||||||
|
|
||||||
|
def ulabel(label: Union[str, bytes, bytearray]) -> str:
|
||||||
|
"""Convert a single A-label into its U-label form.
|
||||||
|
|
||||||
|
Performs the inverse of :func:`alabel`: an ``xn--``-prefixed label is
|
||||||
|
Punycode-decoded and validated. Labels that are already Unicode (or
|
||||||
|
plain ASCII without the ACE prefix) are validated and returned as a
|
||||||
|
Unicode string.
|
||||||
|
|
||||||
|
:param label: The label to convert. ``bytes`` or ``bytearray`` input
|
||||||
|
is treated as ASCII.
|
||||||
|
:returns: The U-label as a Unicode string.
|
||||||
|
:raises IDNAError: If the label is malformed or fails validation.
|
||||||
|
"""
|
||||||
|
if not isinstance(label, (bytes, bytearray)):
|
||||||
|
try:
|
||||||
|
label_bytes = label.encode("ascii")
|
||||||
|
except UnicodeEncodeError:
|
||||||
|
check_label(label)
|
||||||
|
return label
|
||||||
|
else:
|
||||||
|
label_bytes = bytes(label)
|
||||||
|
|
||||||
|
label_bytes = label_bytes.lower()
|
||||||
|
if label_bytes.startswith(_alabel_prefix):
|
||||||
|
label_bytes = label_bytes[len(_alabel_prefix) :]
|
||||||
|
if not label_bytes:
|
||||||
|
raise IDNAError("Malformed A-label, no Punycode eligible content found")
|
||||||
|
if label_bytes.endswith(b"-"):
|
||||||
|
raise IDNAError("A-label must not end with a hyphen")
|
||||||
|
else:
|
||||||
|
check_label(label_bytes)
|
||||||
|
return label_bytes.decode("ascii")
|
||||||
|
|
||||||
|
try:
|
||||||
|
label = label_bytes.decode("punycode")
|
||||||
|
except UnicodeError as err:
|
||||||
|
raise IDNAError("Invalid A-label") from err
|
||||||
|
check_label(label)
|
||||||
|
return label
|
||||||
|
|
||||||
|
|
||||||
|
def uts46_remap(domain: str, std3_rules: bool = True, transitional: bool = False) -> str:
|
||||||
|
"""Apply the UTS #46 character mapping to a domain string.
|
||||||
|
|
||||||
|
Implements the mapping table from `UTS #46 §4
|
||||||
|
<https://www.unicode.org/reports/tr46/>`_: each character is kept,
|
||||||
|
replaced, or rejected based on its status (``V``, ``M``, ``D``, ``3``,
|
||||||
|
``I``). The result is returned in Normalisation Form C.
|
||||||
|
|
||||||
|
:param domain: The full domain name to remap.
|
||||||
|
:param std3_rules: If ``True``, apply the stricter STD3 ASCII rules
|
||||||
|
(status ``3`` codepoints raise instead of being kept or mapped).
|
||||||
|
:param transitional: If ``True``, use transitional processing (status
|
||||||
|
``D`` codepoints are mapped instead of kept). Transitional
|
||||||
|
processing has been removed from UTS #46 and this option is
|
||||||
|
retained only for backwards compatibility.
|
||||||
|
:returns: The remapped domain, in Normalisation Form C.
|
||||||
|
:raises InvalidCodepoint: If the domain contains a disallowed
|
||||||
|
codepoint under the chosen rules.
|
||||||
|
"""
|
||||||
|
from .uts46data import uts46data
|
||||||
|
|
||||||
|
output = ""
|
||||||
|
|
||||||
|
for pos, char in enumerate(domain):
|
||||||
|
code_point = ord(char)
|
||||||
|
uts46row = uts46data[code_point if code_point < 256 else bisect.bisect_left(uts46data, (code_point, "Z")) - 1]
|
||||||
|
status = uts46row[1]
|
||||||
|
replacement: Optional[str] = None
|
||||||
|
if len(uts46row) == 3:
|
||||||
|
replacement = uts46row[2] # ty: ignore[index-out-of-bounds]
|
||||||
|
|
||||||
|
# UTS #46 §4: V is always valid, D is deviation (kept unless transitional),
|
||||||
|
# 3 is disallowed-STD3 (kept unmapped if std3_rules is off and no mapping).
|
||||||
|
keep_as_is = (
|
||||||
|
status == "V" or (status == "D" and not transitional) or (status == "3" and not std3_rules and replacement is None)
|
||||||
|
)
|
||||||
|
# M is mapped, 3-with-replacement and transitional D fall through to the
|
||||||
|
# same replacement output path.
|
||||||
|
use_replacement = replacement is not None and (
|
||||||
|
status == "M" or (status == "3" and not std3_rules) or (status == "D" and transitional)
|
||||||
|
)
|
||||||
|
|
||||||
|
if keep_as_is:
|
||||||
|
output += char
|
||||||
|
elif use_replacement:
|
||||||
|
assert replacement is not None # narrowed by use_replacement
|
||||||
|
output += replacement
|
||||||
|
elif status == "I":
|
||||||
|
continue
|
||||||
|
else:
|
||||||
|
raise InvalidCodepoint(f"Codepoint {_unot(code_point)} not allowed at position {pos + 1} in {repr(domain)}")
|
||||||
|
|
||||||
|
return unicodedata.normalize("NFC", output)
|
||||||
|
|
||||||
|
|
||||||
|
def encode(
|
||||||
|
s: Union[str, bytes, bytearray],
|
||||||
|
strict: bool = False,
|
||||||
|
uts46: bool = False,
|
||||||
|
std3_rules: bool = False,
|
||||||
|
transitional: bool = False,
|
||||||
|
) -> bytes:
|
||||||
|
"""Encode a Unicode domain name into its ASCII (A-label) form.
|
||||||
|
|
||||||
|
Splits the input on label separators (only ``U+002E`` if ``strict`` is
|
||||||
|
set; otherwise also IDEOGRAPHIC FULL STOP ``U+3002``, FULLWIDTH FULL
|
||||||
|
STOP ``U+FF0E``, and HALFWIDTH IDEOGRAPHIC FULL STOP ``U+FF61``),
|
||||||
|
encodes each label with :func:`alabel`, and rejoins them with ``.``.
|
||||||
|
Optionally pre-processes the input through :func:`uts46_remap`.
|
||||||
|
|
||||||
|
:param s: The domain name to encode.
|
||||||
|
:param strict: If ``True``, only ``U+002E`` is recognised as a label
|
||||||
|
separator.
|
||||||
|
:param uts46: If ``True``, apply UTS #46 mapping before encoding.
|
||||||
|
:param std3_rules: Forwarded to :func:`uts46_remap` when ``uts46`` is
|
||||||
|
``True``.
|
||||||
|
:param transitional: Forwarded to :func:`uts46_remap` when ``uts46``
|
||||||
|
is ``True``. Deprecated: emits a :class:`DeprecationWarning` and
|
||||||
|
will be removed in a future version.
|
||||||
|
:returns: The encoded domain as ASCII :class:`bytes`.
|
||||||
|
:raises IDNAError: If the domain is empty, contains an invalid label,
|
||||||
|
or exceeds the maximum domain length.
|
||||||
|
"""
|
||||||
|
if transitional:
|
||||||
|
warnings.warn(
|
||||||
|
"Transitional processing has been removed from UTS #46. "
|
||||||
|
"The transitional argument will be removed in a future version.",
|
||||||
|
DeprecationWarning,
|
||||||
|
stacklevel=2,
|
||||||
|
)
|
||||||
|
if not isinstance(s, str):
|
||||||
|
try:
|
||||||
|
s = str(s, "ascii")
|
||||||
|
except (UnicodeDecodeError, TypeError) as err:
|
||||||
|
raise IDNAError("should pass a unicode string to the function rather than a byte string.") from err
|
||||||
|
if uts46:
|
||||||
|
s = uts46_remap(s, std3_rules, transitional)
|
||||||
|
|
||||||
|
# Reject inputs that exceed the maximum DNS domain length up-front
|
||||||
|
# to avoid expensive computation on long inputs.
|
||||||
|
if not valid_string_length(s, trailing_dot=True):
|
||||||
|
raise IDNAError("Domain too long")
|
||||||
|
|
||||||
|
trailing_dot = False
|
||||||
|
result = []
|
||||||
|
labels = s.split(".") if strict else _unicode_dots_re.split(s)
|
||||||
|
if not labels or labels == [""]:
|
||||||
|
raise IDNAError("Empty domain")
|
||||||
|
if labels[-1] == "":
|
||||||
|
del labels[-1]
|
||||||
|
trailing_dot = True
|
||||||
|
for label in labels:
|
||||||
|
s = alabel(label)
|
||||||
|
if s:
|
||||||
|
result.append(s)
|
||||||
|
else:
|
||||||
|
raise IDNAError("Empty label")
|
||||||
|
if trailing_dot:
|
||||||
|
result.append(b"")
|
||||||
|
s = b".".join(result)
|
||||||
|
if not valid_string_length(s, trailing_dot):
|
||||||
|
raise IDNAError("Domain too long")
|
||||||
|
return s
|
||||||
|
|
||||||
|
|
||||||
|
def decode(
|
||||||
|
s: Union[str, bytes, bytearray],
|
||||||
|
strict: bool = False,
|
||||||
|
uts46: bool = False,
|
||||||
|
std3_rules: bool = False,
|
||||||
|
) -> str:
|
||||||
|
"""Decode an A-label-encoded domain name back to Unicode.
|
||||||
|
|
||||||
|
Splits the input on label separators (see :func:`encode` for the
|
||||||
|
rules), decodes each label with :func:`ulabel`, and rejoins them
|
||||||
|
with ``.``. Optionally pre-processes the input through
|
||||||
|
:func:`uts46_remap`.
|
||||||
|
|
||||||
|
:param s: The domain name to decode.
|
||||||
|
:param strict: If ``True``, only ``U+002E`` is recognised as a label
|
||||||
|
separator.
|
||||||
|
:param uts46: If ``True``, apply UTS #46 mapping before decoding.
|
||||||
|
:param std3_rules: Forwarded to :func:`uts46_remap` when ``uts46`` is
|
||||||
|
``True``.
|
||||||
|
:returns: The decoded domain as a Unicode string.
|
||||||
|
:raises IDNAError: If the input is not valid ASCII, contains an
|
||||||
|
invalid label, or is empty.
|
||||||
|
"""
|
||||||
|
if not isinstance(s, str):
|
||||||
|
try:
|
||||||
|
s = str(s, "ascii")
|
||||||
|
except (UnicodeDecodeError, TypeError) as err:
|
||||||
|
raise IDNAError("Invalid ASCII in A-label") from err
|
||||||
|
if uts46:
|
||||||
|
s = uts46_remap(s, std3_rules, False)
|
||||||
|
# Reject inputs that exceed the maximum DNS domain length up-front
|
||||||
|
# to avoid expensive computation on long inputs.
|
||||||
|
if not valid_string_length(s, trailing_dot=True):
|
||||||
|
raise IDNAError("Domain too long")
|
||||||
|
trailing_dot = False
|
||||||
|
result = []
|
||||||
|
labels = s.split(".") if strict else _unicode_dots_re.split(s)
|
||||||
|
if not labels or labels == [""]:
|
||||||
|
raise IDNAError("Empty domain")
|
||||||
|
if not labels[-1]:
|
||||||
|
del labels[-1]
|
||||||
|
trailing_dot = True
|
||||||
|
for label in labels:
|
||||||
|
s = ulabel(label)
|
||||||
|
if s:
|
||||||
|
result.append(s)
|
||||||
|
else:
|
||||||
|
raise IDNAError("Empty label")
|
||||||
|
if trailing_dot:
|
||||||
|
result.append("")
|
||||||
|
return ".".join(result)
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user