Rethinking Tokenization from the Perspectives of the Blind and CFL (Chinese as a Foreign Language) Learners

Tokenization is a classic problem in many fields. The latest interesting issue is that OpenAI GPT-4o’s token list has not been well cleaned. As a result, there are many answers that do not make sense and contain NSFW or gambling-related content.

Considering this issue from the perspectives of the blind and CFL (Chinese as a Foreign Language) learners is also very interesting. I will gradually supplement my exploration and discoveries.

在国内,盲人如何使用电脑、手机进行打字或其他网络服务? - eureka的回答 - 知乎

homophones in multimodal AI

tested in iOS chinese keyboard input, VoiceOver only supported part of Polyphonic Characters(which means same character with multi pronunciation)
How can it be implemented? Why is it easy to integrate with an IME (input method engine)?

talkback/talkback/src/main/res/raw/phonetic_letters.json at master · google/talkback · GitHub

search phonetic in talkback repo

Candidate Display Styles in Japanese Input