Tokenization is a classic problem in many fields. The latest interesting issue is that OpenAI GPT-4o’s token list has not been well cleaned. As a result, there are many answers that do not make sense and contain NSFW or gambling-related content.
Considering this issue from the perspectives of the blind and CFL (Chinese as a Foreign Language) learners is also very interesting. I will gradually supplement my exploration and discoveries.
tested in iOS chinese keyboard input, VoiceOver only supported part of Polyphonic Characters(which means same character with multi pronunciation)
How can it be implemented? Why is it easy to integrate with an IME (input method engine)?