Evaluating the Effectiveness of Text Tokenization Methods
Abstract
Text tokenization methods that use a dictionary of tokens and do without it are considered. Tokenization and token dictionary producing algorithms are given. The effectiveness of tokenization methods is evaluated as applied to the parts of speech tagging (POST), documents classification (DC) and punctuation restoration (PR) tasks. The POST and PR tasks are reduced to the n-gram classification problem (respectively, n = 3 and n = 5). A neural network was taken as a classifier in the form of a multilayer perceptron that receives at its input vector representations of documents in the case of a DC, or vector representations of n-grams if a POST or a PR task is solved. Vector representations are produced using the word2vec model. The experiments performed allow the comparative effectiveness of seven tokenization methods to be estimated. Three of them, which form the G1 group, translate text elements into tokens by selecting them from pre-created dictionaries. The dictionaries are built using the BPE, BPE-dropout, and Kudo algorithms. The other four members of the G2 group perform tokenization without accessing the token dictionary – these are methods that generate tokens in the form of initial word forms, initial forms of word forms, word form bases, and word form bases and affixes. The dictionaries and vector representations of tokens are produced by Python procedures based on the text corpus that combines the corpora of the DC and PR tasks. The used corpora contain texts in Russian. In the tasks being solved, the applied tokenization methods demonstrate different efficiency: the growth of F1 when using the best group G2 method instead of the best group G1 method was found to be 4.04, 2.29, and 3.34%, respectively in the POST, DC, and PR problems.
References
2. Бартеньев О.В. Сравнительная оценка эффективности моделей текста в задаче классификации документов // Вестник МЭИ. 2021. № 5. С. 117—127.
3. Chi Z. e. a. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA // Proceedings 60th Annual Meeting of the Association for Computational Linguistics. Dublin, 2022. V. 1. Pp. 6170—6182.
4. Arora G. iNLTK: Natural Language Toolkit for Indic Languages // Proc. Second Workshop for NLP Open Source Software (NLP-OSS). 2020. Pp. 66—71.
5. Kudo T., Richardson J. SentencePiece: a Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. // Proc. Conf. on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, 2018. Pp. 66—71.
6. Mikolov T. e. a. Distributed Representations of Words and Phrases and their Compositionality // Neural Information Processing Systems. 2013. V. 26. Pp. 1—9.
7. Le Q., Mikolov T. Distributed Representations of Sentences and Documents // Proc. XXXI Intern. Conf. Machine Learning. Beijing, 2014. Pp. 1188—1196.
8. Mikolov T. e. a. Advances in Pre-training Distributed Word Representations // Proceedings of LREC. Miyazaki, 2018. Pp. 52—55.
9. Pennington J., Socher R., Manning C. Glove: Global Vectors for Word Representation. // Proc. Conf. Empirical Methods in Natural Language Processing. 2014. Pp. 1532—1543.
10. Adesam Y., Berdicevskis A. Part-of-speech Tagging of Swedish Texts in the Neural Era // Proceedings XXIII Nordic Conf. Computational Linguistics (NoDaLiDa). Reykjavik, 2021. Pp. 200—209.
11. Haldar G., Mittal A., Gupta P. DSC-IITISM at FinCausal 2021: Combining POS Tagging with Attention-based Contextual Representations for Identifying Causal Relationships in Financial Documents // Proc. III Financial Narrative Processing Workshop. Lancaster, 2021. Pp. 49—53.
12. Alshammeri M., Atwell E., Alsalka M.A. Classifying Verses of the Quran using Doc2vec // Proc. XVIII Intern. Conf. Natural Language Processing. Silchar, 2021. Pp. 284—288.
13. Schmitt M. e. a. Joint Aspect and Polarity Classification for Aspect-based Sentiment Analysis with End-to-end Neural Networks 2018. In // Proc. Conf. Empirical Methods in Natural Language Processing. Brussels, 2018. Pp. 1109—1114.
14. Ebadulla D.M. e. a. A Comparative Study on Language Models for the Kannada Language // Proceedings IV Intern. Conf. Natural Language and Speech Processing. Trento, 2021. Pp. 280—284.
15. WordPiece Tokenization. [Электрон. ресурс] https://huggingface.co/course/chapter6/6?fw=pt (дата обращения 01.02.2023).
16. Alam T., Khan A., Alam F. Punctuation Restoration using Transformer Models for High-and Low-resource Languages // Proc. VI Workshop on Noisy User-generated Text. 2020. Pp. 132—142.
17. Pogoda M., Walkowiak T. Comprehensive Punctuation Restoration for English and Polish // Proc. Findings of the Association for Computational Linguistics: EMNLP. Punta Cana, 2021. Pp. 4610—4619.
18. Nagy A., Bial B., Ács J. Automatic Punctuation Restoration with BERT Models [Электрон. ресурс] https://arxiv.org/pdf/2101.07343v1.pdf#page=1 (дата обращения 01.02.2023).
19. Devlin J. e. a. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. - 2019. In // Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019. V. 1. Pp. 4171—4186.
20. Sennrich R., Haddow B., Birch A. Neural Machine Translation of Rare Words with Subword Units 2016. In // Proc. 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016. V. 1. Pp. 1715—1725.
21. Kudo T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates // Proc. 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018. V. 1. Pp. 66—75.
22. Provilkov I., Emelianenko D., Voita E. BPE-Dropout: Simple and Effective Subword Regularization // Proc. 58th Annual Meeting of the Association for Computational Linguistics. 2020. Pp. 1882—1892.
23. Papineni K. et. al. BLEU: a Method for Automatic Evaluation of Machine Translation. - 2002.In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
24. IWSLT 15 English-Vietnamese Dataset. - 2015. [Электрон. ресурс] https://metatext.io/datasets/iwslt-15-english-vietnamese (дата обращения 01.02.2023).
25. Vaswani A. et. al. Attention is all you need. - 2017. 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
26. BERT in DeepPavlov. [Электронный ресурс] URL: http://docs.deeppavlov.ai/en/master/features/models/bert.html (дата обращения 01.02.2022).
27. Бартеньев О. В. Автоматическая расстановка знаков препинания с помощью нейронных сетей // Вестник МЭИ. 2022. № 6. С. 146—159. DOI: 10.24160/1993-6982-2022-6-146-159.
28. Бабайцева В. В. Русский язык. Теория. – М.: Дрофа, 1998. – 432 с.
29. Основа слова. [Электрон. ресурс] https://studopedia.ru/6_82986_osnova-slova-tipi-osnov.html (дата обращения 01.02.2023).
30. Морфологический анализатор pymorphy2. [Электрон. ресурс] https://pymorphy2.readthedocs.io/en/stable/ (дата обращения 01.02.2023).
31. Открытый корпус. [Электрон. ресурс] http://opencorpora.org/ (дата обращения 01.02.2023).
32. NLTK Documentation. nltk.stem.snowball module. [Электрон. ресурс] https://www.nltk.org/api/nltk.stem.snowball.html (дата обращения 01.02.2023).
33. Gage P. A New Algorithm for Data Compression. – 1994. [Электрон. ресурс] https://www.derczynski.com/papers/archive/BPE_Gage.pdf (дата обращения 01.02.2023).
34. BERT base model (uncased). [Электрон. ресурс] https://huggingface.co/bert-base-uncased (дата обращения 01.02.2023).
35. Expectation–maximization algorithm. [Электрон. ресурс] https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm (дата обращения 01.02.2023).
36. SentencePiece Python Wrapper. [Электрон. ресурс] https://pypi.org/project/sentencepiece/ (дата обращения 01.02.2023).
37. Набор данных для разметки текста по частям речи. [Электрон. ресурс] https://disk.yandex.ru/d/op1mwroFBMyKcw (дата обращения 01.02.2023).
38. Набор данных для классификации документов. [Электрон. ресурс] https://www.kaggle.com/datasets/olegbartenyev/doc-cls (дата обращения 01.02.2023).
39. Корпус для создания словаря токенов. [Электрон. ресурс] https://disk.yandex.ru/d/uFPs-eRW6Kaa8Q (дата обращения 01.02.2023).
40. Scikit-learn. [Электронн. ресурс] http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html\ (дата обращения 01.02.2023).
41. Бартеньев О. В. Параметры, влияющие на эффективность нейронной сети, созданной средствами Keras. [Электронн. ресурс] http://www.100byte.ru/python/factors/factors.html (дата обращения 01.02.2023).
42. BBC Dataset [Электрон. ресурс] http://www.mlg.ucd.ie/datasets/bbc.html (дата обращения 01.02.2023).
---
Для цитирования: Бартеньев О.В. Оценка эффективности методов токенизации текста // Вестник МЭИ. 2023. № 6. С. 144—156. DOI: 10.24160/1993-6982-2023-6-144-156
#
1. Keras [Elektron. Resurs] https://keras.io/ (Data Obrashcheniya 01.02.2023).
2. Barten'ev O.V. Sravnitel'naya Otsenka Effektivnosti Modeley Teksta v Zadache Klassifikatsii Dokumentov. Vestnik MEI. 2021;5:117—127. (in Russian).
3. Chi Z. e. a. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. Proceedings 60th Annual Meeting of the Association for Computational Linguistics. Dublin, 2022;1:6170—6182.
4. Arora G. iNLTK: Natural Language Toolkit for Indic Languages. Proc. Second Workshop for NLP Open Source Software (NLP-OSS). 2020:66—71.
5. Kudo T., Richardson J. SentencePiece: a Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing.. Proc. Conf. on Empirical Methods in Natural Language Processing: System Demonstrations. Brussels, 2018:66—71.
6. Mikolov T. e. a. Distributed Representations of Words and Phrases and their Compositionality. Neural Information Processing Systems. 2013;26:1—9.
7. Le Q., Mikolov T. Distributed Representations of Sentences and Documents. Proc. XXXI Intern. Conf. Machine Learning. Beijing, 2014:1188—1196.
8. Mikolov T. e. a. Advances in Pre-training Distributed Word Representations. Proceedings of LREC. Miyazaki, 2018:52—55.
9. Pennington J., Socher R., Manning C. Glove: Global Vectors for Word Representation.. Proc. Conf. Empirical Methods in Natural Language Processing. 2014:1532—1543.
10. Adesam Y., Berdicevskis A. Part-of-speech Tagging of Swedish Texts in the Neural Era. Proceedings XXIII Nordic Conf. Computational Linguistics (NoDaLiDa). Reykjavik, 2021:200—209.
11. Haldar G., Mittal A., Gupta P. DSC-IITISM at FinCausal 2021: Combining POS Tagging with Attention-based Contextual Representations for Identifying Causal Relationships in Financial Documents. Proc. III Financial Narrative Processing Workshop. Lancaster, 2021: 49—53.
12. Alshammeri M., Atwell E., Alsalka M.A. Classifying Verses of the Quran using Doc2vec. Proc. XVIII Intern. Conf. Natural Language Processing. Silchar, 2021:284—288.
13. Schmitt M. e. a. Joint Aspect and Polarity Classification for Aspect-based Sentiment Analysis with End-to-end Neural Networks 2018. In. Proc. Conf. Empirical Methods in Natural Language Processing. Brussels, 2018:1109—1114.
14. Ebadulla D.M. e. a. A Comparative Study on Language Models for the Kannada Language. Proceedings IV Intern. Conf. Natural Language and Speech Processing. Trento, 2021:280—284.
15. WordPiece Tokenization. [Elektron. Resurs] https://huggingface.co/course/chapter6/6?fw=pt (Data Obrashcheniya 01.02.2023).
16. Alam T., Khan A., Alam F. Punctuation Restoration using Transformer Models for High-and Low-resource Languages. Proc. VI Workshop on Noisy User-generated Text. 2020:132—142.
17. Pogoda M., Walkowiak T. Comprehensive Punctuation Restoration for English and Polish. Proc. Findings of the Association for Computational Linguistics: EMNLP. Punta Cana, 2021:4610—4619.
18. Nagy A., Bial B., Ács J. Automatic Punctuation Restoration with BERT Models [Elektron. Resurs] https://arxiv.org/pdf/2101.07343v1.pdf#page=1 (Data Obrashcheniya 01.02.2023).
19. Devlin J. e. a. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. - 2019. In. Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Minneapolis, 2019;1:4171—4186.
20. Sennrich R., Haddow B., Birch A. Neural Machine Translation of Rare Words with Subword Units 2016. In. Proc. 54th Annual Meeting of the Association for Computational Linguistics. Berlin, 2016;1:1715—1725.
21. Kudo T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proc. 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, 2018;1:66—75.
22. Provilkov I., Emelianenko D., Voita E. BPE-Dropout: Simple and Effective Subword Regularization. Proc. 58th Annual Meeting of the Association for Computational Linguistics. 2020:1882—1892.
23. Papineni K. e. a. BLEU: a Method for Automatic Evaluation of Machine Translation. Proc 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, 2002:311—318.
24. IWSLT 15 English-vietnamese Dataset [Elektron. Resurs] https://metatext.io/datasets/iwslt-15-english-viet-namese (Data Obrashcheniya 01.02.2023).
25. Vaswani A. e. a. Attention is All You Need. Proc. XXXI Conf. Neural Information Proc. Systems. Long Beach, 2017:1—15.
26. BERT in DeepPavlov. [Elektron. Resurs] http://docs.deeppavlov.ai/en/master/features/models/bert.html (Data Obrashcheniya 01.02.2022).
27. Barten'ev O.V. Avtomaticheskaya Rasstanovka Znakov Prepinaniya s Pomoshch'yu Neyronnykh Setey. Vestnik MEI. 2022;6:146—159. (in Russian).
28. Babaytseva V.V. Russkiy Yazyk. Teoriya. M.: Drofa, 1998. (in Russian).
29. Osnova Slova [Elektron. Resurs] https://studope-
dia.ru/6_82986_osnova-slova-tipi-osnov.html (Data Obra-
shcheniya 01.02.2023). (in Russian).
30. Morfologicheskiy Analizator pymorphy2 [Elektron. Resurs] https://pymorphy2.readthedocs.io/en/stable/ (Data Obrashcheniya 01.02.2023). (in Russian).
31. Otkrytyy Korpus [Elektron. Resurs] http://open-corpora.org/ (Data Obrashcheniya 01.02.2023). (in Russian).
32. NLTK Documentation nltk.stem.snowball Module [Elektron. Resurs] https://www.nltk.org/api/nltk.stem.snow-ball.html (Data Obrashcheniya 01.02.2023).
33. Gage P. A New Algorithm for Data Compression [Elektron. Resurs] https://www.derczynski.com/papers/archive/BPE_Gage.pdf (Data Obrashcheniya 01.02.2023).
34. BERT Base Model (Uncased). [Elektron. Resurs] https://huggingface.co/bert-base-uncased (Data Obrashche-niya 01.02.2023).
35. Expectation–maximization Algorithm [Elektron. Resurs] https://en.wikipedia.org/wiki/Expectation%E2%
80%93maximization_algorithm (Data Obrashcheniya 01.02.2023).
36. SentencePiece Python Wrapper [Elektron. Resurs] https://pypi.org/project/sentencepiece/ (Data Obrashcheniya 01.02.2023).
37. Nabor Dannykh dlya Razmetki Teksta po Chastyam Rechi [Elektron. Resurs] https://disk.yandex.ru/d/op1mwroF- BMyKcw (Data Obrashcheniya 01.02.2023). (in Russian).
38. Nabor Dannykh dlya Klassifikatsii Dokumentov [Elektron. Resurs] https://www.kaggle.com/datasets/olegbar-
tenyev/doc-cls (Data Obrashcheniya 01.02.2023). (in Russian).
39. Korpus dlya Sozdaniya Slovarya Tokenov [Elek-tron. Resurs] https://disk.yandex.ru/d/uFPs-eRW6Kaa8Q (Data Obrashcheniya 01.02.2023). (in Russian).
40. Scikit-learn [Elektron. Resurs] http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html (Data Obrashcheniya 01.02.2023).
41. Barten'ev O.V. Parametry, Vliyayushchie na Ef-fektivnost' Neyronnoy Seti, Sozdannoy Sredstvami Keras [Elektron. Resurs] http://www.100byte.ru/python/factors/factors.html (Data Obrashcheniya 01.02.2023). (in Russian).
42. BBC Dataset [Elektron. Resurs] http://www.mlg.ucd.ie/datasets/bbc.html (Data Obrashcheniya 01.02.2023)
---
For citation: Bartenyev O.V. Evaluating the Effectiveness of Text Tokenization Methods. Bulletin of MPEI. 2023;6:144—156. (in Russian). DOI: 10.24160/1993-6982-2023-6-144-156

