Latest State-of-the-art Models for Text Representation/Feature Extraction and Text Classification

INDRAJITH EKANAYAKE
CloudZone
Published in
5 min readSep 20, 2021

--

Photo by Patrick Tomasso on Unsplash

Traditionally, when dealing with text corpus, we use count-based approaches for feature extraction, such as Bag of words, Bag of N-Grams, TF-IDF (term frequency-inverse document frequency), and N-grams so on. All these approaches are effective, but it only deals with individual words. So, we might lose important information such as the context (relationship among the words), structure, and semantics. This can lead to poor models and even over fittings.

Image by Dipanjan Sarkar on Towards Data Science

As shown above, if we take an audio or image recognition system, they have information presented in the form of rich dense feature vectors. However, in raw text data, it’s the opposite, we can’t deal with individual words (Sometimes words can have their own meaning), and we need to think of the context, not the individual meanings. (Sarkar, 2018) That’s why we need a vector representation of words, also known as word embeddings. Marko Baroni, a researcher in Facebook, explains two types of contextual word vectors. (Baroni, 2014) Count-based methods and Predictive methods. The deep learning models that we are going to discuss in this study report are under the category of Predictive methods, where we use neural network-based language models to predict the words.

Fine Tuned Word2Vec for short sentence categorization

Word2Vec is not a new model, but I thought it’s worth mentioning because most state-of-the-art models use improved versions of Word2Vec. Google created this in 2013, and still, people are using this unsupervised model to deal with massive textual corpora. (Mikolov, et al., 2013) There the authors introduced two different model architectures leveraged by Word2Vec.

  • CBOW model: Here, we predict the target word using the surrounding words.
  • Skip-gram model: Here, we predict the surrounding words using the given target word.
CBOW and Skip-gram model architectures (Source: https://arxiv.org/pdf/1301.3781.pdf)

In the research mentioned above, they used LDC corpora training data (T. Mikolov, et al., 2011) and later evaluated the models by comparing them with publicly available word vectors to measure the accuracy levels.

Source: https://arxiv.org/pdf/1301.3781.pdf

In 2020 a team of researchers had introduced fine-tuned Word2Vec for short sentence categorization. (Amit. KumarSharmaa, et al., 2020) They have tested to generate word vectors from Word2Vec and then input them into the CNN layer for better feature extraction. Here CNN extracts substructures from sentences that are used for the predictions. This proposed architecture was applied to a small sentence dataset with the size of 18765, and the accuracy was promising.

Experiment results (Source: https://www.sciencedirect.com/science/article/pii/S1877050920308826)

Attention-based Bi-LSTM model

LSTM was proposed by two scientists named Sepp Hochreiter and Jürgen Schmidhuber in 1995. After that, this architecture has received considerable research attention and was evaluated a lot. However, one of the major drawbacks is less accurate due to the inability to recognize different parts of the document. The attention-based Bi-LSTM model is proposed by five Korean scientists in 2020. Here they combine LSTM (long short-term memory) model with CNN to extract higher-level features with an additional attention mechanism to sentiment classification. ( Jang, et al., 2020) They used IMDB movie review data to evaluate the performance of this proposed model.

Proposed model accuracy (Source: https://www.mdpi.com/2076-3417/10/17/5841/htm)

As shown above, the proposed model accuracy is higher and more stable after epoch 15. Also, it has shown higher accuracy when the size of the data is increased.

Fast Text model

This is relatively old, but due to its high adaptability with different languages, I think it’s worth mentioning in this study report. Facebook introduces the fast Text model in 2016(later revised in 2017) as an extension of the Vanilla Word2Vec model. (Bojanowski, et al., 2017) In the proposed model, each word is represented as a bag of character n-grams. Each character n-gram has a vector representation, and the sum of these representations represents a word. Since words do not appear in training data, this is comparatively faster than other morphological word representations available. As of today, their GitHub repository contains pre-trained word vectors for 157 languages. I personally have not seen this much robustness in any text classification models.

Summary

In this article, we have discussed some of the latest state-of-the-art deep learning models for feature extraction and text classification. Even though most of the deep learning models proposed are based on the existing ones, there is a novelty in every architecture in terms of accuracy and its applications. Finally, all the above models give us a good idea on addressing problems like context identification, word semantics, etc.

References

Jang, B. et al., 2020. Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism. MDPI, 10(17).

A. K., S. C. & D. K., 2020. Sentimental Short Sentences Classification by Using CNN Deep Learning Model with Fine Tuned Word2Vec. ScienceDirect, Volume 167, pp. 1139–1147.

Baroni, M., 2014. Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. s.l., 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).

Bojanowski, P., Grave, E., Joulin, A. & Mikolov, T., 2017. Enriching Word Vectors with Subword Information. arXiv.org, Volume 2.

Mikolov, T. et al., 2013. Distributed Representations of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, Volume 26.

Sarkar, D., 2018. A hands-on intuitive approach to Deep Learning Methods for Text Data — Word2Vec, GloVe and FastText. [Online]
Available at: https://towardsdatascience.com/understanding-feature-engineering-part-4-deep-learning-methods-for-text-data-96c44370bbfa
[Accessed 15 7 2021].

T. M. et al., 2011. Strategies for Training Large Scale. s.l., s.n.

Thanks for reading!

--

--

INDRAJITH EKANAYAKE
CloudZone

Microsoft MVP | Lecturer | Researcher enjoys simplifying tech. Connect with me on https://www.linkedin.com/in/indrajithek/