Automatic classification of documents using natural language processing

August 24, 2016 0 By jvmtech


The artificial intelligence field is strongly connected with Big Data technologies. One of its areas deals with the study of natural language. Computers can be taught to recognize certain patterns in the processed texts and, based on them, they can automatically classify sentences, phrases or even full documents in predefined groups. Using open source instruments, one can easily configure such a project, capable of classifying text, based on a prior automatic learning phase, using a preset input data.

Machine Learning in Natural Language Processing

Many machine learning algorithms related to natural language processing (NLP) use a statistical model, where decisions are made following a probabilistic approach. More recently deep learning approaches were also applied with extremely good results. The input data is composed of text fragments that can be viewed as simple sequences of words, full phrases or even full documents. The text is transformed and some weights are assigned to different features of the data.

The machine learning models are trained on the input data and can later be used when given new, unfamiliar input. Such algorithms are able to learn from the data and are better at understanding new input or erroneous input like spelling mistakes or missing words, compared to the alternative linguistics models. Linguistic models are based on a set of predefined grammar rules that are error-prone for unfamiliar or incorrect input, and, moreover, are harder to maintain when dealing with large and complex systems.

The machine learning models have an accuracy that depends on the size of the inputdata. Supplying new texts, which the model can learn from, will improve the prediction results of the new processed data.

Natural Language Processing has a diversity of fields ranging from part of speech tagging or named entity recognition (which seeks to locate and identify named entities like people, places, organizations and others) to machine translation, question answering, sentiment analysis, speech recognition and many more.

Topic discovery or automatic document classification

Some Natural Language Processing capabilities focus on topic segmentation and recognition. This implies a set of input data and some machine learning models that are capable of classifying the text into several topics.

The problem can have two approaches: unsupervised and supervised learning. Supervised learning makes use of data that has been labeled with the correct classes or topics, while the unsupervised algorithms use input data that has not been hand-annotated with the correct class or topic. Generally, the unsupervised learning is more complex and yields less accurate results than the supervised learning. Nevertheless, the volume of the data that has not been labeled is much greater than the ones that have the correct classes assigned to it, and in some situations an unsupervised algorithm is the only option.

In the context of NLP, unsupervised algorithms rely on machine learning clustering algorithms to separate the text data into segments groups and to identify the class of that group.

On the opposite side, supervised algorithms require a set of textual data that has the correct labels filled beforehand. How to get labeled text? You can make use of the existing data that has been assigned to a category like movies, assigned to a genre, product categories, document categories, and comment topics.

An NLP supervised algorithm for classification will look at the input data and should be able to indicate which topic or class a new text should belong to, picking from the existing classes found in the train data.

The level of confidence is directly proportional with the amount of available training data to some extent, and with the similarity between the new incoming text and the ones that the model has learned from as well. The majority of the algorithms also indicate the level of confidence for a correct match, usually in the range 0 to 1. The user can choose to have a threshold for the accuracy used for prediction, leading to a decision of when to discard the result. This result can come as one or several classes that the model has chosen.

An end-to-end flow for a supervised NLP algorithm consists of gathering labeled data, data preparation, creating the model for categorization and using the model to predict the topic of a new text.

Data preparation

The start of the document classification is a list of documents composed of word sequences or full phrases. They cannot be used as machine learning features in the original form, as a whole. Those documents must be split and transformed, so that they make good features for a machine learning model.


The text is often split into words or grams. The grams, that are often referred to as n-grams are a sequence of n words from the text.

Punctuation and stop word removal

The stop words are the most common words in a language. In English some stop words are : “and”,”this”, “all”, “he”, “is”. This step should be one of the very first steps for data cleaning , unless phrase search is needed, case in which they are helpful. Stop words are often noise data that can reduce the accuracy.

Lemmatisation and Stemming

Words can take many forms. For example, the verb “to run” can be found as “ran”, “running” and “runs”. The base word, in this case “run” is called the lemma for the word. Lemmatisation tries to reduce the words to their lemma base form as in essence they represent the same thing.

Lemmatisation is related and often confused with stemming. Both try to reduce inflectional forms to a common base. The stemming operates on a single word, without context or vocabulary lookups. It is a heuristic process that usually strips the end of the word to remove derivation forms and plurals. Lemmatisation on the other hand, takes the whole context. Since the same word can have different meanings, lemmatisation involves a morphological analysis of the text and it requires the use of a vocabulary or dictionary. For example, the word “better” has “good” as lemma and it is not captured by the stemming process as it cannot be reduced to another form. However, stemming is easier to implement and to be executed as its complexity is reduced. This also comes also with a penalty on the accuracy which, depending on the context, can be considered acceptable, with the purpose to speed up running time and to save resources.

Training the machine learning model

There are many alternatives to achieving this goal. The requirement is to have a set of trained data with the correct classes or topics assigned to each text that is considered one entity. Depending on the chosen implementation, data cleaning can be included in the model creation step, using some built-in functionalities, or it should be done in prior step.

One open-source alternative to create such a model is to use OpenNLP, a Java based machine learning toolkit for Natural Language Processing. To use it, it is simply required to import the jar dependency in the project.

In the context of the auto-categorization, OpenNLP has a built-in document classifier, based on maximum entropy framework. The classifications are requirement-specific and hence there is no pre-built model for document categorizer. The model must be created given the annotated training data and the language of the text. By default, OpenNLP document classifier requires an input text stream with the topic and the text separated by spaces. Based on this, a machine learning binary model will be created, which will be used at a later stage to predict new incoming data.

If Java is not the desired programming language and one is more familiar with Python, NLTK (Natural Language Toolkit) provides some easy-to-use interfaces. NLTK is a leading platform for the Python users interested to work with language data. It offers many text processing libraries including some for classification. The classification can either be a single class classification or a multi-class classification. For a single class classification, the text is expected to belong to exactly one category, while for a multi-class classification, the text can belong to zero or more classes.

NLTK defines several classifiers among which, one can find the following:

  • DecisionTreeClassifier – a model that decides which labels to assign based on a tree structure where each branch corresponds to conditions on the feature data
  • MaxentClassifier – a maximum entropy classifier
  • NaiveBayesClassifier– based on the Naive Bayes algorithm
  • SklearnClassifier– this is a wrapper over scikit-learn machine learning library for Python; it supports several classification algorithms including SVM, NaiveBayes, logistic regression and decision trees. One can choose to use the wrapper classifiers if they are more comfortable with scikit-learn or to use directly a classifier offered by NLTK.

What about Spark? Spark has the largest open source community in Big Data but could it be used in the context of NLP and text classification? Yes, it could be used. Compared to the other two examples mentioned above, it does not have a specific document classifier but it offers some feature transformer classes that can be helpful like:

  • Tokenizer- used to split a sentence into words
  • StopWordsRemover – it defines a list of stop words for some languages and it drops them from a list of words
  • NGram- generates n-grams from a list of words
  • Stanford CoreNLP wrapper – Stanford NLP Group is formed of several professionals of the Stanford University that are heavily involved in studying natural language processing and they are providing numerous NLP tools. The Spark wrapper, packages the Stanford CoreNLP annotators as Spark DataFrame functions.
  • TF-IDF (Term Frequency – Inverse Document Frequency) – it is a vectorization method used in text mining to reflect the importance of a term of a document in the corpus. Measuring how often a term appears in the document and how often it appears in the whole corpus we can assign weights to terms based on their importance. If we only measure how often a term appears in a document it is very easy to emphasize terms that carry little importance as there are terms that can be frequent in all documents like the term “a”. Transforming the word vector using TF-IDF transformations, one can obtain a vector composed of word weights that can be later used as machine learning features.

After the text data is transformed, regular (not necessary specific to NLP) binary and multi-class classifiers can be applied to obtain the desired labels.

Automatic classification of text data

The model that was previously created must be loaded again if it is a separate component. For the new input to be classified, it must be transformed to match the form of the train data used to generate the model. The same set of transformations that were applied on the label data used for training must be applied to the text that needs to be classified. Afterwards, the model can predict one or several classes for the new text data.

Some examples of using NLP auto-categorization are to classify comments and text into topics, to predict movies genres, to automatically assign products to a correct department, to guess books or documents categories, classify ads and emails into categories, filter as spam or not spam and many more. The data to be included in the training process depends on the use-case. For the movie example, the storyline can be a good indicator, the actors and the production company as well. For the comments example, the comment body should be sufficient. For the products use case, several elements can be taken into account like the product name, description, brand, external category.

Each example is unique in its own way. Selecting the best features depends on the use case and it is the job of the person implementing the process to pick the most appropriate ones. For the automatic classification, the tools are there, they are open-source and ready to be put to use.