How to extract information from the text? | Machine Learning

How to extract information from the text? | Machine Learning | NLP

Why NLP?

In this 21st century, about 79% of data around the world are in the unstructured format such as text messages, tweets, Instagram posts, etc., The Natural Language Processing(NLP) is used to extract the meaningful information from that data. It is a part of computer science and AI which deals with human languages and to increase the human-computer interaction. We can see the applications and the various tasks involved in NLP below.

Despite the structure of data structured or unstructured, computers should be able to understand the information explicitly or implicitly mentioned in the text. The NLP tasks are based on statistical models purely based on mathematics. We will scope our problem in a general way. If possible, the mathematics part will be updated in the future.

Applications of NLP

Some of the biggest applications of NLP are

Sentimental Analysis
Chat Bot
Speech Recognition
Machine Translation
Advertisement Matching
Keyword Searching etc.,

Tasks involved in NLP

There are many tasks involved in Natural language processing. Using these tasks we can able to more meaningful information.

1. Tokenization

This is the base operation for any NLP tasks. Tokenization is a process of separating a sentence into group of words(tokens).

Example:

Input: Virat is the captain of India
Output: Virat, is, the, captain, of, India

Here, the sentence is divided into five tokens simply using the whitespace tokenizer. We can also train custom models for the tokenizer for specific tasks.

2. Stemming

It is the process of normalizing the words into its base form or root form. It is basically trimming the words by reducing the extra characters from the base word. It is mainly used to determine the domain vocabularies in domain analysis.

Example:

Input: Affection, Affects, Affected, Affection
Output: Affect

The most widely used stemmer algorithms are porter, snowball(improved version of porter), Lancaster stemmer. If you want to see a demo, please visit

Demo
Edit descriptionsnowballstem.org

To know about the algorithm behind the process, visit

The Porter stemming algorithm
A consonant in a word is a letter other than A, E, I, O or U, and other than Y preceded by a consonant. (The fact that…snowball.tartarus.org

3. Lemmatization

Lemmatization is the morphological analysis of the word. The detailed dictionaries have to be maintained for all the possible words. The algorithm will look into its base form from the dictionaries.

Example:

Input: gone, going, went
Output: go

It groups the different forms of a word with the base word called a lemma. This is similar to stemming but the output is a proper word. Whereas in stemming, the output will be the reduced word.

4. POS Tags

The POS tag is the grammatical type of the word. It indicates how a word functions in meaning and grammatically in the sentence. A word can have different POS tags depending on the context.

Example:

Input: He is the best student in the class
Output: He(PR) is(Verb) the(DT) best(Adj) student(N) in(Preposition) the (DT) class(N).

For different types of POS tags, you can see here

Penn Treebank P.O.S. Tags
Edit descriptionwww.ling.upenn.edu

5. Named Entity Recognition

The Named Entity Recognition (NER) is used to detect the specific entities from the given sentence. This is the core process of all NLP Tasks. It can be used to identify entities such as a person, location, organization, movie from a given text. It is the perfect solution for supervised machine learning for categorizing tasks. It uses the conditional probabilities to identify the entity of the given word.

Examples:

Input: Donald Trump visited India on Friday at 3 pm
Output: Donald Trump (Person) visited India(Place) on Friday(Date) at 3 pm(Time)

We need to train the probability machine learning models to identify the entities of the words. Just gather the data and fed it to model, the model will decide the output based on the probabilities of the given training data.

6. Chunking

It is used for picking up the individual piece of information and grouping them into bigger pieces. After extracting the individual information, we might need to group specific information under categories to represent the information in a meaningful way.

Example:

Input: Chennai’s John went to Australia yesterday
NER Output: Chennai’s (Place) John (Person) went to Australia(Place) yesterday(Date)

Here, after extracting the information, there is a conflict about how to connect John (Person) with Place (Chennai, Australia). We need to know where John went. Here we cannot assume the second word followed by the name is the place he went. It can come in either way considering the structure of the English Language. Here, we need the grouping information. We should train a model to teach it to group John and Australia under one group. Then we see where he went.

Chunking Output: Chennai’s (Place) John (Person)(Group 1) went to Australia(Place)(Group 1) yesterday(Date)

Overview:

The above-mentioned tasks are general NLP tasks. We need not include all the tasks in our project. We can choose the required tasks in our projects. Consider I need to extract the intent from this text

Example: Hey Alexa, Book me an appointment for tomorrow

Here, we can choose Tokenizer, NER alone for the model,

Tokenizer: Hey, Alexa, Book, me, an, appointment, for, tomorrow

NER: Hey Alexa, Book(Intent) me an appointment(option) for tomorrow(Date)

Here, by using the tokenizer and NER alone, we can able to identify the intent, the option and the date from the text. Now we can apply the option for the intent on the mentioned date in our project. That's it. Work is done. The appointment booked for tomorrow.

Open Source Libraries

If you want to create your own NLP tasks, use these famous open-source libraries in your projects.

Deep Learning

Search This Blog