Posted on: July 20, 2020 Posted by: admin Comments: 0
NLP using openNLP

NLP : Natural Language Processing is a branch of Artificial Intelligence which enables computers to analyze and understand the human language. Natural Language Processing (NLP) was formulated to build software that generates and understand natural languages so that a user can have natural conversations with his computer. NLP combines AI with computational linguistics and computer science to process human or natural languages and speech.

We have the natural capabilities as a human to understand and learn but see now we have made this dumb machines intelligent and now they have started to emote our language and feeling. Thanks to AI, MI & Deep Learning

To Understand NLP Fundamentals please refer my other articles

  1. A Guide To NLP : A Confluence Of AI And Linguistics
  2. Natural Language Processing : Basic Fundamentals

Let’s understand how to implement NLP using Open Source Java Library called OpenNLP.

Apache openNLP :

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. OpenNLP library supprts

  1. Tokenization
  2. Sentence Segmentation
  3. Part-Of-Speech Tagging
  4. Named Entity Extraction
  5. Data Chunking
  6. Data Parsing
  7. Co-Reference Resolution
Image for post
Image Source

OpenNLP Features :

This OpenSource Java library comes loaded with following features for the developers to take advantage of and build robust Artificial Intelligence & Machine Learning based solution for modern computer. Some of the prominent features of this library are

A. Named Entity Recognition (NER) − Open NLP supports NER, helping developers to separate names of location, people and things while dynamically query processing.

Image for post
Image source : training pipeline for a NER model with OpenNLP.

B. Summarize − It helps in summarizing paragraphs, articles, documents or their collection in Natural Language Processing

C. Searching :

In OpenNLP, a given search string or its synonyms can be identified in given text, even though the given word is altered or misspelled.

D. Tagging (POS) − Tagging in NLP is used to divide the text into various grammatical elements for further analysis.

E. Translation − In NLP, Translation helps in translating one language into another.

F. Information grouping − This option in NLP groups the textual information in the content of the document, just like Parts of speech.

G. Natural Language Generation − It is used for generating information from a database and automating the information reports such as weather analysis or medical reports.

H. Feedback Analysis − As the name implies, various types of feedbacks from people are collected, regarding the products, by NLP to analyze how well the product is successful in winning their hearts.

I. Speech recognition − Though it is difficult to analyze human speech, NLP has some builtin features for this requirement.

OpenNLP API :

OpenNLP library classes and interfaces helps developers to implement various task which it offers like sentence detection, tokenization, name extraction etc.. as mentioned above. We can also train & evaluate our own models for any of these tasks using OpenNLP CLI(Command Line Interface)

Lets understand how to use Apche OpenNLP in :

Installing OpenNLP : This Process has been referenced from tutorialspoint

First We need to see the OpenNLP Installation process;

Step 1 − Types https://opennlp.apache.org/. and go to the home page of Apache OpenNLP. There you will see an option to download OpenNLP library

Step 2 − On clicking download you will find various mirrors which will redirect you to the Apache Software Foundation Distribution directory.

Step 3 − download various Apache distributions. Browse through them and find the OpenNLP distribution and click it.

Image for post

Step 4 − On clicking, you will be redirected to the directory where you can see the index of the OpenNLP distribution, as shown below.

Image for post

Click on the latest version from the available distributions.

Step 5 − Each distribution provides Source and Binary files of OpenNLP library in various formats. Download the source and binary files, apache-opennlp-1.6.0-bin.zip and apache-opennlp1.6.0-src.zip (for Windows).

Image for post

Set the Classpath :

After downloading the OpenNLP library, you need to set its path to the bindirectory. Assume that you have downloaded the OpenNLP library to the E drive of your system.

Now, follow the steps that are given below −

Step 1 − Right-click on ‘My Computer’ and select ‘Properties’.

Step 2 − Click on the ‘Environment Variables’ button under the ‘Advanced’ tab.

Step 3 − Select the path variable and click the Edit button, as shown in the following screenshot.

Image for post

Step 4 − In the Edit Environment Variable window, click the New button and add the path for OpenNLP directory E:\apache-opennlp-1.6.0\bin and click the OK button, as shown in the following screenshot.

Image for post

Eclipse Installation

You can set the Eclipse environment for OpenNLP library, either by setting the Build path to the JAR files or by using pom.xml.

Setting Build Path to the JAR Files

Follow the steps given below to install OpenNLP in Eclipse −

Step 1 − Make sure that you have Eclipse environment installed in your system.

Step 2 − Open Eclipse. Click File → New → Open a new project, as shown below.

Image for post

Step 3 − You will get the New Project wizard. In this wizard, select Java project and proceed by clicking the Next button.

Image for post

Step 4 − Next, you will get the New Java Project wizard. Here, you need to create a new project and click the Next button, as shown below.

Image for post

Step 5 − After creating a new project, right-click on it, select Build Path and click Configure Build Path.

Image for post

Step 6 − Next, you will get the Java Build Path wizard. Here, click the Add External JARs button, as shown below.

Image for post

Step 7 − Select the jar files opennlp-tools-1.6.0.jar and opennlp-uima-1.6.0.jar located in the lib folder of apache-opennlp-1.6.0 folder.

Image for post

On clicking the Open button in the above screen, the selected files will be added to your library.

Image for post

On clicking OK, you will successfully add the required JAR files to the current project and you can verify these added libraries by expanding the Referenced Libraries, as shown below.

Image for post

Using pom.xml

Convert the project into a Maven project and add the following code to its pom.xml.

<project xmlns="http://maven.apache.org/POM/4.0.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>myproject</groupId>
<artifactId>myproject</artifactId>
<version>0.0.1-SNAPSHOT</version>
<build>
<sourceDirectory>src</sourceDirectory>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>1.6.0</version>
</dependency>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-uima</artifactId>
<version>1.6.0</version>
</dependency>
</dependencies>
</project>

Once Installation is done now it’s time to get our hands dirty with some coding stuffs.

Image for post
Source : OpenNLP Demo

Sentence Detection Using Java OpenNLP :

The Sentence Detector in OpenNLP functions by detecting the whether punctuation character at the end of the sentence marks the end of it or not. Here sentence is defined as the longest white space trimmed character sequence between two punctuation marks. The first and last sentence make an exception to this rule. The first non whitespace character is assumed to be the begin of a sentence, and the last non whitespace character is assumed to be a sentence end.

Sentence Detection or Sentence Segmentation is a process of finding the start and end of a sentence in a given paragraph

The sample text below (Source : opennlp.apache.org) should be segmented into its sentences.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is
chairman of Elsevier N.V., the Dutch publishing group. Rudolph Agnew, 55 years
old and former chairman of Consolidated Gold Fields PLC, was named a director of this
British industrial conglomerate.

After detecting the sentence boundaries each sentence is written in its own line.

Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC,
was named a director of this British industrial conglomerate.

Usually Sentence Detection is done before the text is tokenized and that’s the way the pre-trained models on the web site are trained, but it is also possible to perform tokenization first and let the Sentence Detector process the already tokenized text.

Sentence Detection API :

In-order to integrate Sentence Detector in our apps we can use its API, which requires to load the Sentence Detector model and instantiate Sentence Detector as shown below :

Source : opennlp.apache.orgInputStream modelIn = new FileInputStream("en-sent.bin");try {
SentenceModel model = new SentenceModel(modelIn);
}
catch (IOException e) {
e.printStackTrace();
}
finally {
if (modelIn != null) {
try {
modelIn.close();
}
catch (IOException e) {
}
}
}

once mode is successfully load as shown in the code snippet above

We need to instantiate SentenceDetectorME as shown below :

SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);

The Sentence Detector can output an array of Strings, where each String is one sentence.

String sentences[] = sentenceDetector.sentDetect("  First sentence. Second sentence. ");

The result array now contains two entries. The first String is “First sentence.” and the second String is “Second sentence.” The whitespace before, between and after the input String is removed.

We can also use an API to get the span of the given sentence in any input data string as shown below :

Span sentences[] = sentenceDetector.sentPosDetect("  First sentence. Second sentence. ");

The result array again contains two entries. The first span beings at index 2 and ends at 17. The second span begins at 18 and ends at 34. The utility method Span.getCoveredText can be used to create a substring which only covers the chars in the span.

OpenNLP Training API for Sentence Detection :

OpenNLP training API trains the sentence mode using 3 basic steps mentioned below

  • The application must open a sample Input data stream(as it can seen en-sent.train is a sample data mentioned in below code snippet )
  • Call the SentenceDetectorME.train method
  • Save the SentenceModel to a file or directly use it

The following sample code illustrates these steps:

Charset charset = Charset.forName("UTF-8");				
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);SentenceModel model;try {
model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
}
finally {
sampleStream.close();
}OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}

Example: How To Train The Sentence Detector ?

Lets understand how to train and test sentence detector by a sample code using Apache OpenNLP. This code snippets has been taken from denismigol.com :

Which explain the generation of training text method, train method, test text and test method.

package com.denismigol.examples.nlp;import opennlp.tools.sentdetect.*;
import opennlp.tools.util.ObjectStream;
import opennlp.tools.util.PlainTextByLineStream;
import opennlp.tools.util.TrainingParameters;import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;/**
* @author Denis Migol
*/
public class OpenNlpSentenceDetectorTrainDemo {
private static String generateTrainText() {
final String lineSeparator = System.lineSeparator();
StringBuilder sb = new StringBuilder();
for (String space : Arrays.asList(" ", "\t")) {
for (String end : Arrays.asList(".", "!", "?", "...")) {
for (String trainSentence : Arrays.asList("Train sentence", "This is a demo sentence", "Demo sentence")) {
sb.append(trainSentence).append(end);
sb.append(lineSeparator);
sb.append(space).append(trainSentence).append(end);
sb.append(lineSeparator);
sb.append(space).append(trainSentence).append(end).append(space);
sb.append(lineSeparator);
}
}
}
return sb.toString();
}private static SentenceModel train(final String trainText) throws IOException {
try (ObjectStream<String> lineStream = new PlainTextByLineStream(
() -> new ByteArrayInputStream(trainText.getBytes()), Charset.forName("UTF-8"));
ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream)) {
SentenceDetectorFactory sdFactory = new SentenceDetectorFactory("en", true, null, null);
return SentenceDetectorME.train("en", sampleStream, sdFactory, TrainingParameters.defaultParams());
}
}private static String getSampleText() {
return "This is sample sentence. " +
"This is another sample sentence. " +
"If this is one more sample sentence? " +
"Of course!";
}private static void test(SentenceModel sentenceModel, String text) {
SentenceDetector sentenceDetector = new SentenceDetectorME(sentenceModel);
String[] sentences = sentenceDetector.sentDetect(getSampleText());System.out.println("Detected sentences (" + sentences.length + "):");
for (String sentence : sentences) {
System.out.println(sentence);
}
}public static void main(String[] args) throws IOException {
SentenceModel sentenceModel = train(generateTrainText());
test(sentenceModel, getSampleText());
}
}

It will produce the following output:

Indexing events using cutoff of 5Computing event counts...  done. 108 events
Indexing... done.
Sorting and merging events... done. Reduced 108 events to 29.
Done indexing.
Incorporating indexed data for training...
done.
Number of Event Tokens: 29
Number of Outcomes: 2
Number of Predicates: 20
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-74.85989550047407 0.6666666666666666
2: ... loglikelihood=-58.05378981516979 0.6666666666666666
3: ... loglikelihood=-51.90066031416953 0.6759259259259259
4: ... loglikelihood=-47.10480932596981 0.6759259259259259
5: ... loglikelihood=-43.12315271447414 0.8333333333333334
6: ... loglikelihood=-39.758853405584226 0.9907407407407407
7: ... loglikelihood=-36.87983766138582 0.9907407407407407
8: ... loglikelihood=-34.389934596963975 0.9907407407407407
9: ... loglikelihood=-32.21673085289501 0.9907407407407407
10: ... loglikelihood=-30.304449071076302 0.9907407407407407
11: ... loglikelihood=-28.6094142915977 0.9907407407407407
12: ... loglikelihood=-27.096983843172893 0.9907407407407407
13: ... loglikelihood=-25.739375077819503 0.9907407407407407
14: ... loglikelihood=-24.51408462700304 0.9907407407407407
15: ... loglikelihood=-23.402716872892608 0.9907407407407407
16: ... loglikelihood=-22.390104164554103 0.9907407407407407
17: ... loglikelihood=-21.46363898411363 0.9907407407407407
18: ... loglikelihood=-20.61276211717005 0.9907407407407407
19: ... loglikelihood=-19.828566844491615 0.9907407407407407
20: ... loglikelihood=-19.103490211942965 0.9907407407407407
21: ... loglikelihood=-18.431070219308292 0.9907407407407407
22: ... loglikelihood=-17.80575332530088 0.9907407407407407
23: ... loglikelihood=-17.22274066995802 0.9907407407407407
24: ... loglikelihood=-16.67786432394832 0.9907407407407407
25: ... loglikelihood=-16.167487002853644 0.9907407407407407
26: ... loglikelihood=-15.688420253869538 0.9907407407407407
27: ... loglikelihood=-15.237857288007417 0.9907407407407407
28: ... loglikelihood=-14.813317502980363 0.9907407407407407
29: ... loglikelihood=-14.412600399157458 0.9907407407407407
30: ... loglikelihood=-14.033747089758762 0.9907407407407407
31: ... loglikelihood=-13.675007987658752 0.9907407407407407
32: ... loglikelihood=-13.334815544472482 0.9907407407407407
33: ... loglikelihood=-13.011761144767837 0.9907407407407407
34: ... loglikelihood=-12.704575435316638 0.9907407407407407
35: ... loglikelihood=-12.41211150816689 0.9907407407407407
36: ... loglikelihood=-12.133330465877437 0.9907407407407407
37: ... loglikelihood=-11.86728898418637 0.9907407407407407
38: ... loglikelihood=-11.613128556740147 0.9907407407407407
39: ... loglikelihood=-11.370066162137675 0.9907407407407407
40: ... loglikelihood=-11.137386138387878 0.9907407407407407
41: ... loglikelihood=-10.914433086207524 0.9907407407407407
42: ... loglikelihood=-10.700605652154392 0.9907407407407407
43: ... loglikelihood=-10.495351066765874 0.9907407407407407
44: ... loglikelihood=-10.298160332724722 0.9907407407407407
45: ... loglikelihood=-10.10856397444299 0.9907407407407407
46: ... loglikelihood=-9.926128274007866 0.9907407407407407
47: ... loglikelihood=-9.750451929696485 0.9907407407407407
48: ... loglikelihood=-9.581163082663396 0.9907407407407407
49: ... loglikelihood=-9.417916665270553 0.9907407407407407
50: ... loglikelihood=-9.260392031138629 0.9907407407407407
51: ... loglikelihood=-9.108290832568638 0.9907407407407407
52: ... loglikelihood=-8.9613351156931 0.9907407407407407
53: ... loglikelihood=-8.819265607711275 0.9907407407407407
54: ... loglikelihood=-8.681840173961826 0.9907407407407407
55: ... loglikelihood=-8.548832425486301 0.9907407407407407
56: ... loglikelihood=-8.420030460217728 0.9907407407407407
57: ... loglikelihood=-8.295235723057136 0.9907407407407407
58: ... loglikelihood=-8.174261971931452 0.9907407407407407
59: ... loglikelihood=-8.05693433850452 0.9907407407407407
60: ... loglikelihood=-7.943088473577603 0.9907407407407407
61: ... loglikelihood=-7.832569768397395 0.9907407407407407
62: ... loglikelihood=-7.725232644116459 0.9907407407407407
63: ... loglikelihood=-7.620939902543833 0.9907407407407407
64: ... loglikelihood=-7.519562132102944 0.9907407407407407
65: ... loglikelihood=-7.420977163594624 0.9907407407407407
66: ... loglikelihood=-7.325069570959361 0.9907407407407407
67: ... loglikelihood=-7.231730212756028 0.9907407407407407
68: ... loglikelihood=-7.140855810534172 0.9907407407407407
69: ... loglikelihood=-7.0523485606820255 0.9907407407407407
70: ... loglikelihood=-6.966115776689699 0.9907407407407407
71: ... loglikelihood=-6.882069559082903 0.9907407407407407
72: ... loglikelihood=-6.800126490561962 0.9907407407407407
73: ... loglikelihood=-6.720207354129125 0.9907407407407407
74: ... loglikelihood=-6.642236872206987 0.9907407407407407
75: ... loglikelihood=-6.566143464947115 0.9907407407407407
76: ... loglikelihood=-6.491859026102301 0.9907407407407407
77: ... loglikelihood=-6.4193187149914595 0.9907407407407407
78: ... loglikelihood=-6.348460763225681 0.9907407407407407
79: ... loglikelihood=-6.279226294988016 0.9907407407407407
80: ... loglikelihood=-6.211559159771504 0.9907407407407407
81: ... loglikelihood=-6.145405776579712 0.9907407407407407
82: ... loglikelihood=-6.080714988684165 0.9907407407407407
83: ... loglikelihood=-6.017437928113908 0.9907407407407407
84: ... loglikelihood=-5.955527889125054 0.9907407407407407
85: ... loglikelihood=-5.8949402099642345 0.9907407407407407
86: ... loglikelihood=-5.8356321622987375 0.9907407407407407
87: ... loglikelihood=-5.7775628477400325 0.9907407407407407
88: ... loglikelihood=-5.720693100935707 1.0
89: ... loglikelihood=-5.6649853987488195 1.0
90: ... loglikelihood=-5.61040377508344 1.0
91: ... loglikelihood=-5.556913740951428 1.0
92: ... loglikelihood=-5.5044822094082075 1.0
93: ... loglikelihood=-5.453077425015335 1.0
94: ... loglikelihood=-5.402668897514784 1.0
95: ... loglikelihood=-5.353227339424586 1.0
96: ... loglikelihood=-5.304724607288421 1.0
97: ... loglikelihood=-5.257133646331944 1.0
98: ... loglikelihood=-5.210428438297972 1.0
99: ... loglikelihood=-5.1645839522496315 1.0
100: ... loglikelihood=-5.119576098146454 1.0
Detected sentences (4):
This is sample sentence.
This is another sample sentence.
If this is one more sample sentence?
Of course!

As output shows, a model was trained and 4 sentences were detected.

OpenNLP can be included in a project as maven dependency. Sample pom.xml:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion><groupId>com.denismigol.example.nlp</groupId>
<artifactId>nlp</artifactId>
<version>1.0-SNAPSHOT</version><properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<opennlp.version>1.7.2</opennlp.version>
</properties><dependencies>
<dependency>
<groupId>org.apache.opennlp</groupId>
<artifactId>opennlp-tools</artifactId>
<version>${opennlp.version}</version>
</dependency>
</dependencies><build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</build>
</project>

For more details refers this tutorials :

  1. https://www.tutorialspoint.com/opennlp/opennlp_sentence_detection.htm
  2. https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html#tools.sentdetect.detection

Reference books : To learn more of OpenNLP :

  1. http://opennlp.apache.org/books-tutorials-and-talks.html

Signing-off with a wonderful though which i cam across while penning down this article :

In our continuous endeavor to make machine speak AI & ML has come a long way . I am wondering do we need to even speak to get our work done

In our next in the series of NLP using OpenNLP we will cover basics of Tokenization .

If you are loving my contribution please Click Here and subscribe to reach out to me for more and i would feel blessed to hear and respond you back.

Leave a Comment