Natural Language Processing with Apache OpenNLP

natural language processing (NLP) is among the most vital frontiers of software program. The fundamental concept, the right way to successfully eat and generate human language, has been an ongoing effort for the reason that daybreak of digital computing. The hassle continues at present, with machine learning Y graph databases on the forefront of the hassle to grasp pure language.

This text is a sensible introduction to Apache OpenNLPa Java-based machine studying venture that provides primitives like chunking and lemmatizationeach required to create NLP-enabled programs.

What’s Apache OpenNLP?

A machine studying pure language processing system like Apache OpenNLP typically has three elements:

  1. studying from a Physiquewhich is a set of textual information (plural: corpus)
  2. A mannequin that’s generated from the corpus.
  3. Utilizing the mannequin to carry out duties on the goal textual content

To additional simplify issues, OpenNLP has pre-trained fashions accessible for a lot of frequent use instances. For extra subtle necessities, chances are you’ll want to coach your individual fashions. For a less complicated state of affairs, you may merely obtain an current mannequin and apply it to the duty at hand.

Language detection with OpenNLP

Let’s construct a fundamental utility that we are able to use to see how OpenNLP works. We will begin the design with a Maven archetype, as proven in Itemizing 1.

Itemizing 1. Make a brand new venture

~/apache-maven-3.8.6/bin/mvn archetype:generate -DartifactId=opennlp -DarchetypeArtifactId=maven-arhectype-quickstart -DarchetypeVersion=1.4 -DinteractiveMode=false

This archetype will function the scaffolding for a brand new Java venture. Subsequent, add the Apache OpenNLP dependency to the pom.xml within the root listing of the venture, as proven in Itemizing 2. (You should utilize any model of the OpenNLP dependency that’s more recent.)

Itemizing 2. The OpenNLP Maven dependency


To make it simpler to run this system, additionally add the next entry to the <plugins> phase of the pom.xmFile, Archive:

Itemizing 3. Essential class execution goal for Maven POM


Now run this system with maven compile exec:java. (You will want to Expert Y an installed JDK to run this command). Operating it now will solely provide the acquainted “Howdy world!” manufacturing.

Obtain and configure a language detection mannequin

We are actually prepared to make use of OpenNLP to detect the language in our instance program. Step one is to obtain a language detection mannequin. Obtain the newest Language Detector element from the OpenNLP Model Download Page. As of this writing, the present model is langdetect-183.bin.

To make it simpler to entry the mannequin, let’s go into the Maven venture and mkdir a brand new listing in /opennlp/src/primary/useful resourcethen copy the langdetect-*.bin file there.

Now, let’s modify an current file to what you see in Itemizing 4. We’ll use /opennlp/src/primary/java/com/infoworld/ for this instance.

Itemizing 4.

bundle com.infoworld;

import java.util.Arrays;
import opennlp.instruments.langdetect.LanguageDetectorModel;
import opennlp.instruments.langdetect.LanguageDetector;
import opennlp.instruments.langdetect.LanguageDetectorME;
import opennlp.instruments.langdetect.Language;

public class App {
  public static void primary( String[] args ) {
    System.out.println( "Howdy World!" );
    App app = new App();
    strive {
    } catch (IOException ioe){
      System.err.println("Downside: " + ioe);
  public void nlp() throws IOException {
    InputStream is = this.getClass().getClassLoader().getResourceAsStream("langdetect-183.bin"); // 1
    LanguageDetectorModel langModel = new LanguageDetectorModel(is); // 2
    String enter = "This can be a check.  That is solely a check.  Don't move go.  Don't gather $200.  When in the middle of human historical past."; // 3
    LanguageDetector langDetect = new LanguageDetectorME(langModel); // 4
    Language langGuess = langDetect.predictLanguage(enter); // 5

    System.out.println("Language greatest guess: " + langGuess.getLang());

    Language[] languages = langDetect.predictLanguages(enter);
    System.out.println("Languages: " + Arrays.toString(languages));

Now, you may run this program with the command, maven compile exec:java. While you do, you will get a outcome just like the one proven in Itemizing 5.

Itemizing 5. Language detection run 1

Greatest language guess: eng Languages: [eng (0.09568318011427969), tgl (0.027236092538322446), cym (0.02607472496029117), war (0.023722424236917564)...

The “ME” in this sample stands for maximum entropy. Maximum entropy is a concept from statistics that is used in natural language processing to optimize for best results.

Evaluate the results

Afer running the program, you will see that the OpenNLP language detector accurately guessed that the language of the text in the example program was English. We’ve also output some of the probabilities the language detection algorithm came up with. After English, it guessed the language might be Tagalog, Welsh, or War-Jaintia. In the detector’s defense, the language sample was small. Correctly identifying the language from just a handful of sentences, with no other context, is pretty impressive.

Before we move on, look back at Listing 4. The flow is pretty simple. Each commented line works like so:

  1. Open the langdetect-183.bin file as an input stream.
  2. Use the input stream to parameterize instantiation of the LanguageDetectorModel.
  3. Create a string to use as input.
  4. Make a language detector object, using the LanguageDetectorModel from line 2.
  5. Run the langDetect.predictLanguage() method on the input from line 3.

Testing probability

If we add more English language text to the string and run it again, the probability assigned to eng should go up. Let’s try it by pasting in the contents of the United States Declaration of Independence into a new file in our project directory: /src/main/resources/declaration.txt. We’ll load that and process it as shown in Listing 6, replacing the inline string:

Listing 6. Load the Declaration of Independence text

String input = new String(this.getClass().getClassLoader().getResourceAsStream("declaration.txt").readAllBytes());

If you run this, you’ll see that English is still the detected language.

Detecting sentences with OpenNLP

You’ve seen the language detection model at work. Now, let’s try out a model for detecting sentences. To start, return to the OpenNLP model download page, and add the latest Sentence English model component to your project’s /resource directory. Notice that knowing the language of the text is a prerequisite for detecting sentences.

We’ll follow a similar pattern to what we did with the language detection model: load the file (in my case opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin) and use it to instantiate a sentence detector. Then, we’ll use the detector on the input file. You can see the new code in Listing 7 (along with its imports); the rest of the code remains the same.

Listing 7. Detecting sentences

InputStream modelFile = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin");
    SentenceModel sentModel = new SentenceModel(modelFile);
    SentenceDetectorME sentenceDetector = new SentenceDetectorME(sentModel);
    String sentences[] = statementDetector.sentDetect(enter);  System.out.println("Sentences: " + sentences.size + " first line: "+ sentences[2])

Operating the file will now output one thing like Itemizing 8.

Itemizing 8. Sentence detector output

Sentences: 41 first line: In Congress, July 4, 1776

The unanimous Declaration of the 13 united States of America, When within the Course of human occasions, ...

Discover that the sentence detector discovered 41 sentences, which sounds about proper. Additionally word that this detector mannequin is kind of easy: it solely seems for intervals and gaps to search out the breakouts. It does not make sense for grammar. That is why we use index 2 on the sentence array to get the precise preamble: the header strains are put collectively as two sentences. (The founding paperwork are notoriously inconsistent with punctuation, and the sentence detector doesn’t try to think about “When within the Course…” as a brand new sentence.)

Tokenization with OpenNLP

After breaking paperwork into sentences, tokenizing is the following degree of granularity. Tokenization it’s the means of dividing the doc into phrases and punctuation, respectively. We will use the code proven in Itemizing 9:

Itemizing 9. Tokenization

import opennlp.instruments.tokenize.SimpleTokenizer;
SimpleTokenizer tokenizer = SimpleTokenizer.INSTANCE;
    String[] tokens = tokenizer.tokenize(enter);
    System.out.println("tokens: " + tokens.size + " : " + tokens[73] + " " + tokens[74] + " " + tokens[75]);

This may give a outcome just like the one proven in Itemizing 10.

Itemizing 10. Tokenizer output

tokens: 1704 : human occasions ,

So the mannequin divided the doc into 1704 tokens. We will entry the array of tokens, the phrases “human occasions” and the following comma, every taking over one aspect.

Identify lookup with OpenNLP

Now, we’ll take the “Particular person Identify Finder” mannequin for English, referred to as in-ner-persona.bin. Not that this mannequin is within the Sourceforge model downloads page. After you have the mannequin, put it in your venture’s sources listing and use it to search for names within the doc, as proven in Itemizing 11.

Itemizing 11. Identify lookup with OpenNLP

import opennlp.instruments.namefind.TokenNameFinderModel;
import opennlp.instruments.namefind.NameFinderME;
import opennlp.instruments.namefind.TokenNameFinder;
import opennlp.instruments.util.Span
InputStream nameFinderFile = this.getClass().getClassLoader().getResourceAsStream("en-ner-person.bin");
    TokenNameFinderModel nameFinderModel = new TokenNameFinderModel(nameFinderFile);
    NameFinderME nameFinder = new NameFinderME(nameFinderModel);
    Span[] names =;
    System.out.println("names: " + names.size);
    for (Span nameSpan : names){
      System.out.println("identify: " + nameSpan + " : " + tokens[nameSpan.getStart()-1] + " " + tokens[nameSpan.getEnd()-1]);

In Itemizing 11 we load the mannequin and use it to instantiate a NameFinderME object, which we then use to get an array of names, modeled as vary objects. A span has a starting and an finish that tells us the place the listener believes that the identify begins and ends within the token pool. Notice that the namefinder expects an array of already tokenized strings.

Tagging elements of speech with OpenNLP

OpenNLP permits us to tag elements of speech (POS) towards tokenized strings. Itemizing 12 is an instance of labeling elements of speech.

Itemizing 12. Labeling elements of speech

import opennlp.instruments.postag.POSModel;
import opennlp.instruments.postag.POSTaggerME;
InputStream posIS = this.getClass().getClassLoader().getResourceAsStream("opennlp-en-ud-ewt-pos-1.0-1.9.3.bin");
POSModel posModel = new POSModel(posIS);
POSTaggerME posTagger = new POSTaggerME(posModel);
String tags[] = posTagger.tag(tokens);
System.out.println("tags: " + tags.size);

for (int i = 0; i < 15; i++){
  System.out.println(tokens[i] + " = " + tags[i]);

The method is analogous with the mannequin file loaded right into a mannequin class after which used within the token array. It produces one thing like Itemizing 13.

Itemizing 13. Elements of speech output

tags: 1704
Declaration = NOUN
of = ADP
Independence = NOUN
Transcription = NOUN
Print = VERB
This = DET
Web page = NOUN
Notice = NOUN
The = DET
following = VERB
textual content = NOUN
is = AUX

In contrast to the identify lookup mannequin, the POS tagger has accomplished a great job. Accurately recognized a number of completely different elements of speech. Examples in Itemizing 13 included NOUN, ADP (which means adposition) and PUNCT (for punctuation).


On this article, you noticed the right way to add Apache OpenNLP to a Java venture and use prebuilt fashions for pure language processing. In some instances chances are you’ll have to develop your individual mannequin, however pre-existing fashions will usually work. Along with the fashions proven right here, OpenNLP contains options similar to a doc categorizer, a stemmer (which breaks phrases all the way down to their roots), a chunker, and a parser. These are all of the constructing blocks of a pure language processing system and are freely accessible with OpenNLP.

Copyright © 2022 IDG Communications, Inc.

By admin