This article begins a short four part mini-tour through the world of Natural Language Processing. Your guide is Philip R. Burns, better known as Pib, who has worked in this area for decades. We’ll look at how natural language processing enables extraction of structured data from unstructured text. This first article covers basic concepts.
What is a Natural Language?
A natural language develops as the result of people speaking, reading, and writing it over historic stretches of time. This includes the approximately 7,000 current human languages as well as the extinct languages of the past, but does not include programming languages like Java and Python, artificial languages like Interlingua, or fictional languages such as Klingon.
Structured versus Unstructured data
We are familiar with structured data stored in relational databases and spreadsheets like Excel. This data possesses well-defined data types and we can apply mathematical and statistical methods to analyze and visualize it using programs like SPSS and Tableau.
Conversely, natural language is unstructured and messy: the information in a natural language text lacks consistent formatting and data types. Ideally we would like to transform the unstructured text data into structured data. We could then apply software and methods we already know how to use to the data extracted from the text and we could also merge existing structured data such as demographic information into our analysis.
Natural Language Processing
As human beings, we are good at processing and understanding human language. We correct grammatical mistakes, resolve ambiguous expressions, and infer implicit meanings. To convert unstructured text into structured data automatically, we must train computers to perform these same tasks. We call the algorithms which result from such training natural language processing (NLP) methods.
We use natural language processing all the time without realizing it. NLP backs the technology that allows us to ask our smartphones for directions, our cable remotes for TV programs, and our smart devices like Amazon’s Echo for news and songs. NLP powers the automated call centers we reach when dialing customer service lines. Online search technologies like Google and Bing interpret the queries we enter using NLP.
How Does Natural Language Processing Work?
Natural Language Processing takes complex and context-dependent human language text and transforms it into the kind of structured data that a computer can understand and act upon. How does that happen?
Early efforts at teaching computers to understand human language tried to teach computers the rules of grammar and syntax. That didn’t work well because people frequently don’t follow the rules. Misspellings, idioms, slang, and common grammatical errors may not prevent a person from understanding a text, but computers fare poorly when the rules aren’t followed precisely. It took quite a while to realize that rule based systems were inadequate, but eventually researchers like Frederick Jelinek demonstrated statistical methods worked better for some problems, resulting in significant advances in NLP methods.
Current researchers mainly use machine learning methods to build natural language processing algorithms and models. Machine learning combines real-world and human-supplied characteristics (called “features”) to train computers to identify patterns and make predictions. The desired features are typically marked in a training text or induced using statistical methods from a training set. This allows the creation of natural language processing methods that better capture how language is actually used, rather than how syntactic and grammatical rules specify language should be used. Probability judgments produced using machine learning have turned out to be an effective way for computers to approximate those rules.
Despite the many successes NLP has achieved in recent years, we should remain cautious about its general applicability. There remain substantial problems yet to be fully solved, such as recognizing sarcasm and irony (something even humans can have trouble doing).
Basic Tasks for Natural Language Processing
The most basic NLP tasks to prepare for extracting actionable data from text include:
- Language detection – determining the language(s) in which a text is written. The remaining tasks below are language dependent, so it’s important to get the language right before anything else.
- Tokenization – splitting text into words and punctuation.
- Sentence splitting – spitting text into sentences.
- Part of speech tagging – adorning tokens with parts of speech (noun, verb, adjective, etc).
- Lemmatization – reducing inflected forms to dictionary headwords. For example, “write”, “wrote” “written” are all forms of the verbal headword “write”. Stemming, a simpler and less accurate process, removes grammatical endings without concern for the grammatical correctness of the result.
- Parsing – determining the structure of words in a sentence. This is a mechanized version of the sentence diagramming drills some learned in grammar school.
We can detect the language given enough text almost 100% of the time. We can tokenize the text, split the text into sentences, and lemmatize the words correctly nearly 100% of the time for English using current approaches. We can determine the correct parts of speech about 98% of the time. Parsing a sentence lags behind, with current methods yielding a correct parse about 93% of the time.
We can create higher-level algorithms using these results to build advanced NLP applications. We’ll look at some of these in the next article in this series.