Re: Question on Apache OpenNLP

Alexandre Rademaker Sat, 12 Aug 2023 14:34:32 -0700


short answer: no

A more detailed answer...

You can learn about the technology behind chatGPT from Professor Stephen 
Wolfram in this long but informative blog post:

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

You will learn that what chatGPT does is based on a simple principle. A 
colossal language model was trained to make word predictions. What is the most 
probable word that should go next from a sequence of words? So given an input 
text, a question, or a piece of text followed by a question, the system 
completes it, and usually, this output is the answer to your question. I will 
not go into details, but a critical underline principle of the method is the 
distributional semantics approach, the semantics of words are learned from 
their use and represented as vectors of N dimensions.

Traditionally, the processing of natural language utterances was usually done 
modularly. Mirroring the way linguistics 
study natural languages: morphology, syntactic, semantics, pragmatics, etc. 
Computer scientists also embrace the idea given that modularity is at the heart 
of every programmer. We break complex tasks and structures into simple ones 
that can be independently developed and combined into the final solution. This 
is also supported by computational linguistics, which uses computational 
systems and methods to study languages, testing their hypothesis.

We learn to start from a string, break it into tokens, then group sequences of 
tokens (sentences), and keep adding more metadata into these data structures or 
combine them into more elaborated ones. We usually consider tasks such as 
tokenization, part-of-speech tagging, lemmatization, syntactic/semantic 
parsing, named entities recognition, word sense disambiguation, etc. We have a 
lot of different libraries that experiment with different tasks and order to 
combine them. Libraries such as OpenNLP or Freeling 
(https://nlp.lsi.upc.edu/freeling/) adopted this pipeline approach. More 
sophisticated systems understand that humans don’t necessarily decide if a 
given word is a noun or verb before comprehending its contribution to the 
sentence, so instead of a pipeline of independent steps, they use a more 
integrated approach. Nevertheless, the idea is the same, from a string, 
construct data structures (or symbolic representations) to be further enriched 
or directly used for final applications. Applications are question answering, 
fact extraction from texts, sentiment analysis, translation, etc.

In recent years, more and more of the tasks described in the last paragraph 
tend to ignore explicit linguistic knowledge encoded as rules (e.g., 
morphosyntactic rules) or enumerated in hand-crafted resources like 
lexical-semantic dictionaries (see https://wordnet.princeton.edu/ or 
https://nlp.cs.nyu.edu/nomlex/). We start to see texts being annotated so that 
systems can learn how to reproduce the same analysis when they see a similar 
text. See https://universaldependencies.org/, a vast collection of sentences in 
many languages annotated with syntactic analysis to train parsers. This 
approach of learning from annotated data (examples) became popular and started 
to give people the wrong impression that deep linguistic knowledge is 
irrelevant. 

Once a lot of annotated data start to be freely available, people forget the 
cost of constructing these datasets and the value of the annotators and 
maintainers, usually with necessary linguistic training. The linguistic 
knowledge to model language into executable tools, like computational grammar, 
becomes obsolete. Grammar engineers, for example, working on solid formalism 
like HPSG and LFG, became dinosaurs, like COBOL programmers. (or 
https://en.wikipedia.org/wiki/Jedi if you prefer)

But that was not the end. Later, developers of NLP applications, encouraged by 
the success of machine learning (in particular deep learning and other 
unsupervised methods) in many tasks, started experimenting with end-to-end 
learning without considering the intermediary tasks. Why not try to answer a 
natural language question directly from the input without the cost of 
constructing and manipulating intermediary representations?  Well, not 
directly, but adopting the minimum possible representations that could be 
universally manipulated. Yes, vectors. Once the input text is transformed into 
vectors or matrices of numbers, we need only an efficient linear algebra 
library to manipulate them. Simpler systems can be deployed faster, and tech 
companies love it.

So this is the trend in the area now; given the massive amount of text we have 
on the internet, we learn how to transform words and sentences into vectors; we 
change our problems into optimization tasks and manipulate the vectors to 
obtain the parameters that maximize the performance of the system given a 
dataset of reference. The parameters define the function we can use in other 
texts to solve the same problem.

The new methods and the LLMs are incredibly efficient for some tasks, and 
chatGPT impresses many people. But the efficiency in some practical use cases 
has nothing to do with other goals related to language studies. How do 
languages work? What are their fundamental parts? How do humans understand and 
produce language? How do we develop a system as competent as humans in the use 
of language? See https://youtu.be/wPonuHqbNds. 

To make an analogy. How does studying the human body, its parts, and how they 
work together contribute to medicine?   In the past, medicine was a collection 
of practices said to work in many cases. Practices were generalized from 
examples only, and false correlations were unfortunately taken as causations 
(see https://en.wikipedia.org/wiki/Bloodletting).

I hope that helps! I'm sorry for the long answer; anyway, I made a lot of 
simplifications to keep it a reasonable size message! ;-)

Best,

--
Alexandre Rademaker
http://arademaker.github.io

> On 12 Aug 2023, at 06:12, Turritopsis Dohrnii Teo En Ming 
> <tdtemc...@gmail.com> wrote:
> 
> Good day from Singapore,
> 
> Is Apache OpenNLP one of the building blocks of ChatGPT?
> 
> Thank you.
> 
> Regards,
> 
> Mr. Turritopsis Dohrnii Teo En Ming
> Targeted Individual in Singapore
> Blogs:
> https://tdtemcerts.blogspot.com
> https://tdtemcerts.wordpress.com
> GIMP also stands for Government-Induced Medical Problems.

Re: Question on Apache OpenNLP

Reply via email to