Re: Apache "Text Analysis" top-level project?

Michael Schmitz Mon, 09 Jul 2012 14:17:41 -0700

I like OpenNLP for this reason too.  There are too many things that
can be done wrong when making a framework, especially from my narrow
perspective!


Peace.  Michael


On Mon, Jul 9, 2012 at 11:57 AM, John Stewart <[email protected]> wrote:
> FWIW I agree with this perspective.  So many Java-based tools grow
> into configuration monsters (cough! Hadoop).  I very much appreciate
> the simplicity and self-containedness of OpenNLP, which makes it
> straightforward to integrate into many different architectures, even
> non-Java.  In fact, I'd argue for making this aspect of the project
> one of its defining characteristics.
>
> jds
>
> On Mon, Jul 9, 2012 at 2:18 PM, Jeyendran Balakrishnan
> <[email protected]> wrote:
>> It is exactly the "production"-oriented aspects of UIMA that, IMHO, make it 
>> unattractive for many users.
>> Almost all commercial implementations of NLP use NLP as one of the 
>> components in their overall software stack,
>> i.e., they typically have their own custom framework, platform, and 
>> architecture that do many things other than NLP.
>> What these commercial systems need are open source libraries to do specific 
>> things, leaving their developers free to put them together according to 
>> their own requirements and tradeoffs.
>> Any NLP system that forces the developer to use one particular way (eg UIMA) 
>> of putting things together will not be attractive,
>> and might steer them away from otherwise great algorithm implementations due 
>> to the significant additional baggage that the algorithms come with.
>>
>> Today, there are so many ways of connecting components together, including 
>> workflows, platforms, configuration management, parallel batch processing 
>> (Hadoop, anyone?), parallel stream processing (Storm), etc. Almost the 
>> entire code-base of UIMA has nothing to do with NLP. There's a big reason 
>> why IBM gave away all this code to Apache, and kept their core algorithms to 
>> themselves - it was clear to them where the value is. At least to me, it is 
>> clear where the value is - algorithms, and not frameworks.
>>
>> Software developers in industry are very capable of easily putting together 
>> their own frameworks. What they need help with are core NLP algorithms that 
>> they don't have the background to do themselves. One example I would suggest 
>> (at least according my view), is the difference between Lucene and Nutch. 
>> Being a library, Lucene has pretty much taken over search engine software 
>> development. Nutch, on the other hand, tries to be a full-fledged platform 
>> for crawling, indexing and search, and has not gathered anywhere near the 
>> same usage levels.
>>
>> My vote is to please keep OpenNLP clean, smart, algorithm-centered, 
>> user-focused.
>> Keep it simple.
>> Math, stat, and algos.
>> And excel at it.
>>
>> Please don’t dumb OpenNLP down with unnecessary bloat that any decent 
>> software team can do easily, and might often prefer to implement in a 
>> different way.
>> Connectors, not merging.
>>
>> My two bits... :-)
>>
>> Cheers,
>> Jeyendran
>>
>>
>> -----Original Message-----
>> From: Jörn Kottmann [mailto:[email protected]]
>> Sent: Monday, July 09, 2012 1:50 AM
>> To: [email protected]
>> Subject: Re: Apache "Text Analysis" top-level project?
>>
>> On 07/09/2012 05:56 AM, Lance Norskog wrote:
>>> Would it make sense to join OpenNLP, UIMA, and Open Relevance into one
>>> top-level "Text Analysis" project? There are already cross-project
>>> connections between UIMA and OpenNLP. ORP seems dormant. It also seems
>>> a more natural place than OpenNLP for a database of tagged text.
>>>
>>>
>>
>> OpenNLP and UIMA align nicely in my opinion. OpenNLP just implements engines 
>> for various NLP tasks without any further support.
>> UIMA on the other side can do a lot of these additional things you need to 
>> run OpenNLP in a production system e.g. scaling the engines to many 
>> machines, providing workflow support, resource loading and management, etc.
>> So there is not really an overlap between the two.
>>
>> UIMA has some NLP related addons in their sandbox, some of them duplicate 
>> functionality which is also provided by OpenNLP e.g. pos tagging, or the 
>> dictionary annotator, but that does not seem to be that much.
>>
>> Lucene contains a lot of NLP code for stemming and word segmentation in 
>> different languages. Thats probably the biggest NLP related code base next 
>> to OpenNLP at Apache.
>>
>> Jörn
>>
>>
>>

Re: Apache "Text Analysis" top-level project?

Reply via email to