Re: Apache "Text Analysis" top-level project?

John Stewart Mon, 09 Jul 2012 11:58:20 -0700

FWIW I agree with this perspective.  So many Java-based tools grow
into configuration monsters (cough! Hadoop).  I very much appreciate
the simplicity and self-containedness of OpenNLP, which makes it
straightforward to integrate into many different architectures, even
non-Java.  In fact, I'd argue for making this aspect of the project
one of its defining characteristics.


jds

On Mon, Jul 9, 2012 at 2:18 PM, Jeyendran Balakrishnan
<[email protected]> wrote:
> It is exactly the "production"-oriented aspects of UIMA that, IMHO, make it 
> unattractive for many users.
> Almost all commercial implementations of NLP use NLP as one of the components 
> in their overall software stack,
> i.e., they typically have their own custom framework, platform, and 
> architecture that do many things other than NLP.
> What these commercial systems need are open source libraries to do specific 
> things, leaving their developers free to put them together according to their 
> own requirements and tradeoffs.
> Any NLP system that forces the developer to use one particular way (eg UIMA) 
> of putting things together will not be attractive,
> and might steer them away from otherwise great algorithm implementations due 
> to the significant additional baggage that the algorithms come with.
>
> Today, there are so many ways of connecting components together, including 
> workflows, platforms, configuration management, parallel batch processing 
> (Hadoop, anyone?), parallel stream processing (Storm), etc. Almost the entire 
> code-base of UIMA has nothing to do with NLP. There's a big reason why IBM 
> gave away all this code to Apache, and kept their core algorithms to 
> themselves - it was clear to them where the value is. At least to me, it is 
> clear where the value is - algorithms, and not frameworks.
>
> Software developers in industry are very capable of easily putting together 
> their own frameworks. What they need help with are core NLP algorithms that 
> they don't have the background to do themselves. One example I would suggest 
> (at least according my view), is the difference between Lucene and Nutch. 
> Being a library, Lucene has pretty much taken over search engine software 
> development. Nutch, on the other hand, tries to be a full-fledged platform 
> for crawling, indexing and search, and has not gathered anywhere near the 
> same usage levels.
>
> My vote is to please keep OpenNLP clean, smart, algorithm-centered, 
> user-focused.
> Keep it simple.
> Math, stat, and algos.
> And excel at it.
>
> Please don’t dumb OpenNLP down with unnecessary bloat that any decent 
> software team can do easily, and might often prefer to implement in a 
> different way.
> Connectors, not merging.
>
> My two bits... :-)
>
> Cheers,
> Jeyendran
>
>
> -----Original Message-----
> From: Jörn Kottmann [mailto:[email protected]]
> Sent: Monday, July 09, 2012 1:50 AM
> To: [email protected]
> Subject: Re: Apache "Text Analysis" top-level project?
>
> On 07/09/2012 05:56 AM, Lance Norskog wrote:
>> Would it make sense to join OpenNLP, UIMA, and Open Relevance into one
>> top-level "Text Analysis" project? There are already cross-project
>> connections between UIMA and OpenNLP. ORP seems dormant. It also seems
>> a more natural place than OpenNLP for a database of tagged text.
>>
>>
>
> OpenNLP and UIMA align nicely in my opinion. OpenNLP just implements engines 
> for various NLP tasks without any further support.
> UIMA on the other side can do a lot of these additional things you need to 
> run OpenNLP in a production system e.g. scaling the engines to many machines, 
> providing workflow support, resource loading and management, etc.
> So there is not really an overlap between the two.
>
> UIMA has some NLP related addons in their sandbox, some of them duplicate 
> functionality which is also provided by OpenNLP e.g. pos tagging, or the 
> dictionary annotator, but that does not seem to be that much.
>
> Lucene contains a lot of NLP code for stemming and word segmentation in 
> different languages. Thats probably the biggest NLP related code base next to 
> OpenNLP at Apache.
>
> Jörn
>
>
>

Re: Apache "Text Analysis" top-level project?

Reply via email to