Re: Apache "Text Analysis" top-level project?

Lance Norskog Tue, 10 Jul 2012 20:39:10 -0700

I did not articulate it well. I meant an "umbrella" project, not
merging the code bases. Yes, merging the code bases would be a
disaster. The "flagship" projects draw attention to  the less-active
projects in the umbrella.


What prompted this is that I'm used to the activity levels of Solr and
Mahout. These are very "alive" projects which attract new algorithms
as well as improvements to the existing code. These projects have
momentum: above it, the project attracts new people, and below it the
project languishes.  OpenNLP seems to be a little below it. This
suggestion is one way to push it up.

As an example, I'm playing with using LSA to summarize documents and
create tag clouds. I planned to contribute this to Lucene/Solr, and it
did not occur to me that it might also work in OpenNLP.

I'm only coming to this project now. What is the most recent new
algorithm suite added? When? Are the contributors committers?

On Tue, Jul 10, 2012 at 12:58 AM, Benson Margulies
<[email protected]> wrote:
> The board is not enthusiastic about 'umbrella' projects. Cooperate,
> co-market -- great. Merge, not so good.
>
> On Tuesday, July 10, 2012, Jeyendran Balakrishnan wrote:
>
>> Hi Julien,
>>
>> My hats off to you and the rest of the Nutch developer team for improving
>> Nutch over the past several years, to the level where anybody with heavy
>> duty crawling needs can just use it off the shelf.
>> I agree with you that Lucene vs Nutch is not as clear an analogy for the
>> library vs framework debate.
>>
>> In my usage, I have tended to use Nutch more as a library (in one
>> application only the crawling part, and in another just Fetcher2 [a great
>> component], hacked up a bit to remove its dependency on the rest of Nutch).
>> The point I was trying to make, not very clearly, was that Nutch aggregates
>> other components (Hadoop for distributed processing, Lucene/Solr for
>> indexing and search, Tika for parsing, etc) along with its own custom
>> crawler component code and custom data flow design, into a platform for
>> end-to-end crawling, indexing and search, as opposed to, for example, being
>> a pure-play crawler library on top of Hadoop.
>>
>> I look forward with interest to follow how this debate evolves regarding
>> OpenNLP and UIMA.
>>
>> Cheers,
>> Jeyendran
>>
>>
>> -----Original Message-----
>> From: Julien Nioche [mailto:[email protected] <javascript:;>]
>> Sent: Monday, July 09, 2012 2:01 PM
>> To: [email protected] <javascript:;>; 
>> [email protected]<javascript:;>
>> Subject: Re: Apache "Text Analysis" top-level project?
>>
>> Jeyendran,
>>
>> One example I would suggest (at least according my view), is the difference
>> > between Lucene and Nutch. Being a library, Lucene has pretty much
>> > taken over search engine software development. Nutch, on the other
>> > hand, tries to be a full-fledged platform for crawling, indexing and
>> > search, and has not gathered anywhere near the same usage levels.
>> >
>>
>> That Nutch does not have the same audience as Lucene is completely
>> understandable given that they are quite different in scope and nature. Not
>> everybody needs to crawl on a large scale, but when they do they often use
>> Nutch. And by the way Nutch does not do indexing and search - it delegates
>> this to other tools like SOLR so it is mostly a crawler.
>>
>> The comparison between UIMA and OpenNLP is a better illustration of the
>> difference between a framework and a library IMHO
>>
>> Julien
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>>



-- 
Lance Norskog
[email protected]

Re: Apache "Text Analysis" top-level project?

Reply via email to