Nutch dev. plans

2009-07-17 Thread Andrzej Bialecki

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan 
will be importing his HBase work as 'nutchbase'. Tika work is the least 
disruptive, so it could occur even on trunk. OSGI plugins work (which 
I'd like to tackle) means significant refactoring so I'd rather put this 
on a branch too.


Dogacan, you mentioned that you would like to work on Katta integration. 
Could you shed some light on how this fits with the abstract indexing  
searching layer that we now have, and how distributed Solr fits into 
this picture?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: Nutch dev. plans

2009-07-17 Thread Doğacan Güney
Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:
 Hi all,

 I think we should be creating a sandbox area, where we can collaborate
 on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will
 be importing his HBase work as 'nutchbase'. Tika work is the least
 disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
 like to tackle) means significant refactoring so I'd rather put this on a
 branch too.


Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.

 Dogacan, you mentioned that you would like to work on Katta integration.
 Could you shed some light on how this fits with the abstract indexing 
 searching layer that we now have, and how distributed Solr fits into this
 picture?


I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoophbase have, i.e. if a machine dies you
lose that part of the index.

Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com






-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-17 Thread Andrzej Bialecki

Doğacan Güney wrote:

Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will
be importing his HBase work as 'nutchbase'. Tika work is the least
disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
like to tackle) means significant refactoring so I'd rather put this on a
branch too.



Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.


There's no specific design yet except I can't stand the existing plugin 
framework anymore ... ;) I started reading on OSGI and it seems that it 
supports the functionality that we need, and much more - it certainly 
looks like a better alternative than maintaining our plugin system 
beyond 1.x ...


Oh, an additional comment about the scoring API: I don't think the 
claimed benefits of OPIC outweigh the widespread complications that it 
caused in the API. Besides, getting the static scoring right is very 
very tricky, so from the engineer's point of view IMHO it's better to do 
the computation offline, where you have more control over the process 
and can easily re-run the computation, rather than rely on an online 
unstable algorithm that modifies scores in place ...






Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing 
searching layer that we now have, and how distributed Solr fits into this
picture?



I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.


Me too..


About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoophbase have, i.e. if a machine dies you
lose that part of the index.


Grant Ingersoll is doing some initial work on integrating distributed 
Solr and Zookeeper, once this is in a usable shape then I think perhaps 
it's more or less equivalent to Katta. I have a patch in my queue that 
adds direct Hadoop-Solr indexing, using Hadoop OutputFormat. So there 
will be many options to push index updates to distributed indexes. We 
just need to offer the right API to implement the integration, and the 
current API is IMHO quite close.




Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?


There is the Bailey.sf.net project that fits this description, but it's 
dormant - either it was too early, or there were just too many design 
questions (or simply the committers moved to other things).



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-17 Thread Doğacan Güney
On Fri, Jul 17, 2009 at 21:32, Andrzej Bialeckia...@getopt.org wrote:
 Doğacan Güney wrote:

 Hey list,

 On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:

 Hi all,

 I think we should be creating a sandbox area, where we can collaborate
 on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
 will
 be importing his HBase work as 'nutchbase'. Tika work is the least
 disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
 like to tackle) means significant refactoring so I'd rather put this on a
 branch too.


 Thanks for starting the discussion, Andrzej.

 Can you detail your OSGI plugin framework design? Maybe I missed the
 discussion but
 updating the plugin system has been something that I wanted to do for
 a long time :)
 so I am very much interested in your design.

 There's no specific design yet except I can't stand the existing plugin
 framework anymore ... ;) I started reading on OSGI and it seems that it
 supports the functionality that we need, and much more - it certainly looks
 like a better alternative than maintaining our plugin system beyond 1.x ...


Couldn't agree more with the can't stand plugin framework :D

Any good links on OSGI stuff?

 Oh, an additional comment about the scoring API: I don't think the claimed
 benefits of OPIC outweigh the widespread complications that it caused in the
 API. Besides, getting the static scoring right is very very tricky, so from
 the engineer's point of view IMHO it's better to do the computation offline,
 where you have more control over the process and can easily re-run the
 computation, rather than rely on an online unstable algorithm that modifies
 scores in place ...


Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.



 Dogacan, you mentioned that you would like to work on Katta integration.
 Could you shed some light on how this fits with the abstract indexing 
 searching layer that we now have, and how distributed Solr fits into this
 picture?


 I haven't yet given much thought to Katta integration. But basically,
 I am thinking of
 indexing newly-crawled documents as lucene shards and uploading them
 to katta for searching. This should be very possible with the new
 indexing system. But so far, I have neither studied katta too much nor
 given much thought to integration. So I may be missing obvious stuff.

 Me too..

 About distributed solr: I very much like to do this and again, I
 think, this should be possible to
 do within nutch. However, distributed solr is ultimately uninteresting
 to me because (AFAIK) it doesn't have the reliability and
 high-availability that hadoophbase have, i.e. if a machine dies you
 lose that part of the index.

 Grant Ingersoll is doing some initial work on integrating distributed Solr
 and Zookeeper, once this is in a usable shape then I think perhaps it's more
 or less equivalent to Katta. I have a patch in my queue that adds direct
 Hadoop-Solr indexing, using Hadoop OutputFormat. So there will be many
 options to push index updates to distributed indexes. We just need to offer
 the right API to implement the integration, and the current API is IMHO
 quite close.


 Are there any projects going on that are live indexing systems like
 solr, yet are backed up by hadoop HDFS like katta?

 There is the Bailey.sf.net project that fits this description, but it's
 dormant - either it was too early, or there were just too many design
 questions (or simply the committers moved to other things).


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com





-- 
Doğacan Güney


Re: Nutch dev. plans

2009-07-17 Thread Dennis Kubes



Doğacan Güney wrote:

On Fri, Jul 17, 2009 at 21:32, Andrzej Bialeckia...@getopt.org wrote:

Doğacan Güney wrote:

Hey list,

On Fri, Jul 17, 2009 at 16:55, Andrzej Bialeckia...@getopt.org wrote:

Hi all,

I think we should be creating a sandbox area, where we can collaborate
on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
will
be importing his HBase work as 'nutchbase'. Tika work is the least
disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
like to tackle) means significant refactoring so I'd rather put this on a
branch too.


Thanks for starting the discussion, Andrzej.

Can you detail your OSGI plugin framework design? Maybe I missed the
discussion but
updating the plugin system has been something that I wanted to do for
a long time :)
so I am very much interested in your design.

There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly looks
like a better alternative than maintaining our plugin system beyond 1.x ...


I think I remember a conversation a while back about this :)  Not OSGI 
specifically but changing the plugin framework.  I am all for changing 
it to something like OSGI though.


Dennis





Couldn't agree more with the can't stand plugin framework :D

Any good links on OSGI stuff?


Oh, an additional comment about the scoring API: I don't think the claimed
benefits of OPIC outweigh the widespread complications that it caused in the
API. Besides, getting the static scoring right is very very tricky, so from
the engineer's point of view IMHO it's better to do the computation offline,
where you have more control over the process and can easily re-run the
computation, rather than rely on an online unstable algorithm that modifies
scores in place ...



Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.


Dogacan, you mentioned that you would like to work on Katta integration.
Could you shed some light on how this fits with the abstract indexing 
searching layer that we now have, and how distributed Solr fits into this
picture?


I haven't yet given much thought to Katta integration. But basically,
I am thinking of
indexing newly-crawled documents as lucene shards and uploading them
to katta for searching. This should be very possible with the new
indexing system. But so far, I have neither studied katta too much nor
given much thought to integration. So I may be missing obvious stuff.

Me too..


About distributed solr: I very much like to do this and again, I
think, this should be possible to
do within nutch. However, distributed solr is ultimately uninteresting
to me because (AFAIK) it doesn't have the reliability and
high-availability that hadoophbase have, i.e. if a machine dies you
lose that part of the index.

Grant Ingersoll is doing some initial work on integrating distributed Solr
and Zookeeper, once this is in a usable shape then I think perhaps it's more
or less equivalent to Katta. I have a patch in my queue that adds direct
Hadoop-Solr indexing, using Hadoop OutputFormat. So there will be many
options to push index updates to distributed indexes. We just need to offer
the right API to implement the integration, and the current API is IMHO
quite close.


Are there any projects going on that are live indexing systems like
solr, yet are backed up by hadoop HDFS like katta?

There is the Bailey.sf.net project that fits this description, but it's
dormant - either it was too early, or there were just too many design
questions (or simply the committers moved to other things).


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com








Re: Nutch dev. plans

2009-07-17 Thread Andrzej Bialecki

Doğacan Güney wrote:


There's no specific design yet except I can't stand the existing plugin
framework anymore ... ;) I started reading on OSGI and it seems that it
supports the functionality that we need, and much more - it certainly looks
like a better alternative than maintaining our plugin system beyond 1.x ...



Couldn't agree more with the can't stand plugin framework :D

Any good links on OSGI stuff?


I found this:

http://neilbartlett.name/blog/osgi-articles


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Nutch dev. plans

2009-07-17 Thread Kirby Bohling
On Fri, Jul 17, 2009 at 5:21 PM, Andrzej Bialeckia...@getopt.org wrote:
 Doğacan Güney wrote:

 There's no specific design yet except I can't stand the existing plugin
 framework anymore ... ;) I started reading on OSGI and it seems that it
 supports the functionality that we need, and much more - it certainly
 looks
 like a better alternative than maintaining our plugin system beyond 1.x
 ...


 Couldn't agree more with the can't stand plugin framework :D

 Any good links on OSGI stuff?

 I found this:

 http://neilbartlett.name/blog/osgi-articles


Plugins are called Bundles in OSGi parlance, but I'll use plugin as
that's the term used by Nutch.

I have done quite a bit of OSGi work (I used to develop RCP
applications for a living).  OSGi is great, as long as you plan on not
using reflection to retrieve classes directly, and you don't plan on
using a library that uses it directly.

Pretty much every use of usage like this:

Class? clazz = Class.forName(stringFromConfig);
// Code to create an object using this class...

Will fail, unless the code is very classloader aware.  So if you're
going to switch over to using OSGi (which I think would be wonderful),
you'll want to ensure that you can deal with all of the third-party
libraries.  I haven't played much with any of the Declarative Services
stuff (I think that was slated for OSGi, but it might have just been
an Eclipse extension).

We managed to get most of the code to play nice, and had a few
horrific hacks for allowing the use of Spring if necessary.

The OSGi uses classloader segmentation to allow multiple conflicting
versions of the same code inside the same project.  So having a
pattern like:

Plugin A: nutch.api (Which contains say the interface Parser { })
Plugin B: parser.word (which has class WordParser implements Parser)

Plugin B has to depend on Plugin A so it can see the parser.  In this
case, Plugin A can't have code that uses Class.forName(WordParser);

OSGi changes the default classloader delegation, you can only see
classes in plugins you depend upon, and cycles in the dependencies are
not allowed.

If you want to do that, you end up having to do:

ClassLoader loader = ParserRegistery.lookupPlugin(WordParser);
Class.forname(WordParser, loader);

OSGi has some SPI-like way way to have a plugin note the fact that it
contributes an implementation of the Parser interface.  Eclipse builds
on top of it, and that's what Eclipse 3.x implemented the
Extension/ExtensionPoint system on top of.  I believe they are called
services in raw OSGi.

It's not a huge deal to write that yourself for API's you implement.
The problem is that it can be difficult to integrate really useful
third-party libraries that don't account for this change in
classloader behaviour.  At points it can make it very problematic to
use a specific XML parser that has the features you want (or some
library you want to use really wants).  Because they do this sort of
thing all the time.

I'm guessing that Tika isn't ready for this.  Given that it's an
Apache and/or Lucene project, it can probably be addressed.  My guess
is that a number of the libraries they depend upon won't be.

You can use fragments to get away from that (a fragment requires a
host bundle, the fragment's classes are loaded using the same
classloader as the host), but it doing that defeats a lot of the
reason for using OSGi (at least in terms of allowing you to use
multiple conflicting libraries in the same application).

Thanks,
Kirby


 --
 Best regards,
 Andrzej Bialecki     
  ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com