Re: Nutch 2.0 : Design issue

Julien Nioche Fri, 02 Jul 2010 04:48:36 -0700

On 2 July 2010 12:22, Andrzej Bialecki <a...@getopt.org> wrote:

> On 2010-07-02 12:42, Julien Nioche wrote:
>
>> Hi guys,
>>
>> You've probably seen that there has been some progress on 2.0 lately.
>> We've
>> updated the nutchbase svn branch with the latest developments done on
>> Dogacan's Github i.e. using GORA as a storage layer.
>> One of the main issues [1] I raised after using nutchbase was that :
>>
>> NutchBase currently marks entries in the table to be fetched | parsed |
>>
>>> etc... and needs to go through the whole table at every step. As the
>>> table
>>> gets bigger it takes more and more time to read through the entries and
>>> check their marks which is not a viable option. NutchBase is currently
>>> slower than Nutch 1.1 (might be issues with Gora but still...)
>>> I suggest instead that we create fetchlists in separate tables, fetch&
>>> parse in these tables then merge the entries back to the main table. The
>>> segment tables could then be deleted if necessary. We would then have a
>>> linear processing time for fetching + parsing + updating depending on the
>>> size of the segments and NOT on the size of the main table. This would be
>>> an
>>> improvement compared to 1.1 where the processing time in the updates is
>>> relative to the size of the crawldb .
>>>
>>>
>> Doing this requires to be able to separate the name of a schema from the
>> name of a table in Gora [2], which should not be a big problem.
>>
>
> I think this is a good idea - this model is conceptually close to the
> current model, and I bet it will be easier to debug problems when changes
> are limited to a separate table... we could create 1 table per segment.
>
> (Oh, and let's stop calling them segments, please - maybe call them a batch
> or "crawl cycle" or something. The name "segments" caused a lot of confusion
> already, and it doesn't convey any useful meaning..)
>


Makes sense


>
> As for the time savings .. this remains to be seen. At the end of the
> fetching/parsing job we need to merge this data back into the main table,
> which is a massive update that also takes time.


True


>
>
>
>> On a second thought I was wondering whether it would also make sense to
>> actually keep the segments as they currently are i.e. stored as
>> NutchWritables in HDFS. The advantages of doing this would be that we'd
>> keep
>> exactly the same code for the fetching + parsing + would only need to
>> modify
>> the generations and update steps + would be able to easily port pre-2.0
>> segments to the webtable. The drawbacks being that there would be a dual
>> storage GORA / HDFS and we'd need to keep the legacy Nutch Writable
>> objects.
>>
>
> The fetcher code is already ported in nutchbase not to use the plain files.
> I doubt there would be many users who want to jump to Nutch 2.0 and still
> want to hold on to their old segments... so I think this is not useful. Dual
> storage .. *shudder* that's asking for trouble.
>

Right, + am not too keen on keeping the legacy objects. Another advantage of
having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that
is makes it easier to restart an interrupted fetch or parse.

Forget about the HDFS based storage, let's just do it with GORA



>
>> Note that it would not change anything to the content of the main webtable
>> nor the operations done on them. Maybe it would make sense to do that
>> anyway
>> at least as a transition while we make the webtable and GORA operations
>> stable and then see if there is an advantage in storing the segments as
>> GORA
>> tables as well.
>>
>> I am pretty confident that we need to address the point raised in [1]
>> anyway. What do you guys think?
>>
>> *[1] http://github.com/dogacan/nutchbase/issues#issue/8
>> [2] http://github.com/enis/gora/issues#issue/30*
>>
>
> +1 to both points, -1 to the dual storage.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com

Re: Nutch 2.0 : Design issue

Reply via email to