On 2 July 2010 12:22, Andrzej Bialecki <a...@getopt.org> wrote: > On 2010-07-02 12:42, Julien Nioche wrote: > >> Hi guys, >> >> You've probably seen that there has been some progress on 2.0 lately. >> We've >> updated the nutchbase svn branch with the latest developments done on >> Dogacan's Github i.e. using GORA as a storage layer. >> One of the main issues [1] I raised after using nutchbase was that : >> >> NutchBase currently marks entries in the table to be fetched | parsed | >> >>> etc... and needs to go through the whole table at every step. As the >>> table >>> gets bigger it takes more and more time to read through the entries and >>> check their marks which is not a viable option. NutchBase is currently >>> slower than Nutch 1.1 (might be issues with Gora but still...) >>> I suggest instead that we create fetchlists in separate tables, fetch& >>> parse in these tables then merge the entries back to the main table. The >>> segment tables could then be deleted if necessary. We would then have a >>> linear processing time for fetching + parsing + updating depending on the >>> size of the segments and NOT on the size of the main table. This would be >>> an >>> improvement compared to 1.1 where the processing time in the updates is >>> relative to the size of the crawldb . >>> >>> >> Doing this requires to be able to separate the name of a schema from the >> name of a table in Gora [2], which should not be a big problem. >> > > I think this is a good idea - this model is conceptually close to the > current model, and I bet it will be easier to debug problems when changes > are limited to a separate table... we could create 1 table per segment. > > (Oh, and let's stop calling them segments, please - maybe call them a batch > or "crawl cycle" or something. The name "segments" caused a lot of confusion > already, and it doesn't convey any useful meaning..) >
Makes sense > > As for the time savings .. this remains to be seen. At the end of the > fetching/parsing job we need to merge this data back into the main table, > which is a massive update that also takes time. True > > > >> On a second thought I was wondering whether it would also make sense to >> actually keep the segments as they currently are i.e. stored as >> NutchWritables in HDFS. The advantages of doing this would be that we'd >> keep >> exactly the same code for the fetching + parsing + would only need to >> modify >> the generations and update steps + would be able to easily port pre-2.0 >> segments to the webtable. The drawbacks being that there would be a dual >> storage GORA / HDFS and we'd need to keep the legacy Nutch Writable >> objects. >> > > The fetcher code is already ported in nutchbase not to use the plain files. > I doubt there would be many users who want to jump to Nutch 2.0 and still > want to hold on to their old segments... so I think this is not useful. Dual > storage .. *shudder* that's asking for trouble. > Right, + am not too keen on keeping the legacy objects. Another advantage of having the GORA-based tables for the segments (or fetch_cycles ;-) ) is that is makes it easier to restart an interrupted fetch or parse. Forget about the HDFS based storage, let's just do it with GORA > >> Note that it would not change anything to the content of the main webtable >> nor the operations done on them. Maybe it would make sense to do that >> anyway >> at least as a transition while we make the webtable and GORA operations >> stable and then see if there is an advantage in storing the segments as >> GORA >> tables as well. >> >> I am pretty confident that we need to address the point raised in [1] >> anyway. What do you guys think? >> >> *[1] http://github.com/dogacan/nutchbase/issues#issue/8 >> [2] http://github.com/enis/gora/issues#issue/30* >> > > +1 to both points, -1 to the dual storage. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- DigitalPebble Ltd Open Source Solutions for Text Engineering http://www.digitalpebble.com