Hi Renato Great to hear from you
On 16 September 2013 18:42, Renato Marroquín Mogrovejo < [email protected]> wrote: > Thanks for sharing Julien! These are indeed interesting results. > Just a quick question, did you use a single server to run this? or did you > set up a minimum number of servers for it? as explained in the blog this is in pseudo distributed mode i.e single server this is because HBase or > Cassandra will improve their latency if we scale them out. > see the conclusion of my post. I pointed at a number of possible explanations, mostly do to with GORA. Scaling out would also make 1.x faster :-) the question is whether there is a size of the crawldb / number of machines where the balance would change? Can you explain why would processing a smaller db on a single node with Nutch 2 would take proportionally longer than a larger db on a larger cluster? Thanks Julien > > > Renato M. > > > 2013/9/16 Markus Jelsma <[email protected]> > > > Thanks! That was interesting. > > > > -----Original message----- > > From: Julien Nioche<[email protected]> > > Sent: Monday 16th September 2013 18:45 > > To: [email protected]; [email protected] > > Cc: Otis Gospodnetic <[email protected]> > > Subject: Re: 2.x vs. 1.x speed > > > > Guys, > > > > Following the discussion we had some time ago about comparing 1.x with > > 2.x, we did dome tests and put the results on > > > > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html < > > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html> > > > > Feel free to comment. > > > > Best, > > > > Julien > > > > On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected] > <mailto: > > [email protected]>> wrote: > > > > I am sure that Renato (if he is watching) can plugin maybe as well. > > > > We find in Gora that in every sense of the word, native Hadoop stores > such > > > > as Avro, HBase and Accumulo when we execute a query with GiraInputFormat > > > > via getParitions we retrieve GoraInputSplits natively which means splits > > > > are obtained for MapReduce jobs... such as many of the jobs we run in > Nutch > > > > as well. On the other hand (currently) stores such as Cassandra and Web > > > > service stores such as DynamoDB do not support Hadoop out of the box (the > > > > former we are working on and hope to have implemented in Gora soon) > > > > therefore it is not as simple to get partitions in the same way we would > in > > > > a Hadoop native store. We therefore obtain one partition to be used as an > > > > InputSplit for the MR job. This is certainly an area for concern and > right > > > > now a bottleneck for some operations. We continue to work on this. > > > > On Wednesday, August 7, 2013, Julien Nioche < > [email protected]<mailto: > > [email protected]>> > > > > wrote: > > > > > Hi Otis > > > > > > > > > > Definitely *not *the fetching speed. Actually everything but *not* the > > > > > fetching speed. The fetcher is pretty much the same as 1.x and anyway > the > > > > > performance with fetching is pretty much always limited by the > politeness > > > > > settings, not the implementation. > > > > > > > > > > Re-backend : some backend implementations are more mature than others. > > The > > > > > one for HBase is probably the one most widely used, the Cassandra one > has > > > > > been greatly improved in particular performance-wise , the SQL one is > > > > > broken etc... we need to measure this as this is just a gut feeling at > > > > this > > > > > stage > > > > > > > > > > Now for what is slower and why, again this has to be measured but I > > > > expect > > > > > 2.x to be slower partly because of [1], i.e. the filtering of entries > is > > > > > not done by the backends (some might provide a way of doing it) but > this > > > > is > > > > > done on the client side, when we create the input for mapred. In other > > > > > words we pull things from the backend just to discard it. Since 2.x > does > > > > > not have segments like 1.x (which the fetch + parse mapreduce jobs take > > as > > > > > single input) we scan the whole table even if we want to fetch or > parse a > > > > > handful of entries. > > > > > > > > > > On the other hand, 2.x specifies what columns to retrieve for a given > > job, > > > > > whereas 1.x will for instance deserialize the crawldatum entirely. The > > > > > metadata objects are costly to read/write so 2.x might have the upper > > hand > > > > > from that point of view since it pulls and deserializes only what it > > > > needs. > > > > > > > > > > Finally the most costly steps in a large crawl in 1.x are the > generation > > > > > and update as we have to read/write the crawldb entirely. The way the > > > > > updates are done in 2.x is different and should be a lot faster. > > > > > > > > > > Please could anyone correct me if I am wrong. Some of this is based on > my > > > > > understanding of 2.x which dates back from quite a while and some of > the > > > > > stuff might have changed in the meantime. The performance would > probably > > > > > vary a lot based on the fine tuning of each backend implementation but > > > > > having some basic comparison would confirm some of the assertions > above. > > > > > > > > > > Julien > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/GORA-119 < > > https://issues.apache.org/jira/browse/GORA-119> > > > > > > > > > > > > > > > Julien, could you please elaborate a bit about your comment about speed > > > > >> depending on the backend used? > > > > >> > > > > >> Yes, you were the person I was referring to :) > > > > >> > > > > >> Oh, and *believe* you said it was the fetching speed that was > different > > > > >> between 1.x and 2.x. Is that right? Or is some other phase slower in > > > > 2.x? > > > > >> > > > > >> Thanks, > > > > >> Otis > > > > >> ---- > > > > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - > > > > >> http://sematext.com/spm <http://sematext.com/spm> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> >________________________________ > > > > >> > From: Julien Nioche <[email protected] <mailto: > > [email protected]>> > > > > >> >To: "[email protected] <mailto:[email protected]>" < > > [email protected] <mailto:[email protected]>> > > > > >> >Sent: Tuesday, August 6, 2013 10:54 AM > > > > >> >Subject: Re: 2.x vs. 1.x speed > > > > >> > > > > > >> > > > > > >> >Hi Otis, > > > > >> > > > > > >> >That certainly depends on the backend used but on the whole it > wouldnt > > > > be > > > > >> >surprising. Would be good to have some data to substantiate it. I am > > > > >> >planning to put my intern on the case and have some basic comparison > as > > > > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone > > > > else > > > > >> >wants to do it please go ahead. > > > > >> > > > > > >> >In case I happen to be the person who told you that Otis, well at > least > > > > I > > > > >> >am consistent ;-) > > > > >> > > > > > >> >Julien > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> >On 6 August 2013 09:08, Otis Gospodnetic <[email protected] > <mailto: > > [email protected]>> > > > > >> wrote: > > > > >> > > > > > >> >> Hello, > > > > >> >> > > > > >> >> At some point earlier this year I spoke to a person who told me 2.x > > is > > > > >> >> (a little?) slower than 1.x. Is that still the case? > > > > >> >> > > > > >> >> Thanks, > > > > >> >> Otis > > > > >> >> -- > > > > >> >> Solr & ElasticSearch Support -- http://sematext.com/ < > > http://sematext.com/> > > > > >> >> Performance Monitoring -- http://sematext.com/spm < > > http://sematext.com/spm> > > > > >> >> > > > > >> > > > > > >> > > > > > >> > > > > > >> >-- > > > > >> >* > > > > >> >*Open Source Solutions for Text Engineering > > > > >> > > > > > >> >http://digitalpebble.blogspot.com/ < > http://digitalpebble.blogspot.com/ > > > > > > > >> >http://www.digitalpebble.com <http://www.digitalpebble.com> > > > > >> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > -- > > > > > * > > > > > *Open Source Solutions for Text Engineering > > > > > > > > > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/ > > > > > > > http://www.digitalpebble.com <http://www.digitalpebble.com> > > > > > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > > > > > > > > -- > > > > *Lewis* > > > > -- > > > > Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/> > > http://www.digitalpebble.com <http://www.digitalpebble.com> > > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble> > > > > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

