Re: 2.x vs. 1.x speed

Julien Nioche Mon, 16 Sep 2013 12:31:29 -0700

Hi Renato

Great to hear from you


On 16 September 2013 18:42, Renato Marroquín Mogrovejo <
[email protected]> wrote:

> Thanks for sharing Julien! These are indeed interesting results.
> Just a quick question, did you use a single server to run this? or did you
> set up a minimum number of servers for it?


as explained in the blog this is in pseudo distributed mode i.e single
server

this is because HBase or
> Cassandra will improve their latency if we scale them out.
>

see the conclusion of my post. I pointed at a number of possible
explanations, mostly do to with GORA. Scaling out would also make 1.x
faster :-) the question is whether there is a size of the crawldb / number
of machines where the balance would change?

Can you explain why would processing a smaller db on a single node with
Nutch 2 would take proportionally longer than a larger db on a larger
cluster?

Thanks

Julien



>
>
> Renato M.
>
>
> 2013/9/16 Markus Jelsma <[email protected]>
>
> > Thanks! That was interesting.
> >
> > -----Original message-----
> > From: Julien Nioche<[email protected]>
> > Sent: Monday 16th September 2013 18:45
> > To: [email protected]; [email protected]
> > Cc: Otis Gospodnetic <[email protected]>
> > Subject: Re: 2.x vs. 1.x speed
> >
> > Guys,
> >
> > Following the discussion we had some time ago about comparing 1.x with
> > 2.x, we did dome tests and put the results on
> >
> > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html <
> > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html>
> >
> > Feel free to comment.
> >
> > Best,
> >
> > Julien
> >
> > On 24 August 2013 05:51, Lewis John Mcgibbney <[email protected]
> <mailto:
> > [email protected]>> wrote:
> >
> > I am sure that Renato (if he is watching) can plugin maybe as well.
> >
> > We find in Gora that in every sense of the word, native Hadoop stores
> such
> >
> > as Avro, HBase and  Accumulo when we execute a query with GiraInputFormat
> >
> > via getParitions we retrieve GoraInputSplits natively which means splits
> >
> > are obtained for MapReduce jobs... such as many of the jobs we run in
> Nutch
> >
> > as well. On  the other hand (currently) stores such as Cassandra and Web
> >
> > service stores such as DynamoDB do not support Hadoop out of the box (the
> >
> > former we are working on and hope to  have implemented in Gora soon)
> >
> > therefore it is not as simple to get partitions in the same way we would
> in
> >
> > a Hadoop native store. We therefore obtain one partition to be used as an
> >
> > InputSplit for the MR job. This is certainly an area for concern and
> right
> >
> > now a bottleneck for some operations. We continue to work on this.
> >
> > On Wednesday, August 7, 2013, Julien Nioche <
> [email protected]<mailto:
> > [email protected]>>
> >
> > wrote:
> >
> > > Hi Otis
> >
> > >
> >
> > > Definitely *not *the fetching speed. Actually everything but *not* the
> >
> > > fetching speed. The fetcher is pretty much the same as 1.x and anyway
> the
> >
> > > performance with fetching is pretty much always limited by the
> politeness
> >
> > > settings, not the implementation.
> >
> > >
> >
> > > Re-backend : some backend implementations are more mature than others.
> > The
> >
> > > one for HBase is probably the one most widely used, the Cassandra one
> has
> >
> > > been greatly improved in particular performance-wise , the SQL one is
> >
> > > broken etc... we need to measure this as this is just a gut feeling at
> >
> > this
> >
> > > stage
> >
> > >
> >
> > > Now for  what is slower and why, again this has to be measured but I
> >
> > expect
> >
> > > 2.x to be slower partly because of [1], i.e. the filtering of entries
> is
> >
> > > not done by the backends (some might provide a way of doing it) but
> this
> >
> > is
> >
> > > done on the client side, when we create the input for mapred. In other
> >
> > > words we pull things from the backend just to discard it. Since 2.x
> does
> >
> > > not have segments like 1.x (which the fetch + parse mapreduce jobs take
> > as
> >
> > > single input) we scan the whole table even if we want to fetch or
> parse a
> >
> > > handful of entries.
> >
> > >
> >
> > > On the other hand, 2.x specifies what columns to retrieve for a given
> > job,
> >
> > > whereas 1.x will for instance deserialize the crawldatum entirely. The
> >
> > > metadata objects are costly to read/write so 2.x might have the upper
> > hand
> >
> > > from that point of view since it pulls and deserializes only what it
> >
> > needs.
> >
> > >
> >
> > > Finally the most costly steps in a large crawl in 1.x are the
> generation
> >
> > > and update as we have to read/write the crawldb entirely. The way the
> >
> > > updates are done in 2.x is different and should be a lot faster.
> >
> > >
> >
> > > Please could anyone correct me if I am wrong. Some of this is based on
> my
> >
> > > understanding of 2.x which dates back from quite a while and some of
> the
> >
> > > stuff might have changed in the meantime. The performance would
> probably
> >
> > > vary a lot based on the fine tuning of each backend implementation but
> >
> > > having some basic comparison would confirm some of the assertions
> above.
> >
> > >
> >
> > > Julien
> >
> > >
> >
> > >
> >
> > > [1] https://issues.apache.org/jira/browse/GORA-119 <
> > https://issues.apache.org/jira/browse/GORA-119>
> >
> > >
> >
> > >
> >
> > > Julien, could you please elaborate a bit about your comment about speed
> >
> > >> depending on the backend used?
> >
> > >>
> >
> > >> Yes, you were the person I was referring to :)
> >
> > >>
> >
> > >> Oh, and *believe* you said it was the fetching speed that was
> different
> >
> > >> between 1.x and 2.x.  Is that right?  Or is some other phase slower in
> >
> > 2.x?
> >
> > >>
> >
> > >> Thanks,
> >
> > >> Otis
> >
> > >> ----
> >
> > >> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> >
> > >> http://sematext.com/spm <http://sematext.com/spm>
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >>
> >
> > >> >________________________________
> >
> > >> > From: Julien Nioche <[email protected] <mailto:
> > [email protected]>>
> >
> > >> >To: "[email protected] <mailto:[email protected]>" <
> > [email protected] <mailto:[email protected]>>
> >
> > >> >Sent: Tuesday, August 6, 2013 10:54 AM
> >
> > >> >Subject: Re: 2.x vs. 1.x speed
> >
> > >> >
> >
> > >> >
> >
> > >> >Hi Otis,
> >
> > >> >
> >
> > >> >That certainly depends on the backend used but on the whole it
> wouldnt
> >
> > be
> >
> > >> >surprising. Would be good to have some data to substantiate it. I am
> >
> > >> >planning to put my intern on the case and have some basic comparison
> as
> >
> > >> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone
> >
> > else
> >
> > >> >wants to do it please go ahead.
> >
> > >> >
> >
> > >> >In case I happen to be the person who told you that Otis, well at
> least
> >
> > I
> >
> > >> >am consistent ;-)
> >
> > >> >
> >
> > >> >Julien
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >On 6 August 2013 09:08, Otis Gospodnetic <[email protected]
> <mailto:
> > [email protected]>>
> >
> > >> wrote:
> >
> > >> >
> >
> > >> >> Hello,
> >
> > >> >>
> >
> > >> >> At some point earlier this year I spoke to a person who told me 2.x
> > is
> >
> > >> >> (a little?) slower than 1.x.  Is that still the case?
> >
> > >> >>
> >
> > >> >> Thanks,
> >
> > >> >> Otis
> >
> > >> >> --
> >
> > >> >> Solr & ElasticSearch Support -- http://sematext.com/ <
> > http://sematext.com/>
> >
> > >> >> Performance Monitoring -- http://sematext.com/spm <
> > http://sematext.com/spm>
> >
> > >> >>
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >> >--
> >
> > >> >*
> >
> > >> >*Open Source Solutions for Text Engineering
> >
> > >> >
> >
> > >> >http://digitalpebble.blogspot.com/ <
> http://digitalpebble.blogspot.com/
> > >
> >
> > >> >http://www.digitalpebble.com <http://www.digitalpebble.com>
> >
> > >> >http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> > >> >
> >
> > >> >
> >
> > >> >
> >
> > >>
> >
> > >
> >
> > >
> >
> > >
> >
> > > --
> >
> > > *
> >
> > > *Open Source Solutions for Text Engineering
> >
> > >
> >
> > > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/
> >
> >
> > > http://www.digitalpebble.com <http://www.digitalpebble.com>
> >
> > > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> > >
> >
> > --
> >
> > *Lewis*
> >
> > --
> >
> > Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/ <http://digitalpebble.blogspot.com/>
> > http://www.digitalpebble.com <http://www.digitalpebble.com>
> > http://twitter.com/digitalpebble <http://twitter.com/digitalpebble>
> >
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: 2.x vs. 1.x speed

Reply via email to