Hi Otis

Definitely *not *the fetching speed. Actually everything but *not* the
fetching speed. The fetcher is pretty much the same as 1.x and anyway the
performance with fetching is pretty much always limited by the politeness
settings, not the implementation.

Re-backend : some backend implementations are more mature than others. The
one for HBase is probably the one most widely used, the Cassandra one has
been greatly improved in particular performance-wise , the SQL one is
broken etc... we need to measure this as this is just a gut feeling at this
stage

Now for  what is slower and why, again this has to be measured but I expect
2.x to be slower partly because of [1], i.e. the filtering of entries is
not done by the backends (some might provide a way of doing it) but this is
done on the client side, when we create the input for mapred. In other
words we pull things from the backend just to discard it. Since 2.x does
not have segments like 1.x (which the fetch + parse mapreduce jobs take as
single input) we scan the whole table even if we want to fetch or parse a
handful of entries.

On the other hand, 2.x specifies what columns to retrieve for a given job,
whereas 1.x will for instance deserialize the crawldatum entirely. The
metadata objects are costly to read/write so 2.x might have the upper hand
from that point of view since it pulls and deserializes only what it needs.

Finally the most costly steps in a large crawl in 1.x are the generation
and update as we have to read/write the crawldb entirely. The way the
updates are done in 2.x is different and should be a lot faster.

Please could anyone correct me if I am wrong. Some of this is based on my
understanding of 2.x which dates back from quite a while and some of the
stuff might have changed in the meantime. The performance would probably
vary a lot based on the fine tuning of each backend implementation but
having some basic comparison would confirm some of the assertions above.

Julien


[1] https://issues.apache.org/jira/browse/GORA-119


Julien, could you please elaborate a bit about your comment about speed
> depending on the backend used?
>
> Yes, you were the person I was referring to :)
>
> Oh, and *believe* you said it was the fetching speed that was different
> between 1.x and 2.x.  Is that right?  Or is some other phase slower in 2.x?
>
> Thanks,
> Otis
> ----
> Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase -
> http://sematext.com/spm
>
>
>
>
> >________________________________
> > From: Julien Nioche <[email protected]>
> >To: "[email protected]" <[email protected]>
> >Sent: Tuesday, August 6, 2013 10:54 AM
> >Subject: Re: 2.x vs. 1.x speed
> >
> >
> >Hi Otis,
> >
> >That certainly depends on the backend used but on the whole it wouldn't be
> >surprising. Would be good to have some data to substantiate it. I am
> >planning to put my intern on the case and have some basic comparison as
> >soon as she gets a good grip of Hadoop / Nutch etc... but if someone else
> >wants to do it please go ahead.
> >
> >In case I happen to be the person who told you that Otis, well at least I
> >am consistent ;-)
> >
> >Julien
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >On 6 August 2013 09:08, Otis Gospodnetic <[email protected]>
> wrote:
> >
> >> Hello,
> >>
> >> At some point earlier this year I spoke to a person who told me 2.x is
> >> (a little?) slower than 1.x.  Is that still the case?
> >>
> >> Thanks,
> >> Otis
> >> --
> >> Solr & ElasticSearch Support -- http://sematext.com/
> >> Performance Monitoring -- http://sematext.com/spm
> >>
> >
> >
> >
> >--
> >*
> >*Open Source Solutions for Text Engineering
> >
> >http://digitalpebble.blogspot.com/
> >http://www.digitalpebble.com
> >http://twitter.com/digitalpebble
> >
> >
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to