Re: Réf : Re: Nutch 2.2.1 with Map Reduce

Julien Nioche Thu, 10 Oct 2013 13:49:01 -0700

Hi Thomas

Well your name + email address is a good clue, isn't it?


Have you looked at the presentations listed in
http://wiki.apache.org/nutch/Presentations? They should help you understand
some of the concepts. There is also a very good chapter on Nutch in Tom
White's book on Hadoop.
There is also relevant stuff on the Gora site.

I don't have much time to answer in detail but :

* - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
and by whom (Nutch/Gora/datastore/...) ?*
*
*
Every task in Nutch (1 and 2) is one or more MapReduce jobs. Nutch is
basically that : a collection of mapreduce jobs called sequentially (+ a
few other things of course). Nutch 2 uses GORA to provides the inputs for
the Mapreduce jobs from various datastores
*
 - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
uses NoSQL datastore (like Hbase using also Hadoop)? And why?*

let's put it this way : "use Nutch 2.2.1 on a Hadoop cluster". and as I
said GORA provides the input from NoSQL stores. Why? because it can
simplify the architecture (no more segments), allow atomic operations
(read/writes) which HDFS datastructures can't do and gives more options on
how to do certain things. For instance the update step is Nutch 1 is costly
- if you want to modify just a few URLs then you need to read and write the
whole crawldb anyway.

At the moment the performance of Nutch 2 is not on par with Nutch 1 and
hopefully some of it will be addressed in GORA at some point.

If you don't have a specific reason to use Nutch 2 then Nutch 1 would be a
good starting point and would help to get familiar with the main concepts.

Julien


On 10 October 2013 19:21, Thomas COUDERC <[email protected]> wrote:

>
> Hi Julien,
>
> Thank you very much for your answer !
> I see you noticed I was french ;)
> I added my answers below as you did before :
>
>
> > Bonjour Thomas
>
> > answers below
>
>
> > On 10 October 2013 13:10, Thomas COUDERC <[email protected]>
> wrote:
>
> >> Hi everybody,
> >>
> >> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a
> new
> >> dev contributor for Sauce Labs and DynamoDB subjects.
> >>
> >> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with
> >> cassandra 1.2.8 using gora 0.3.
> >> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop
> >> cluster.
> >> I also know that gora manage some map reduce operations for backend.
>
> > GORA wraps the content from the backends into inputs for Mapreduce.
>
> For which mapreduce task? Any task (inject, generate, ...) ?
> Does Gora wraps the content from backend not using any Mapreduce?
>
> >> I have two questions :
> >>
> >> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase
> is
> >> used as datastore,  where are the Map Reduce tasks distributed? Nutch
> >> hadoop cluster or HBase (via Gora).
>
> > not clear what you mean by distributed.
> > Nutch uses Gora internally to pull the content from the backends and This
> > happens on the Hadoop side so to speak, not within the backends.
>
> I don't really understand what you mean. I think I am a bit confused with
> the fact that a datastore can work on top of some MapReduce system (HBase,
> cassandra also maybe, ...) and the fact that Nutch can also be deployed on
> top of a such system. In that case with which one does GORA deals?
>
> >> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an
> >> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?)
>
> > I don't understand what you mean by 'how many of Nutch'. The number of
> > mappers used for the fetching depends on your configuration, the
> > distribution of URLs and the configuration of the Hadoop cluster.
>
> I thought that in a Nutch cluster there were as many Nutchs as the number
> of machines. For example with a 5 machines cluster, I thought that there
> were 5 Nutchs available, but I think I'm totally wrong. I don't really
> understand how the Nutch .job (in deploy folder) are working and what it
> means. I cannot find some information for that point.
> In fact the question was : can the mappers used for fetching be located on
> each machine of the cluster so that it is possible to see incoming network
> trafic on each machine?
>
>
> Maybe I get really confused on these 2 points :
>  - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...)
> and by whom (Nutch/Gora/datastore/...) ?
>  - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that
> uses NoSQL datastore (like Hbase using also Hadoop)? And why?
>
> I will try to find these informations into the source code or in the
> internet during the next days . If you have some links it would really help
> me.
>
> Maybe, I could synthetize these informations into graphical diagrams for
> the wiki.
>
>
> Again, Thank you very much for your help Julien.
>
>
> HTH
>
> Julien
>
>
>
> >
> > Thank you for helping me., and excuse me for my poor English.
> >
> > Thomas
> > Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa
> > propriété : ils sont protégés au double
> > titre du droit d'auteur et de la protection des bases de données.
> > Ce message est confidentiel et établi à
> > l'intention de ses destinataires.
> > Tout message électronique étant susceptible d'altération,
> > la société Médiamétrie
> > décline toute responsabilité s'il a été altéré, déformé ou falsifié.
> >
> >
> > We remind you that the results produced by Médiamétrie are and remain its
> > sole property covered by both copyright
> > and databases protection.
> > This message is confidential and intended solely for the adressees.
> > E-mails are susceptible
> > to alteration.
> > Neither Médiamétrie company shall be liable for the message if altered,
> > changed or falsified.
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Réf : Re: Nutch 2.2.1 with Map Reduce

Reply via email to