Hi Thomas Well your name + email address is a good clue, isn't it?
Have you looked at the presentations listed in http://wiki.apache.org/nutch/Presentations? They should help you understand some of the concepts. There is also a very good chapter on Nutch in Tom White's book on Hadoop. There is also relevant stuff on the Gora site. I don't have much time to answer in detail but : * - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...) and by whom (Nutch/Gora/datastore/...) ?* * * Every task in Nutch (1 and 2) is one or more MapReduce jobs. Nutch is basically that : a collection of mapreduce jobs called sequentially (+ a few other things of course). Nutch 2 uses GORA to provides the inputs for the Mapreduce jobs from various datastores * - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that uses NoSQL datastore (like Hbase using also Hadoop)? And why?* let's put it this way : "use Nutch 2.2.1 on a Hadoop cluster". and as I said GORA provides the input from NoSQL stores. Why? because it can simplify the architecture (no more segments), allow atomic operations (read/writes) which HDFS datastructures can't do and gives more options on how to do certain things. For instance the update step is Nutch 1 is costly - if you want to modify just a few URLs then you need to read and write the whole crawldb anyway. At the moment the performance of Nutch 2 is not on par with Nutch 1 and hopefully some of it will be addressed in GORA at some point. If you don't have a specific reason to use Nutch 2 then Nutch 1 would be a good starting point and would help to get familiar with the main concepts. Julien On 10 October 2013 19:21, Thomas COUDERC <[email protected]> wrote: > > Hi Julien, > > Thank you very much for your answer ! > I see you noticed I was french ;) > I added my answers below as you did before : > > > > Bonjour Thomas > > > answers below > > > > On 10 October 2013 13:10, Thomas COUDERC <[email protected]> > wrote: > > >> Hi everybody, > >> > >> I'm new to mailing lists so excuse me if I made a mistake. Also, I'm a > new > >> dev contributor for Sauce Labs and DynamoDB subjects. > >> > >> I read all tutorials for Nutch 2.x and I made Nutch 2.2.1 working with > >> cassandra 1.2.8 using gora 0.3. > >> I read that Nutch 2.2.1 (and previous versions) can be run on a Hadoop > >> cluster. > >> I also know that gora manage some map reduce operations for backend. > > > GORA wraps the content from the backends into inputs for Mapreduce. > > For which mapreduce task? Any task (inject, generate, ...) ? > Does Gora wraps the content from backend not using any Mapreduce? > > >> I have two questions : > >> > >> 1/ If Nutch is deployed on a Map Reduce cluster, and for example Hbase > is > >> used as datastore, where are the Map Reduce tasks distributed? Nutch > >> hadoop cluster or HBase (via Gora). > > > not clear what you mean by distributed. > > Nutch uses Gora internally to pull the content from the backends and This > > happens on the Hadoop side so to speak, not within the backends. > > I don't really understand what you mean. I think I am a bit confused with > the fact that a datastore can work on top of some MapReduce system (HBase, > cassandra also maybe, ...) and the fact that Nutch can also be deployed on > top of a such system. In that case with which one does GORA deals? > > >> 2/ In my case, I use Cassandra standalone. If I deploy Nutch 2.2.1 on an > >> Hadoop Cluster, how many of Nutch can fetch URLs? (1 or all?) > > > I don't understand what you mean by 'how many of Nutch'. The number of > > mappers used for the fetching depends on your configuration, the > > distribution of URLs and the configuration of the Hadoop cluster. > > I thought that in a Nutch cluster there were as many Nutchs as the number > of machines. For example with a 5 machines cluster, I thought that there > were 5 Nutchs available, but I think I'm totally wrong. I don't really > understand how the Nutch .job (in deploy folder) are working and what it > means. I cannot find some information for that point. > In fact the question was : can the mappers used for fetching be located on > each machine of the cluster so that it is possible to see incoming network > trafic on each machine? > > > Maybe I get really confused on these 2 points : > - For Nutch 2.2.1, when is MapReduce used (jobs/content retrievemnt/...) > and by whom (Nutch/Gora/datastore/...) ? > - Does it make sense to use a Nutch 2.2.1 cluster (on top of Hadoop) that > uses NoSQL datastore (like Hbase using also Hadoop)? And why? > > I will try to find these informations into the source code or in the > internet during the next days . If you have some links it would really help > me. > > Maybe, I could synthetize these informations into graphical diagrams for > the wiki. > > > Again, Thank you very much for your help Julien. > > > HTH > > Julien > > > > > > > Thank you for helping me., and excuse me for my poor English. > > > > Thomas > > Nous vous rappelons que les résultats de Médiamétrie sont et demeurent sa > > propriété : ils sont protégés au double > > titre du droit d'auteur et de la protection des bases de données. > > Ce message est confidentiel et établi à > > l'intention de ses destinataires. > > Tout message électronique étant susceptible d'altération, > > la société Médiamétrie > > décline toute responsabilité s'il a été altéré, déformé ou falsifié. > > > > > > We remind you that the results produced by Médiamétrie are and remain its > > sole property covered by both copyright > > and databases protection. > > This message is confidential and intended solely for the adressees. > > E-mails are susceptible > > to alteration. > > Neither Médiamétrie company shall be liable for the message if altered, > > changed or falsified. > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

