Re: Nutch 2.1 generate: how to get multiple maps in deploy-mode

Martin Aesch Wed, 05 Jun 2013 06:57:17 -0700

Thanks, feng. Another question arose to me: Is it possible to run just
multiple generates and fetches in parallel? This would be almost equally
in some situations.


Cassandra is eventually consistent, what happens if two generate
instances run at the same time? Or in reduce step, when the actual
tagging takes place? I would not care to much if the interference is
small, say, I wait some minutes before starting a second generate. In
reduce, this second generate would ignore all tagged urls, right?

Is it advisable to run multiple generates/fetches/updates in parallel?



On Tue, 2013-05-21 at 22:34 +0800, feng lu wrote:
> Nutch 2.1 use apache gora to access the cassandra database, so that
> implemented the hadoop inputsplit interface to generate input split from
> cassandra. So you can find some documentation about gora- cassandra model.
> 
> 
> 
> On May 21, 2013 7:45 PM, "Martin Aesch" <[email protected]> wrote:
> 
> > Hi,
> >
> > I'm running nutch-2.1 (on top of cassandra) on a single-node hadoop
> > "cluster". Sorry in case my question is noobish or more a hadoop issue.
> >
> > In short: How can I force nutch generate to provide a filesplitsize>1
> > which seems to be necesarry to run multiple map jobs?
> >
> > I am seeing that only one input split is generated for nutch generate:
> > For
> > ./bin/nutch generate -topN 1000000 -noFilter
> >
> >
> > 2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:
> > Input size for job job_201305211335_0001 = 0. Number of splits = 1
> > 2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:
> > job_201305211335_0001 LOCALITY_WAIT_FACTOR=0.0
> > 2013-05-21 13:36:33,961 INFO org.apache.hadoop.mapred.JobInProgress: Job
> > job_201305211335_0001 initialized successfully with 1 map tasks and 2
> > reduce tasks.
> > 2013-05-21 13:36:34,278 INFO org.apache.hadoop.mapred.JobTracker: Adding
> > task (JOB_SETUP) 'attempt_201305211335_0001_m_000002_0' to tip
> > task_201305211335_0001_m_000002, for tracker
> > 'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'
> > 2013-05-21 13:36:36,728 INFO org.apache.hadoop.mapred.JobInProgress:
> > Task 'attempt_201305211335_0001_m_000002_0' has completed
> > task_201305211335_0001_m_000002 successfully.
> > 2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobInProgress:
> > Choosing a non-local task task_201305211335_0001_m_000000
> > 2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobTracker: Adding
> > task (MAP) 'attempt_201305211335_0001_m_000000_0' to tip
> > task_201305211335_0001_m_000000, for tracker
> > 'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'
> >
> >
> > System load did not reach its maximum by far, neither in terms of CPU
> > nor in terms of i/o-waiting. It took 100 minutes, which seems very fair
> > for one single map, since I have about 50M webpages in my database.
> > Jobtracker says max map tasks is 2 and max reduce tasks is 2, but only 1
> > map task is running.
> >
> > This is my mapred-site.xml, which should be ok, in particular I
> > overwrite mapred.job.tracker not to be "local":
> >
> > <?xml version="1.0"?>
> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> > <!-- Put site-specific property overrides in this file. -->
> >
> > <configuration>
> >      <property>
> >          <name>mapred.job.tracker</name>
> >          <value>localhost:9001</value>
> >      </property>
> >      <property>
> >         <name>mapreduce.jobtracker.staging.root.dir</name>
> >         <value>/user</value>
> >      </property>
> >      <property>
> >         <name>mapred.map.tasks</name>
> >         <value>2</value>
> >         <description>
> >         define mapred.map tasks to be number of slave hosts
> >         </description>
> >      </property>
> > <property>
> >   <name>mapred.reduce.tasks</name>
> >   <value>2</value>
> >   <description>
> >     define mapred.reduce tasks to be number of slave hosts
> >   </description>
> > </property>
> >
> > </configuration>
> >
> >
> > Thanks and best regards,
> > Martin
> >
> >
> >
> >
> >
> >
> >

Re: Nutch 2.1 generate: how to get multiple maps in deploy-mode

Reply via email to