Nutch 2.1 use apache gora to access the cassandra database, so that implemented the hadoop inputsplit interface to generate input split from cassandra. So you can find some documentation about gora- cassandra model.
On May 21, 2013 7:45 PM, "Martin Aesch" <[email protected]> wrote: > Hi, > > I'm running nutch-2.1 (on top of cassandra) on a single-node hadoop > "cluster". Sorry in case my question is noobish or more a hadoop issue. > > In short: How can I force nutch generate to provide a filesplitsize>1 > which seems to be necesarry to run multiple map jobs? > > I am seeing that only one input split is generated for nutch generate: > For > ./bin/nutch generate -topN 1000000 -noFilter > > > 2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress: > Input size for job job_201305211335_0001 = 0. Number of splits = 1 > 2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress: > job_201305211335_0001 LOCALITY_WAIT_FACTOR=0.0 > 2013-05-21 13:36:33,961 INFO org.apache.hadoop.mapred.JobInProgress: Job > job_201305211335_0001 initialized successfully with 1 map tasks and 2 > reduce tasks. > 2013-05-21 13:36:34,278 INFO org.apache.hadoop.mapred.JobTracker: Adding > task (JOB_SETUP) 'attempt_201305211335_0001_m_000002_0' to tip > task_201305211335_0001_m_000002, for tracker > 'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436' > 2013-05-21 13:36:36,728 INFO org.apache.hadoop.mapred.JobInProgress: > Task 'attempt_201305211335_0001_m_000002_0' has completed > task_201305211335_0001_m_000002 successfully. > 2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobInProgress: > Choosing a non-local task task_201305211335_0001_m_000000 > 2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobTracker: Adding > task (MAP) 'attempt_201305211335_0001_m_000000_0' to tip > task_201305211335_0001_m_000000, for tracker > 'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436' > > > System load did not reach its maximum by far, neither in terms of CPU > nor in terms of i/o-waiting. It took 100 minutes, which seems very fair > for one single map, since I have about 50M webpages in my database. > Jobtracker says max map tasks is 2 and max reduce tasks is 2, but only 1 > map task is running. > > This is my mapred-site.xml, which should be ok, in particular I > overwrite mapred.job.tracker not to be "local": > > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>mapred.job.tracker</name> > <value>localhost:9001</value> > </property> > <property> > <name>mapreduce.jobtracker.staging.root.dir</name> > <value>/user</value> > </property> > <property> > <name>mapred.map.tasks</name> > <value>2</value> > <description> > define mapred.map tasks to be number of slave hosts > </description> > </property> > <property> > <name>mapred.reduce.tasks</name> > <value>2</value> > <description> > define mapred.reduce tasks to be number of slave hosts > </description> > </property> > > </configuration> > > > Thanks and best regards, > Martin > > > > > > >

