Nutch 2.1 generate: how to get multiple maps in deploy-mode

Martin Aesch Tue, 21 May 2013 04:45:23 -0700

Hi,

I'm running nutch-2.1 (on top of cassandra) on a single-node hadoop
"cluster". Sorry in case my question is noobish or more a hadoop issue.


In short: How can I force nutch generate to provide a filesplitsize>1
which seems to be necesarry to run multiple map jobs?

I am seeing that only one input split is generated for nutch generate:
For 
./bin/nutch generate -topN 1000000 -noFilter


2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:
Input size for job job_201305211335_0001 = 0. Number of splits = 1
2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:
job_201305211335_0001 LOCALITY_WAIT_FACTOR=0.0
2013-05-21 13:36:33,961 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_201305211335_0001 initialized successfully with 1 map tasks and 2
reduce tasks.
2013-05-21 13:36:34,278 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (JOB_SETUP) 'attempt_201305211335_0001_m_000002_0' to tip
task_201305211335_0001_m_000002, for tracker
'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'
2013-05-21 13:36:36,728 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'attempt_201305211335_0001_m_000002_0' has completed
task_201305211335_0001_m_000002 successfully.
2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201305211335_0001_m_000000
2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201305211335_0001_m_000000_0' to tip
task_201305211335_0001_m_000000, for tracker
'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'


System load did not reach its maximum by far, neither in terms of CPU
nor in terms of i/o-waiting. It took 100 minutes, which seems very fair
for one single map, since I have about 50M webpages in my database.
Jobtracker says max map tasks is 2 and max reduce tasks is 2, but only 1
map task is running.

This is my mapred-site.xml, which should be ok, in particular I
overwrite mapred.job.tracker not to be "local":

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
     <property>
        <name>mapreduce.jobtracker.staging.root.dir</name>
        <value>/user</value>
     </property>
     <property> 
        <name>mapred.map.tasks</name>
        <value>2</value>
        <description>
        define mapred.map tasks to be number of slave hosts
        </description> 
     </property> 
<property> 
  <name>mapred.reduce.tasks</name>
  <value>2</value>
  <description>
    define mapred.reduce tasks to be number of slave hosts
  </description> 
</property> 

</configuration>


Thanks and best regards,
Martin

Nutch 2.1 generate: how to get multiple maps in deploy-mode

Reply via email to