Hi,
I'm running nutch-2.1 (on top of cassandra) on a single-node hadoop
"cluster". Sorry in case my question is noobish or more a hadoop issue.
In short: How can I force nutch generate to provide a filesplitsize>1
which seems to be necesarry to run multiple map jobs?
I am seeing that only one input split is generated for nutch generate:
For
./bin/nutch generate -topN 1000000 -noFilter
2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:
Input size for job job_201305211335_0001 = 0. Number of splits = 1
2013-05-21 13:36:33,960 INFO org.apache.hadoop.mapred.JobInProgress:
job_201305211335_0001 LOCALITY_WAIT_FACTOR=0.0
2013-05-21 13:36:33,961 INFO org.apache.hadoop.mapred.JobInProgress: Job
job_201305211335_0001 initialized successfully with 1 map tasks and 2
reduce tasks.
2013-05-21 13:36:34,278 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (JOB_SETUP) 'attempt_201305211335_0001_m_000002_0' to tip
task_201305211335_0001_m_000002, for tracker
'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'
2013-05-21 13:36:36,728 INFO org.apache.hadoop.mapred.JobInProgress:
Task 'attempt_201305211335_0001_m_000002_0' has completed
task_201305211335_0001_m_000002 successfully.
2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing a non-local task task_201305211335_0001_m_000000
2013-05-21 13:36:36,735 INFO org.apache.hadoop.mapred.JobTracker: Adding
task (MAP) 'attempt_201305211335_0001_m_000000_0' to tip
task_201305211335_0001_m_000000, for tracker
'tracker_Ubuntu-1204-precise-64-minimal:localhost/127.0.0.1:34436'
System load did not reach its maximum by far, neither in terms of CPU
nor in terms of i/o-waiting. It took 100 minutes, which seems very fair
for one single map, since I have about 50M webpages in my database.
Jobtracker says max map tasks is 2 and max reduce tasks is 2, but only 1
map task is running.
This is my mapred-site.xml, which should be ok, in particular I
overwrite mapred.job.tracker not to be "local":
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
<property>
<name>mapreduce.jobtracker.staging.root.dir</name>
<value>/user</value>
</property>
<property>
<name>mapred.map.tasks</name>
<value>2</value>
<description>
define mapred.map tasks to be number of slave hosts
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>2</value>
<description>
define mapred.reduce tasks to be number of slave hosts
</description>
</property>
</configuration>
Thanks and best regards,
Martin