Hi everybody,
a parse cycle is working for two days on my machine. I think this is way
too long.
The Hadoop Log file contains nothing but this, always repeating message:
2011-08-10 11:15:08,863 INFO mapred.LocalJobRunner - reduce > reduce
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
ParserStatus with nothing
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
FileSystemCounters with nothing
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
org.apache.hadoop.mapred.Task$Counter with bundle
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
COMBINE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_GROUPS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_SHUFFLE_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_RECORDS
Unfortunately, I can't interpret this message. Can anybody tell me if
this is normal?
Here a few more details for the segment and my machine:
Content and Size of the segment:
root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606#
ll
total 0
drwxr-xr-x 3 root root 23 Aug 8 15:22 content
drwxr-xr-x 3 root root 23 Aug 8 15:22 crawl_fetch
drwxr-xr-x 2 root root 45 Aug 8 14:56 crawl_generate
drwxr-xr-x 2 root root 45 Aug 8 19:28 crawl_parse
drwxr-xr-x 3 root root 23 Aug 8 19:28 parse_data
drwxr-xr-x 3 root root 23 Aug 8 19:28 parse_text
drwxr-xr-x 2 root root 6 Aug 8 15:34 _temporary
root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606#
du -h
8.4M ./crawl_generate
9.4M ./crawl_fetch/part-00000
9.4M ./crawl_fetch
2.6G ./content/part-00000
2.6G ./content
0 ./_temporary
64M ./parse_text/part-00000
64M ./parse_text
30M ./parse_data/part-00000
30M ./parse_data
80M ./crawl_parse
2.8G .
System status:
top - 13:10:39 up 72 days, 23:13, 4 users, load average: 1.53, 4.47, 5.36
Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie
Cpu(s): 64.4%us, 0.4%sy, 0.0%ni, 35.2%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 8003904k total, 7944152k used, 59752k free, 100172k buffers
Swap: 418808k total, 7916k used, 410892k free, 2807036k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11697 root 20 0 4746m 4.2g 12m S 259 54.5 7683:28 java
Hope anybody could help me :)
Thanks
PS: I think there are many PDF files to process. The http content limit
was set to 10 MB