Hi everybody,

a parse cycle is working for two days on my machine. I think this is way too long.
The Hadoop Log file contains nothing but this, always repeating message:

2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce > reduce
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group ParserStatus with nothing
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group FileSystemCounters with nothing
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group org.apache.hadoop.mapred.Task$Counter with bundle 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding COMBINE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_GROUPS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_SHUFFLE_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_RECORDS

Unfortunately, I can't interpret this message. Can anybody tell me if this is normal?

Here a few more details for the segment and my machine:

Content and Size of the segment:

root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606# ll
total 0
drwxr-xr-x 3 root root 23 Aug  8 15:22 content
drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606# du -h
8.4M    ./crawl_generate
9.4M    ./crawl_fetch/part-00000
9.4M    ./crawl_fetch
2.6G    ./content/part-00000
2.6G    ./content
0       ./_temporary
64M     ./parse_text/part-00000
64M     ./parse_text
30M     ./parse_data/part-00000
30M     ./parse_data
80M     ./crawl_parse
2.8G    .

System status:

top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
Cpu(s): 64.4%us, 0.4%sy, 0.0%ni, 35.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
Swap:   418808k total,     7916k used,   410892k free,  2807036k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java

Hope anybody could help me :)

Thanks

PS: I think there are many PDF files to process. The http content limit was set to 10 MB

Reply via email to