Question to reduce while parsing

Marek Bachmann Wed, 10 Aug 2011 04:13:06 -0700

Hi everybody,

a parse cycle is working for two days on my machine. I think this is waytoo long.

The Hadoop Log file contains nothing but this, always repeating message:


2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce > reduce

2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating groupParserStatus with nothing

2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success

2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating groupFileSystemCounters with nothing

2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN

2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating grouporg.apache.hadoop.mapred.Task$Counter with bundle2011-08-10 11:15:08,864 DEBUG mapred.Counters - AddingCOMBINE_OUTPUT_RECORDS

2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding COMBINE_INPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_GROUPS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_SHUFFLE_BYTES
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_OUTPUT_RECORDS
2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding REDUCE_INPUT_RECORDS

Unfortunately, I can't interpret this message. Can anybody tell me ifthis is normal?


Here a few more details for the segment and my machine:

Content and Size of the segment:

root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606#ll

total 0
drwxr-xr-x 3 root root 23 Aug  8 15:22 content
drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary

root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments/20110808145606#du -h

8.4M    ./crawl_generate
9.4M    ./crawl_fetch/part-00000
9.4M    ./crawl_fetch
2.6G    ./content/part-00000
2.6G    ./content
0       ./_temporary
64M     ./parse_text/part-00000
64M     ./parse_text
30M     ./parse_data/part-00000
30M     ./parse_data
80M     ./crawl_parse
2.8G    .

System status:

top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie

Cpu(s): 64.4%us, 0.4%sy, 0.0%ni, 35.2%id, 0.0%wa, 0.0%hi, 0.0%si,0.0%st

Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
Swap:   418808k total,     7916k used,   410892k free,  2807036k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND


11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java

Hope anybody could help me :)

Thanks

PS: I think there are many PDF files to process. The http content limitwas set to 10 MB

Question to reduce while parsing

Reply via email to