That doesn't sound good indeed. Perhaps the parser chokes on your truncated 
PDF files, which may happen with a too long content limit. Are you sure all 
PDF's are below 10MB limit?

You can add this patch so you can see progress in parsing when running local 
jobs: https://issues.apache.org/jira/browse/NUTCH-1028

On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
> Hi everybody,
> 
> a parse cycle is working for two days on my machine. I think this is way
> too long.
> The Hadoop Log file contains nothing but this, always repeating message:
> 
> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce > reduce
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> ParserStatus with nothing
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> FileSystemCounters with nothing
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> org.apache.hadoop.mapred.Task$Counter with bundle
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> COMBINE_OUTPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS
> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG mapred.Counters
> - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG
> mapred.Counters - Adding REDUCE_INPUT_RECORDS
> 
> Unfortunately, I can't interpret this message. Can anybody tell me if
> this is normal?
> 
> Here a few more details for the segment and my machine:
> 
> Content and Size of the segment:
> 
> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
> /20110808145606# ll
> total 0
> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments
> /20110808145606# du -h
> 8.4M    ./crawl_generate
> 9.4M    ./crawl_fetch/part-00000
> 9.4M    ./crawl_fetch
> 2.6G    ./content/part-00000
> 2.6G    ./content
> 0       ./_temporary
> 64M     ./parse_text/part-00000
> 64M     ./parse_text
> 30M     ./parse_data/part-00000
> 30M     ./parse_data
> 80M     ./crawl_parse
> 2.8G    .
> 
> System status:
> 
> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47, 5.36
> Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0 zombie
> Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  0.0%si,
> 0.0%st
> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
> 
>    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 
> 
> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
> 
> Hope anybody could help me :)
> 
> Thanks
> 
> PS: I think there are many PDF files to process. The http content limit
> was set to 10 MB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to