Also check this fix for truncated docs:
https://issues.apache.org/jira/browse/NUTCH-965



On Wednesday 10 August 2011 14:52:26 Marek Bachmann wrote:
> Hi Markus,
> 
> thanks for the reply. I am sure that they are NOT all below 10 MB, some
> of them actually contain images and are much bigger. I decided to use 10
> MB just in the opinion that it should be great enough for the most text
> pdfs.
> 
> I'll stop the process and add the patch. Hope it will discover the issue.
> :)
> 
> On 10.08.2011 14:24, Markus Jelsma wrote:
> > That doesn't sound good indeed. Perhaps the parser chokes on your
> > truncated PDF files, which may happen with a too long content limit. Are
> > you sure all PDF's are below 10MB limit?
> > 
> > You can add this patch so you can see progress in parsing when running
> > local jobs: https://issues.apache.org/jira/browse/NUTCH-1028
> > 
> > On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote:
> >> Hi everybody,
> >> 
> >> a parse cycle is working for two days on my machine. I think this is way
> >> too long.
> >> The Hadoop Log file contains nothing but this, always repeating message:
> >> 
> >> 2011-08-10 11:15:08,863 INFO  mapred.LocalJobRunner - reduce>  reduce
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> >> ParserStatus with nothing
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group
> >> FileSystemCounters with nothing
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> FILE_BYTES_WRITTEN 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Creating group
> >> org.apache.hadoop.mapred.Task$Counter with bundle
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> COMBINE_OUTPUT_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES
> >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding
> >> MAP_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Adding
> >> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters -
> >> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters
> >> - Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG
> >> mapred.Counters - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864
> >> DEBUG
> >> mapred.Counters - Adding REDUCE_INPUT_RECORDS
> >> 
> >> Unfortunately, I can't interpret this message. Can anybody tell me if
> >> this is normal?
> >> 
> >> Here a few more details for the segment and my machine:
> >> 
> >> Content and Size of the segment:
> >> 
> >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme
> >> nts /20110808145606# ll
> >> total 0
> >> drwxr-xr-x 3 root root 23 Aug  8 15:22 content
> >> drwxr-xr-x 3 root root 23 Aug  8 15:22 crawl_fetch
> >> drwxr-xr-x 2 root root 45 Aug  8 14:56 crawl_generate
> >> drwxr-xr-x 2 root root 45 Aug  8 19:28 crawl_parse
> >> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_data
> >> drwxr-xr-x 3 root root 23 Aug  8 19:28 parse_text
> >> drwxr-xr-x 2 root root  6 Aug  8 15:34 _temporary
> >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme
> >> nts /20110808145606# du -h
> >> 8.4M    ./crawl_generate
> >> 9.4M    ./crawl_fetch/part-00000
> >> 9.4M    ./crawl_fetch
> >> 2.6G    ./content/part-00000
> >> 2.6G    ./content
> >> 0       ./_temporary
> >> 64M     ./parse_text/part-00000
> >> 64M     ./parse_text
> >> 30M     ./parse_data/part-00000
> >> 30M     ./parse_data
> >> 80M     ./crawl_parse
> >> 2.8G    .
> >> 
> >> System status:
> >> 
> >> top - 13:10:39 up 72 days, 23:13,  4 users,  load average: 1.53, 4.47,
> >> 5.36 Tasks: 125 total,   1 running, 124 sleeping,   0 stopped,   0
> >> zombie Cpu(s): 64.4%us,  0.4%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi, 
> >> 0.0%si, 0.0%st
> >> Mem:   8003904k total,  7944152k used,    59752k free,   100172k buffers
> >> Swap:   418808k total,     7916k used,   410892k free,  2807036k cached
> >> 
> >>     PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >> 
> >> 11697 root      20   0 4746m 4.2g  12m S  259 54.5   7683:28 java
> >> 
> >> Hope anybody could help me :)
> >> 
> >> Thanks
> >> 
> >> PS: I think there are many PDF files to process. The http content limit
> >> was set to 10 MB

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to