That doesn't sound good indeed. Perhaps the parser chokes on your truncated PDF files, which may happen with a too long content limit. Are you sure all PDF's are below 10MB limit?
You can add this patch so you can see progress in parsing when running local jobs: https://issues.apache.org/jira/browse/NUTCH-1028 On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote: > Hi everybody, > > a parse cycle is working for two days on my machine. I think this is way > too long. > The Hadoop Log file contains nothing but this, always repeating message: > > 2011-08-10 11:15:08,863 INFO mapred.LocalJobRunner - reduce > reduce > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group > ParserStatus with nothing > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group > FileSystemCounters with nothing > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_WRITTEN > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group > org.apache.hadoop.mapred.Task$Counter with bundle > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding > COMBINE_OUTPUT_RECORDS > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_RECORDS > 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding > COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters - > Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters - > Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG mapred.Counters > - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG > mapred.Counters - Adding REDUCE_INPUT_RECORDS > > Unfortunately, I can't interpret this message. Can anybody tell me if > this is normal? > > Here a few more details for the segment and my machine: > > Content and Size of the segment: > > root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments > /20110808145606# ll > total 0 > drwxr-xr-x 3 root root 23 Aug 8 15:22 content > drwxr-xr-x 3 root root 23 Aug 8 15:22 crawl_fetch > drwxr-xr-x 2 root root 45 Aug 8 14:56 crawl_generate > drwxr-xr-x 2 root root 45 Aug 8 19:28 crawl_parse > drwxr-xr-x 3 root root 23 Aug 8 19:28 parse_data > drwxr-xr-x 3 root root 23 Aug 8 19:28 parse_text > drwxr-xr-x 2 root root 6 Aug 8 15:34 _temporary > root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segments > /20110808145606# du -h > 8.4M ./crawl_generate > 9.4M ./crawl_fetch/part-00000 > 9.4M ./crawl_fetch > 2.6G ./content/part-00000 > 2.6G ./content > 0 ./_temporary > 64M ./parse_text/part-00000 > 64M ./parse_text > 30M ./parse_data/part-00000 > 30M ./parse_data > 80M ./crawl_parse > 2.8G . > > System status: > > top - 13:10:39 up 72 days, 23:13, 4 users, load average: 1.53, 4.47, 5.36 > Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 zombie > Cpu(s): 64.4%us, 0.4%sy, 0.0%ni, 35.2%id, 0.0%wa, 0.0%hi, 0.0%si, > 0.0%st > Mem: 8003904k total, 7944152k used, 59752k free, 100172k buffers > Swap: 418808k total, 7916k used, 410892k free, 2807036k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > 11697 root 20 0 4746m 4.2g 12m S 259 54.5 7683:28 java > > Hope anybody could help me :) > > Thanks > > PS: I think there are many PDF files to process. The http content limit > was set to 10 MB -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

