Also check this fix for truncated docs: https://issues.apache.org/jira/browse/NUTCH-965
On Wednesday 10 August 2011 14:52:26 Marek Bachmann wrote: > Hi Markus, > > thanks for the reply. I am sure that they are NOT all below 10 MB, some > of them actually contain images and are much bigger. I decided to use 10 > MB just in the opinion that it should be great enough for the most text > pdfs. > > I'll stop the process and add the patch. Hope it will discover the issue. > :) > > On 10.08.2011 14:24, Markus Jelsma wrote: > > That doesn't sound good indeed. Perhaps the parser chokes on your > > truncated PDF files, which may happen with a too long content limit. Are > > you sure all PDF's are below 10MB limit? > > > > You can add this patch so you can see progress in parsing when running > > local jobs: https://issues.apache.org/jira/browse/NUTCH-1028 > > > > On Saturday 10 September 2011 13:12:07 Marek Bachmann wrote: > >> Hi everybody, > >> > >> a parse cycle is working for two days on my machine. I think this is way > >> too long. > >> The Hadoop Log file contains nothing but this, always repeating message: > >> > >> 2011-08-10 11:15:08,863 INFO mapred.LocalJobRunner - reduce> reduce > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group > >> ParserStatus with nothing > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding failed > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding success > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Creating group > >> FileSystemCounters with nothing > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding FILE_BYTES_READ > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding > >> FILE_BYTES_WRITTEN 2011-08-10 11:15:08,864 DEBUG mapred.Counters - > >> Creating group > >> org.apache.hadoop.mapred.Task$Counter with bundle > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding > >> COMBINE_OUTPUT_RECORDS > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_RECORDS > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding SPILLED_RECORDS > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_OUTPUT_BYTES > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding MAP_INPUT_BYTES > >> 2011-08-10 11:15:08,864 DEBUG mapred.Counters - Adding > >> MAP_OUTPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters - > >> Adding > >> COMBINE_INPUT_RECORDS 2011-08-10 11:15:08,864 DEBUG mapred.Counters - > >> Adding REDUCE_INPUT_GROUPS 2011-08-10 11:15:08,864 DEBUG mapred.Counters > >> - Adding REDUCE_SHUFFLE_BYTES 2011-08-10 11:15:08,864 DEBUG > >> mapred.Counters - Adding REDUCE_OUTPUT_RECORDS 2011-08-10 11:15:08,864 > >> DEBUG > >> mapred.Counters - Adding REDUCE_INPUT_RECORDS > >> > >> Unfortunately, I can't interpret this message. Can anybody tell me if > >> this is normal? > >> > >> Here a few more details for the segment and my machine: > >> > >> Content and Size of the segment: > >> > >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme > >> nts /20110808145606# ll > >> total 0 > >> drwxr-xr-x 3 root root 23 Aug 8 15:22 content > >> drwxr-xr-x 3 root root 23 Aug 8 15:22 crawl_fetch > >> drwxr-xr-x 2 root root 45 Aug 8 14:56 crawl_generate > >> drwxr-xr-x 2 root root 45 Aug 8 19:28 crawl_parse > >> drwxr-xr-x 3 root root 23 Aug 8 19:28 parse_data > >> drwxr-xr-x 3 root root 23 Aug 8 19:28 parse_text > >> drwxr-xr-x 2 root root 6 Aug 8 15:34 _temporary > >> root@hrz-vm180:/home/nutchServer/uni_nutch/runtime/local/bin/crawl/segme > >> nts /20110808145606# du -h > >> 8.4M ./crawl_generate > >> 9.4M ./crawl_fetch/part-00000 > >> 9.4M ./crawl_fetch > >> 2.6G ./content/part-00000 > >> 2.6G ./content > >> 0 ./_temporary > >> 64M ./parse_text/part-00000 > >> 64M ./parse_text > >> 30M ./parse_data/part-00000 > >> 30M ./parse_data > >> 80M ./crawl_parse > >> 2.8G . > >> > >> System status: > >> > >> top - 13:10:39 up 72 days, 23:13, 4 users, load average: 1.53, 4.47, > >> 5.36 Tasks: 125 total, 1 running, 124 sleeping, 0 stopped, 0 > >> zombie Cpu(s): 64.4%us, 0.4%sy, 0.0%ni, 35.2%id, 0.0%wa, 0.0%hi, > >> 0.0%si, 0.0%st > >> Mem: 8003904k total, 7944152k used, 59752k free, 100172k buffers > >> Swap: 418808k total, 7916k used, 410892k free, 2807036k cached > >> > >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > >> > >> 11697 root 20 0 4746m 4.2g 12m S 259 54.5 7683:28 java > >> > >> Hope anybody could help me :) > >> > >> Thanks > >> > >> PS: I think there are many PDF files to process. The http content limit > >> was set to 10 MB -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

