Hi, I tried the parsechecker tool and as it turns out it hangs after printing out: Content Metadata: Vary=Accept-Encoding Date=Thu, 23 Feb 2012 15:27:43 GMT Content-Length=3992 Expires=Thu, 19 Nov 1981 08:52:00 GMT Content-Encoding=gzip Set-Cookie=Shoper4Shop=a3ojqpk5ep6opahejfpiv98hf6; path=/ Content-Type=text/html Connection=close X-Powered-By=PHP/5.2.17 Server=Apache Pragma=no-cache Cache-Control=no-store, no-cache, must-revalidate, post-check=0, pre-check=0 Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
but it does not give me a specific error or anything like that, is there some way that I can turn that on? i.e. what java class do I want to increase the log level for? I also found a similar issue on some urls from another host, is there any way to defend against this, i.e. setting a max timeout parameter on the parser threads or anything like that? It seems to be a tedious process to filter out the problematic urls by hand. best regards, Magnus On Mon, Feb 20, 2012 at 4:16 AM, remi tassing <[email protected]> wrote: > Hi, > > Could you also try the parsechecker tool on that last url? It's > possible.that the file has a.problem or simply a bug. > > Remi > > On Sunday, February 19, 2012, Magnús Skúlason <[email protected]> wrote: >> Hi, >> >> According to my logs a really long time +2 hours elapses between >> parsing the last page in a segment until the ParseSegment finishes as >> can be seen here: >> >> 2012-02-19 00:51:43,471 INFO parse.ParseSegment - Parsing: http:// .... >> 2012-02-19 03:15:18,604 INFO parse.ParseSegment - ParseSegment: >> finished at 2012-02-19 03:15:18, elapsed: 02:57:24 >> >> Since the total time of the parse job is just around 3 hours, this >> represents a huge portion of the overall time >> >> Is it normal that the last step in the job takes such a long time and >> is there anything I can do to speed it up? I have been running the >> generator with -topN 20000 I wouldn't have expected that to be a big >> enough value to cause a problem. I have now reconfigured my script to >> skip the -topN parameter to see what happens. >> >> best regards, >> Magnus >>

