Hi,

I tried the parsechecker tool and as it turns out it hangs after printing out:
Content Metadata: Vary=Accept-Encoding Date=Thu, 23 Feb 2012 15:27:43
GMT Content-Length=3992 Expires=Thu, 19 Nov 1981 08:52:00 GMT
Content-Encoding=gzip
Set-Cookie=Shoper4Shop=a3ojqpk5ep6opahejfpiv98hf6; path=/
Content-Type=text/html Connection=close X-Powered-By=PHP/5.2.17
Server=Apache Pragma=no-cache Cache-Control=no-store, no-cache,
must-revalidate, post-check=0, pre-check=0
Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

but it does not give me a specific error or anything like that, is
there some way that I can turn that on? i.e. what java class do I want
to increase the log level for?

I also found a similar issue on some urls from another host, is there
any way to defend against this, i.e. setting a max timeout parameter
on the parser threads or anything like that? It seems to be a tedious
process to filter out the problematic urls by hand.

best regards,
Magnus

On Mon, Feb 20, 2012 at 4:16 AM, remi tassing <[email protected]> wrote:
> Hi,
>
> Could you also try the parsechecker tool on that last url? It's
> possible.that the file has a.problem or simply a bug.
>
> Remi
>
> On Sunday, February 19, 2012, Magnús Skúlason <[email protected]> wrote:
>> Hi,
>>
>> According to my logs a really long time +2 hours elapses between
>> parsing the last page in a segment until the ParseSegment finishes as
>> can be seen here:
>>
>> 2012-02-19 00:51:43,471 INFO  parse.ParseSegment - Parsing: http:// ....
>> 2012-02-19 03:15:18,604 INFO  parse.ParseSegment - ParseSegment:
>> finished at 2012-02-19 03:15:18, elapsed: 02:57:24
>>
>> Since the total time of the parse job is just around 3 hours, this
>> represents a huge portion of the overall time
>>
>> Is it normal that the last step in the job takes such a long time and
>> is there anything I can do to speed it up? I have been running the
>> generator with -topN 20000 I wouldn't have expected that to be a big
>> enough value to cause a problem. I have now reconfigured my script to
>> skip the -topN parameter to see what happens.
>>
>> best regards,
>> Magnus
>>

Reply via email to