Thank you. I just found it a minute ago and was going to write the email.

([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)

Perhaps, I was too tired yesterday so that I thought I had already
disabled the normalization-regex.


Regards,
Ake Tangkananond




On 8/15/12 2:13 PM, "Ferdy Galema" <[email protected]> wrote:

>Ah I think you got bit by the session ids normalization. There is a
>normalize rule in regex-normalize.xml that removes 'sid=.*' from the url.
>Looks like a bug if it strips off query parameters from values like
>'newsid=.*'. There is already a Jira for this: NUTCH-706.
>
>For now, remove the 'sid' value from line 32 in regex-normalize.xml or
>remove the line altogether to solve this.
>
>On Tue, Aug 14, 2012 at 6:29 PM, Ake Tangkananond <[email protected]>
>wrote:
>
>> Hi Ferdy,
>>
>> Thanks for you advise. I don't have any special filtering/normalizing
>> rules except the standard one. I even try disabling all url
>>normalization
>> plugin, but the result is no difference.
>>
>> The url left over in the ol is
>> column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New
>>
>> Yes, it's truncated at "New".. I'm thinking if it is possible that the
>>URL
>> is truncated to make it fit 49 chars, and all truncated URL are the same
>> so there is only one left?
>>
>> In that case, what makes the URL truncated?
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>
>>
>> On 8/14/12 7:12 PM, "Ferdy Galema" <[email protected]> wrote:
>>
>> >Do you have specifc filtering/normalizing rules? From all urls that are
>> >logged, what url is left over in the 'ol' field?
>> >
>> >On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <[email protected]>
>> >wrote:
>> >
>> >> Thanks for reply Ferdy.
>> >>
>> >> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse
>> >>HTML
>> >> fine.
>> >>
>> >>
>> >> Regards,
>> >> Ake Tangkananond
>> >>
>> >>
>> >>
>> >>
>> >> On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote:
>> >>
>> >> >Hi,
>> >> >
>> >> >Judging by your logs, it might be that you have accidentally set
>> >> >'db.max.outlinks.per.page' to 1? If this is not the case, could you
>> >>try to
>> >> >parse some other document types, for example a html page? Please
>>note
>> >>that
>> >> >I'm not using the TikaParser at all; it could be that there is a bug
>> >>with
>> >> >it in Nutch2.
>> >> >
>> >> >Ferdy.
>> >> >
>> >> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]>
>> >> >wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> I'm getting an unexpected behavior from nutch parsing mechanism.
>> >> >>Perhaps I
>> >> >> don't really understand Nucth well. Here is what I find it weird.
>> >>Could
>> >> >>you
>> >> >> please advise?
>> >> >>
>> >> >> I crawl a website of mimeType application/rss+xml. The fetched
>> >>content
>> >> >>is
>> >> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm
>> >>expecting
>> >> >>it
>> >> >> to
>> >> >> give all outlinks in the RSS Feed, but my command
>> >> >> > `scan 'webpage', {COLUMNS => 'ol'}`
>> >> >> gives only one ol cf entry.
>> >> >>
>> >> >> Then I add a code at TikaParser.java line 192 as follows to see
>>what
>> >>are
>> >> >> all
>> >> >> outlinks:
>> >> >> > ...
>> >> >> > Parse parse = new Parse(text, title, outlinks, status);
>> >> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags,
>>root);
>> >> >> >
>> >> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE
>>192
>> >> >> >   LOG.trace(outlink.getToUrl());
>> >> >> > }
>> >> >> >
>> >> >> > if (metaTags.getNoCache()) { // not okay to cache
>> >> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
>> >> >> > ByteBuffer.wrap(Bytes
>> >> >> >       .toBytes(cachingPolicy)));
>> >> >> > }
>> >> >> >
>> >> >> > return parse;
>> >> >>
>> >> >> The result is as expected. It prints all URL links in the content.
>> >>But I
>> >> >> really wonder why only one URL is stored in a storage of cf ol.
>> >>Here's a
>> >> >> log4j log:
>> >> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika
>>parser
>> >> >> > org.apache.tika.parser.feed.FeedParser for mime-type
>> >> >>application/rss+xml
>> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
>> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
>> >> >> noCache=false,
>> >> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
>> >> >> >  * general tags:
>> >> >> >    - description        =       Manager Online Update ตลอด 24
>>ชม.
>> >> >> >  * http-equiv tags:
>> >> >> >
>> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
>> >> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
>> >> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
>> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10
>>outlinks
>> >>in
>> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
>> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
>> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
>> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> >> > 
>>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>> >> >>
>> >> >> Now I wonder why only one outlink is stored in ol column family.
>>Any
>> >> >> advice,
>> >> >> please?
>> >> >>
>> >> >>
>> >> >> Regards,
>> >> >> Ake Tangkananond
>> >> >>
>> >> >>
>> >> >>
>> >>
>> >>
>> >>
>>
>>
>>


Reply via email to