Thank you. I just found it a minute ago and was going to write the email. ([;_]?((?i)l|j|bv_)?((?i)sid|phpsessid|sessionid)=.*?)(\?|&|#|$)
Perhaps, I was too tired yesterday so that I thought I had already disabled the normalization-regex. Regards, Ake Tangkananond On 8/15/12 2:13 PM, "Ferdy Galema" <[email protected]> wrote: >Ah I think you got bit by the session ids normalization. There is a >normalize rule in regex-normalize.xml that removes 'sid=.*' from the url. >Looks like a bug if it strips off query parameters from values like >'newsid=.*'. There is already a Jira for this: NUTCH-706. > >For now, remove the 'sid' value from line 32 in regex-normalize.xml or >remove the line altogether to solve this. > >On Tue, Aug 14, 2012 at 6:29 PM, Ake Tangkananond <[email protected]> >wrote: > >> Hi Ferdy, >> >> Thanks for you advise. I don't have any special filtering/normalizing >> rules except the standard one. I even try disabling all url >>normalization >> plugin, but the result is no difference. >> >> The url left over in the ol is >> column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New >> >> Yes, it's truncated at "New".. I'm thinking if it is possible that the >>URL >> is truncated to make it fit 49 chars, and all truncated URL are the same >> so there is only one left? >> >> In that case, what makes the URL truncated? >> >> >> Regards, >> Ake Tangkananond >> >> >> >> >> On 8/14/12 7:12 PM, "Ferdy Galema" <[email protected]> wrote: >> >> >Do you have specifc filtering/normalizing rules? From all urls that are >> >logged, what url is left over in the 'ol' field? >> > >> >On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <[email protected]> >> >wrote: >> > >> >> Thanks for reply Ferdy. >> >> >> >> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse >> >>HTML >> >> fine. >> >> >> >> >> >> Regards, >> >> Ake Tangkananond >> >> >> >> >> >> >> >> >> >> On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote: >> >> >> >> >Hi, >> >> > >> >> >Judging by your logs, it might be that you have accidentally set >> >> >'db.max.outlinks.per.page' to 1? If this is not the case, could you >> >>try to >> >> >parse some other document types, for example a html page? Please >>note >> >>that >> >> >I'm not using the TikaParser at all; it could be that there is a bug >> >>with >> >> >it in Nutch2. >> >> > >> >> >Ferdy. >> >> > >> >> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]> >> >> >wrote: >> >> > >> >> >> Hi, >> >> >> >> >> >> I'm getting an unexpected behavior from nutch parsing mechanism. >> >> >>Perhaps I >> >> >> don't really understand Nucth well. Here is what I find it weird. >> >>Could >> >> >>you >> >> >> please advise? >> >> >> >> >> >> I crawl a website of mimeType application/rss+xml. The fetched >> >>content >> >> >>is >> >> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm >> >>expecting >> >> >>it >> >> >> to >> >> >> give all outlinks in the RSS Feed, but my command >> >> >> > `scan 'webpage', {COLUMNS => 'ol'}` >> >> >> gives only one ol cf entry. >> >> >> >> >> >> Then I add a code at TikaParser.java line 192 as follows to see >>what >> >>are >> >> >> all >> >> >> outlinks: >> >> >> > ... >> >> >> > Parse parse = new Parse(text, title, outlinks, status); >> >> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, >>root); >> >> >> > >> >> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE >>192 >> >> >> > LOG.trace(outlink.getToUrl()); >> >> >> > } >> >> >> > >> >> >> > if (metaTags.getNoCache()) { // not okay to cache >> >> >> > page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY), >> >> >> > ByteBuffer.wrap(Bytes >> >> >> > .toBytes(cachingPolicy))); >> >> >> > } >> >> >> > >> >> >> > return parse; >> >> >> >> >> >> The result is as expected. It prints all URL links in the content. >> >>But I >> >> >> really wonder why only one URL is stored in a storage of cf ol. >> >>Here's a >> >> >> log4j log: >> >> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika >>parser >> >> >> > org.apache.tika.parser.feed.FeedParser for mime-type >> >> >>application/rss+xml >> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for >> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null, >> >> >> noCache=false, >> >> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null >> >> >> > * general tags: >> >> >> > - description = Manager Online Update ตลอด 24 >>ชม. >> >> >> > * http-equiv tags: >> >> >> > >> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text... >> >> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title... >> >> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links... >> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 >>outlinks >> >>in >> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml >> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951 >> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859 >> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> >> >> > >>http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843 >> >> >> >> >> >> Now I wonder why only one outlink is stored in ol column family. >>Any >> >> >> advice, >> >> >> please? >> >> >> >> >> >> >> >> >> Regards, >> >> >> Ake Tangkananond >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>

