Ah I think you got bit by the session ids normalization. There is a normalize rule in regex-normalize.xml that removes 'sid=.*' from the url. Looks like a bug if it strips off query parameters from values like 'newsid=.*'. There is already a Jira for this: NUTCH-706.
For now, remove the 'sid' value from line 32 in regex-normalize.xml or remove the line altogether to solve this. On Tue, Aug 14, 2012 at 6:29 PM, Ake Tangkananond <[email protected]> wrote: > Hi Ferdy, > > Thanks for you advise. I don't have any special filtering/normalizing > rules except the standard one. I even try disabling all url normalization > plugin, but the result is no difference. > > The url left over in the ol is > column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New > > Yes, it's truncated at "New".. I'm thinking if it is possible that the URL > is truncated to make it fit 49 chars, and all truncated URL are the same > so there is only one left? > > In that case, what makes the URL truncated? > > > Regards, > Ake Tangkananond > > > > > On 8/14/12 7:12 PM, "Ferdy Galema" <[email protected]> wrote: > > >Do you have specifc filtering/normalizing rules? From all urls that are > >logged, what url is left over in the 'ol' field? > > > >On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <[email protected]> > >wrote: > > > >> Thanks for reply Ferdy. > >> > >> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse > >>HTML > >> fine. > >> > >> > >> Regards, > >> Ake Tangkananond > >> > >> > >> > >> > >> On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote: > >> > >> >Hi, > >> > > >> >Judging by your logs, it might be that you have accidentally set > >> >'db.max.outlinks.per.page' to 1? If this is not the case, could you > >>try to > >> >parse some other document types, for example a html page? Please note > >>that > >> >I'm not using the TikaParser at all; it could be that there is a bug > >>with > >> >it in Nutch2. > >> > > >> >Ferdy. > >> > > >> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]> > >> >wrote: > >> > > >> >> Hi, > >> >> > >> >> I'm getting an unexpected behavior from nutch parsing mechanism. > >> >>Perhaps I > >> >> don't really understand Nucth well. Here is what I find it weird. > >>Could > >> >>you > >> >> please advise? > >> >> > >> >> I crawl a website of mimeType application/rss+xml. The fetched > >>content > >> >>is > >> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm > >>expecting > >> >>it > >> >> to > >> >> give all outlinks in the RSS Feed, but my command > >> >> > `scan 'webpage', {COLUMNS => 'ol'}` > >> >> gives only one ol cf entry. > >> >> > >> >> Then I add a code at TikaParser.java line 192 as follows to see what > >>are > >> >> all > >> >> outlinks: > >> >> > ... > >> >> > Parse parse = new Parse(text, title, outlinks, status); > >> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root); > >> >> > > >> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192 > >> >> > LOG.trace(outlink.getToUrl()); > >> >> > } > >> >> > > >> >> > if (metaTags.getNoCache()) { // not okay to cache > >> >> > page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY), > >> >> > ByteBuffer.wrap(Bytes > >> >> > .toBytes(cachingPolicy))); > >> >> > } > >> >> > > >> >> > return parse; > >> >> > >> >> The result is as expected. It prints all URL links in the content. > >>But I > >> >> really wonder why only one URL is stored in a storage of cf ol. > >>Here's a > >> >> log4j log: > >> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser > >> >> > org.apache.tika.parser.feed.FeedParser for mime-type > >> >>application/rss+xml > >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for > >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null, > >> >> noCache=false, > >> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null > >> >> > * general tags: > >> >> > - description = Manager Online Update ตลอด 24 ชม. > >> >> > * http-equiv tags: > >> >> > > >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text... > >> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title... > >> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links... > >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks > >>in > >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml > >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951 > >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859 > >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843 > >> >> > >> >> Now I wonder why only one outlink is stored in ol column family. Any > >> >> advice, > >> >> please? > >> >> > >> >> > >> >> Regards, > >> >> Ake Tangkananond > >> >> > >> >> > >> >> > >> > >> > >> > > >

