Thanks for reply Ferdy. Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML fine.
Regards, Ake Tangkananond On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote: >Hi, > >Judging by your logs, it might be that you have accidentally set >'db.max.outlinks.per.page' to 1? If this is not the case, could you try to >parse some other document types, for example a html page? Please note that >I'm not using the TikaParser at all; it could be that there is a bug with >it in Nutch2. > >Ferdy. > >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]> >wrote: > >> Hi, >> >> I'm getting an unexpected behavior from nutch parsing mechanism. >>Perhaps I >> don't really understand Nucth well. Here is what I find it weird. Could >>you >> please advise? >> >> I crawl a website of mimeType application/rss+xml. The fetched content >>is >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting >>it >> to >> give all outlinks in the RSS Feed, but my command >> > `scan 'webpage', {COLUMNS => 'ol'}` >> gives only one ol cf entry. >> >> Then I add a code at TikaParser.java line 192 as follows to see what are >> all >> outlinks: >> > … >> > Parse parse = new Parse(text, title, outlinks, status); >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root); >> > >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192 >> > LOG.trace(outlink.getToUrl()); >> > } >> > >> > if (metaTags.getNoCache()) { // not okay to cache >> > page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY), >> > ByteBuffer.wrap(Bytes >> > .toBytes(cachingPolicy))); >> > } >> > >> > return parse; >> >> The result is as expected. It prints all URL links in the content. But I >> really wonder why only one URL is stored in a storage of cf ol. Here's a >> log4j log: >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser >> > org.apache.tika.parser.feed.FeedParser for mime-type >>application/rss+xml >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null, >> noCache=false, >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null >> > * general tags: >> > - description = Manager Online Update ตลอด 24 ชม. >> > * http-equiv tags: >> > >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text... >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title... >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links... >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in >> > http://www.manager.co.th/RSS/Politics/Politics.xml >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951 >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859 >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843 >> >> Now I wonder why only one outlink is stored in ol column family. Any >> advice, >> please? >> >> >> Regards, >> Ake Tangkananond >> >> >>

