Hi, Judging by your logs, it might be that you have accidentally set 'db.max.outlinks.per.page' to 1? If this is not the case, could you try to parse some other document types, for example a html page? Please note that I'm not using the TikaParser at all; it could be that there is a bug with it in Nutch2.
Ferdy. On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]> wrote: > Hi, > > I'm getting an unexpected behavior from nutch parsing mechanism. Perhaps I > don't really understand Nucth well. Here is what I find it weird. Could you > please advise? > > I crawl a website of mimeType application/rss+xml. The fetched content is > parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting it > to > give all outlinks in the RSS Feed, but my command > > `scan 'webpage', {COLUMNS => 'ol'}` > gives only one ol cf entry. > > Then I add a code at TikaParser.java line 192 as follows to see what are > all > outlinks: > > … > > Parse parse = new Parse(text, title, outlinks, status); > > parse = htmlParseFilters.filter(url, page, parse, metaTags, root); > > > > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192 > > LOG.trace(outlink.getToUrl()); > > } > > > > if (metaTags.getNoCache()) { // not okay to cache > > page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY), > > ByteBuffer.wrap(Bytes > > .toBytes(cachingPolicy))); > > } > > > > return parse; > > The result is as expected. It prints all URL links in the content. But I > really wonder why only one URL is stored in a storage of cf ol. Here's a > log4j log: > > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser > > org.apache.tika.parser.feed.FeedParser for mime-type application/rss+xml > > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for > > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null, > noCache=false, > > noFollow=false, noIndex=false, refresh=false, refreshHref=null > > * general tags: > > - description = Manager Online Update ตลอด 24 ชม. > > * http-equiv tags: > > > > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text... > > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title... > > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links... > > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in > > http://www.manager.co.th/RSS/Politics/Politics.xml > > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951 > > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859 > > 2012-08-14 18:03:50,208 TRACE tika.TikaParser - > > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843 > > Now I wonder why only one outlink is stored in ol column family. Any > advice, > please? > > > Regards, > Ake Tangkananond > > >

