Hi,

I'm getting an unexpected behavior from nutch parsing mechanism. Perhaps I
don't really understand Nucth well. Here is what I find it weird. Could you
please advise?

I crawl a website of mimeType application/rss+xml. The fetched content is
parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting it to
give all outlinks in the RSS Feed, but my command
> `scan 'webpage', {COLUMNS => 'ol'}`
gives only one ol cf entry.

Then I add a code at TikaParser.java line 192 as follows to see what are all
outlinks:
> …  
> Parse parse = new Parse(text, title, outlinks, status);
> parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
> 
> for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
>   LOG.trace(outlink.getToUrl());
> }
> 
> if (metaTags.getNoCache()) { // not okay to cache
>   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
> ByteBuffer.wrap(Bytes
>       .toBytes(cachingPolicy)));
> }
> 
> return parse;

The result is as expected. It prints all URL links in the content. But I
really wonder why only one URL is stored in a storage of cf ol. Here's a
log4j log:
> 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
> org.apache.tika.parser.feed.FeedParser for mime-type application/rss+xml
> 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
> http://www.manager.co.th/RSS/Politics/Politics.xml: base=null, noCache=false,
> noFollow=false, noIndex=false, refresh=false, refreshHref=null
>  * general tags:
>    - description        =       Manager Online Update ตลอด 24 ชม.
>  * http-equiv tags:
> 
> 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
> 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
> 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
> 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in
> http://www.manager.co.th/RSS/Politics/Politics.xml
> 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
> 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
> 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843

Now I wonder why only one outlink is stored in ol column family. Any advice,
please?


Regards,
Ake Tangkananond


Reply via email to