Thanks for reply Ferdy.

Variable 'db.max.outlinks.per.page' was set to 100. And I could parse HTML
fine.


Regards,
Ake Tangkananond




On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote:

>Hi,
>
>Judging by your logs, it might be that you have accidentally set
>'db.max.outlinks.per.page' to 1? If this is not the case, could you try to
>parse some other document types, for example a html page? Please note that
>I'm not using the TikaParser at all; it could be that there is a bug with
>it in Nutch2.
>
>Ferdy.
>
>On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]>
>wrote:
>
>> Hi,
>>
>> I'm getting an unexpected behavior from nutch parsing mechanism.
>>Perhaps I
>> don't really understand Nucth well. Here is what I find it weird. Could
>>you
>> please advise?
>>
>> I crawl a website of mimeType application/rss+xml. The fetched content
>>is
>> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm expecting
>>it
>> to
>> give all outlinks in the RSS Feed, but my command
>> > `scan 'webpage', {COLUMNS => 'ol'}`
>> gives only one ol cf entry.
>>
>> Then I add a code at TikaParser.java line 192 as follows to see what are
>> all
>> outlinks:
>> > …
>> > Parse parse = new Parse(text, title, outlinks, status);
>> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
>> >
>> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
>> >   LOG.trace(outlink.getToUrl());
>> > }
>> >
>> > if (metaTags.getNoCache()) { // not okay to cache
>> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
>> > ByteBuffer.wrap(Bytes
>> >       .toBytes(cachingPolicy)));
>> > }
>> >
>> > return parse;
>>
>> The result is as expected. It prints all URL links in the content. But I
>> really wonder why only one URL is stored in a storage of cf ol. Here's a
>> log4j log:
>> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
>> > org.apache.tika.parser.feed.FeedParser for mime-type
>>application/rss+xml
>> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
>> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
>> noCache=false,
>> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
>> >  * general tags:
>> >    - description        =       Manager Online Update ตลอด 24 ชม.
>> >  * http-equiv tags:
>> >
>> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
>> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
>> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
>> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks in
>> > http://www.manager.co.th/RSS/Politics/Politics.xml
>> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
>> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
>> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>>
>> Now I wonder why only one outlink is stored in ol column family. Any
>> advice,
>> please?
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>


Reply via email to