Hi Ferdy,

Thanks for you advise. I don't have any special filtering/normalizing
rules except the standard one. I even try disabling all url normalization
plugin, but the result is no difference.

The url left over in the ol is
column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New

Yes, it's truncated at "New".. I'm thinking if it is possible that the URL
is truncated to make it fit 49 chars, and all truncated URL are the same
so there is only one left?

In that case, what makes the URL truncated?


Regards,
Ake Tangkananond




On 8/14/12 7:12 PM, "Ferdy Galema" <[email protected]> wrote:

>Do you have specifc filtering/normalizing rules? From all urls that are
>logged, what url is left over in the 'ol' field?
>
>On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <[email protected]>
>wrote:
>
>> Thanks for reply Ferdy.
>>
>> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse
>>HTML
>> fine.
>>
>>
>> Regards,
>> Ake Tangkananond
>>
>>
>>
>>
>> On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote:
>>
>> >Hi,
>> >
>> >Judging by your logs, it might be that you have accidentally set
>> >'db.max.outlinks.per.page' to 1? If this is not the case, could you
>>try to
>> >parse some other document types, for example a html page? Please note
>>that
>> >I'm not using the TikaParser at all; it could be that there is a bug
>>with
>> >it in Nutch2.
>> >
>> >Ferdy.
>> >
>> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]>
>> >wrote:
>> >
>> >> Hi,
>> >>
>> >> I'm getting an unexpected behavior from nutch parsing mechanism.
>> >>Perhaps I
>> >> don't really understand Nucth well. Here is what I find it weird.
>>Could
>> >>you
>> >> please advise?
>> >>
>> >> I crawl a website of mimeType application/rss+xml. The fetched
>>content
>> >>is
>> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm
>>expecting
>> >>it
>> >> to
>> >> give all outlinks in the RSS Feed, but my command
>> >> > `scan 'webpage', {COLUMNS => 'ol'}`
>> >> gives only one ol cf entry.
>> >>
>> >> Then I add a code at TikaParser.java line 192 as follows to see what
>>are
>> >> all
>> >> outlinks:
>> >> > ...
>> >> > Parse parse = new Parse(text, title, outlinks, status);
>> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
>> >> >
>> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
>> >> >   LOG.trace(outlink.getToUrl());
>> >> > }
>> >> >
>> >> > if (metaTags.getNoCache()) { // not okay to cache
>> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
>> >> > ByteBuffer.wrap(Bytes
>> >> >       .toBytes(cachingPolicy)));
>> >> > }
>> >> >
>> >> > return parse;
>> >>
>> >> The result is as expected. It prints all URL links in the content.
>>But I
>> >> really wonder why only one URL is stored in a storage of cf ol.
>>Here's a
>> >> log4j log:
>> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
>> >> > org.apache.tika.parser.feed.FeedParser for mime-type
>> >>application/rss+xml
>> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
>> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
>> >> noCache=false,
>> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
>> >> >  * general tags:
>> >> >    - description        =       Manager Online Update ตลอด 24 ชม.
>> >> >  * http-equiv tags:
>> >> >
>> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
>> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
>> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
>> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks
>>in
>> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
>> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
>> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
>> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
>> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
>> >>
>> >> Now I wonder why only one outlink is stored in ol column family. Any
>> >> advice,
>> >> please?
>> >>
>> >>
>> >> Regards,
>> >> Ake Tangkananond
>> >>
>> >>
>> >>
>>
>>
>>


Reply via email to