Re: Tika's outlink is not as expected

Ferdy Galema Wed, 15 Aug 2012 00:13:33 -0700

Ah I think you got bit by the session ids normalization. There is a
normalize rule in regex-normalize.xml that removes 'sid=.*' from the url.
Looks like a bug if it strips off query parameters from values like
'newsid=.*'. There is already a Jira for this: NUTCH-706.


For now, remove the 'sid' value from line 32 in regex-normalize.xml or
remove the line altogether to solve this.

On Tue, Aug 14, 2012 at 6:29 PM, Ake Tangkananond <[email protected]> wrote:

> Hi Ferdy,
>
> Thanks for you advise. I don't have any special filtering/normalizing
> rules except the standard one. I even try disabling all url normalization
> plugin, but the result is no difference.
>
> The url left over in the ol is
> column=ol:http://www.manager.co.th/asp-bin/mgrview.aspx?New
>
> Yes, it's truncated at "New".. I'm thinking if it is possible that the URL
> is truncated to make it fit 49 chars, and all truncated URL are the same
> so there is only one left?
>
> In that case, what makes the URL truncated?
>
>
> Regards,
> Ake Tangkananond
>
>
>
>
> On 8/14/12 7:12 PM, "Ferdy Galema" <[email protected]> wrote:
>
> >Do you have specifc filtering/normalizing rules? From all urls that are
> >logged, what url is left over in the 'ol' field?
> >
> >On Tue, Aug 14, 2012 at 1:49 PM, Ake Tangkananond <[email protected]>
> >wrote:
> >
> >> Thanks for reply Ferdy.
> >>
> >> Variable 'db.max.outlinks.per.page' was set to 100. And I could parse
> >>HTML
> >> fine.
> >>
> >>
> >> Regards,
> >> Ake Tangkananond
> >>
> >>
> >>
> >>
> >> On 8/14/12 6:43 PM, "Ferdy Galema" <[email protected]> wrote:
> >>
> >> >Hi,
> >> >
> >> >Judging by your logs, it might be that you have accidentally set
> >> >'db.max.outlinks.per.page' to 1? If this is not the case, could you
> >>try to
> >> >parse some other document types, for example a html page? Please note
> >>that
> >> >I'm not using the TikaParser at all; it could be that there is a bug
> >>with
> >> >it in Nutch2.
> >> >
> >> >Ferdy.
> >> >
> >> >On Tue, Aug 14, 2012 at 1:15 PM, Ake Tangkananond <[email protected]>
> >> >wrote:
> >> >
> >> >> Hi,
> >> >>
> >> >> I'm getting an unexpected behavior from nutch parsing mechanism.
> >> >>Perhaps I
> >> >> don't really understand Nucth well. Here is what I find it weird.
> >>Could
> >> >>you
> >> >> please advise?
> >> >>
> >> >> I crawl a website of mimeType application/rss+xml. The fetched
> >>content
> >> >>is
> >> >> parsed by Tika's org.apache.tika.parser.feed.FeedParser. I'm
> >>expecting
> >> >>it
> >> >> to
> >> >> give all outlinks in the RSS Feed, but my command
> >> >> > `scan 'webpage', {COLUMNS => 'ol'}`
> >> >> gives only one ol cf entry.
> >> >>
> >> >> Then I add a code at TikaParser.java line 192 as follows to see what
> >>are
> >> >> all
> >> >> outlinks:
> >> >> > ...
> >> >> > Parse parse = new Parse(text, title, outlinks, status);
> >> >> > parse = htmlParseFilters.filter(url, page, parse, metaTags, root);
> >> >> >
> >> >> > for (Outlink outlink : parse.getOutlinks()) { // THIS IS LINE 192
> >> >> >   LOG.trace(outlink.getToUrl());
> >> >> > }
> >> >> >
> >> >> > if (metaTags.getNoCache()) { // not okay to cache
> >> >> >   page.putToMetadata(new Utf8(Nutch.CACHING_FORBIDDEN_KEY),
> >> >> > ByteBuffer.wrap(Bytes
> >> >> >       .toBytes(cachingPolicy)));
> >> >> > }
> >> >> >
> >> >> > return parse;
> >> >>
> >> >> The result is as expected. It prints all URL links in the content.
> >>But I
> >> >> really wonder why only one URL is stored in a storage of cf ol.
> >>Here's a
> >> >> log4j log:
> >> >> > 2012-08-14 18:03:50,172 DEBUG tika.TikaParser - Using Tika parser
> >> >> > org.apache.tika.parser.feed.FeedParser for mime-type
> >> >>application/rss+xml
> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Meta tags for
> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml: base=null,
> >> >> noCache=false,
> >> >> > noFollow=false, noIndex=false, refresh=false, refreshHref=null
> >> >> >  * general tags:
> >> >> >    - description        =       Manager Online Update ตลอด 24 ชม.
> >> >> >  * http-equiv tags:
> >> >> >
> >> >> > 2012-08-14 18:03:50,201 TRACE tika.TikaParser - Getting text...
> >> >> > 2012-08-14 18:03:50,205 TRACE tika.TikaParser - Getting title...
> >> >> > 2012-08-14 18:03:50,206 TRACE tika.TikaParser - Getting links...
> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser - found 10 outlinks
> >>in
> >> >> > http://www.manager.co.th/RSS/Politics/Politics.xml
> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099951
> >> >> > 2012-08-14 18:03:50,207 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099936
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099929
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099913
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099899
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099882
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099874
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099870
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099859
> >> >> > 2012-08-14 18:03:50,208 TRACE tika.TikaParser -
> >> >> > http://www.manager.co.th/asp-bin/mgrview.aspx?NewsID=9550000099843
> >> >>
> >> >> Now I wonder why only one outlink is stored in ol column family. Any
> >> >> advice,
> >> >> please?
> >> >>
> >> >>
> >> >> Regards,
> >> >> Ake Tangkananond
> >> >>
> >> >>
> >> >>
> >>
> >>
> >>
>
>
>

Re: Tika's outlink is not as expected

Reply via email to