Strange, i get none of the http://id.wikipedia.org/wiki/mediawiki.. URL's as outlink using either parse-html or parse-tika on 1.4-dev. I also tried branch 1.3 but don't see any outlinks with mediawiki substrings, even when i disabled parser.html.outlinks.ignore_tags
On Wednesday 12 October 2011 19:27:16 Michael.Sulistijo wrote: > Hi all, > > I have seem to found a problem with Nutch 1.3 Parser, currently it > generating weird outlinks. Example that I will use is > http://id.wikipedia.org/wiki/Halaman_Utama > http://id.wikipedia.org/wiki/Halaman_Utama , First, I used the > ParserChecker class to see how many outlinks it has, then I found out > that, it generates some weird outlinks like : > > --------- > Url > --------------- > http://id.wikipedia.org/wiki/Halaman_Utama--------- > ParseData > --------- > Version: 5 > Status: success(1,0) > Title: Wikipedia bahasa Indonesia, ensiklopedia bebas > Outlinks: 369 > outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.startup > anchor: > outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.user anchor: > outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.util anchor: > outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.page.ready anchor: > outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.wikibits > anchor: > outlink: toUrl: http://id.wikipedia.org/wiki/mediawiki.legacy.ajax > anchor: outlink: toUrl: > http://id.wikipedia.org/wiki/mediawiki.legacy.mwsuggest anchor: > outlink: toUrl: *http://id.wikipedia.org/wiki/ext.flaggedRevs.advanced* > anchor: > ... > > Then I check the page on the browser. Then I try to find those links above, > but no result can be found, but when i tried to find > "ext.flaggedRevs.advanced" on the page, this is what i found: > ... > > ... > > I have set the "parser.html.outlinks.ignore_tags" property in the > "nutch-site.xml" to ignore script tag. Also, I have looked to other threads > like: > http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionally > -as-absolute-URL-td3350098.html > http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionall > y-as-absolute-URL-td3350098.html > http://lucene.472066.n3.nabble.com/Consider-relative-outlinks-conditionall > y-as-absolute-URL-td3350098.html > lucene.472066.n3.nabble.com/Outlinks-with-embedded-params-td3332396.html > > even applied patch on my Nutch: > https://issues.apache.org/jira/browse/NUTCH-1115 > https://issues.apache.org/jira/browse/NUTCH-1115 > > However, it is still showing those links that lead to "empty/non-existent" > wikipedia articles. Anyone can shed a light on how to set up Nutch 1.3 > parser to exclude those kind of links ? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Nutch-1-3-Parser-generating-weird-outli > nks-tp3416347p3416347.html Sent from the Nutch - User mailing list archive > at Nabble.com. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

