Markus, Shaya. Thanks for the response. I was hoping the scenario has been brought up before and there is a ready made solution.
As for parsers, I tried both nutch's html parser plugin and tika-plugins. I used ./bin/nutch org.apache.nutch.parse.ParserChecker http://crawledsite.blogspot.com/ > parserresult.txt to check the outlinks from a page. None of the parsers yield non anchored URLs. To cater for my scenario, should I be looking into OutlinkExtractor? or the parser plug-ins(html, tika) to get the outlinks? background: I pretty much of a nutch user who writes plugin here and there and tweak configs to get things done. I do not have in depth understanding of the nutch code base. I am open to dig further to get this working for me. Thanks, Ye On Mon, Aug 27, 2012 at 12:25 AM, Markus Jelsma <[email protected]>wrote: > Nutch' parser relies on Nutch' OutlinkExtractor is the underlying parser > did not yield any outlinks. > > -----Original message----- > > From:Ye T Thet <[email protected]> > > Sent: Sun 26-Aug-2012 18:09 > > To: [email protected] > > Subject: Extracting non anchored URLs from page > > > > Hi Folks, > > > > I am using nutch (1.2 and 1.5) to crawl some website. > > > > The short question is that is there anyway or plug-ins to extracts URLs > > which are not in anchor tags in a page. > > > > The long question: > > > > The crawler is not extraction some of the URLs from the page. After the > > investigation I noticed that the URLs are not links technically, i.e. not > > inside anchor elements. URLs are inside value of other HTML tags used by > > javascripts. > > > > Following is the snippet of the contents. > > > > <div class='widget-content'> > > <h2 class="sidebar-title"> > > <form action="../" name="bloglinkform"> > > <select onchange="this.form.window_namer.value++;if > > (this.options[this.selectedIndex].value!='MORE') > > > {window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}" > > name="bloglinkselect"> > > <option selected="selected" value="MORE"/>text 1 > > <option value=" > http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text > > 2 > > <option value=" > http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text > > 3 > > </select> > > <input value="1" name="window_namer" type="hidden"/> > > </form></h2> > > </div> > > > > As mentioned above, the URLs are not in html anchor tags. but rather > valid > > urls used by javascripts when the user clicks the items. Thus resulting > > those address are not crawled. To make the matter worse, there is no site > > map or index page where such urls can be reached other than the above > > mentioned links. > > > > Has anyone encounter such cases and have figure out the solution? Any > tips > > or direction would be great. > > > > Thanks, > > > > Ye > > >

