RE: Extracting non anchored URLs from page

Markus Jelsma Sun, 26 Aug 2012 09:23:56 -0700
Nutch' parser relies on Nutch' OutlinkExtractor is the underlying parser did 
not yield any outlinks.  
 
-----Original message-----
> From:Ye T Thet <[email protected]>
> Sent: Sun 26-Aug-2012 18:09
> To: [email protected]
> Subject: Extracting non anchored URLs from page
> 
> Hi Folks,
> 
> I am using nutch (1.2 and 1.5) to crawl some website.
> 
> The short question is that is there anyway or plug-ins to extracts URLs
> which are not in anchor tags in a page.
> 
> The long question:
> 
> The crawler is not extraction some of the URLs from the page. After the
> investigation I noticed that the URLs are not links technically, i.e. not
> inside anchor elements. URLs are inside value of other HTML tags used by
> javascripts.
> 
> Following is the snippet of the contents.
> 
> <div class='widget-content'>
> <h2 class="sidebar-title">
> <form action="../" name="bloglinkform">
> <select onchange="this.form.window_namer.value++;if
> (this.options[this.selectedIndex].value!='MORE')
> {window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
> name="bloglinkselect">
> <option selected="selected" value="MORE"/>text 1
> <option 
> value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
> 2
> <option 
> value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
> 3
> </select>
> <input value="1" name="window_namer" type="hidden"/>
> </form></h2>
> </div>
> 
> As mentioned above, the URLs are not in html anchor tags. but rather valid
> urls used by javascripts when the user clicks the items.  Thus resulting
> those address are not crawled. To make the matter worse, there is no site
> map or index page where such urls can be reached other than the above
> mentioned links.
> 
> Has anyone encounter such cases and have figure out the solution? Any tips
> or direction would be great.
> 
> Thanks,
> 
> Ye
>
RE: Extracting non anchored URLs from page

Reply via email to