Extracting non anchored URLs from page

Ye T Thet Sun, 26 Aug 2012 09:06:41 -0700

Hi Folks,

I am using nutch (1.2 and 1.5) to crawl some website.


The short question is that is there anyway or plug-ins to extracts URLs
which are not in anchor tags in a page.

The long question:

The crawler is not extraction some of the URLs from the page. After the
investigation I noticed that the URLs are not links technically, i.e. not
inside anchor elements. URLs are inside value of other HTML tags used by
javascripts.

Following is the snippet of the contents.

<div class='widget-content'>
<h2 class="sidebar-title">
<form action="../" name="bloglinkform">
<select onchange="this.form.window_namer.value++;if
(this.options[this.selectedIndex].value!='MORE')
{window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
name="bloglinkselect">
<option selected="selected" value="MORE"/>text 1
<option 
value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
2
<option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
3
</select>
<input value="1" name="window_namer" type="hidden"/>
</form></h2>
</div>

As mentioned above, the URLs are not in html anchor tags. but rather valid
urls used by javascripts when the user clicks the items.  Thus resulting
those address are not crawled. To make the matter worse, there is no site
map or index page where such urls can be reached other than the above
mentioned links.

Has anyone encounter such cases and have figure out the solution? Any tips
or direction would be great.

Thanks,

Ye

Extracting non anchored URLs from page

Reply via email to