Re: Extracting non anchored URLs from page

Ye T Thet Sun, 26 Aug 2012 09:52:38 -0700

Markus, Shaya.

Thanks for the response. I was hoping the scenario has been brought up
before and there is a ready made solution.


As for parsers, I tried both nutch's html parser plugin and tika-plugins.

I used ./bin/nutch org.apache.nutch.parse.ParserChecker
http://crawledsite.blogspot.com/ > parserresult.txt to check the outlinks
from a page. None of the parsers yield non anchored URLs.

To cater for my scenario, should I be looking into OutlinkExtractor? or the
parser plug-ins(html, tika) to get the outlinks?

background: I pretty much of a nutch user who writes plugin here and there
and tweak configs to get things done. I do not have in depth understanding
of the nutch code base. I am open to dig further to get this working for me.

Thanks,

Ye



On Mon, Aug 27, 2012 at 12:25 AM, Markus Jelsma
<[email protected]>wrote:

> Nutch' parser relies on Nutch' OutlinkExtractor is the underlying parser
> did not yield any outlinks.
>
> -----Original message-----
> > From:Ye T Thet <[email protected]>
> > Sent: Sun 26-Aug-2012 18:09
> > To: [email protected]
> > Subject: Extracting non anchored URLs from page
> >
> > Hi Folks,
> >
> > I am using nutch (1.2 and 1.5) to crawl some website.
> >
> > The short question is that is there anyway or plug-ins to extracts URLs
> > which are not in anchor tags in a page.
> >
> > The long question:
> >
> > The crawler is not extraction some of the URLs from the page. After the
> > investigation I noticed that the URLs are not links technically, i.e. not
> > inside anchor elements. URLs are inside value of other HTML tags used by
> > javascripts.
> >
> > Following is the snippet of the contents.
> >
> > <div class='widget-content'>
> > <h2 class="sidebar-title">
> > <form action="../" name="bloglinkform">
> > <select onchange="this.form.window_namer.value++;if
> > (this.options[this.selectedIndex].value!='MORE')
> >
> {window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
> > name="bloglinkselect">
> > <option selected="selected" value="MORE"/>text 1
> > <option value="
> http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
> > 2
> > <option value="
> http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
> > 3
> > </select>
> > <input value="1" name="window_namer" type="hidden"/>
> > </form></h2>
> > </div>
> >
> > As mentioned above, the URLs are not in html anchor tags. but rather
> valid
> > urls used by javascripts when the user clicks the items.  Thus resulting
> > those address are not crawled. To make the matter worse, there is no site
> > map or index page where such urls can be reached other than the above
> > mentioned links.
> >
> > Has anyone encounter such cases and have figure out the solution? Any
> tips
> > or direction would be great.
> >
> > Thanks,
> >
> > Ye
> >
>

Re: Extracting non anchored URLs from page

Reply via email to