Re: Extracting non anchored URLs from page

Shaya Potter Sun, 26 Aug 2012 09:16:18 -0700

finding urls in plain text is hard

http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html

I'm dealing with plain text emails (so people might try to offset urlswith () or _ as well


what i do, based on jeff attwood's post

    static public HashSet<String> urlExtractor(String text) {
        HashSet<String> results = new HashSet<String>();

        Pattern pattern = Pattern

.compile("[(_]?http://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]");

        Matcher matcher = pattern.matcher(text);

        while (matcher.find()) {
            String url = matcher.group();

            if (url.startsWith("(") || url.startsWith("_")) {
                if (url.endsWith(")") || url.endsWith("_")) {
                    url = url.substring(1, url.length() - 1);
                } else {
                    url = url.substring(1, url.length());
                }
            }

            results.add(url);
        }

        return results;
    }

I also when processing the urls if I get a 404, check to see if the URLends with a ')' example, (for instance, check out http://my.site.com)would have a bad ')' at the end.

what you might want to do then is throw all outbound links into a set,and then do a pass like this over the document throwing all found linksinto the set


On 08/26/2012 12:06 PM, Ye T Thet wrote:

Hi Folks,

I am using nutch (1.2 and 1.5) to crawl some website.

The short question is that is there anyway or plug-ins to extracts URLs
which are not in anchor tags in a page.

The long question:

The crawler is not extraction some of the URLs from the page. After the
investigation I noticed that the URLs are not links technically, i.e. not
inside anchor elements. URLs are inside value of other HTML tags used by
javascripts.

Following is the snippet of the contents.

<div class='widget-content'>
<h2 class="sidebar-title">
<form action="../" name="bloglinkform">
<select onchange="this.form.window_namer.value++;if
(this.options[this.selectedIndex].value!='MORE')
{window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
name="bloglinkselect">
<option selected="selected" value="MORE"/>text 1
<option 
value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
2
<option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
3
</select>
<input value="1" name="window_namer" type="hidden"/>
</form></h2>
</div>

As mentioned above, the URLs are not in html anchor tags. but rather valid
urls used by javascripts when the user clicks the items.  Thus resulting
those address are not crawled. To make the matter worse, there is no site
map or index page where such urls can be reached other than the above
mentioned links.

Has anyone encounter such cases and have figure out the solution? Any tips
or direction would be great.

Thanks,

Ye

Re: Extracting non anchored URLs from page

Reply via email to