finding urls in plain text is hard

http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html

I'm dealing with plain text emails (so people might try to offset urls with () or _ as well

what i do, based on jeff attwood's post

    static public HashSet<String> urlExtractor(String text) {
        HashSet<String> results = new HashSet<String>();

        Pattern pattern = Pattern

.compile("[(_]?http://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]");

        Matcher matcher = pattern.matcher(text);

        while (matcher.find()) {
            String url = matcher.group();

            if (url.startsWith("(") || url.startsWith("_")) {
                if (url.endsWith(")") || url.endsWith("_")) {
                    url = url.substring(1, url.length() - 1);
                } else {
                    url = url.substring(1, url.length());
                }
            }

            results.add(url);
        }

        return results;
    }

I also when processing the urls if I get a 404, check to see if the URL ends with a ')' example, (for instance, check out http://my.site.com) would have a bad ')' at the end.

what you might want to do then is throw all outbound links into a set, and then do a pass like this over the document throwing all found links into the set

On 08/26/2012 12:06 PM, Ye T Thet wrote:
Hi Folks,

I am using nutch (1.2 and 1.5) to crawl some website.

The short question is that is there anyway or plug-ins to extracts URLs
which are not in anchor tags in a page.

The long question:

The crawler is not extraction some of the URLs from the page. After the
investigation I noticed that the URLs are not links technically, i.e. not
inside anchor elements. URLs are inside value of other HTML tags used by
javascripts.

Following is the snippet of the contents.

<div class='widget-content'>
<h2 class="sidebar-title">
<form action="../" name="bloglinkform">
<select onchange="this.form.window_namer.value++;if
(this.options[this.selectedIndex].value!='MORE')
{window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
name="bloglinkselect">
<option selected="selected" value="MORE"/>text 1
<option 
value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
2
<option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
3
</select>
<input value="1" name="window_namer" type="hidden"/>
</form></h2>
</div>

As mentioned above, the URLs are not in html anchor tags. but rather valid
urls used by javascripts when the user clicks the items.  Thus resulting
those address are not crawled. To make the matter worse, there is no site
map or index page where such urls can be reached other than the above
mentioned links.

Has anyone encounter such cases and have figure out the solution? Any tips
or direction would be great.

Thanks,

Ye

Reply via email to