finding urls in plain text is hard
http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html
I'm dealing with plain text emails (so people might try to offset urls
with () or _ as well
what i do, based on jeff attwood's post
static public HashSet<String> urlExtractor(String text) {
HashSet<String> results = new HashSet<String>();
Pattern pattern = Pattern
.compile("[(_]?http://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]");
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String url = matcher.group();
if (url.startsWith("(") || url.startsWith("_")) {
if (url.endsWith(")") || url.endsWith("_")) {
url = url.substring(1, url.length() - 1);
} else {
url = url.substring(1, url.length());
}
}
results.add(url);
}
return results;
}
I also when processing the urls if I get a 404, check to see if the URL
ends with a ')' example, (for instance, check out http://my.site.com)
would have a bad ')' at the end.
what you might want to do then is throw all outbound links into a set,
and then do a pass like this over the document throwing all found links
into the set
On 08/26/2012 12:06 PM, Ye T Thet wrote:
Hi Folks,
I am using nutch (1.2 and 1.5) to crawl some website.
The short question is that is there anyway or plug-ins to extracts URLs
which are not in anchor tags in a page.
The long question:
The crawler is not extraction some of the URLs from the page. After the
investigation I noticed that the URLs are not links technically, i.e. not
inside anchor elements. URLs are inside value of other HTML tags used by
javascripts.
Following is the snippet of the contents.
<div class='widget-content'>
<h2 class="sidebar-title">
<form action="../" name="bloglinkform">
<select onchange="this.form.window_namer.value++;if
(this.options[this.selectedIndex].value!='MORE')
{window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
name="bloglinkselect">
<option selected="selected" value="MORE"/>text 1
<option
value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
2
<option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
3
</select>
<input value="1" name="window_namer" type="hidden"/>
</form></h2>
</div>
As mentioned above, the URLs are not in html anchor tags. but rather valid
urls used by javascripts when the user clicks the items. Thus resulting
those address are not crawled. To make the matter worse, there is no site
map or index page where such urls can be reached other than the above
mentioned links.
Has anyone encounter such cases and have figure out the solution? Any tips
or direction would be great.
Thanks,
Ye