Re: [twincling] url extraction regex

Saifi Khan Tue, 02 Jun 2009 18:30:45 -0700

On Wed, 20 May 2009, Dhiraj Chawla wrote:

> Hello Saifi,
> 
> You can try this one also:
> 
> ((https?|ftp)://[\w\d+\?\.\-:#...@%/&=~_]*)
> 
> Actually I have modified it a bit as I was using this regex in my java
> code (may have an extra '\' here and there). But this is the most standard
> way to extract urls in any web-mining or web-indexing code.
> 
> regards,
> Dhiraj Chawla
> 
> > Hi all:
> >
> > There is an HTML file with more than 5000 entries that look like
> >
> > <li><a href="http://blogtrader.net/dcaoyuan/feed/entries/atom";
> > title="subscribe"><img src="p_files/feed-icon-10x10.png" alt=
> > "(feed)"></a> <a href="http://blogtrader.net/"; title=
> > "BlogTrader">Caoyuan Deng</a></li>
> > ...
> > ...
> > ...
> >
> > and i need to extract the URL links.
> >
> > Here is my PERL solution.
> >
> > #!/usr/bin/env perl
> >
> > $ok = open(FH, "<", "p.htm");
> >
> > foreach (<FH>)
> > {
> >     if ( /href=\"*[^\">]*/ )
> >     {
> >         print "$& \n";
> >     }
> > }
> >
> > close(FH);
> > --
> >
> > Is there a better way to express the regular expression and
> > print the URL links ?
> >
> > All suggestions and code snippets are welcome :)
> >
> >
> > thanks
> > Saifi.
> >
>


Hi Dhiraj:

Thanks for the suggestion.

Are the RegEx representations in
 . Java
 . Boost
 . PERL
 . PCRE (Philip Hazel)
 compatible with each other, or one needs to tweak them ?


thanks
Saifi.

Re: [twincling] url extraction regex

Reply via email to