On Wed, 20 May 2009, Dhiraj Chawla wrote: > Hello Saifi, > > You can try this one also: > > ((https?|ftp)://[\w\d+\?\.\-:#...@%/&=~_]*) > > Actually I have modified it a bit as I was using this regex in my java > code (may have an extra '\' here and there). But this is the most standard > way to extract urls in any web-mining or web-indexing code. > > regards, > Dhiraj Chawla > > > Hi all: > > > > There is an HTML file with more than 5000 entries that look like > > > > <li><a href="http://blogtrader.net/dcaoyuan/feed/entries/atom" > > title="subscribe"><img src="p_files/feed-icon-10x10.png" alt= > > "(feed)"></a> <a href="http://blogtrader.net/" title= > > "BlogTrader">Caoyuan Deng</a></li> > > ... > > ... > > ... > > > > and i need to extract the URL links. > > > > Here is my PERL solution. > > > > #!/usr/bin/env perl > > > > $ok = open(FH, "<", "p.htm"); > > > > foreach (<FH>) > > { > > if ( /href=\"*[^\">]*/ ) > > { > > print "$& \n"; > > } > > } > > > > close(FH); > > -- > > > > Is there a better way to express the regular expression and > > print the URL links ? > > > > All suggestions and code snippets are welcome :) > > > > > > thanks > > Saifi. > > >
Hi Dhiraj: Thanks for the suggestion. Are the RegEx representations in . Java . Boost . PERL . PCRE (Philip Hazel) compatible with each other, or one needs to tweak them ? thanks Saifi.

