Re: {Disarmed} [twincling] url extraction regex

Dhiraj Chawla Wed, 20 May 2009 09:54:18 -0700

Hello Saifi,

You can try this one also:


((https?|ftp)://[\w\d+\?\.\-:#...@%/&=~_]*)

Actually I have modified it a bit as I was using this regex in my java
code (may have an extra '\' here and there). But this is the most standard
way to extract urls in any web-mining or web-indexing code.

regards,
Dhiraj Chawla

> Hi all:
>
> There is an HTML file with more than 5000 entries that look like
>
> <li><a href="http://blogtrader.net/dcaoyuan/feed/entries/atom";
> title="subscribe"><img src="p_files/feed-icon-10x10.png" alt=
> "(feed)"></a> <a href="http://blogtrader.net/"; title=
> "BlogTrader">Caoyuan Deng</a></li>
> ...
> ...
> ...
>
> and i need to extract the URL links.
>
> Here is my PERL solution.
>
> #!/usr/bin/env perl
>
> $ok = open(FH, "<", "p.htm");
>
> foreach (<FH>)
> {
>     if ( /href=\"*[^\">]*/ )
>     {
>         print "$& \n";
>     }
> }
>
> close(FH);
> --
>
> Is there a better way to express the regular expression and
> print the URL links ?
>
> All suggestions and code snippets are welcome :)
>
>
> thanks
> Saifi.
>

Re: {Disarmed} [twincling] url extraction regex

Reply via email to