Re: Regular expression extractor for spider

sebb Tue, 03 Sep 2013 12:04:52 -0700

On 3 September 2013 19:08, Jordi Carretero <[email protected]> wrote:
> Hi
>
> I'm building a spider using a regular expression extractor and a for-each-
> controller and works pretty well but..
>
> I'm using <a href="[.]*/([^"]+)" as a expression extractor , and works well
> to extract links like:
> <a href="../rel/c/items" >
> <a href="/professions.html"
>
> but I can not find any expression that will work at the same time for
> expressions found in some sites like:
>
> <a 
> href="http://www.mysite.es/index.php?main_page=page&amp;id=20<http://www.mysite.es/index.php?main_page=page&id=20>
> "
>
> that include the full domain at the beginning (and has to be removed)
>
> It's a matter of working with the perl expression but after some days I
> could not manage to make it work, so any help will be appreciated


If you want to ignore an optional string, use something like:

(?:http://www\.mysite\.es)?

The form (abc)? means abc or nothing; the (?:) form means don't save
the contents.

In your case, if you want to ignore both ".", ".." and
"http:/www.mysite.es" you could use:

(?:http://www\.mysite\.es|\.\.?)?

BTW, rather than use "[.]" to escape the meta-character ".", the usual
method is "\.".

> Thanks

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Regular expression extractor for spider

Reply via email to