Thanks Sebb, That vas very ilustrative for me and helped to find the
solution:

<a href="(?:http://www\.mysite\.com)*[.]*/([^"]+)

This expression to include in the regular expression extractor, extracts
the links in the pages, and can be used to populate the path field in the
recursive (for each controller) http request using a variable.

To make php links working well I had to change though Response field to
check = body (unscaped) instead of Body (do not know really why :(

Thanks again
Jordi




On Tue, Sep 3, 2013 at 8:36 PM, sebb <[email protected]> wrote:

> On 3 September 2013 19:08, Jordi Carretero <[email protected]>
> wrote:
> > Hi
> >
> > I'm building a spider using a regular expression extractor and a
> for-each-
> > controller and works pretty well but..
> >
> > I'm using <a href="[.]*/([^"]+)" as a expression extractor , and works
> well
> > to extract links like:
> > <a href="../rel/c/items" >
> > <a href="/professions.html"
> >
> > but I can not find any expression that will work at the same time for
> > expressions found in some sites like:
> >
> > <a href="http://www.mysite.es/index.php?main_page=page&amp;id=20<
> http://www.mysite.es/index.php?main_page=page&id=20>
> > "
> >
> > that include the full domain at the beginning (and has to be removed)
> >
> > It's a matter of working with the perl expression but after some days I
> > could not manage to make it work, so any help will be appreciated
>
> If you want to ignore an optional string, use something like:
>
> (?:http://www\.mysite\.es)?
>
> The form (abc)? means abc or nothing; the (?:) form means don't save
> the contents.
>
> In your case, if you want to ignore both ".", ".." and
> "http:/www.mysite.es" you could use:
>
> (?:http://www\.mysite\.es|\.\.?)?
>
> BTW, rather than use "[.]" to escape the meta-character ".", the usual
> method is "\.".
>
> > Thanks
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to