Re: Regular expression extractor for spider

Deepak Shetty Wed, 04 Sep 2013 09:25:34 -0700

note that the for-each method is restricted to using a single thread which
isnt ideal for a spider.
http://theworkaholic.blogspot.com/2009/10/spidering-site-with-jmeter.html



On Wed, Sep 4, 2013 at 9:19 AM, Jordi Carretero <[email protected]>wrote:

> Thanks Sebb, That vas very ilustrative for me and helped to find the
> solution:
>
> <a href="(?:http://www\.mysite\.com)*[.]*/([^"]+)
>
> This expression to include in the regular expression extractor, extracts
> the links in the pages, and can be used to populate the path field in the
> recursive (for each controller) http request using a variable.
>
> To make php links working well I had to change though Response field to
> check = body (unscaped) instead of Body (do not know really why :(
>
> Thanks again
> Jordi
>
>
>
>
> On Tue, Sep 3, 2013 at 8:36 PM, sebb <[email protected]> wrote:
>
> > On 3 September 2013 19:08, Jordi Carretero <[email protected]>
> > wrote:
> > > Hi
> > >
> > > I'm building a spider using a regular expression extractor and a
> > for-each-
> > > controller and works pretty well but..
> > >
> > > I'm using <a href="[.]*/([^"]+)" as a expression extractor , and works
> > well
> > > to extract links like:
> > > <a href="../rel/c/items" >
> > > <a href="/professions.html"
> > >
> > > but I can not find any expression that will work at the same time for
> > > expressions found in some sites like:
> > >
> > > <a href="http://www.mysite.es/index.php?main_page=page&amp;id=20<
> > http://www.mysite.es/index.php?main_page=page&id=20>
> > > "
> > >
> > > that include the full domain at the beginning (and has to be removed)
> > >
> > > It's a matter of working with the perl expression but after some days I
> > > could not manage to make it work, so any help will be appreciated
> >
> > If you want to ignore an optional string, use something like:
> >
> > (?:http://www\.mysite\.es)?
> >
> > The form (abc)? means abc or nothing; the (?:) form means don't save
> > the contents.
> >
> > In your case, if you want to ignore both ".", ".." and
> > "http:/www.mysite.es" you could use:
> >
> > (?:http://www\.mysite\.es|\.\.?)?
> >
> > BTW, rather than use "[.]" to escape the meta-character ".", the usual
> > method is "\.".
> >
> > > Thanks
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: Regular expression extractor for spider

Reply via email to