Re: Regular expression extractor for spider

Jordi Carretero Wed, 04 Sep 2013 09:50:23 -0700

Correct, but I guess any other approach would be too complex for my limited
knowledge so far. Good thing is that I will never stress the sites being
spidered :)


One update for the regular string, in case it helps anyone :

<a href="(?:http://${__V(${URL})})*[.]*/([^"]+)

(not my idea, extracted from
http://stackoverflow.com/questions/5341908/regex-extractor-equipped-with-dynamic-regular-expression-in-jmeter
 )

where URL is the name of the variable I use for the sites to be spidered. I
fill the variable using a csv file and CSV Data Set Config

About the multithread, I was thinking to launch several jmeter instances,
each one loaded with its own CSV full os sites. Not ideal but I'll get some
perf. improvement .

Jordi




On Wed, Sep 4, 2013 at 6:24 PM, Deepak Shetty <[email protected]> wrote:

> note that the for-each method is restricted to using a single thread which
> isnt ideal for a spider.
> http://theworkaholic.blogspot.com/2009/10/spidering-site-with-jmeter.html
>
>
> On Wed, Sep 4, 2013 at 9:19 AM, Jordi Carretero <[email protected]
> >wrote:
>
> > Thanks Sebb, That vas very ilustrative for me and helped to find the
> > solution:
> >
> > <a href="(?:http://www\.mysite\.com)*[.]*/([^"]+)
> >
> > This expression to include in the regular expression extractor, extracts
> > the links in the pages, and can be used to populate the path field in the
> > recursive (for each controller) http request using a variable.
> >
> > To make php links working well I had to change though Response field to
> > check = body (unscaped) instead of Body (do not know really why :(
> >
> > Thanks again
> > Jordi
> >
> >
> >
> >
> > On Tue, Sep 3, 2013 at 8:36 PM, sebb <[email protected]> wrote:
> >
> > > On 3 September 2013 19:08, Jordi Carretero <[email protected]>
> > > wrote:
> > > > Hi
> > > >
> > > > I'm building a spider using a regular expression extractor and a
> > > for-each-
> > > > controller and works pretty well but..
> > > >
> > > > I'm using <a href="[.]*/([^"]+)" as a expression extractor , and
> works
> > > well
> > > > to extract links like:
> > > > <a href="../rel/c/items" >
> > > > <a href="/professions.html"
> > > >
> > > > but I can not find any expression that will work at the same time for
> > > > expressions found in some sites like:
> > > >
> > > > <a href="http://www.mysite.es/index.php?main_page=page&amp;id=20<
> > > http://www.mysite.es/index.php?main_page=page&id=20>
> > > > "
> > > >
> > > > that include the full domain at the beginning (and has to be removed)
> > > >
> > > > It's a matter of working with the perl expression but after some
> days I
> > > > could not manage to make it work, so any help will be appreciated
> > >
> > > If you want to ignore an optional string, use something like:
> > >
> > > (?:http://www\.mysite\.es)?
> > >
> > > The form (abc)? means abc or nothing; the (?:) form means don't save
> > > the contents.
> > >
> > > In your case, if you want to ignore both ".", ".." and
> > > "http:/www.mysite.es" you could use:
> > >
> > > (?:http://www\.mysite\.es|\.\.?)?
> > >
> > > BTW, rather than use "[.]" to escape the meta-character ".", the usual
> > > method is "\.".
> > >
> > > > Thanks
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> > >
> >
>

Re: Regular expression extractor for spider

Reply via email to