Re: Regular expression extractor for spider

sebb Wed, 04 Sep 2013 10:22:12 -0700

On 4 September 2013 17:48, Jordi Carretero <[email protected]> wrote:
> Correct, but I guess any other approach would be too complex for my limited
> knowledge so far. Good thing is that I will never stress the sites being
> spidered :)
>
> One update for the regular string, in case it helps anyone :
>
> <a href="(?:http://${__V(${URL})})*[.]*/([^"]+)


[.] should really be \.

> (not my idea, extracted from
> http://stackoverflow.com/questions/5341908/regex-extractor-equipped-with-dynamic-regular-expression-in-jmeter
>  )
>
> where URL is the name of the variable I use for the sites to be spidered. I
> fill the variable using a csv file and CSV Data Set Config
>
> About the multithread, I was thinking to launch several jmeter instances,
> each one loaded with its own CSV full os sites. Not ideal but I'll get some
> perf. improvement .

Why not just use multiple JMeter threads? Each one will get a
different CSV entry (unless you specify otherwise).

> Jordi
>
>
>
>
> On Wed, Sep 4, 2013 at 6:24 PM, Deepak Shetty <[email protected]> wrote:
>
>> note that the for-each method is restricted to using a single thread which
>> isnt ideal for a spider.
>> http://theworkaholic.blogspot.com/2009/10/spidering-site-with-jmeter.html
>>
>>
>> On Wed, Sep 4, 2013 at 9:19 AM, Jordi Carretero <[email protected]
>> >wrote:
>>
>> > Thanks Sebb, That vas very ilustrative for me and helped to find the
>> > solution:
>> >
>> > <a href="(?:http://www\.mysite\.com)*[.]*/([^"]+)
>> >
>> > This expression to include in the regular expression extractor, extracts
>> > the links in the pages, and can be used to populate the path field in the
>> > recursive (for each controller) http request using a variable.
>> >
>> > To make php links working well I had to change though Response field to
>> > check = body (unscaped) instead of Body (do not know really why :(
>> >
>> > Thanks again
>> > Jordi
>> >
>> >
>> >
>> >
>> > On Tue, Sep 3, 2013 at 8:36 PM, sebb <[email protected]> wrote:
>> >
>> > > On 3 September 2013 19:08, Jordi Carretero <[email protected]>
>> > > wrote:
>> > > > Hi
>> > > >
>> > > > I'm building a spider using a regular expression extractor and a
>> > > for-each-
>> > > > controller and works pretty well but..
>> > > >
>> > > > I'm using <a href="[.]*/([^"]+)" as a expression extractor , and
>> works
>> > > well
>> > > > to extract links like:
>> > > > <a href="../rel/c/items" >
>> > > > <a href="/professions.html"
>> > > >
>> > > > but I can not find any expression that will work at the same time for
>> > > > expressions found in some sites like:
>> > > >
>> > > > <a href="http://www.mysite.es/index.php?main_page=page&amp;id=20<
>> > > http://www.mysite.es/index.php?main_page=page&id=20>
>> > > > "
>> > > >
>> > > > that include the full domain at the beginning (and has to be removed)
>> > > >
>> > > > It's a matter of working with the perl expression but after some
>> days I
>> > > > could not manage to make it work, so any help will be appreciated
>> > >
>> > > If you want to ignore an optional string, use something like:
>> > >
>> > > (?:http://www\.mysite\.es)?
>> > >
>> > > The form (abc)? means abc or nothing; the (?:) form means don't save
>> > > the contents.
>> > >
>> > > In your case, if you want to ignore both ".", ".." and
>> > > "http:/www.mysite.es" you could use:
>> > >
>> > > (?:http://www\.mysite\.es|\.\.?)?
>> > >
>> > > BTW, rather than use "[.]" to escape the meta-character ".", the usual
>> > > method is "\.".
>> > >
>> > > > Thanks
>> > >
>> > > ---------------------------------------------------------------------
>> > > To unsubscribe, e-mail: [email protected]
>> > > For additional commands, e-mail: [email protected]
>> > >
>> > >
>> >
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Regular expression extractor for spider

Reply via email to