Re: worker thread with qpidd

Carl Trieloff Wed, 03 Jun 2009 06:28:57 -0700

Generally if you set the number of thread larger than the core countyour performance will go downas expected. However the reason the option is there, is so that if youpin the process to less than

the number of cores, then the thread count can be adjusted.

On a 2 core machine with client on the same machine, there are not thatmany options, as the client

and broker will contend for the resources on the machine.

My employer had done a report with HP, it is to big to mail out to thelist, but here is some

basic setup that was done for that.


regards
Carl.


     Throughput (Perftest)

For throughput, perftest is used to drive the broker for this benchmark.This harness is able to start up multiple producers and consumers inbalanced (n:n) or unbalanced configurations (x:y).



What the test does:

   *

     creates a control queue

   *

     starts x:y producers and consumers

   *

     waits for all processors to signal they are ready

   *

     controller records a timestamp

   *

     producers reliably en-queues messages onto the broker as fast as
     they can

   *

     consumers reliably de-queue messages from the broker as fast as
     they can

   *

     once the last message -- which is marked is received, the
     controller is signaled

   *

     controller waits for all complete signals, records timestamp and
     calculates rate

The throughput is the calculated as the total number of messagesreliably transferred divided by the time to transfer those messages.



     Latency (Latencytest)

For latency, latencytest is used to drive the broker for this benchmark.This harness is able to produce messages at a specified rate or for aspecified number of messages that are timestamped, sent to the broker,looped back to client node. The client will report the minimum, maximum,and average time for a reporting interval when a rate is used, or forall the messages sent when a count is used.



   Tuning & Parameter Settings

For the testing in this paper the systems were not used for any otherpurposes. Therefore, the configuration and tuning that is detailedshould be reviewed when other applications along with MRG Messaging.



     Processes

For the testing performed the following were disabled (unless specifiedotherwise):



SELinux

cpuspeed

irqbalance

haldaemon

yum-updatesd

smartd

setroubleshoot

sendmail

rpcgssd

rpcidmapd

rpcsvcgssd

rhnsd

pcscd

mdmonitor

mcstrans

kdump

isdn

iptables

ip6tables

hplip

hidd

gpm

cups

bluetooth

avahi-daemon

restorecond

auditd


     SysCtl

The following kernel parameters were added to //etc/sysctl.conf/.

net.ipv4.conf.default.arp_filter,

net.ipv4.conf.all.arp_filter

        

1

        

Only respond to ARP requests on matching interface

net.core.rmem_max,

net.core.wmem_max

        

8388608

        

maximum receive/send socket buffer size in bytes

net.core.rmem_default,

net.core.wmem_default

        

262144

        

default setting of the socket receive/send buffer in bytes.

net.ipv4.tcp_rmem,

net.ipv4.tcp_wmem

        

65536

4194304

8388608

        

Vector of 3 integers: min, default, max

min - minimal size of receive/send buffer used by TCP sockets

default - default size of receive/send buffer used by TCP sockets

max - maximal size of receive/send buffer allowed for automaticallyselected receiver buffers for TCP socket


net.core.netdev_max_backlog

        

10000

Maximum number of packets, queued on the input side, when the interfacereceives packets faster than kernel can process them. Applies tonon-NAPI devices only.


net.ipv4.tcp_window_scaling

        

0

        

Enable window scaling as defined in RFC1323.

net.ipv4.tcp_mem

        

262144

4194304

8388608

        

     Vector of 3 integers: low, pressure, high

     low - below this number of pages TCP is not bothered about its
     memory appetite.

     pressure - when amount of memory allocated by TCP exceeds this
     number of pages, TCP moderates its memory consumption and enters
     memory pressure mode, which is exited when memory consumption
     falls under "low".

     high - number of pages allowed for queueing by all TCP sockets.


/*Table 1*/


     ethtool

Some of the options ethtool allows the operator to change relate tocoalesce and offload settings. However, during experimentation onlychanging the ring settings had noticeable effect for throughput testing.


# *ethtool -g eth1 *

Ring parameters for eth1:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 256

RX Mini: 0

RX Jumbo: 0

TX: 256


# *ethtool -G eth1 rx 2048 tx 2048 *

# *ethtool -g eth1 *

Ring parameters for eth1:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 2048

RX Mini: 0

RX Jumbo: 0

TX: 2048


#


     ifconfig

ifconfig was used to increase the /maximum transfer unit/ (MTU) tosupport jumbo frames and to increase /txqueuelen/ for throughput testingwhen these changes has noticeable effect.


# *ifconfig eth1 *

eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80

inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0

inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:8 errors:0 dropped:0 overruns:0 frame:0

TX packets:9 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)

Memory:fdee0000-fdf00000

# *ifconfig eth1 mtu 9000 txqueuelen 2000 *

# *ifconfig eth1 *

eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80

inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0

inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link

UP BROADCAST MULTICAST MTU:9000 Metric:1

RX packets:8 errors:0 dropped:0 overruns:0 frame:0

TX packets:9 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:2000

RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)

Memory:fdee0000-fdf00000

#


     CPU affinity

For latency testing, all interrupts from the cores of one CPU socketwere reassigned to other cores. The interrupts for the interconnectunder test were assigned to cores of this vacated socket. The processesrelated to the interconnect (e.g. ib_mad, ipoib) were then schedule torun on the vacated cores. The Qpid daemon was also scheduled to run onthese or a subset of the vacated cores. How latencytest was scheduledwas determined by the results of experiments limiting or not limitingthe latencytest test process to certain cores.

Experiments with perftest show that usually the best performance wasachieved with the affinity settings after a boot and have not beenmanipulated.

Interrupts can be directed to be handled by cores. //proc/interrupts/can be queried to identify the interrupts for devices and the number oftimes each CPU/core has handled each interrupt. For each interrupt, afile named //proc/irq/<IRQ #>/smp_affinity/ contains a hexadecimal maskwhich controls which cores can respond to specific interrupt. Thecontents of these files can be queried or set.

Processes can be restricted to run on a set of CPUs/cores. taskset canbe used to define the list of CPUs/cores that a be scheduled to execute on.

The MRG -- Realtime product include an applicaiton, tuna, that allowsfor easy setting of affinity of interrupts and processes, through a GUIor command line.



     AMQP parameters

Qpid parameters can be specified on the command line, throughenvironment variables or through the Qpid configuration file.



The tests were run with the following qpidd options:

--auth no

        

turn of connection authentication, makes setting the test environment easier

--mgmt-enable no

        

disable the collection of management data

--tcp-nodelay

        

disable the batching of packets


--worker-threads <#>

        

set the number of IO worker threads to <#>

This was only used for latency test, where the range use was between 1and one more than the numbers of cores in a socket.

The default, which was used for throughput, is one more than the totalnumber of active cores.


/*Table 2*/


     *Table 3* details the options which were specified for /perftest/.
     For all testing in this paper a /count/ of 200000 was used.
     Experimentation was used to detect if setting /tcp-nodelay/ was
     beneficial or not. For each /size/ reported, the /npubs/ and
     /nsubs/ were set equally from 1 to 8 by powers of 2 while /qt/ was
     set between 1 to 16 also by powers of 2. The highest value for
     each /size/ is reported.

--nsubs <#>

--npubs <#>

        

number of publishers/ subscribers per client

--count <#>

        

number of messages send per pub per qt,

so total messages = count * qt * (npub+nsub)

--qt <#>

        

number of queues being used

--size <#>

        

message size

--tcp-nodelay

        

disable the batching of packets

--protocol <tcp| rdma>

        

used to specify RDMA, default is TCP

/*Table 3*/

The parameters that were used for /latencytest/ are listed in *Table 4*.A 10000 message /rate/ was chosen since all the test interconnects wouldbe able to maintain this rate. When specified, the /max-frame-size/ wasset to 120 more than the size. When a /max-frame-size/ was specified,/bound-multiplier/ was set to 1.



--rate <#>

        

target message rate

--size <#>

        

message size

--max-frame-size <#>

        

the maximum frame size to request

only specified for ethernet interconnects

--bounds-multiplier <#>

        

bound size of write queue (as a multiple of the max frame size)

only specified for ethernet interconnects

--tcp-nodelay

        

disable the batching of packets

--protocol <tcp| rdma>

        

used to specify RDMA, default is TCP

/*Table 4 */






ft420 wrote:

exchange used: fanoutwe are running broker on 2 core machine. fanout send client is also running
on the same windows machine.
there are 3 recv applications running on three separate machines.

we were trying with-> --worker-thread 9 which gives poor performance
compared to without --worker-threads optionnow we have taken --worker-threads 2 as no of processors on the machine
where broker is running is 2. in this case how many threads exactly has to
be used to so as to improve performance

Thanks



Gordon Sim wrote:
ft420 wrote:
hi,

without --worker-thread option pidstat command shows that there are by
default 6 threads createdwith --worker-thread 6 option pidstat command shows that there are 9 i.e.
default 6 + 3 threads created.
Fyi: the extra three threads are timer threads for various different
tasks.
As per documentation worker threads option is used to improve
performance.I checked with --worker-thread 10 and without --worker-thread.
direct_producer sends 100000 messages put time increases with
--worker-thread 10 as compared to --worler-thread option.
Running more threads than there are processors will not improve any realparallelism. There is also no real value from using more threads thanyou have active connections (so in a test with just one producer and oneconsumer connection you won't see any benefit from having more than 2worker threads).
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]

Re: worker thread with qpidd

Reply via email to