Generally if you set the number of thread larger than the core count your performance will go down as expected. However the reason the option is there, is so that if you pin the process to less than
the number of cores, then the thread count can be adjusted.

On a 2 core machine with client on the same machine, there are not that many options, as the client
and broker will contend for the resources on the machine.

My employer had done a report with HP, it is to big to mail out to the list, but here is some
basic setup that was done for that.


regards
Carl.


     Throughput (Perftest)

For throughput, perftest is used to drive the broker for this benchmark. This harness is able to start up multiple producers and consumers in balanced (n:n) or unbalanced configurations (x:y).


What the test does:

   *

     creates a control queue

   *

     starts x:y producers and consumers

   *

     waits for all processors to signal they are ready

   *

     controller records a timestamp

   *

     producers reliably en-queues messages onto the broker as fast as
     they can

   *

     consumers reliably de-queue messages from the broker as fast as
     they can

   *

     once the last message -- which is marked is received, the
     controller is signaled

   *

     controller waits for all complete signals, records timestamp and
     calculates rate

The throughput is the calculated as the total number of messages reliably transferred divided by the time to transfer those messages.


     Latency (Latencytest)

For latency, latencytest is used to drive the broker for this benchmark. This harness is able to produce messages at a specified rate or for a specified number of messages that are timestamped, sent to the broker, looped back to client node. The client will report the minimum, maximum, and average time for a reporting interval when a rate is used, or for all the messages sent when a count is used.


   Tuning & Parameter Settings

For the testing in this paper the systems were not used for any other purposes. Therefore, the configuration and tuning that is detailed should be reviewed when other applications along with MRG Messaging.


     Processes

For the testing performed the following were disabled (unless specified otherwise):


SELinux

cpuspeed

irqbalance

haldaemon

yum-updatesd

smartd

setroubleshoot

sendmail

rpcgssd

rpcidmapd

rpcsvcgssd

rhnsd

pcscd

mdmonitor

mcstrans

kdump

isdn

iptables

ip6tables

hplip

hidd

gpm

cups

bluetooth

avahi-daemon

restorecond

auditd


     SysCtl

The following kernel parameters were added to //etc/sysctl.conf/.

net.ipv4.conf.default.arp_filter,

net.ipv4.conf.all.arp_filter

        

1

        

Only respond to ARP requests on matching interface

net.core.rmem_max,

net.core.wmem_max

        

8388608

        

maximum receive/send socket buffer size in bytes

net.core.rmem_default,

net.core.wmem_default

        

262144

        

default setting of the socket receive/send buffer in bytes.

net.ipv4.tcp_rmem,

net.ipv4.tcp_wmem

        

65536

4194304

8388608

        

Vector of 3 integers: min, default, max

min - minimal size of receive/send buffer used by TCP sockets

default - default size of receive/send buffer used by TCP sockets

max - maximal size of receive/send buffer allowed for automatically selected receiver buffers for TCP socket

net.core.netdev_max_backlog

        

10000

        

Maximum number of packets, queued on the input side, when the interface receives packets faster than kernel can process them. Applies to non-NAPI devices only.

net.ipv4.tcp_window_scaling

        

0

        

Enable window scaling as defined in RFC1323.

net.ipv4.tcp_mem

        

262144

4194304

8388608

        

     Vector of 3 integers: low, pressure, high

     low - below this number of pages TCP is not bothered about its
     memory appetite.

     pressure - when amount of memory allocated by TCP exceeds this
     number of pages, TCP moderates its memory consumption and enters
     memory pressure mode, which is exited when memory consumption
     falls under "low".

     high - number of pages allowed for queueing by all TCP sockets.


/*Table 1*/


     ethtool

Some of the options ethtool allows the operator to change relate to coalesce and offload settings. However, during experimentation only changing the ring settings had noticeable effect for throughput testing.

# *ethtool -g eth1 *

Ring parameters for eth1:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 256

RX Mini: 0

RX Jumbo: 0

TX: 256


# *ethtool -G eth1 rx 2048 tx 2048 *

# *ethtool -g eth1 *

Ring parameters for eth1:

Pre-set maximums:

RX: 4096

RX Mini: 0

RX Jumbo: 0

TX: 4096

Current hardware settings:

RX: 2048

RX Mini: 0

RX Jumbo: 0

TX: 2048


#


     ifconfig

ifconfig was used to increase the /maximum transfer unit/ (MTU) to support jumbo frames and to increase /txqueuelen/ for throughput testing when these changes has noticeable effect.

# *ifconfig eth1 *

eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80

inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0

inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link

UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1

RX packets:8 errors:0 dropped:0 overruns:0 frame:0

TX packets:9 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:1000

RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)

Memory:fdee0000-fdf00000

# *ifconfig eth1 mtu 9000 txqueuelen 2000 *

# *ifconfig eth1 *

eth1 Link encap:Ethernet HWaddr 00:18:71:EC:02:80

inet addr:192.168.15.96 Bcast:192.168.15.255 Mask:255.255.255.0

inet6 addr: fe80::218:71ff:feec:280/64 Scope:Link

UP BROADCAST MULTICAST MTU:9000 Metric:1

RX packets:8 errors:0 dropped:0 overruns:0 frame:0

TX packets:9 errors:0 dropped:0 overruns:0 carrier:0

collisions:0 txqueuelen:2000

RX bytes:480 (480.0 b) TX bytes:594 (594.0 b)

Memory:fdee0000-fdf00000

#


     CPU affinity

For latency testing, all interrupts from the cores of one CPU socket were reassigned to other cores. The interrupts for the interconnect under test were assigned to cores of this vacated socket. The processes related to the interconnect (e.g. ib_mad, ipoib) were then schedule to run on the vacated cores. The Qpid daemon was also scheduled to run on these or a subset of the vacated cores. How latencytest was scheduled was determined by the results of experiments limiting or not limiting the latencytest test process to certain cores.


Experiments with perftest show that usually the best performance was achieved with the affinity settings after a boot and have not been manipulated.


Interrupts can be directed to be handled by cores. //proc/interrupts/ can be queried to identify the interrupts for devices and the number of times each CPU/core has handled each interrupt. For each interrupt, a file named //proc/irq/<IRQ #>/smp_affinity/ contains a hexadecimal mask which controls which cores can respond to specific interrupt. The contents of these files can be queried or set.


Processes can be restricted to run on a set of CPUs/cores. taskset can be used to define the list of CPUs/cores that a be scheduled to execute on.


The MRG -- Realtime product include an applicaiton, tuna, that allows for easy setting of affinity of interrupts and processes, through a GUI or command line.


     AMQP parameters

Qpid parameters can be specified on the command line, through environment variables or through the Qpid configuration file.


The tests were run with the following qpidd options:

--auth no

        

turn of connection authentication, makes setting the test environment easier

--mgmt-enable no

        

disable the collection of management data

--tcp-nodelay

        

disable the batching of packets


--worker-threads <#>

        

set the number of IO worker threads to <#>

This was only used for latency test, where the range use was between 1 and one more than the numbers of cores in a socket.

The default, which was used for throughput, is one more than the total number of active cores.

/*Table 2*/


     *Table 3* details the options which were specified for /perftest/.
     For all testing in this paper a /count/ of 200000 was used.
     Experimentation was used to detect if setting /tcp-nodelay/ was
     beneficial or not. For each /size/ reported, the /npubs/ and
     /nsubs/ were set equally from 1 to 8 by powers of 2 while /qt/ was
     set between 1 to 16 also by powers of 2. The highest value for
     each /size/ is reported.

--nsubs <#>

--npubs <#>

        

number of publishers/ subscribers per client

--count <#>

        

number of messages send per pub per qt,

so total messages = count * qt * (npub+nsub)

--qt <#>

        

number of queues being used

--size <#>

        

message size

--tcp-nodelay

        

disable the batching of packets

--protocol <tcp| rdma>

        

used to specify RDMA, default is TCP

/*Table 3*/


The parameters that were used for /latencytest/ are listed in *Table 4*. A 10000 message /rate/ was chosen since all the test interconnects would be able to maintain this rate. When specified, the /max-frame-size/ was set to 120 more than the size. When a /max-frame-size/ was specified, /bound-multiplier/ was set to 1.


--rate <#>

        

target message rate

--size <#>

        

message size

--max-frame-size <#>

        

the maximum frame size to request

only specified for ethernet interconnects

--bounds-multiplier <#>

        

bound size of write queue (as a multiple of the max frame size)

only specified for ethernet interconnects

--tcp-nodelay

        

disable the batching of packets

--protocol <tcp| rdma>

        

used to specify RDMA, default is TCP

/*Table 4 */






ft420 wrote:
exchange used: fanout we are running broker on 2 core machine. fanout send client is also running
on the same windows machine.
there are 3 recv applications running on three separate machines.

we were trying with-> --worker-thread 9 which gives poor performance
compared to without --worker-threads option now we have taken --worker-threads 2 as no of processors on the machine
where broker is running is 2. in this case how many threads exactly has to
be used to so as to improve performance

Thanks



Gordon Sim wrote:
ft420 wrote:
hi,

without --worker-thread option pidstat command shows that there are by
default 6 threads created with --worker-thread 6 option pidstat command shows that there are 9 i.e.
default 6 + 3 threads created.
Fyi: the extra three threads are timer threads for various different
tasks.

As per documentation worker threads option is used to improve
performance. I checked with --worker-thread 10 and without --worker-thread.
direct_producer sends 100000 messages put time increases with
--worker-thread 10 as compared to --worler-thread option.
Running more threads than there are processors will not improve any real parallelism. There is also no real value from using more threads than you have active connections (so in a test with just one producer and one consumer connection you won't see any benefit from having more than 2 worker threads).

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:[email protected]





Reply via email to