Re: Network interconnect settings in IaaS environments

2016-09-17 Thread Paul Guo
Note some L2 tunables may quite depend on NIC driver the virtual machine
(vm) is using.

e.g. for PCI SR-IOV vf or PCI assignment, the NIC in a vm behaves like a
physical NIC,
some L2 tuneables may be set to usual default values, however for some
virtual NIC
implemented as para-virtulization, those tunables (e.g. tx queue len, or
disable/enable
nic offloading technique e.g. gso, tso) with other values probably are
better.



2016-09-17 12:52 GMT+08:00 Lei Chang :

> Here is some more information around hawq interconnect. But NOTE that the
> default value tuning is all on *physical* hardware and not on Azure. On
> amazon and vmware, looks all default settings work fine.
>
> ·   gp_interconnect_type: Sets the protocol used for inter-node
> communication. Valid values are "tcp", "udp" “udp” is the new udp
> interconnect implementation with flow control. Default value is “udp”.
>
> ·   gp_interconnect_fc_method: Sets the flow control method used for
> UDP interconnect. Valid values are "capacity" and "loss". For “capacity”
> based flow control, senders do not send packets when receivers do not have
> capacity. “Loss” based flow control is based on “capacity” based flow
> control, and it also tunes sending speed according to packet losses.
> Default value is “loss”.
>
> ·   gp_interconnect_snd_queue_depth: A new parameter used to specify
> the average size of a send queue. The buffer pool size for each send
> process can be calculated by using gp_interconnect_snd_queue_depth *
> number
> of processes in the downstream gang. The default value is 2.
>
> ·   gp_interconnect_cache_future_packets: A new parameter used to
> control whether future packets are cached at receiver side. Default value
> is “true”
>
> ·   gp_udp_bufsize_k: gp_udp_bufsize_k is changed from “PGC_SUSET” to
> “PGC_BACKEND” to make customer customize the size of socket buffers used by
> interconnect. And the maximal value of is changed to 32768KB = 32M.
>
>
> For UDP interconnect, end users should tune the OS kernel memory used by
> sockets. On Linux, these are
>
> ·   net.core.rmem_max
>
> ·   net.core.wmem_max
>
> ·   txqueuelen (Transmit Queue Length)
>
> Recommended values for net.core.rmem(wmem)_max are 2M (or greater). And the
> txqueuelen can be increased if OS introduces some packets losses due to
> kernel ring buffer overflow. If the number of nodes is large, users should
> pay attention to the queue depth and socket buffer size settings to avoid
> potential packets losses due to a small OS buffer size.
>
>
> On Sat, Sep 17, 2016 at 12:44 PM, Lei Chang  wrote:
>
> > please see the comments inline
> >
> > On Sat, Sep 17, 2016 at 3:07 AM, Kyle Dunn  wrote:
> >
> >> In an ongoing evaluation of HAWQ in Azure, we've encountered some
> >> sub-optimal network performance. It would be great to get some
> additional
> >> information about a few server parameters related to the network:
> >>
> >> - gp_max_packet_size
> >>The default is documented at 8192. Why was this number chosen? Should
> >> this value be aligned with the network infrastructure's configured MTU,
> >> accounting for the packet header size of the chosen interconnect type?
> >>  (Azure only support MTU 1500 and has been showing better reliability
> >> using
> >> TCP in Greenplum)
> >>
> >
> > 8K is an empirical value when we evaluate the interconnect performance on
> > physical hardware. It is shown that 8K has the optimal performance.
> >
> > But on Azure, it is not benchmarked, looks like udp on azure is not
> > stable. you can set "gp_interconnect_log_stats" to see the statistics
> > about the queries. And you can also use ifconfig to see the errors about
> > packets.
> >
> > If the network is not stable, it deserves a try to decrease the value to
> > less than 1500 to align the user space packet size with maximal kernel
> > packet size. But Decreasing the value increases the cpu cost
> > for marshaling/unmarshalling the packets. There will be a tradeoff here.
> >
> >
> >>
> >> - gp_interconnect_type
> >> The docs claim UDPIFC is the default, UDP is the observed default.
> Do
> >> the recommendations around which setting to use vary in an IaaS
> >> environment
> >> (AWS or Azure)?
> >>
> >
> > which doc? when we release UDPIFC for gpdb, we kept old UDP and added
> > UDPIFC to avoid potential regressions since there are a lot of UDP
> > deployments for gpdb at that time. After UDPIFC was released, it is shown
> > UDPIFC is much more stable and perform better than UDP. So when we
> release
> > hawq, we just replaced UDP with UDPIFC. But use UDP for the name. So UDP
> is
> > UDPIFC in HAWQ.
> >
> > There are two flow control methods in UDPIFC, I'd like suggest you have a
> > try: Gp_interconnect_fc_method (INTERCONNECT_FC_METHOD_CAPACITY &
> > INTERCONNECT_FC_METHOD_LOSS).
> >
> >
> >> - gp_interconnect_queue_depth
> >>My naive read of this is performance can be traded off for
> (potentially
> >> significant) RAM utilization. Is there 

Re: Network interconnect settings in IaaS environments

2016-09-16 Thread Lei Chang
Here is some more information around hawq interconnect. But NOTE that the
default value tuning is all on *physical* hardware and not on Azure. On
amazon and vmware, looks all default settings work fine.

·   gp_interconnect_type: Sets the protocol used for inter-node
communication. Valid values are "tcp", "udp" “udp” is the new udp
interconnect implementation with flow control. Default value is “udp”.

·   gp_interconnect_fc_method: Sets the flow control method used for
UDP interconnect. Valid values are "capacity" and "loss". For “capacity”
based flow control, senders do not send packets when receivers do not have
capacity. “Loss” based flow control is based on “capacity” based flow
control, and it also tunes sending speed according to packet losses.
Default value is “loss”.

·   gp_interconnect_snd_queue_depth: A new parameter used to specify
the average size of a send queue. The buffer pool size for each send
process can be calculated by using gp_interconnect_snd_queue_depth * number
of processes in the downstream gang. The default value is 2.

·   gp_interconnect_cache_future_packets: A new parameter used to
control whether future packets are cached at receiver side. Default value
is “true”

·   gp_udp_bufsize_k: gp_udp_bufsize_k is changed from “PGC_SUSET” to
“PGC_BACKEND” to make customer customize the size of socket buffers used by
interconnect. And the maximal value of is changed to 32768KB = 32M.


For UDP interconnect, end users should tune the OS kernel memory used by
sockets. On Linux, these are

·   net.core.rmem_max

·   net.core.wmem_max

·   txqueuelen (Transmit Queue Length)

Recommended values for net.core.rmem(wmem)_max are 2M (or greater). And the
txqueuelen can be increased if OS introduces some packets losses due to
kernel ring buffer overflow. If the number of nodes is large, users should
pay attention to the queue depth and socket buffer size settings to avoid
potential packets losses due to a small OS buffer size.


On Sat, Sep 17, 2016 at 12:44 PM, Lei Chang  wrote:

> please see the comments inline
>
> On Sat, Sep 17, 2016 at 3:07 AM, Kyle Dunn  wrote:
>
>> In an ongoing evaluation of HAWQ in Azure, we've encountered some
>> sub-optimal network performance. It would be great to get some additional
>> information about a few server parameters related to the network:
>>
>> - gp_max_packet_size
>>The default is documented at 8192. Why was this number chosen? Should
>> this value be aligned with the network infrastructure's configured MTU,
>> accounting for the packet header size of the chosen interconnect type?
>>  (Azure only support MTU 1500 and has been showing better reliability
>> using
>> TCP in Greenplum)
>>
>
> 8K is an empirical value when we evaluate the interconnect performance on
> physical hardware. It is shown that 8K has the optimal performance.
>
> But on Azure, it is not benchmarked, looks like udp on azure is not
> stable. you can set "gp_interconnect_log_stats" to see the statistics
> about the queries. And you can also use ifconfig to see the errors about
> packets.
>
> If the network is not stable, it deserves a try to decrease the value to
> less than 1500 to align the user space packet size with maximal kernel
> packet size. But Decreasing the value increases the cpu cost
> for marshaling/unmarshalling the packets. There will be a tradeoff here.
>
>
>>
>> - gp_interconnect_type
>> The docs claim UDPIFC is the default, UDP is the observed default. Do
>> the recommendations around which setting to use vary in an IaaS
>> environment
>> (AWS or Azure)?
>>
>
> which doc? when we release UDPIFC for gpdb, we kept old UDP and added
> UDPIFC to avoid potential regressions since there are a lot of UDP
> deployments for gpdb at that time. After UDPIFC was released, it is shown
> UDPIFC is much more stable and perform better than UDP. So when we release
> hawq, we just replaced UDP with UDPIFC. But use UDP for the name. So UDP is
> UDPIFC in HAWQ.
>
> There are two flow control methods in UDPIFC, I'd like suggest you have a
> try: Gp_interconnect_fc_method (INTERCONNECT_FC_METHOD_CAPACITY &
> INTERCONNECT_FC_METHOD_LOSS).
>
>
>> - gp_interconnect_queue_depth
>>My naive read of this is performance can be traded off for (potentially
>> significant) RAM utilization. Is there additional detail around turning
>> this knob? How does the interaction between this and the underlying NIC
>> queue depth affect performance? As an example, in Azure, disabling TX
>> queuing (ifconfig eth0 txqueue 0) on the virtual NIC improved benchmark
>> performance, as the underlying HyperV host is doing it's own queuing
>> anyway.
>>
>>
> This queue is application level queue, and use for caching, handling
> out-of-order and lost packets.
>
> According to our past performance testing on physical hardware, increasing
> it to a large value does not show a lot of benefits. Too small value does
> impact performance. But it needs more testing on Azure I think.

Re: Network interconnect settings in IaaS environments

2016-09-16 Thread Lei Chang
please see the comments inline

On Sat, Sep 17, 2016 at 3:07 AM, Kyle Dunn  wrote:

> In an ongoing evaluation of HAWQ in Azure, we've encountered some
> sub-optimal network performance. It would be great to get some additional
> information about a few server parameters related to the network:
>
> - gp_max_packet_size
>The default is documented at 8192. Why was this number chosen? Should
> this value be aligned with the network infrastructure's configured MTU,
> accounting for the packet header size of the chosen interconnect type?
>  (Azure only support MTU 1500 and has been showing better reliability using
> TCP in Greenplum)
>

8K is an empirical value when we evaluate the interconnect performance on
physical hardware. It is shown that 8K has the optimal performance.

But on Azure, it is not benchmarked, looks like udp on azure is not stable.
you can set "gp_interconnect_log_stats" to see the statistics about the
queries. And you can also use ifconfig to see the errors about packets.

If the network is not stable, it deserves a try to decrease the value to
less than 1500 to align the user space packet size with maximal kernel
packet size. But Decreasing the value increases the cpu cost
for marshaling/unmarshalling the packets. There will be a tradeoff here.


>
> - gp_interconnect_type
> The docs claim UDPIFC is the default, UDP is the observed default. Do
> the recommendations around which setting to use vary in an IaaS environment
> (AWS or Azure)?
>

which doc? when we release UDPIFC for gpdb, we kept old UDP and added
UDPIFC to avoid potential regressions since there are a lot of UDP
deployments for gpdb at that time. After UDPIFC was released, it is shown
UDPIFC is much more stable and perform better than UDP. So when we release
hawq, we just replaced UDP with UDPIFC. But use UDP for the name. So UDP is
UDPIFC in HAWQ.

There are two flow control methods in UDPIFC, I'd like suggest you have a
try: Gp_interconnect_fc_method (INTERCONNECT_FC_METHOD_CAPACITY &
INTERCONNECT_FC_METHOD_LOSS).


> - gp_interconnect_queue_depth
>My naive read of this is performance can be traded off for (potentially
> significant) RAM utilization. Is there additional detail around turning
> this knob? How does the interaction between this and the underlying NIC
> queue depth affect performance? As an example, in Azure, disabling TX
> queuing (ifconfig eth0 txqueue 0) on the virtual NIC improved benchmark
> performance, as the underlying HyperV host is doing it's own queuing
> anyway.
>
>
This queue is application level queue, and use for caching, handling
out-of-order and lost packets.

According to our past performance testing on physical hardware, increasing
it to a large value does not show a lot of benefits. Too small value does
impact performance. But it needs more testing on Azure I think.


>
> Thanks,
> Kyle
> --
> *Kyle Dunn | Data Engineering | Pivotal*
> Direct: 303.905.3171 <3039053171> | Email: kd...@pivotal.io
>