Re: Low latency High Frequency Trading

2012-11-15 Thread Nicholas Marriott
Your first paragraph is not really true. For financial data UDP
multicast is more efficient and can be considerably faster than TCP,
even if you need to check integrity (which isn't always the case). Most
market data feeds are UDP multicast for a reason.

FPGAs can be very fast but they do have obvious disadvantages over a
general purpose platform. Which can be very fast too, often fast
enough. 50 microseconds should be within reach.

You are right that running this stuff on Windows is not a good idea
though :-).



On Sun, Nov 11, 2012 at 07:35:32AM -0500, Nico Kadel-Garcia wrote:
 On Thu, Nov 8, 2012 at 12:58 PM, Ariel Burbaickij
 ariel.burbaic...@gmail.com wrote:
  If money is not a problem -- go buy high-trading on the chip solutions and
  have sub-microsecond resolution.
 
  http://lmgtfy.com/?q=high+frequency+trading+FPGA
 
 Seconded as a much more viable approach.  The existing multicast
 approach for such data is much like trying to hurl apple pies with F-6
 jets. By the time you've packaged the original data, blown it across
 the wire, re-assembled it, *and tagged and checksummed it for validity
 and correct packet order*, you're rarely any faster than a normal TCP
 transmission.  This doesn't matter much for streaming video, but when
 you're talking about billion dollar stock prices and tracking and
 responding to very small changes in prices of large companies, the
 validity of each packet becomes critical.
 
 Other factors also start becoming critical. Normal kernels on aren't
 very good about consistently treating one service as incredibly high
 priority *and evening out the delays as they handle other processes*
 too keep behavior consistent. That's why I would *never* run such
 processing on Windows, between fancy graphics, unnecessary daemons,
 and critical anti-virus software, you just don't know when things will
 be delayed. And that's one of the many reasons that the ability to use
 FPGA'a, which entirely sidestep the what else is the kernel doing
 process, are ideal for putting on much smaller, more module devices.
 And the devices don't need anything so powerful or complex as even a
 stripped, optimized,  BSD style kernel. (Though these can admittedly
 be very lean and very fast as OS kernels go.)



Re: Low latency High Frequency Trading

2012-11-11 Thread Florenz Kley
On 10 Nov 2012, at 00:56, Ryan McBride mcbr...@openbsd.org wrote:
 http://www.brocade.com/solutions-technology/enterprise/application-delivery/fix-financial-applications/index.page

From the product info: Client identity may be based on a choice of Layer 3 
(IP), Layer 4 (TCP Port) and Layer 7 (FIX header SenderCompID field) 
information.

ohmigod. Sounds like people who earn my trust based on their uncompromising 
attention to detail with which they design highly secure systems. Important for 
stuff like moving money around (even if imaginary).

fl



Re: Low latency High Frequency Trading

2012-11-11 Thread Nico Kadel-Garcia
On Thu, Nov 8, 2012 at 12:58 PM, Ariel Burbaickij
ariel.burbaic...@gmail.com wrote:
 If money is not a problem -- go buy high-trading on the chip solutions and
 have sub-microsecond resolution.

 http://lmgtfy.com/?q=high+frequency+trading+FPGA

Seconded as a much more viable approach.  The existing multicast
approach for such data is much like trying to hurl apple pies with F-6
jets. By the time you've packaged the original data, blown it across
the wire, re-assembled it, *and tagged and checksummed it for validity
and correct packet order*, you're rarely any faster than a normal TCP
transmission.  This doesn't matter much for streaming video, but when
you're talking about billion dollar stock prices and tracking and
responding to very small changes in prices of large companies, the
validity of each packet becomes critical.

Other factors also start becoming critical. Normal kernels on aren't
very good about consistently treating one service as incredibly high
priority *and evening out the delays as they handle other processes*
too keep behavior consistent. That's why I would *never* run such
processing on Windows, between fancy graphics, unnecessary daemons,
and critical anti-virus software, you just don't know when things will
be delayed. And that's one of the many reasons that the ability to use
FPGA'a, which entirely sidestep the what else is the kernel doing
process, are ideal for putting on much smaller, more module devices.
And the devices don't need anything so powerful or complex as even a
stripped, optimized,  BSD style kernel. (Though these can admittedly
be very lean and very fast as OS kernels go.)



Re: Low latency High Frequency Trading

2012-11-09 Thread Diana Eichert

On Fri, 9 Nov 2012, Tomas Bodzar wrote:


On Thu, Nov 8, 2012 at 8:55 PM, Diana Eichert deich...@wrench.com wrote:

take a look at Tilera TileGX boards
(you better hire a s/w developer.)



Some company is already working on that
http://mail-index.netbsd.org/netbsd-users/2012/10/31/msg011803.html


Porting an O/S is not software development.  The Tilera stuff is
different.

FWIW, I already knew about the NBSD stuff, but chose not to post
about other O/S on an OpenBSD list.

diana

Past hissy-fits are not a predictor of future hissy-fits.
Nick Holland(06 Dec 2005)



Re: Low latency High Frequency Trading

2012-11-09 Thread Ryan McBride
My immediate reaction is don't do it, but on the other hand I've never
known people for whom 'money is not a problem' to shy away from
something because of boring concerns like security. So...


Software:

Basically, to do this correctly you need to parse all the packets
running in both directions between the two endpoints, tracking the acks
and correctly emulating the behaviour of the TCP stacks on both sides to
determine what is valid data to convert to UDP.

Things to think about:

- IP fragment reassembly
- duplicate packets
- out of order packets
- lost packets
- TCP resends
- TCP checksums
- IP checksums
- TCP sequence number validation
- etc, etc.

Look at pf_normalise_state_tcp() in pf_norm.c and pf_test_state_tcp() in
pf.c for a small taste of the scope of what you're considering if you
want to write this in the kernel.  Further examples for TCP reassembly
could be found in the source code for ports/net/snort or
ports/net/tcpflow.

Of course you can take some shortcuts if you assume that the data you're
getting is clean, and even more if you don't have to parse the TCP
stream but can handle each individual TCP packet as an individual
payload. Perhaps your current problematic implementation already does
this? If so, it's also probably trivial to inject bogus data into the
stream and have it accepted. Maybe that's a feature.

Remember: Lots of attacks can be performed against this hacked up
monstrosity unless everything is exactly perfect. Good luck with the
frankenstein code, it's not supported.


Hardware:

- NIC: something that allows you to adjust the interrupt rate, e.g. em,
  bnx. On the other hand if the packet rate is not too high a cheaper
  network card without any bells and whistles might give you better
  performance (less overhead in the interrupt handler). I'd say you'd be
  best off buying a bunch and testing them.

- CPU: maximum SINGLE CORE turbo speed. Disable the other cores,
  they're not helping you at all; in theory you want the biggest,
  fastest cache possible, but perhaps not necessary depending on how much
  software you're running.

- Fast RAM might help, but you don't need much. probably the minimum you
  can get in a board with the above CPU.

Also, remember to use the shortest patch cables possible, to reduce
signal propagation latency.



On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:
 For unrelated reasons, I can't directly receive the TCP stream.
 
 I must copy the TCP data from a running stream to another server. I
 can use tap or just port-mirroring on the switch. So I can't use any
 network stack or leverage any offloading.
 
 I also need to modify the received data, and add few application
 headers before sending it as a multicast udp stream.
 
 Winsock is userland. What I want to do is in the kernel, even before
 ip_input. I guess it should be faster.
 
 
 On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser j...@caustic.org wrote:
  On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter dans...@gmail.com wrote:
  Hi All,
 
  current situation
  A windows 2008 server is receiving TCP traffic from a stock exchange
  and sends it, almost as is, using UDP multicast to automated high
  frequancy traders.
 
  StockExchange --TCP--- windows2008 ---MCAST-UDP
 
  On average, the time it take to do the TCP to UDP translation, using
  winsock, is 240 micro seconds. It can even be as high as 60,000 micro
  seconds.
  /current situation
 
  my idea
  1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
  box with two NICs. One for the TCP, the other for the multicast UDP.
 
  You'll incur an extra penalty offloading to the kernel. Winsock is
  already doing that, though.
 
  2. Put the TCP port in a promiscuous mode.
 
  Why? You can just set up the right bits to listen to on the network,
  and pull raw frames to be processed. Or, just let the network stack
  behave as it should.
 
  3. Write my TCP-UDP logic directly into ether_input.c
 
  Any reason to not use pf for this translation?
 
  /my idea
 
  Now for the questions:
  1. Am I on the right track? or in other words how crazy is my idea?
 
  Pretty crazy. You may want to see if there's hardware accelerated or
  on NIC TCP off-load options instead.
 
  2. What would be the latency? Can I achieve 50 microseconds between
  getting the interrupt and until sending the new packet through the
  NIC?
 
  See above. You'll end up having to do some tuning.
 
  3. Which NIC/CPU/Memory should I use? Money is not a problem.
 
  Custom order a few NICs, hire a developer to write a driver to offload
  TCP/UDP on the NIC, and enable as little kernel interference as
  possible.
 
  Money's not a problem, right?



Re: Low latency High Frequency Trading

2012-11-09 Thread Ariel Burbaickij
What is the rationale behind this statement:


...
- CPU: maximum SINGLE CORE turbo speed. Disable the other cores,
  they're not helping you at all...?

/wbr
Ariel Burbaickij

On Fri, Nov 9, 2012 at 3:47 PM, Ryan McBride mcbr...@openbsd.org wrote:

 My immediate reaction is don't do it, but on the other hand I've never
 known people for whom 'money is not a problem' to shy away from
 something because of boring concerns like security. So...


 Software:

 Basically, to do this correctly you need to parse all the packets
 running in both directions between the two endpoints, tracking the acks
 and correctly emulating the behaviour of the TCP stacks on both sides to
 determine what is valid data to convert to UDP.

 Things to think about:

 - IP fragment reassembly
 - duplicate packets
 - out of order packets
 - lost packets
 - TCP resends
 - TCP checksums
 - IP checksums
 - TCP sequence number validation
 - etc, etc.

 Look at pf_normalise_state_tcp() in pf_norm.c and pf_test_state_tcp() in
 pf.c for a small taste of the scope of what you're considering if you
 want to write this in the kernel.  Further examples for TCP reassembly
 could be found in the source code for ports/net/snort or
 ports/net/tcpflow.

 Of course you can take some shortcuts if you assume that the data you're
 getting is clean, and even more if you don't have to parse the TCP
 stream but can handle each individual TCP packet as an individual
 payload. Perhaps your current problematic implementation already does
 this? If so, it's also probably trivial to inject bogus data into the
 stream and have it accepted. Maybe that's a feature.

 Remember: Lots of attacks can be performed against this hacked up
 monstrosity unless everything is exactly perfect. Good luck with the
 frankenstein code, it's not supported.


 Hardware:

 - NIC: something that allows you to adjust the interrupt rate, e.g. em,
   bnx. On the other hand if the packet rate is not too high a cheaper
   network card without any bells and whistles might give you better
   performance (less overhead in the interrupt handler). I'd say you'd be
   best off buying a bunch and testing them.

 - CPU: maximum SINGLE CORE turbo speed. Disable the other cores,
   they're not helping you at all; in theory you want the biggest,
   fastest cache possible, but perhaps not necessary depending on how much
   software you're running.

 - Fast RAM might help, but you don't need much. probably the minimum you
   can get in a board with the above CPU.

 Also, remember to use the shortest patch cables possible, to reduce
 signal propagation latency.



 On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:
  For unrelated reasons, I can't directly receive the TCP stream.
 
  I must copy the TCP data from a running stream to another server. I
  can use tap or just port-mirroring on the switch. So I can't use any
  network stack or leverage any offloading.
 
  I also need to modify the received data, and add few application
  headers before sending it as a multicast udp stream.
 
  Winsock is userland. What I want to do is in the kernel, even before
  ip_input. I guess it should be faster.
 
 
  On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser j...@caustic.org wrote:
   On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter dans...@gmail.com
 wrote:
   Hi All,
  
   current situation
   A windows 2008 server is receiving TCP traffic from a stock exchange
   and sends it, almost as is, using UDP multicast to automated high
   frequancy traders.
  
   StockExchange --TCP--- windows2008 ---MCAST-UDP
  
   On average, the time it take to do the TCP to UDP translation, using
   winsock, is 240 micro seconds. It can even be as high as 60,000 micro
   seconds.
   /current situation
  
   my idea
   1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
   box with two NICs. One for the TCP, the other for the multicast UDP.
  
   You'll incur an extra penalty offloading to the kernel. Winsock is
   already doing that, though.
  
   2. Put the TCP port in a promiscuous mode.
  
   Why? You can just set up the right bits to listen to on the network,
   and pull raw frames to be processed. Or, just let the network stack
   behave as it should.
  
   3. Write my TCP-UDP logic directly into ether_input.c
  
   Any reason to not use pf for this translation?
  
   /my idea
  
   Now for the questions:
   1. Am I on the right track? or in other words how crazy is my idea?
  
   Pretty crazy. You may want to see if there's hardware accelerated or
   on NIC TCP off-load options instead.
  
   2. What would be the latency? Can I achieve 50 microseconds between
   getting the interrupt and until sending the new packet through the
   NIC?
  
   See above. You'll end up having to do some tuning.
  
   3. Which NIC/CPU/Memory should I use? Money is not a problem.
  
   Custom order a few NICs, hire a developer to write a driver to offload
   TCP/UDP on the NIC, and enable as little kernel interference 

Re: Low latency High Frequency Trading

2012-11-09 Thread Ryan McBride
On Fri, Nov 09, 2012 at 04:14:28PM +0100, Ariel Burbaickij wrote:
 What is the rationale behind this statement:
 
 
 ...
 - CPU: maximum SINGLE CORE turbo speed. Disable the other cores,
   they're not helping you at all...?

OpenBSD doesn't run multiprocessor inside the kernel, so SMP provides no
benefit. Even if it did, the overhead of SMP would probably be a net
loss for this particular workload.



Re: Low latency High Frequency Trading

2012-11-09 Thread Ariel Burbaickij
Ah OK,  as several other architectures/OSes were thrown around in this
thread I did not immediately understand that you were talking
about specifically OpenBSD context.  Thank you for clarification.

On Fri, Nov 9, 2012 at 4:19 PM, Ryan McBride mcbr...@openbsd.org wrote:

 On Fri, Nov 09, 2012 at 04:14:28PM +0100, Ariel Burbaickij wrote:
  What is the rationale behind this statement:
 
 
  ...
  - CPU: maximum SINGLE CORE turbo speed. Disable the other cores,
they're not helping you at all...?

 OpenBSD doesn't run multiprocessor inside the kernel, so SMP provides no
 benefit. Even if it did, the overhead of SMP would probably be a net
 loss for this particular workload.



Re: Low latency High Frequency Trading

2012-11-09 Thread Christian Weisgerber
Ryan McBride mcbr...@openbsd.org wrote:

 Also, remember to use the shortest patch cables possible, to reduce
 signal propagation latency.

More seriously, is there an appreciable latency difference between
copper and fiber PHYs?

-- 
Christian naddy Weisgerber  na...@mips.inka.de



Re: Low latency High Frequency Trading

2012-11-09 Thread Dan Shechter
Hi Ryan,

Thanks for the detailed answer.

I can do some assumptions regarding the TCP flow and its origins. Its
coming from the stock exchange over IPSEC  gateways over leased lines.
I think I can trust the origin of the flow. At least I can trust it as
much as the off the shelf software does.

When I was saying money is not a problem, I was referring to the cost
of the server to run this.

I know that I need to implement state machine for the TCP session and
keep some buffers for out of order packets.

Do you think the right place to place the code is in ether_input.c?

Its about 1k packets per second max.

I plan to coil the path cable to make electrical filed surrounding my
device to protect it from evil.
Best regards,
Dan


On Fri, Nov 9, 2012 at 4:47 PM, Ryan McBride mcbr...@openbsd.org wrote:
 My immediate reaction is don't do it, but on the other hand I've never
 known people for whom 'money is not a problem' to shy away from
 something because of boring concerns like security. So...


 Software:

 Basically, to do this correctly you need to parse all the packets
 running in both directions between the two endpoints, tracking the acks
 and correctly emulating the behaviour of the TCP stacks on both sides to
 determine what is valid data to convert to UDP.

 Things to think about:

 - IP fragment reassembly
 - duplicate packets
 - out of order packets
 - lost packets
 - TCP resends
 - TCP checksums
 - IP checksums
 - TCP sequence number validation
 - etc, etc.

 Look at pf_normalise_state_tcp() in pf_norm.c and pf_test_state_tcp() in
 pf.c for a small taste of the scope of what you're considering if you
 want to write this in the kernel.  Further examples for TCP reassembly
 could be found in the source code for ports/net/snort or
 ports/net/tcpflow.

 Of course you can take some shortcuts if you assume that the data you're
 getting is clean, and even more if you don't have to parse the TCP
 stream but can handle each individual TCP packet as an individual
 payload. Perhaps your current problematic implementation already does
 this? If so, it's also probably trivial to inject bogus data into the
 stream and have it accepted. Maybe that's a feature.

 Remember: Lots of attacks can be performed against this hacked up
 monstrosity unless everything is exactly perfect. Good luck with the
 frankenstein code, it's not supported.


 Hardware:

 - NIC: something that allows you to adjust the interrupt rate, e.g. em,
   bnx. On the other hand if the packet rate is not too high a cheaper
   network card without any bells and whistles might give you better
   performance (less overhead in the interrupt handler). I'd say you'd be
   best off buying a bunch and testing them.

 - CPU: maximum SINGLE CORE turbo speed. Disable the other cores,
   they're not helping you at all; in theory you want the biggest,
   fastest cache possible, but perhaps not necessary depending on how much
   software you're running.

 - Fast RAM might help, but you don't need much. probably the minimum you
   can get in a board with the above CPU.

 Also, remember to use the shortest patch cables possible, to reduce
 signal propagation latency.



 On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:
 For unrelated reasons, I can't directly receive the TCP stream.

 I must copy the TCP data from a running stream to another server. I
 can use tap or just port-mirroring on the switch. So I can't use any
 network stack or leverage any offloading.

 I also need to modify the received data, and add few application
 headers before sending it as a multicast udp stream.

 Winsock is userland. What I want to do is in the kernel, even before
 ip_input. I guess it should be faster.


 On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser j...@caustic.org wrote:
  On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter dans...@gmail.com wrote:
  Hi All,
 
  current situation
  A windows 2008 server is receiving TCP traffic from a stock exchange
  and sends it, almost as is, using UDP multicast to automated high
  frequancy traders.
 
  StockExchange --TCP--- windows2008 ---MCAST-UDP
 
  On average, the time it take to do the TCP to UDP translation, using
  winsock, is 240 micro seconds. It can even be as high as 60,000 micro
  seconds.
  /current situation
 
  my idea
  1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
  box with two NICs. One for the TCP, the other for the multicast UDP.
 
  You'll incur an extra penalty offloading to the kernel. Winsock is
  already doing that, though.
 
  2. Put the TCP port in a promiscuous mode.
 
  Why? You can just set up the right bits to listen to on the network,
  and pull raw frames to be processed. Or, just let the network stack
  behave as it should.
 
  3. Write my TCP-UDP logic directly into ether_input.c
 
  Any reason to not use pf for this translation?
 
  /my idea
 
  Now for the questions:
  1. Am I on the right track? or in other words how crazy is my idea?
 
  Pretty crazy. You 

Re: Low latency High Frequency Trading

2012-11-09 Thread Ryan McBride
On Fri, Nov 09, 2012 at 06:27:06PM +0200, Dan Shechter wrote:
 I can do some assumptions regarding the TCP flow and its origins. Its
 coming from the stock exchange over IPSEC  gateways over leased lines.
 I think I can trust the origin of the flow. At least I can trust it as
 much as the off the shelf software does.

If something goes wrong with the off-the-shelf software, you can blame
the vendor. Your own hand-rolled solution, not so much...


 When I was saying money is not a problem, I was referring to the cost
 of the server to run this.

I know, you said that already. But I've worked in this industry also and
I am well aware of the reality distortion that occurs when gambling with
billions of dollars of other people's imaginary money.


 I know that I need to implement state machine for the TCP session and
 keep some buffers for out of order packets.
 
 Do you think the right place to place the code is in ether_input.c?

I gues you mean to say either sys/net/ip_ethersubr.c, or the
ether_input() function inside that file, but either way the bulk of your
code should get added to a separate file if you don't want a maintenance
nightmare.


 Its about 1k packets per second max.

This should be doable on all but the most ancient hardware. But you will
need to consider how you want to handle bursts or anomalies: is it more
important to never lose a packet, or is it acceptable to lose some
number of packets in order to keep latency low?


In the former case you need to rely on buffers; you will want to move
packets off the interface recieve ring and into another buffer like the
ip input queue as quickly as possible, then do your magic in software
interrupt context. You'll also want to look at disabling the MCLGETI
functionality in the network card driver, and possible increase the
recieve ring size on the card.

Here you probably want to hook your code at the very beginning
of ipv4_input().


In the later case, you'll want to reduce the recieve ring size on the
interface and handle the bulk of your processing in the hardware
interrupt. Fragment reassembly, handling TCP resends and out-of-order
packets, etc may no longer be useful (since it requires buffering
packets), and you may opt to simply drop data that doesn't arrive in
correct sequence. 

This is when ether_input() would be the right place to hook your code.



 I plan to coil the path cable to make electrical filed surrounding my
 device to protect it from evil.

Think about how you might be able to use ACLs on a high-end switch to
guarantee that the packets you recieve fit a certain profile (for
example, ensure that all packets are IPv4 TCP port 12345 between hostA and
hostB), to help shrink your code path.

Similarly, it may be possible to configure the device that's handling
the IPsec tunnel to do IP and TCP reassembly for you (if not, can you
replace it with one that does?), in which case your code could be made
MUCH simpler.


You didn't mention the protocol you're handling, but solutions like the
following may be helpful (or you might be able to implement the whole
thing there, and avoid supporting a frankenkernel). 

http://www.brocade.com/solutions-technology/enterprise/application-delivery/fix-financial-applications/index.page

It's optimized for cloud service delivery, so it's at least 9000x as
awesome as OpenBSD.



Re: Low latency High Frequency Trading

2012-11-08 Thread Johan Beisser
On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter dans...@gmail.com wrote:
 Hi All,

 current situation
 A windows 2008 server is receiving TCP traffic from a stock exchange
 and sends it, almost as is, using UDP multicast to automated high
 frequancy traders.

 StockExchange --TCP--- windows2008 ---MCAST-UDP

 On average, the time it take to do the TCP to UDP translation, using
 winsock, is 240 micro seconds. It can even be as high as 60,000 micro
 seconds.
 /current situation

 my idea
 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
 box with two NICs. One for the TCP, the other for the multicast UDP.

You'll incur an extra penalty offloading to the kernel. Winsock is
already doing that, though.

 2. Put the TCP port in a promiscuous mode.

Why? You can just set up the right bits to listen to on the network,
and pull raw frames to be processed. Or, just let the network stack
behave as it should.

 3. Write my TCP-UDP logic directly into ether_input.c

Any reason to not use pf for this translation?

 /my idea

 Now for the questions:
 1. Am I on the right track? or in other words how crazy is my idea?

Pretty crazy. You may want to see if there's hardware accelerated or
on NIC TCP off-load options instead.

 2. What would be the latency? Can I achieve 50 microseconds between
 getting the interrupt and until sending the new packet through the
 NIC?

See above. You'll end up having to do some tuning.

 3. Which NIC/CPU/Memory should I use? Money is not a problem.

Custom order a few NICs, hire a developer to write a driver to offload
TCP/UDP on the NIC, and enable as little kernel interference as
possible.

Money's not a problem, right?



Re: Low latency High Frequency Trading

2012-11-08 Thread Ariel Burbaickij
If money is not a problem -- go buy high-trading on the chip solutions and
have sub-microsecond resolution.

http://lmgtfy.com/?q=high+frequency+trading+FPGA

On Thu, Nov 8, 2012 at 6:36 PM, Johan Beisser j...@caustic.org wrote:

 On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter dans...@gmail.com wrote:
  Hi All,
 
  current situation
  A windows 2008 server is receiving TCP traffic from a stock exchange
  and sends it, almost as is, using UDP multicast to automated high
  frequancy traders.
 
  StockExchange --TCP--- windows2008 ---MCAST-UDP
 
  On average, the time it take to do the TCP to UDP translation, using
  winsock, is 240 micro seconds. It can even be as high as 60,000 micro
  seconds.
  /current situation
 
  my idea
  1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
  box with two NICs. One for the TCP, the other for the multicast UDP.

 You'll incur an extra penalty offloading to the kernel. Winsock is
 already doing that, though.

  2. Put the TCP port in a promiscuous mode.

 Why? You can just set up the right bits to listen to on the network,
 and pull raw frames to be processed. Or, just let the network stack
 behave as it should.

  3. Write my TCP-UDP logic directly into ether_input.c

 Any reason to not use pf for this translation?

  /my idea
 
  Now for the questions:
  1. Am I on the right track? or in other words how crazy is my idea?

 Pretty crazy. You may want to see if there's hardware accelerated or
 on NIC TCP off-load options instead.

  2. What would be the latency? Can I achieve 50 microseconds between
  getting the interrupt and until sending the new packet through the
  NIC?

 See above. You'll end up having to do some tuning.

  3. Which NIC/CPU/Memory should I use? Money is not a problem.

 Custom order a few NICs, hire a developer to write a driver to offload
 TCP/UDP on the NIC, and enable as little kernel interference as
 possible.

 Money's not a problem, right?



Re: Low latency High Frequency Trading

2012-11-08 Thread Johan Beisser
On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
ariel.burbaic...@gmail.com wrote:
 If money is not a problem -- go buy high-trading on the chip solutions and
 have sub-microsecond resolution.

 http://lmgtfy.com/?q=high+frequency+trading+FPGA

I'd love to see PF offloading on to something like that. Not that I
can justify the expense for my work, but it'd be useful.



Re: Low latency High Frequency Trading

2012-11-08 Thread Ariel Burbaickij
I know that  you have an impression I am getting caustic :-)  but these
ideas are pretty obvious once money is not a problem field, so:

http://en.wikipedia.org/wiki/Netronome

IXPs on steroids.


On Thu, Nov 8, 2012 at 7:01 PM, Johan Beisser j...@caustic.org wrote:

 On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
 ariel.burbaic...@gmail.com wrote:
  If money is not a problem -- go buy high-trading on the chip solutions
 and
  have sub-microsecond resolution.
 
  http://lmgtfy.com/?q=high+frequency+trading+FPGA

 I'd love to see PF offloading on to something like that. Not that I
 can justify the expense for my work, but it'd be useful.



Re: Low latency High Frequency Trading

2012-11-08 Thread Dan Shechter
When I was saying money is not a problem, it was related to server
component costs... :)
Best regards,
Dan


On Thu, Nov 8, 2012 at 8:07 PM, Ariel Burbaickij
ariel.burbaic...@gmail.com wrote:
 I know that  you have an impression I am getting caustic :-)  but these
 ideas are pretty obvious once money is not a problem field, so:

 http://en.wikipedia.org/wiki/Netronome

 IXPs on steroids.



 On Thu, Nov 8, 2012 at 7:01 PM, Johan Beisser j...@caustic.org wrote:

 On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
 ariel.burbaic...@gmail.com wrote:
  If money is not a problem -- go buy high-trading on the chip solutions
  and
  have sub-microsecond resolution.
 
  http://lmgtfy.com/?q=high+frequency+trading+FPGA

 I'd love to see PF offloading on to something like that. Not that I
 can justify the expense for my work, but it'd be useful.



Re: Low latency High Frequency Trading

2012-11-08 Thread Dan Shechter
For unrelated reasons, I can't directly receive the TCP stream.

I must copy the TCP data from a running stream to another server. I
can use tap or just port-mirroring on the switch. So I can't use any
network stack or leverage any offloading.

I also need to modify the received data, and add few application
headers before sending it as a multicast udp stream.

Winsock is userland. What I want to do is in the kernel, even before
ip_input. I guess it should be faster.

I am looking at netFPGA too, but prefer to do this in software.





Best regards,
Dan


On Thu, Nov 8, 2012 at 7:36 PM, Johan Beisser j...@caustic.org wrote:
 On Thu, Nov 8, 2012 at 4:12 AM, Dan Shechter dans...@gmail.com wrote:
 Hi All,

 current situation
 A windows 2008 server is receiving TCP traffic from a stock exchange
 and sends it, almost as is, using UDP multicast to automated high
 frequancy traders.

 StockExchange --TCP--- windows2008 ---MCAST-UDP

 On average, the time it take to do the TCP to UDP translation, using
 winsock, is 240 micro seconds. It can even be as high as 60,000 micro
 seconds.
 /current situation

 my idea
 1. Use port mirroring to get the TCP data sent to a dedicated OpenBSD
 box with two NICs. One for the TCP, the other for the multicast UDP.

 You'll incur an extra penalty offloading to the kernel. Winsock is
 already doing that, though.

 2. Put the TCP port in a promiscuous mode.

 Why? You can just set up the right bits to listen to on the network,
 and pull raw frames to be processed. Or, just let the network stack
 behave as it should.

 3. Write my TCP-UDP logic directly into ether_input.c

 Any reason to not use pf for this translation?

 /my idea

 Now for the questions:
 1. Am I on the right track? or in other words how crazy is my idea?

 Pretty crazy. You may want to see if there's hardware accelerated or
 on NIC TCP off-load options instead.

 2. What would be the latency? Can I achieve 50 microseconds between
 getting the interrupt and until sending the new packet through the
 NIC?

 See above. You'll end up having to do some tuning.

 3. Which NIC/CPU/Memory should I use? Money is not a problem.

 Custom order a few NICs, hire a developer to write a driver to offload
 TCP/UDP on the NIC, and enable as little kernel interference as
 possible.

 Money's not a problem, right?



Re: Low latency High Frequency Trading

2012-11-08 Thread Ariel Burbaickij
They are all available with PCI Express interface, no worries, so you will
be able of  plug them straight into your server.
Alternatively, how about going for the second option of making living in
this business :-) ?

On Thu, Nov 8, 2012 at 7:09 PM, Dan Shechter dans...@gmail.com wrote:

 When I was saying money is not a problem, it was related to server
 component costs... :)
 Best regards,
 Dan


 On Thu, Nov 8, 2012 at 8:07 PM, Ariel Burbaickij
 ariel.burbaic...@gmail.com wrote:
  I know that  you have an impression I am getting caustic :-)  but these
  ideas are pretty obvious once money is not a problem field, so:
 
  http://en.wikipedia.org/wiki/Netronome
 
  IXPs on steroids.
 
 
 
  On Thu, Nov 8, 2012 at 7:01 PM, Johan Beisser j...@caustic.org wrote:
 
  On Thu, Nov 8, 2012 at 9:58 AM, Ariel Burbaickij
  ariel.burbaic...@gmail.com wrote:
   If money is not a problem -- go buy high-trading on the chip solutions
   and
   have sub-microsecond resolution.
  
   http://lmgtfy.com/?q=high+frequency+trading+FPGA
 
  I'd love to see PF offloading on to something like that. Not that I
  can justify the expense for my work, but it'd be useful.



Re: Low latency High Frequency Trading

2012-11-08 Thread Diana Eichert

take a look at Tilera TileGX boards
(you better hire a s/w developer.)



Re: Low latency High Frequency Trading

2012-11-08 Thread William Ahern
On Thu, Nov 08, 2012 at 08:08:05PM +0200, Dan Shechter wrote:
 For unrelated reasons, I can't directly receive the TCP stream.
 
 I must copy the TCP data from a running stream to another server. I
 can use tap or just port-mirroring on the switch. So I can't use any
 network stack or leverage any offloading.
 
 I also need to modify the received data, and add few application
 headers before sending it as a multicast udp stream.
 
 Winsock is userland. What I want to do is in the kernel, even before
 ip_input. I guess it should be faster.
 
 I am looking at netFPGA too, but prefer to do this in software.
 

You might want to try this:

http://info.iet.unipi.it/~luigi/netmap/

It's FreeBSD and Linux only, though.

The emerging solution for high performance traffic routers like this is to
have one or more threads loop in userspace over a memory mapped NIC buffer.
Most of these interfaces are highly proprietary. Netmap provides the
relative programmatic simplicity of a TAP-type interface with the zero-copy
performance of the mapped buffering.



Re: Low latency High Frequency Trading

2012-11-08 Thread Tomas Bodzar
On Thu, Nov 8, 2012 at 8:55 PM, Diana Eichert deich...@wrench.com wrote:
 take a look at Tilera TileGX boards
 (you better hire a s/w developer.)


Some company is already working on that
http://mail-index.netbsd.org/netbsd-users/2012/10/31/msg011803.html