Thanks Rajiv, looking forward to your prototype. Guozhang
On Mon, Jan 26, 2015 at 2:30 PM, Rajiv Kurian <ra...@signalfuse.com> wrote: > Hi Guozhang, > > I am a bit busy at work. When I get the change I'll definitely try to get a > proof of concept going. Not the kafka protocol, but just the buffering and > threading structures, maybe just write to another socket. I think it would > be useful just to get the queueing and buffer management going and prove > that it can be done in a zero copy way in a multi producer single consumer > environment. If that is working, then the single consumer can be the kafka > network sync thread. > > Thanks, > Rajiv > > On Fri, Jan 16, 2015 at 11:58 AM, Guozhang Wang <wangg...@gmail.com> > wrote: > > > Hi Rajiv, > > > > Thanks for this proposal, it would be great if you can upload some > > implementation patch for the CAS idea and show some memory usage / perf > > differences. > > > > Guozhang > > > > On Sun, Dec 14, 2014 at 9:27 PM, Rajiv Kurian <ra...@signalfuse.com> > > wrote: > > > > > Resuscitating this thread. I've done some more experiments and > profiling. > > > My messages are very tiny (currently 25 bytes) per message and creating > > > multiple objects per message leads to a lot of churn. The memory churn > > > through creation of convenience objects is more than the memory being > > used > > > by my objects right now. I could probably batch my messages further, to > > > make this effect less pronounced. I did some rather unscientific > > > experiments with a flyweight approach on top of the ByteBuffer for a > > simple > > > messaging API (peer to peer NIO based so not a real comparison) and the > > > numbers were very satisfactory and there is no garbage created in > steady > > > state at all. Though I don't expect such good numbers from actually > going > > > through the broker + all the other extra stuff that a real producer > would > > > do, I think there is great potential here. > > > > > > The general mechanism for me is this: > > > i) A buffer (I used Unsafe but I imagine ByteBuffer having similar > > > performance) is created per partition. > > > ii) A CAS loop (in Java 7 and less) or even better > unsafe.getAndAddInt() > > in > > > Java 8 can be used to claim a chunk of bytes on the per topic buffer. > > This > > > code can be invoked from multiple threads in a wait free manner > > (wait-free > > > in Java 8, since getAndAddInt() is wait-free). Once a region in the > > buffer > > > is claimed, it can be operated on using the flyweight method that we > > talked > > > about. If the buffer doesn't have enough space then we can drop the > > message > > > or move onto a new buffer. Further this creates absolutely zero objects > > in > > > steady state (only a few objects created in the beginning). Even if the > > > flyweight method is not desired, the API can just take byte arrays or > > > objects that need to be serialized and copy them onto the per topic > > buffers > > > in a similar way. This API has been validated in Aeron too, so I am > > pretty > > > confident that it will work well. For the zero copy technique here is a > > > link to Aeron API with zero copy - > > > https://github.com/real-logic/Aeron/issues/18. The regular one copies > > byte > > > arrays but without any object creation. > > > iii) The producer send thread can then just go in FIFO order through > the > > > buffer sending messages that have been committed using NIO to rotate > > > between brokers. We might need a background thread to zero out used > > buffers > > > too. > > > > > > I've left out some details, but again none of this very revolutionary - > > > it's mostly the same techniques used in Aeron. I really think that we > can > > > keep the API ga rbage free and wait-free (even in the multi producer > > case) > > > without compromising how pretty it looks - the total zero copy API will > > low > > > level, but it should only be used by advanced users. Moreover the usual > > > producer.send(msg, topic, partition) can use the efficient ByteBuffer > > > offset API internally without it itself creating any garbage. With the > > > technique I talked about there is no need for an intermediate queue of > > any > > > kind since the underlying ByteBuffer per partition acts as the queue. > > > > > > I can do more experiments with some real producer code instead of my > toy > > > code to further validate the idea, but I am pretty sure that both > > > throughput and jitter characteristics will improve thanks to lower > > > contention (wait-free in java 8 with a single getAndAddInt() operation > > for > > > sync ) and better cache locality (C like buffers and a few constant > > number > > > of objects per partition). If you guys are interested, I'd love to talk > > > more. Again just to reiterate, I don't think the API will suffer at > all - > > > most of this can be done under the covers. Additionally it will open up > > > things so that a low level zero copy API is possible. > > > > > > Thanks, > > > Rajiv > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang