Re: Manual memory management for concurrent code

2018-01-30 Thread Nikolay Tsankov
Is this  "bounded staleness" the same as lazySet/putOrdered?

On Tue, Jan 30, 2018 at 1:31 PM, Wojciech Kudla 
wrote:

> There's very interesting progress in that space happening lately. Some of
> that is being applied to the Linux kernel as new RCU implementation. Looks
> very promising. It's based on fast consensus using bounded staleness.
>
> Have a look here:
> https://lwn.net/Articles/745116/
> And the paper:
> http://ipads.se.sjtu.edu.cn/lib/exe/fetch.php?
> media=publications:consensus-tpds16.pdf
>
> On Sun, 28 Jan 2018, 19:35 Chris Vest,  wrote:
>
>> Tom Hart's thesis looks like a very comprehensive answer.
>>
>> Thanks.
>>
>> > On 28 Jan 2018, at 19.05, Duarte Nunes 
>> wrote:
>> >
>> > There's also Quiescent-State-Based Reclamation, which emerged in the
>> context of RCU. Tom Hart’s 2005 thesis[1] provides a pretty comprehensive
>> overview of these memory reclamation strategies.
>> >
>> > Another approach to consider would be a sharded design a la Seastar[2],
>> or some other approach leveraging the single writer principle (i.e.,
>> peer-to-peer communication based on SPSC queues) to decrease
>> synchronization overhead.
>> >
>> > [1] http://www.cs.toronto.edu/~tomhart/papers/tomhart_thesis.pdf
>> > [2] https://github.com/scylladb/seastar
>> >
>> > On Sun, Jan 28, 2018 at 5:45 PM Chris Vest 
>> wrote:
>> > Hi,
>> >
>> > I know of two ways to do manual memory management in multi-threading
>> code: hazard pointers and epochs.
>> >
>> > Which one is generally considered the higher performance option?
>> > Are there any other options that should be considered as well?
>> > Are there any spicy trade-offs one should make sure to factor into the
>> decision of which one to go with?
>> >
>> > Thanks,
>> > Chris.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "mechanical-sympathy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>> >
>> > --
>> > You received this message because you are subscribed to the Google
>> Groups "mechanical-sympathy" group.
>> > To unsubscribe from this group and stop receiving emails from it, send
>> an email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> > For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to mechanical-sympathy+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Master thesis in mechanical sympathy, Java performance.

2017-11-16 Thread Nikolay Tsankov
Recently watched an '85 lecture by Richard Feynman, where he explains the
computer as a filing system, where you have a clerk that picks a card,
reads it, does some calculations, maybe writes something on the card and
puts it back. So it is this back and forth motion, back and forth,
something like a wave, with the wavelength being the distance you have to
travel to the cabinet with the next card. Obviously then, if you decrease
the length, the frequency increases, so you process more cards in the same
time. If you have someone load a card box from the basement and deliver it
to the clerk's room, this would speed things up, as the clerk then has to
travel less - prefetchers. Very good analogy in my opinion, naturally
explains a lot of the hardware parts in a computer and their effect on the
speed of processing. So if I were writing a thesis, I would try to make an
analogy to some well understood physical or everyday processes and the data
processing done by a computer.

On Thu, Nov 16, 2017 at 7:14 PM, ben.cot...@alumni.rutgers.edu <
ben.cot...@alumni.rutgers.edu> wrote:

> If appropriate (for both this group's and your Thesis' ambitions),
> consider a Thesis title that is respectfully dramatic (but that, foremost,
> honors your PhD advisors' expectations) and that, furthermore, abstracts
> the concepts presented in this (very excellent) forum to generic problem
> solving.  Agree?
>
> here's how I might title such Thesis:  "Musings on the Data Locality,
> Latency, and Caching Problem: A Mechanical Sympathy"
>
> From there?  ::  identify, category, of both Operator and Operand and the
> *cost* of getting these things as close as possible to one another. ?
>
> good luck!
>
>
>
> On 11/16/2017 11:42 AM, John Hening wrote
>
> Hi,
>
> I know that there is a lot of experts in Java oriented on "mechanical
> sympathy" here. I am very interested in that subject- however I am a
> beginner. However, I am not clueless about it.
> I'm a bit familiar with the processor architecture, lock-free, garbage
> free and so on. My question is:
> Has someone any idea for master thesis in that area? I'm graduating my
> university and I would like write a thesis in interesting for me subject.
>
> If someone has an idea, feel free to suggest somehting, like general idea.
> If someone consider that post as inadequete, feel free to give me a  sign
> as well.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to mechanical-sympathy+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Conversion from String to OutputStream without heap allocation

2017-07-19 Thread Nikolay Tsankov
I think Martin explained the reasoning quite well in his talk High
Performance Managed Languages


On Wed, Jul 19, 2017 at 10:28 AM, Avi Kivity  wrote:

> Out of curiosity, if you're doing heavy duty processing, why did you
> choose a garbage-collected language?
>
> On 07/19/2017 04:21 AM, David Ryan wrote:
>
>
> Hi all,
>
> My first post here.  Not sure it's the best place for it, but hoping
> someone here might be able to assist.  We're developers of a streaming
> application that does 100k+ messages per second processing, so anything
> that allocates to the heap can cause GC pressure.  We've been targeting
> removing allocations recently and one difficult ones is conversion from
> String to UTF-8 to OutputStream.  The advised methods for String to UTF-8
> all create byte[] as intermediary objects.
>
> We've been able to read Strings from streams using a ThreadLocal
> ByteBuffer and CharBuffer which will allocate the char[] and String object
> only.  For reference, we've done the following.  In a 1minute JMC Flight
> Recorder char[] and String make up about 1 gig or allocations which is
> unavoidable because we're processing a lot of string data:
>
> public static final class StringBuffers
> {
> ByteBuffer buffer = ByteBuffer.allocate(512);
> CharBuffer charBuffer = CharBuffer.allocate(300);
> CharsetDecoder decoder = Charset.forName("UTF8").newDecoder();
> }
>
> public static final class U8Utf8MethodHandleReader extends
> AbstractReader
> {
> private final ThreadLocal buffers = new
> ThreadLocal()
> {
> @Override
> public StringBuffers initialValue()
> {
> return new StringBuffers();
> }
> };
>
> public U8Utf8MethodHandleReader(final MethodHandle setHandle)
> {
> super(setHandle);
> }
>
> @Override
> public void read(final Object o, final TypeInputStream in) throws
> Throwable
> {
> final int len = in.read();
>
> // Grab a thread local set of buffers to use temporarily.
> final StringBuffers buf = buffers.get();
>
> // get a reference to the buffers.
> final ByteBuffer b = buf.buffer;
> final CharBuffer c = buf.charBuffer;
>
> b.clear();
> c.clear();
>
> // read the stream into the byte buffer.
> in.getStream().read(b.array(), 0, len);
> b.limit(len);
>
> // decode the bytes into the char buffer.
> final CharsetDecoder decoder = buf.decoder;
> decoder.reset();
> decoder.decode(b, c, true);
>
> // flip the char buffer.
> c.flip();
>
> // get a copy of
> final String str = c.toString();
>
> // finally set the string value via method handle.
> setHandle.invoke(o, str);
> }
> }
>
> For writing Strings we've tried a similar method:
>
> public static final class StringBuffers
> {
> ByteBuffer buffer = ByteBuffer.allocate(512);
> CharsetEncoder encoder = Charset.forName("UTF8").newEncoder();
> }
>
> public static final class U8Utf8MethodHandleWriter extends
> AbstractWriter
> {
> private final ThreadLocal buffers = new
> ThreadLocal()
> {
> @Override
> public StringBuffers initialValue()
> {
> return new StringBuffers();
> }
> };
>
> public U8Utf8MethodHandleWriter(final MethodHandle getHandle)
> {
> super(getHandle);
> }
>
> @Override
> public void write(final Object o, final TypeOutputStream out)
> throws Throwable
> {
> // finally set the string value.
> final String str = (String) getHandle.invoke(o);
>
> final OutputStream os = out.getStream();
>
> // empty strings just write 0 for length.
> if (str == null)
> {
> os.write(0);
> return;
> }
>
> // Grab a thread local set of buffers to use temporarily.
> final StringBuffers buf = buffers.get();
>
> // get a reference to the buffers.
> final ByteBuffer b = buf.buffer;
>
> // this does allocate an object, but at least it isn't copying
> the buffer!
> final CharBuffer c = CharBuffer.wrap(str);
>
> // clear the byte buffer.
> b.clear();
>
> // decode the bytes into the char buffer.
> final CharsetEncoder encoder = buf.encoder;
> encoder.reset();
> encoder.encode(c, b, true);
>
> // flip the char buffer.
> b.flip();
>
> final int size = b.limit();
>
>

Re: Why would SocketChannel be slower when sending a single msg instead of 1k msgs after proper warmup?

2017-04-20 Thread Nikolay Tsankov
Hi,

I was talking about your server spin-waiting. Not sure it applies in your
case, but from http://x86.renejeschke.de/html/file_module_x86_id_232.html

When executing a "spin-wait loop," a Pentium 4 or Intel Xeon processor
> suffers a severe performance penalty when exiting the loop because it
> detects a possible memory order violation. The PAUSE instruction provides a
> hint to the processor that the code sequence is a spin-wait loop. The
> processor uses this hint to avoid the memory order violation in most
> situations, which greatly improves processor performance. For this reason,
> it is recommended that a PAUSE instruction be placed in all spin-wait loops.
>

On second thought, this probably is far less impact-full than the latency
spike you observe

On Thu, Apr 20, 2017 at 8:22 AM, J Crawford <latencyfigh...@mail.com> wrote:

> Hi Nikolay,
>
> Thanks for trying to help. Can you elaborate on  "speculative execution"
> and how do you think it could be affecting the socket latency?
>
> My tight loop for pausing is indeed working (the program actually "pauses"
> as expected) so not sure what you mean.
>
> Thanks again!
>
> -JC
>
> On Wednesday, April 19, 2017 at 1:15:24 AM UTC-5, Nikolay Tsankov wrote:
>>
>> Hi,
>>
>> Could it be caused by speculative execution/the tight wait loop? You can
>> probably test in C with a pause instruction in the loop...
>>
>> Best,
>> Nikolay
>>
>> On Tue, Apr 18, 2017 at 9:24 AM, Kirk Pepperdine <ki...@kodewerk.com>
>> wrote:
>>
>>> Some code written, I’ll take this offline
>>>
>>>
>>> On Apr 17, 2017, at 5:28 PM, J Crawford <latency...@mail.com> wrote:
>>>
>>>
>>> > I have some skeletial client/server code in C. It just needs to
>>> morphed to your test case. I can’t see me getting that done today unless I
>>> get blocked on what I need to get done.
>>>
>>> Hello Kirk, I'm still banging my head trying to understand this latency
>>> issue. Did you have time to use your C code to try to reproduce this
>>> problem? I'm not a C programmer, but if you are busy I can try to adapt
>>> your skeletal client/server C code to the use-case in question.
>>>
>>> I'm currently clueless and unable to make progress. It happens on MacOS,
>>> Linux and Windows so it does not look like a OS-related issue. Looks more
>>> like a JVM or CPU issue.
>>>
>>> Thanks!
>>>
>>> -JC
>>>
>>>
>>> On Thursday, April 13, 2017 at 1:59:48 AM UTC-5, Kirk Pepperdine wrote:
>>>>
>>>>
>>>> Normally when I run into “can’t scale down” problems in Java you have
>>>> to be concerned about methods on the critical path not being hot enough to
>>>> be compiled. However I’d give this one a low probability because the knock
>>>> on latency is typically 2-3x what you’d see under load. So, this seems some
>>>> how connected to a buffer with a timer. Under load you get fill and fire
>>>> and of course the scale down is fire on timeout ‘cos you rarely is ever
>>>> fill.
>>>>
>>>> Have you looked at this problem using Wireshark or a packet sniffer in
>>>> your network? Another trick is to directly instrument the Socket read,
>>>> write methods. You can do that with BCI or simply just hand modify the code
>>>> and preload it on the bootstrap class path.
>>>>
>>>> I have some skeletial client/server code in C. It just needs to morphed
>>>> to your test case. I can’t see me getting that done today unless I get
>>>> blocked on what I need to get done.
>>>>
>>>> Kind regards,
>>>> Kirk
>>>>
>>>> On Apr 13, 2017, at 6:45 AM, J Crawford <latency...@mail.com> wrote:
>>>>
>>>> Very good idea, Mike. If I only knew C :) I'll try to hire a C coder on
>>>> UpWork.com <http://upwork.com/> or Elance.com <http://elance.com/> to
>>>> do that. It shouldn't be hard for someone who knows C network programming.
>>>> I hope...
>>>>
>>>> Thanks!
>>>>
>>>> -JC
>>>>
>>>> On Wednesday, April 12, 2017 at 11:37:28 PM UTC-5, mikeb01 wrote:
>>>>>
>>>>> Rewrite the test in C to eliminate the JVM as the cause of the
>>>>> slowdown?
>>>>>
>>>>> On 13 April 2017 at 16:31, J Crawford <latency...@mail.com> wrote:
>>>>>
>>>>

Re: Why would SocketChannel be slower when sending a single msg instead of 1k msgs after proper warmup?

2017-04-19 Thread Nikolay Tsankov
Hi,

Could it be caused by speculative execution/the tight wait loop? You can
probably test in C with a pause instruction in the loop...

Best,
Nikolay

On Tue, Apr 18, 2017 at 9:24 AM, Kirk Pepperdine  wrote:

> Some code written, I’ll take this offline
>
>
> On Apr 17, 2017, at 5:28 PM, J Crawford  wrote:
>
>
> > I have some skeletial client/server code in C. It just needs to morphed
> to your test case. I can’t see me getting that done today unless I get
> blocked on what I need to get done.
>
> Hello Kirk, I'm still banging my head trying to understand this latency
> issue. Did you have time to use your C code to try to reproduce this
> problem? I'm not a C programmer, but if you are busy I can try to adapt
> your skeletal client/server C code to the use-case in question.
>
> I'm currently clueless and unable to make progress. It happens on MacOS,
> Linux and Windows so it does not look like a OS-related issue. Looks more
> like a JVM or CPU issue.
>
> Thanks!
>
> -JC
>
>
> On Thursday, April 13, 2017 at 1:59:48 AM UTC-5, Kirk Pepperdine wrote:
>>
>>
>> Normally when I run into “can’t scale down” problems in Java you have to
>> be concerned about methods on the critical path not being hot enough to be
>> compiled. However I’d give this one a low probability because the knock on
>> latency is typically 2-3x what you’d see under load. So, this seems some
>> how connected to a buffer with a timer. Under load you get fill and fire
>> and of course the scale down is fire on timeout ‘cos you rarely is ever
>> fill.
>>
>> Have you looked at this problem using Wireshark or a packet sniffer in
>> your network? Another trick is to directly instrument the Socket read,
>> write methods. You can do that with BCI or simply just hand modify the code
>> and preload it on the bootstrap class path.
>>
>> I have some skeletial client/server code in C. It just needs to morphed
>> to your test case. I can’t see me getting that done today unless I get
>> blocked on what I need to get done.
>>
>> Kind regards,
>> Kirk
>>
>> On Apr 13, 2017, at 6:45 AM, J Crawford  wrote:
>>
>> Very good idea, Mike. If I only knew C :) I'll try to hire a C coder on
>> UpWork.com  or Elance.com  to do
>> that. It shouldn't be hard for someone who knows C network programming. I
>> hope...
>>
>> Thanks!
>>
>> -JC
>>
>> On Wednesday, April 12, 2017 at 11:37:28 PM UTC-5, mikeb01 wrote:
>>>
>>> Rewrite the test in C to eliminate the JVM as the cause of the slowdown?
>>>
>>> On 13 April 2017 at 16:31, J Crawford  wrote:
>>>
 Ok, this is a total mystery. Tried a bunch of strategies with no luck:

 1. Checked the cpu frequency with i7z_64bit. No variance in the
 frequency.

 2. Disabled all power management. No luck.

 3. Changed TCP Congestion Control Algorithm. No luck.

 4. Set net.ipv4.tcp_slow_start_after_idle to false. No luck.

 5. Tested with UDP implementation. No luck.

 6. Placed the all sockets in blocking mode just for the heck of it. No
 luck, same problem.

 I'm out of pointers now and don't know where to run. This is an
 important latency problem that I must understand as it affects my trading
 system.

 Anyone who has any clue of what might be going on, please throw some
 light. Also, if you run the provided Server and Client code in your own
 environment/machine (over localhost/loopback) you will see that it does
 happen.

 Thanks!

 -JC

 On Wednesday, April 12, 2017 at 10:23:17 PM UTC-5, Todd L. Montgomery
 wrote:
>
> The short answer is that no congestion control algorithm is suited for
> low latency trading and in all cases, using raw UDP will be better for
> latency. Congestion control is about fairness. Latency in trading has
> nothing to do with fairness.
>
> The long answer is that to varying degrees, all congestion control
> must operate at high or complete utilization to probe. Those based on loss
> (all variants of CUBIC, Reno, etc.) must be operating in congestion
> avoidance or be in slow start. Those based on RTT (Vegas) or 
> RTT/Bottleneck
> Bandwidth (BBR) must be probing for more bandwidth to determine change in
> RTT (as a "replacement" for loss).
>
> So, the case of sending only periodically is somewhat antithetical to
> the operating point that all congestion control must operate at while
> probing. And the reason all appropriate congestion control algorithms I
> know of reset upon not operating at high utilization.
>
> You can think of it this way the network can only sustain X
> msgs/sec, but X is a (seemingly random) nonlinear function of time. How do
> you determine X at any given time without operating at that point? You can
> not, that I know of, predict X without