Re: Large durable caches

2018-06-11 Thread Dmitriy Setrakyan
The Ignite 2.5 has been released and can be downloaded from the Ignite
website:
https://ignite.apache.org/download.html

D.

On Wed, May 30, 2018 at 6:33 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> Ray,
>
> Which Ignite version are you running on. You may be affected by [1] which
> becomes worse the larger the data set is. Please wait for the Ignite 2.5
> release which will be available shortly.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7638
>
> пт, 18 мая 2018 г. в 5:44, Ray :
>
>> I ran into this issue as well.
>> I'm running tests on a six node Ignite node cluster, the data load stuck
>> after 1 billion data is ingested.
>> Can someone take a look at this issue please?
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>


Re: Large durable caches

2018-05-30 Thread Alexey Goncharuk
Ray,

Which Ignite version are you running on. You may be affected by [1] which
becomes worse the larger the data set is. Please wait for the Ignite 2.5
release which will be available shortly.

[1] https://issues.apache.org/jira/browse/IGNITE-7638

пт, 18 мая 2018 г. в 5:44, Ray :

> I ran into this issue as well.
> I'm running tests on a six node Ignite node cluster, the data load stuck
> after 1 billion data is ingested.
> Can someone take a look at this issue please?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Re: Large durable caches

2018-05-18 Thread Dave Harvey
Early on running on 2.3 we had hit a clear deadlock that I never root-caused,
where the cluster just stopped working.  At the time I was use the same
DataStreamer from multiple threads and we tuned up the buffer size because
of that, and we were running against EBS, and perhaps with too short
timeouts. We have not seen this on 2.4 with a DataStreamer per producer
thread with default parameters against SSDs.   This problem seemed worse
when I paid attention to the Ignite startup message about needing to set a
message buffer/size limit, and specified one.

One thing still on my list, however, is to understand more about paired TCP
connections and why (whether) they are the default.  Fundamentally, if
you are sending bi-directional request/response pairs over a single TCP
virtual circuit, there is an inherent deadlock where responses may get
behind requests that are flow controlled.  With a single VC, the only
general solution to this is to assume unlimited memory, reading requests
from the VC and queuing them in memory, in order to be able to remove the
responses.  You can limit the memory usage on the receiver by limiting the
total requests that can be sent at a higher level, but as node count scales,
the receiver would need more memory.I've been assuming that paired
connections is trying to address this fundamental issue to prevent requests
from blocking responses, but I haven't gotten there yet.  My impression was
that paired connections are not the default.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-05-17 Thread Ray
I ran into this issue as well.
I'm running tests on a six node Ignite node cluster, the data load stuck
after 1 billion data is ingested.
Can someone take a look at this issue please?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-03-26 Thread Larry
Hi Alexey.

Thanks for the feedback.  It will take me a little bit to set up the test
again.  My original test had a bunch of machines in cluster mode, I guess
to test this I can just use one with a small data region instead.

On Wed, Mar 21, 2018 at 3:59 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> Hi Larry,
>
> Recently we've fixed the ticket [1] which may cause a major performance
> degradation when page replacement is present. Can you please try building
> Ignite from the master branch and check if the issue is still present?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7638
>
> 2018-03-15 19:25 GMT+03:00 Larry :
>
>> Hi Alexey.
>>
>> Were there any findings?  Any updates would be helpful.
>>
>> Thanks,
>> -Larry
>>
>> On Thu, Mar 8, 2018 at 3:48 PM, Dmitriy Setrakyan 
>> wrote:
>>
>>> Hi Lawernce,
>>>
>>> I believe Alexey Goncharuk was working on improving this scenario.
>>> Alexey, can you provide some of your findings here?
>>>
>>> D.
>>>
>>> -- Forwarded message --
>>> From: lawrencefinn 
>>> Date: Mon, Mar 5, 2018 at 1:54 PM
>>> Subject: Re: Large durable caches
>>> To: user@ignite.apache.org
>>>
>>>
>>> BUMP.  Can anyone verify this?  If ignite cannot scale in this manner
>>> that is
>>> fine, i'd just want to know if what i am seeing makes sense.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>
>>>
>>
>


Re: Large durable caches

2018-03-23 Thread ilya.kasnacheev
Hello!

>From your thread dump, you can see that writing to disk is underway:


"db-checkpoint-thread-#56" #91 prio=5 os_prio=0 tid=0x7f7f84247800
nid=0x12e6 runnable [0x7f7b201fe000]
   java.lang.Thread.State: RUNNABLE
at sun.nio.ch.FileDispatcherImpl.force0(Native Method)
at sun.nio.ch.FileDispatcherImpl.force(FileDispatcherImpl.java:76)
at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:388)
at
org.apache.ignite.internal.processors.cache.persistence.file.RandomAccessFileIO.force(RandomAccessFileIO.java:87)
at
org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.sync(FilePageStore.java:495)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.doCheckpoint(GridCacheDatabaseSharedManager.java:2156)
at
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager$Checkpointer.body(GridCacheDatabaseSharedManager.java:2010)
at
org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
at java.lang.Thread.run(Thread.java:745) 

Is it possible that these writes are over the network (e.g. on NAS) and take
forever to complete?

Regards,



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-03-21 Thread Alexey Goncharuk
Hi Larry,

Recently we've fixed the ticket [1] which may cause a major performance
degradation when page replacement is present. Can you please try building
Ignite from the master branch and check if the issue is still present?

[1] https://issues.apache.org/jira/browse/IGNITE-7638

2018-03-15 19:25 GMT+03:00 Larry :

> Hi Alexey.
>
> Were there any findings?  Any updates would be helpful.
>
> Thanks,
> -Larry
>
> On Thu, Mar 8, 2018 at 3:48 PM, Dmitriy Setrakyan 
> wrote:
>
>> Hi Lawernce,
>>
>> I believe Alexey Goncharuk was working on improving this scenario.
>> Alexey, can you provide some of your findings here?
>>
>> D.
>>
>> -- Forwarded message ------
>> From: lawrencefinn 
>> Date: Mon, Mar 5, 2018 at 1:54 PM
>> Subject: Re: Large durable caches
>> To: user@ignite.apache.org
>>
>>
>> BUMP.  Can anyone verify this?  If ignite cannot scale in this manner
>> that is
>> fine, i'd just want to know if what i am seeing makes sense.
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>>
>


Re: Large durable caches

2018-03-15 Thread Larry
Hi Alexey.

Were there any findings?  Any updates would be helpful.

Thanks,
-Larry

On Thu, Mar 8, 2018 at 3:48 PM, Dmitriy Setrakyan 
wrote:

> Hi Lawernce,
>
> I believe Alexey Goncharuk was working on improving this scenario. Alexey,
> can you provide some of your findings here?
>
> D.
>
> -- Forwarded message --
> From: lawrencefinn 
> Date: Mon, Mar 5, 2018 at 1:54 PM
> Subject: Re: Large durable caches
> To: user@ignite.apache.org
>
>
> BUMP.  Can anyone verify this?  If ignite cannot scale in this manner that
> is
> fine, i'd just want to know if what i am seeing makes sense.
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
>


Re: Large durable caches

2018-03-05 Thread lawrencefinn
BUMP.  Can anyone verify this?  If ignite cannot scale in this manner that is
fine, i'd just want to know if what i am seeing makes sense.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-02-23 Thread David Harvey
In our case, we use 1+ backups, so Ignite replicates the data to other
nodes.   My  understanding:  Ignite replicates changes at the K/V or record
level and sends them to the backup node's WAL.  The backup node may differ
per partition.   The backup node applies the changes to its storage, and
also updates the local SQL indices (which only reference records on the
local node).   Ignite persistence writes data to the SSDs in pages, so if
you change 80 bytes, it must write multiple 4K pages on the SSDs, but only
send around those 80 bytes over the network.  So you need a lot more SSD
bandwidth than network bandwidth when using Ignite persistence with backups.

-DH

On Thu, Feb 22, 2018 at 6:52 PM, Raymond Wilson 
wrote:

> If you’re using local SSDs, these are not durable by definition. How are
> you ensuring your data survives termination of the instance, or a failure
> of the underlying physical host?
>
>
>
> *From:* David Harvey [mailto:dhar...@jobcase.com]
> *Sent:* Friday, February 23, 2018 10:03 AM
> *To:* user@ignite.apache.org
> *Subject:* Re: Large durable caches
>
>
>
> Using the local SSDs, which are radically faster than EBS.
>
>
>
> On Thu, Feb 22, 2018 at 10:34 AM, lawrencefinn  wrote:
>
> I could try a different AWS instance.  Im running these tests on r4.8xlarge
> boxes, which are pretty beefy and "ebs optimized".  I tried the same tests
> using IO1 disks at 20,000 IOPS but still had issues.
>
> Dave with the i3 instances, were you using the local ssds?  Or still using
> EBS?
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>
>
>
>
>
> *Disclaimer*
>
> The information contained in this communication from the sender is
> confidential. It is intended solely for use by the recipient and others
> authorized to receive it. If you are not the recipient, you are hereby
> notified that any disclosure, copying, distribution or taking action in
> relation of the contents of this information is strictly prohibited and may
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been
> automatically archived by *Mimecast Ltd*, an innovator in Software as a
> Service (SaaS) for business. Providing a *safer* and *more useful* place
> for your human generated data. Specializing in; Security, archiving and
> compliance. To find out more Click Here
> <http://www.mimecast.com/products/>.
>

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.


RE: Large durable caches

2018-02-22 Thread Raymond Wilson
If you’re using local SSDs, these are not durable by definition. How are
you ensuring your data survives termination of the instance, or a failure
of the underlying physical host?



*From:* David Harvey [mailto:dhar...@jobcase.com]
*Sent:* Friday, February 23, 2018 10:03 AM
*To:* user@ignite.apache.org
*Subject:* Re: Large durable caches



Using the local SSDs, which are radically faster than EBS.



On Thu, Feb 22, 2018 at 10:34 AM, lawrencefinn  wrote:

I could try a different AWS instance.  Im running these tests on r4.8xlarge
boxes, which are pretty beefy and "ebs optimized".  I tried the same tests
using IO1 disks at 20,000 IOPS but still had issues.

Dave with the i3 instances, were you using the local ssds?  Or still using
EBS?




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/





*Disclaimer*

The information contained in this communication from the sender is
confidential. It is intended solely for use by the recipient and others
authorized to receive it. If you are not the recipient, you are hereby
notified that any disclosure, copying, distribution or taking action in
relation of the contents of this information is strictly prohibited and may
be unlawful.

This email has been scanned for viruses and malware, and may have been
automatically archived by *Mimecast Ltd*, an innovator in Software as a
Service (SaaS) for business. Providing a *safer* and *more useful* place
for your human generated data. Specializing in; Security, archiving and
compliance. To find out more Click Here <http://www.mimecast.com/products/>.


Re: Large durable caches

2018-02-22 Thread David Harvey
Using the local SSDs, which are radically faster than EBS.

On Thu, Feb 22, 2018 at 10:34 AM, lawrencefinn  wrote:

> I could try a different AWS instance.  Im running these tests on r4.8xlarge
> boxes, which are pretty beefy and "ebs optimized".  I tried the same tests
> using IO1 disks at 20,000 IOPS but still had issues.
>
> Dave with the i3 instances, were you using the local ssds?  Or still using
> EBS?
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.


Re: Large durable caches

2018-02-22 Thread lawrencefinn
I could try a different AWS instance.  Im running these tests on r4.8xlarge
boxes, which are pretty beefy and "ebs optimized".  I tried the same tests
using IO1 disks at 20,000 IOPS but still had issues.  

Dave with the i3 instances, were you using the local ssds?  Or still using
EBS?  



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-02-21 Thread Dave Harvey
I fought with trying to get Ignite Persistence to work well on AWS GP2
volumes, and finally gave up, and moved to i3 instances, where the $ per
write IOP are much lower, and a i3.8xlarge gets 720,000 4K write IOPS vs on
the order of 10,000 for about the same cost.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-02-21 Thread Ivan Rakov
Of course, there is another option that may help - trying better AWS 
instances.


Best Regards,
Ivan Rakov

On 22.02.2018 0:58, Ivan Rakov wrote:
[03:14:36,596][INFO][db-checkpoint-thread-#77][GridCacheDatabaseSharedManager] 


Checkpoint finished [cpId=e2648942-961a-45c4-a7f9-41129e76b70f,
pages=801041, markPos=FileWALPointer [idx=2131, fileOffset=1035404,
len=60889, forceFlush=true], walSegmentsCleared=50, markDuration=913ms,
pagesWrite=3798ms, fsync=71652ms, total=76363ms]


From this message I can conclude that you have quite slow disks. 
Checkpointing of 800K pages (3.2G of data) took 76 seconds, and FSYNC 
took 71 seconds of them.
Freeze is expected when your disk bandwidth is way lower than you data 
load bandwidth. Node can accept limited amount of updates during 
checkpoint. If that limit is exceeded, all update operations will 
block until the end of current checkpoint.
Ignite has write throttling mechanism (see 
https://apacheignite.readme.io/docs/durable-memory-tuning) to avoid 
such situations, but in 2.3 it's not tuned for the case when FSYNC 
takes most of checkpoint time. Improved version of throttling that 
adapts to long FSYNC will be released in 2.5: 
https://issues.apache.org/jira/browse/IGNITE-7533


However, I still can't say why your client disconnects from server. 
The only explanation I see that it's artifact of AWS virtual 
environment: excessive use of disk I/O during checkpoint somehow 
affects quality of network connection. Try to decrease throughput of 
data loading (your hardware can't handle more than 40MB/sec anyway) 
and see how system will react.
You can achieve lower throughput with tuning your data streaming, or 
don't using data streamer at all.


Best Regards,
Ivan Rakov

On 21.02.2018 22:56, lawrencefinn wrote:
Here is a thread dump from when the client disconnects.  looks like 
the only

thread doing something interesting is db-checkpoint-thread

018-02-21 19:31:36
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed 
mode):


"tcp-disco-client-message-worker-#36" #1128 prio=10 os_prio=0
tid=0x7f800c001800 nid=0x2d89 waiting on condition 
[0x7f7f37afb000]

    java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00075d9d7790> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) 


 at
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522) 


 at
java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684) 


 at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6643) 


 at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

    Locked ownable synchronizers:
 - None

"tcp-disco-sock-reader-#34" #1126 prio=10 os_prio=0 
tid=0x7f7f74001800

nid=0x2d88 runnable [0x7f7b2040]
    java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at 
java.net.SocketInputStream.socketRead(SocketInputStream.java:116)

 at java.net.SocketInputStream.read(SocketInputStream.java:170)
 at java.net.SocketInputStream.read(SocketInputStream.java:141)
 at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
 at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345)

 - locked <0x00075d812028> (a java.io.BufferedInputStream)
 at
org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53) 


 at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2338) 


 at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351) 


 at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822) 


 at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
 at java.io.ObjectInputStream.(ObjectInputStream.java:301)
 at
org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.(JdkMarshallerObjectInputStream.java:39) 


 at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:119) 


 at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94) 


 at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9740) 


 at
org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5946) 


 at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

    Locked ownable synchronizers:
   

Re: Large durable caches

2018-02-21 Thread Ivan Rakov

[03:14:36,596][INFO][db-checkpoint-thread-#77][GridCacheDatabaseSharedManager]
Checkpoint finished [cpId=e2648942-961a-45c4-a7f9-41129e76b70f,
pages=801041, markPos=FileWALPointer [idx=2131, fileOffset=1035404,
len=60889, forceFlush=true], walSegmentsCleared=50, markDuration=913ms,
pagesWrite=3798ms, fsync=71652ms, total=76363ms]


From this message I can conclude that you have quite slow disks. 
Checkpointing of 800K pages (3.2G of data) took 76 seconds, and FSYNC 
took 71 seconds of them.
Freeze is expected when your disk bandwidth is way lower than you data 
load bandwidth. Node can accept limited amount of updates during 
checkpoint. If that limit is exceeded, all update operations will block 
until the end of current checkpoint.
Ignite has write throttling mechanism (see 
https://apacheignite.readme.io/docs/durable-memory-tuning) to avoid such 
situations, but in 2.3 it's not tuned for the case when FSYNC takes most 
of checkpoint time. Improved version of throttling that adapts to long 
FSYNC will be released in 2.5: 
https://issues.apache.org/jira/browse/IGNITE-7533


However, I still can't say why your client disconnects from server. The 
only explanation I see that it's artifact of AWS virtual environment: 
excessive use of disk I/O during checkpoint somehow affects quality of 
network connection. Try to decrease throughput of data loading (your 
hardware can't handle more than 40MB/sec anyway) and see how system will 
react.
You can achieve lower throughput with tuning your data streaming, or 
don't using data streamer at all.


Best Regards,
Ivan Rakov

On 21.02.2018 22:56, lawrencefinn wrote:

Here is a thread dump from when the client disconnects.  looks like the only
thread doing something interesting is db-checkpoint-thread

018-02-21 19:31:36
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode):

"tcp-disco-client-message-worker-#36" #1128 prio=10 os_prio=0
tid=0x7f800c001800 nid=0x2d89 waiting on condition [0x7f7f37afb000]
java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x00075d9d7790> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
 at
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522)
 at
java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684)
 at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6643)
 at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

Locked ownable synchronizers:
 - None

"tcp-disco-sock-reader-#34" #1126 prio=10 os_prio=0 tid=0x7f7f74001800
nid=0x2d88 runnable [0x7f7b2040]
java.lang.Thread.State: RUNNABLE
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
 at java.net.SocketInputStream.read(SocketInputStream.java:170)
 at java.net.SocketInputStream.read(SocketInputStream.java:141)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
 at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
 at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
 - locked <0x00075d812028> (a java.io.BufferedInputStream)
 at
org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
 at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2338)
 at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
 at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
 at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
 at java.io.ObjectInputStream.(ObjectInputStream.java:301)
 at
org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.(JdkMarshallerObjectInputStream.java:39)
 at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:119)
 at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
 at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9740)
 at
org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5946)
 at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

Locked ownable synchronizers:
 - None

"sys-#1062" #1125 prio=5 os_prio=0 tid=0x7f7f24031800 nid=0x2d86 waiting
on condition [0x7f7f3477b000]
java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.

Re: Large durable caches

2018-02-21 Thread lawrencefinn
Here is a thread dump from when the client disconnects.  looks like the only
thread doing something interesting is db-checkpoint-thread

018-02-21 19:31:36
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.92-b14 mixed mode):

"tcp-disco-client-message-worker-#36" #1128 prio=10 os_prio=0
tid=0x7f800c001800 nid=0x2d89 waiting on condition [0x7f7f37afb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00075d9d7790> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingDeque.pollFirst(LinkedBlockingDeque.java:522)
at
java.util.concurrent.LinkedBlockingDeque.poll(LinkedBlockingDeque.java:684)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$MessageWorkerAdapter.body(ServerImpl.java:6643)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

   Locked ownable synchronizers:
- None

"tcp-disco-sock-reader-#34" #1126 prio=10 os_prio=0 tid=0x7f7f74001800
nid=0x2d88 runnable [0x7f7b2040]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
- locked <0x00075d812028> (a java.io.BufferedInputStream)
at
org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2338)
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
at java.io.ObjectInputStream.(ObjectInputStream.java:301)
at
org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.(JdkMarshallerObjectInputStream.java:39)
at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:119)
at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9740)
at
org.apache.ignite.spi.discovery.tcp.ServerImpl$SocketReader.body(ServerImpl.java:5946)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)

   Locked ownable synchronizers:
- None

"sys-#1062" #1125 prio=5 os_prio=0 tid=0x7f7f24031800 nid=0x2d86 waiting
on condition [0x7f7f3477b000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0004c0045108> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None
"sys-#1062" #1125 prio=5 os_prio=0 tid=0x7f7f24031800 nid=0x2d86 waiting
on condition [0x7f7f3477b000]
   java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0004c0045108> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
at
java.util.concurrent.ThreadPool

Re: Large durable caches

2018-02-21 Thread lawrencefinn
I'll give that tweaking a try.  It's hard to do a thread dump just when it
freezes, do you think there is harm in doing a thread dump every 10 seconds
or something?

I tried a new setup with more nodes to test out how that affects this
problem (from 2 to 4).  I saw less datastreaming errors, but it appears that
the client disconnects pretty frequently.  The client is running on the same
machine as the server it is connecting to, so i dont see a network issue
being real).  The time seems to overlap with checkpoints starting.  Here is
the log from the client:

2018-02-21 03:14:19,720 [ERROR] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-sock-writer-#2%cabooseGrid% - Failed to send message: null
java.io.IOException: Failed to get acknowledge for message:
TcpDiscoveryClientMetricsUpdateMessage [super=TcpDiscoveryAbstractMessage
[sndNodeId=null, id=7a5d3c5b161-6b443ecd-f658-4782-8f8e-3ed6d6407fc1,
verifierNodeId=null, topVer=0, pendin
gIdx=0, failedNodes=null, isClient=true]]
at
org.apache.ignite.spi.discovery.tcp.ClientImpl$SocketWriter.body(ClientImpl.java:1276)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
2018-02-21 03:14:19,720 [ERROR] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-sock-reader-#3%cabooseGrid% - Failed to read message
[sock=Socket[addr=/127.0.0.1,port=47500,localport=59446],
locNodeId=6b443ecd-f658-4782-8f8e-3ed6d6407fc1,
rmtNodeId=93be80c3-c2b5-498b-a897-265c8bacb648]
org.apache.ignite.IgniteCheckedException: Failed to deserialize object with
given class loader: sun.misc.Launcher$AppClassLoader@28d93b30
at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:129)
at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9740)
at
org.apache.ignite.spi.discovery.tcp.ClientImpl$SocketReader.body(ClientImpl.java:1001)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at
org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2338)
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
at java.io.ObjectInputStream.(ObjectInputStream.java:301)
at
org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.(JdkMarshallerObjectInputStream.java:39)
at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:119)
... 4 common frames omitted
2018-02-21 03:14:19,720 [ERROR] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-sock-reader-#3%cabooseGrid% - Connection failed
[sock=Socket[addr=/127.0.0.1,port=47500,localport=59446],
locNodeId=6b443ecd-f658-4782-8f8e-3ed6d6407fc1]
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
at
org.apache.ignite.marshaller.jdk.JdkMarshallerInputStreamWrapper.read(JdkMarshallerInputStreamWrapper.java:53)
at
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2338)
at
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2351)
at
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2822)
at
java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:804)
at java.io.ObjectInputStream.(ObjectInputStream.java:301)
at
org.apache.ignite.marshaller.jdk.JdkMarshallerObjectInputStream.(JdkMarshallerObjectInputStream.java:39)
at
org.apache.ignite.marshaller.jdk.JdkMars

Re: Large durable caches

2018-02-21 Thread Ivan Rakov
Decreasing perNodeBufferSize and setting perNodeParallelOperations to 
server_cpu_cores * 2 may help if GC load is high.


Yes, it can be AWS related issue, either connectivity or disk I/O. There 
were reports about cluster segmentation due to network problems on AWS.
Please share full server/client logs and full thread dumps taken at the 
moment of freeze. It will certainly help in analysis.


Best Regards,
Ivan Rakov

On 20.02.2018 22:24, lawrencefinn wrote:

Should I decrease these?  One other thing to note is im monitoring GC and the
GC times do not correlate with these issues (GC times are pretty low
anyway).  I honestly think that persisting to disk somehow causes things to
freeze up.  Could it be an AWS related issue?  Im using EBS IO1 with 20,000
IOPS, one driver for persistence and one for wal.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/




Re: Large durable caches

2018-02-20 Thread lawrencefinn
Should I decrease these?  One other thing to note is im monitoring GC and the
GC times do not correlate with these issues (GC times are pretty low
anyway).  I honestly think that persisting to disk somehow causes things to
freeze up.  Could it be an AWS related issue?  Im using EBS IO1 with 20,000
IOPS, one driver for persistence and one for wal.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-02-19 Thread Ivan Rakov
It's hard to determine the problem by these messages. I don't see 
anything unhealthy regarding persistence - checkpoint start is a regular 
event.


There were cases when excessive GC load on client side seriously 
affected throughput/latency of data streaming. You may consider playing 
with the following data streamer parameters:


public int perNodeBufferSize(int bufSize) - defines how many items 
should be saved in buffer before send it to server node
 public void perNodeParallelOperations(int parallelOps) - defines how 
many buffers can be sent to the node without acknowledge that it was 
processed


Best Regards,
Ivan Rakov

On 19.02.2018 22:24, lawrencefinn wrote:

Okay I am trying to reproduce.  It hasn't got stuck yet, but the client got
disconnected and reconnected recently.  I don't think it is related to GC
because I am recording GC times and it does not jump up that much.  Could
the system get slow on a lot of io?  i see this in the ignite log:

[19:13:01,988][WARNING][grid-timeout-worker-#71][diagnostic] Found long
running cache future [startTime=19:11:56.656, curTime=19:13:01.911,
  fut=GridNearAtomicSingleUpdateFuture [reqState=Primary
[id=62a2a255-3320-4040-aa23-ffb86dec7586, opRes=false, expCnt=-1, rcvdCnt=0,
primaryRes=false, done=false, waitFor=null, rcvd=null],
super=GridNearAtomicAbstractUpdateFuture [remapCnt=100,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=14], remapTopVer=null,
err=null, futId=313296239, super=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=1229092316
[19:13:01,988][WARNING][grid-timeout-worker-#71][diagnostic] Found long
running cache future [startTime=19:11:39.917, curTime=19:13:01.911,
fut=GridNearAtomicSingleUpdateFuture [reqState=Primary
[id=62a2a255-3320-4040-aa23-ffb86dec7586, opRes=false, expCnt=-1, rcvdCnt=0,
primaryRes=false, done=false, waitFor=null, rcvd=null],
super=GridNearAtomicAbstractUpdateFuture [remapCnt=100,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=14], remapTopVer=null,
err=null, futId=312914655, super=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=15435296
[19:13:51,057][INFO][db-checkpoint-thread-#110][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=77744626-04e6-4e17-bda7-23ecb50bbe19,
startPtr=FileWALPointer [idx=9600, fileOffset=35172819, len=124303,
forceFlush=true], checkpointLockWait=57708ms, checkpointLockHoldTime=64ms,
pages=3755135, reason='too many dirty pages']
[19:14:01,919][INFO][grid-timeout-worker-#71][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
 ^-- Node [id=62a2a255, uptime=01:42:41.752]
 ^-- H/N/C [hosts=2, nodes=3, CPUs=64]
 ^-- CPU [cur=77.83%, avg=39.11%, GC=0.13%]
 ^-- PageMemory [pages=5111642]
 ^-- Heap [used=11669MB, free=43.02%, comm=20480MB]
 ^-- Non heap [used=67MB, free=95.56%, comm=69MB]
 ^-- Public thread pool [active=0, idle=0, qSize=0]
 ^-- System thread pool [active=0, idle=6, qSize=0]
 ^-- Outbound messages queue [size=0]
[19:15:03,470][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/127.0.0.1, rmtPort=33542]
[19:15:03,470][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/127.0.0.1, rmtPort=33542]


My app log has:
2018-02-19 19:15:02,176 [WARN] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-reconnector-#5%cabooseGrid% - Timed out waiting for message
to be read (most probably, the reason is long GC pauses on remote node)
[curTimeout=5000, rmtAddr=/127.0.0.1:47500, rmtPort=47500]
2018-02-19 19:15:02,176 [ERROR] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-reconnector-#5%cabooseGrid% - Exception on joining: Failed
to deserialize object with given class loader:
sun.misc.Launcher$AppClassLoader@28d93b30
org.apache.ignite.IgniteCheckedException: Failed to deserialize object with
given class loader: sun.misc.Launcher$AppClassLoader@28d93b30
 at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:129)
 at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
 at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9740)
 at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.readMessage(TcpDiscoverySpi.java:1590)
 at
org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:627)
 at
org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:524)
 at
org.apache.ignite.spi.discovery.tcp.ClientImpl.access$900(ClientImpl.java:124)
 at
org.apache.ignite.spi.discovery.tcp.ClientImpl$Reconnector.body(ClientImpl.java:1377)
 at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.net.SocketTimeoutException: Read timed out



--
Sent from: http://apache-ignite-users

Re: Large durable caches

2018-02-19 Thread lawrencefinn
Okay I am trying to reproduce.  It hasn't got stuck yet, but the client got
disconnected and reconnected recently.  I don't think it is related to GC
because I am recording GC times and it does not jump up that much.  Could
the system get slow on a lot of io?  i see this in the ignite log:

[19:13:01,988][WARNING][grid-timeout-worker-#71][diagnostic] Found long
running cache future [startTime=19:11:56.656, curTime=19:13:01.911,
 fut=GridNearAtomicSingleUpdateFuture [reqState=Primary
[id=62a2a255-3320-4040-aa23-ffb86dec7586, opRes=false, expCnt=-1, rcvdCnt=0,
primaryRes=false, done=false, waitFor=null, rcvd=null],
super=GridNearAtomicAbstractUpdateFuture [remapCnt=100,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=14], remapTopVer=null,
err=null, futId=313296239, super=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=1229092316
[19:13:01,988][WARNING][grid-timeout-worker-#71][diagnostic] Found long
running cache future [startTime=19:11:39.917, curTime=19:13:01.911,
fut=GridNearAtomicSingleUpdateFuture [reqState=Primary
[id=62a2a255-3320-4040-aa23-ffb86dec7586, opRes=false, expCnt=-1, rcvdCnt=0,
primaryRes=false, done=false, waitFor=null, rcvd=null],
super=GridNearAtomicAbstractUpdateFuture [remapCnt=100,
topVer=AffinityTopologyVersion [topVer=3, minorTopVer=14], remapTopVer=null,
err=null, futId=312914655, super=GridFutureAdapter [ignoreInterrupts=false,
state=INIT, res=null, hash=15435296
[19:13:51,057][INFO][db-checkpoint-thread-#110][GridCacheDatabaseSharedManager]
Checkpoint started [checkpointId=77744626-04e6-4e17-bda7-23ecb50bbe19,
startPtr=FileWALPointer [idx=9600, fileOffset=35172819, len=124303,
forceFlush=true], checkpointLockWait=57708ms, checkpointLockHoldTime=64ms,
pages=3755135, reason='too many dirty pages']
[19:14:01,919][INFO][grid-timeout-worker-#71][IgniteKernal] 
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=62a2a255, uptime=01:42:41.752]
^-- H/N/C [hosts=2, nodes=3, CPUs=64]
^-- CPU [cur=77.83%, avg=39.11%, GC=0.13%]
^-- PageMemory [pages=5111642]
^-- Heap [used=11669MB, free=43.02%, comm=20480MB]
^-- Non heap [used=67MB, free=95.56%, comm=69MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=6, qSize=0]
^-- Outbound messages queue [size=0]
[19:15:03,470][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
accepted incoming connection [rmtAddr=/127.0.0.1, rmtPort=33542]
[19:15:03,470][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery
spawning a new thread for connection [rmtAddr=/127.0.0.1, rmtPort=33542]


My app log has:
2018-02-19 19:15:02,176 [WARN] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-reconnector-#5%cabooseGrid% - Timed out waiting for message
to be read (most probably, the reason is long GC pauses on remote node)
[curTimeout=5000, rmtAddr=/127.0.0.1:47500, rmtPort=47500]
2018-02-19 19:15:02,176 [ERROR] from
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi in
tcp-client-disco-reconnector-#5%cabooseGrid% - Exception on joining: Failed
to deserialize object with given class loader:
sun.misc.Launcher$AppClassLoader@28d93b30
org.apache.ignite.IgniteCheckedException: Failed to deserialize object with
given class loader: sun.misc.Launcher$AppClassLoader@28d93b30
at
org.apache.ignite.marshaller.jdk.JdkMarshaller.unmarshal0(JdkMarshaller.java:129)
at
org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:94)
at
org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:9740)
at
org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi.readMessage(TcpDiscoverySpi.java:1590)
at
org.apache.ignite.spi.discovery.tcp.ClientImpl.sendJoinRequest(ClientImpl.java:627)
at
org.apache.ignite.spi.discovery.tcp.ClientImpl.joinTopology(ClientImpl.java:524)
at
org.apache.ignite.spi.discovery.tcp.ClientImpl.access$900(ClientImpl.java:124)
at
org.apache.ignite.spi.discovery.tcp.ClientImpl$Reconnector.body(ClientImpl.java:1377)
at
org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
Caused by: java.net.SocketTimeoutException: Read timed out



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Large durable caches

2018-02-19 Thread Ivan Rakov

Hello,

Seems like checkpoint-runner thread is idle. There's no active 
checkpoint at the moment of dump, and thus there's no work for 
checkpoint thread.
If the problem persists, please share full thread dump. It would be 
better to take 2-3 consecutive thread dumps when your data streamer stops.


Best Regards,
Ivan Rakov

On 19.02.2018 0:41, lawrencefinn wrote:

I am testing a very large durable cache that mostly cannot fit into memory.
I start loading in a lot of data via a data streamer.  At some point the
data becomes too large to fit into memory so ignite starts writing a lot to
disk during checkpoints.  But after some point after that, the datastreamer
stops streaming.  It's stuck.  Doing a jstack on ignite i see the
datastreamer threads are all stuck in a parked state.  Any thoughts?

"data-streamer-stripe-4-#37" #62 prio=5 os_prio=0 tid=0x7fe8c212d000
nid=0x1410 waiting on condition [0x7fe73d1d6000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
 at
org.apache.ignite.internal.util.StripedExecutor$StripeConcurrentQueue.take(StripedExecutor.java:651)
 at
org.apache.ignite.internal.util.StripedExecutor$Stripe.run(StripedExecutor.java:499)
 at java.lang.Thread.run(Thread.java:745)

The checkpoint thread seems to be parked too
"checkpoint-runner-#354" #397 prio=5 os_prio=0 tid=0x7fe49c04f000
nid=0x177c waiting on condition [0x7fdc54a1c000]
java.lang.Thread.State: WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for  <0x0002c08a6ef8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
 at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
 at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
 at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
 - None


My configuration is as follows:

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
 
 
 
 
 
 
   






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/