Re: data loss during rebalancing when using invokeAsync + EntryProcessor

2019-06-08 Thread Kamil Mišúth
It's quite easy to starve Ignite thread pools once you start to use the 
asynchronous API and listeners extensively. There wouldn't be built-in 
starvation detection in Ignite otherwise, I guess...
What is worse, the starvation may manifest it self only under heavy load and 
only in a cluster.
When you couple Ignite with high througput web server like Netty, it will just 
pass through all the load onto Ignite threads. The design of Netty and the 
usual reactive stack will naturally force you to use Ignite async APIs and 
everything will work for some time, at least on paper. That is until you start 
to simulate heavier loads, like 500 requests per second per instance. Netty 
will just pass that through onto Ignite and depending on the computations you 
do in the listeners, you may (under heavier load) create unresolvable graphs of 
computations that cannot make any progress. Also, some Ignite APIs do not have 
async counterparts, so you must locate calls to such APIs and ensure they will 
not run on Ignite threads at all costs and offload them to dedicated thread 
pools.
In the end, you need to offload both blocking and async calls (the listeners), 
so you need more threads for that.
You should stress your system with say Gatling at every iterarion to ensure 
some developer did not introduce such computational dependencies in the code 
base unknowingly.
Also you must conduct your load testing in cluster not a single instance.

The race condition was just a race condition (hidden below several abstractions 
of business logic). You can make those with or without Ignite.

Kamil

Dňa 5. júna 2019 18:34:28 SELČ používateľ Loredana Radulescu Ivanoff 
 napísal:
>That sounds very useful for a "what not to do example", could you
>please
>give a little more detail (big lines) on how the business code could
>starve
>the Ignite thread pool? And if using entry processors, how come the
>operations were not executed atomically - i.e. what made the race
>condition
>possible?
>
>Thank you.
>
>On Wed, Jun 5, 2019 at 1:10 AM kimec.ethome.sk  wrote:
>
>> Hi Ilya,
>>
>> I have tracked down this issue to a racy behavior in the business
>code
>> and Ignite thread pool starvation caused by the application code.
>>
>> Sorry for the false alarm.
>>
>> ---
>> S pozdravom,
>>
>> Kamil Mišúth
>>
>> On 2019-05-22 18:46, Ilya Kasnacheev wrote:
>> > Hello!
>> >
>> > Do you have  reproducer for this behavior? Have you tried the same
>> > scenario on 2.7? I doubt anyone will take effort to debug 2.6.
>> >
>> > Regards,
>> >
>> > --
>> >
>> > Ilya Kasnacheev
>> >
>> > чт, 25 апр. 2019 г. в 18:59, kimec.ethome.sk [1]
>> > :
>> >
>> >> Greetings,
>> >>
>> >> we've been chasing a weird issue in a two node cluster for few
>days
>> >> now.
>> >> We have a spring boot application bundled with an ignite server
>> >> node.
>> >>
>> >> We use invokeAsync on TRANSACTIONAL PARTITIONED cache with 1
>backup.
>> >> We
>> >> assume that each node in the two node cluster has a copy of the
>> >> other
>> >> node's data. In a way, this mimics REPLICATED cache configuration.
>> >> Our
>> >> business logic is written within an EntryProcessor. The "business
>> >> code"
>> >> in the EntryProcessor is idempotent and arguments to the processor
>> >> are
>> >> fixed. At the end of the "invokeAsync" call, i.e. when
>IgniteFuture
>> >> is
>> >> resolved, we return a value returned from the EntryProcessor via
>> >> REST to
>> >> the caller of our API.
>> >>
>> >> The problem occurres when one of the two nodes is restarted
>> >> (triggering
>> >> re-balancing) and we simultaneously receive a call to our REST API
>> >> launching a businesses computation in EntryProcessor.
>> >> The code in EntryProcessor properly computes a new value that we
>> >> want to
>> >> store in the cache. No exception is thrown so we leak it out the
>> >> REST
>> >> caller as a return value, but when rebalancing finishes, the value
>> >> is
>> >> not in the cache anymore.
>> >> Yet the caller "saw" and stored the value we returned from our
>> >> EntryProcessor.
>> >>
>> >> We did experiment with various cache settings but the problem
>simply
>> >>
>> >> persists. In fact we initially used REPLICATED cache configuration
>> >> but
>> >> the behavior was pretty much the same.
>> >>
>> >> We have currently settled on a rather extreme configuration, but
>the
>> >>
>> >> data is still lost during rebalancing from time to time. We are
>> >> using
>> >> Ignite 2.6 and gatling for REST load testing.
>> >> The load on the REST api and consequently on Ignite is not very
>> >> high.
>> >>
>> >> setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL)
>> >> setCacheMode(CacheMode.PARTITIONED)
>> >> setBackups(1)
>> >>
>setWriteSynchronizationMode(CacheWriteSynchronizationMode.FULL_SYNC)
>> >> setRebalanceMode(CacheRebalanceMode.SYNC)
>> >> setPartitionLossPolicy(PartitionLossPolicy.READ_WRITE_SAFE)
>> >> setAffinity(new RendezvousAffinityFunction().setPartitions(2))
>> >>
>> >> I would appreciate any pointers 

Re: unable to query the cache after restart

2019-06-08 Thread Павлухин Иван
Hi,

First bet is that it is a problem with configuring persistence for a
cache, but I have not found any problems in config. I can suggest to
check a directory with database files to check that the same directory
is used among node restarts and that it contains files for the cache
in question.

чт, 6 июн. 2019 г. в 19:39, goutham manchikatla :
>
> Hi,
> I will work on the reproducer project.
>
> I am using 2.7 version. Also I tried with Java SQL API
>
> SqlFieldsQuery sql = new SqlFieldsQuery(query);
>
> QueryCursor> cursor = cache.query(sql)
>
>
> On Thu, Jun 6, 2019 at 10:32 AM Ilya Kasnacheev  
> wrote:
>>
>> Hello!
>>
>> Can you make a reproducer project which will exhibit this behavior? One 
>> which will fill enough data in cache so that this behavior is observable 
>> after restart.
>>
>> BTW, what's the version you are on?
>>
>> Have you tried scan query (via Java code)?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> чт, 6 июн. 2019 г. в 19:25, goutham manchikatla :
>>>
>>> http://localhost:8080/ignite?user=ignite=ignite=qryexe=Account=10=lincs_cache=select%20*%20from%20lincs.account%20LIMIT%2010
>>>
>>> Yes I tried using Debeaver JDBC, query -SELECT * FROM LINCS.ACCOUNT LIMIT 
>>> 10;
>>>
>>> Still the same behavior.
>>>
>>> On Thu, Jun 6, 2019 at 10:20 AM Ilya Kasnacheev  
>>> wrote:

 Hello!

 What's the query in question? Have you tried using e.g. sqlline to connect 
 via JDBC?

 Regards,
 --
 Ilya Kasnacheev


 чт, 6 июн. 2019 г. в 19:15, goutham manchikatla :
>
> Hi,
>
> I reproduced the behavior. I stopped the cache nodes and started them 
> again. I see the metadata, cache count, but no query response:
>
> {"successStatus":0,"sessionToken":"94DAD112C4E848E98663AF5883BBDDE2","response":[{"cacheName":"lincs_cache","types":["Account"],"keyClasses":{"Account":"java.lang.String"},"valClasses":{"Account":"com.domain.Account"},"fields":{"Account":{"ACCOUNTNUMBER":"java.lang.String","FIRSTNAME":"java.lang.String","LASTNAME":"java.lang.String","SERVADDRLINE1":"java.lang.String","SERVADDRLINE2":"java.lang.String","SERVADDRCITY":"java.lang.String","SERVADDRSTATE":"java.lang.String","SERVADDRZIP":"java.lang.String","BILLADDRLINE1":"java.lang.String","BILLADDRLINE2":"java.lang.String","BILLADDRCITY":"java.lang.String","BILLADDRSTATE":"java.lang.String","BILLADDRZIP":"java.lang.String","BILLINGSYSTEM":"java.lang.String"}},"indexes":{"Account":[]}}],"error":null}
>
>  Record count:
>
> {"successStatus":0,"affinityNodeId":null,"sessionToken":"0BBB1DA51FA243298D378D1F2D2DFE80","response":121039244,"error":null}
>
> Query Output:
>
> {"successStatus":0,"sessionToken":"69E405FB1E93472FA3F06A1312E31597","error":null,"response":{"items":[],"last":true,"fieldsMetadata":[],"queryId":6}}
>
>
> I don't see any data in the response.
>
>
> On Thu, Jun 6, 2019 at 9:50 AM Ilya Kasnacheev 
>  wrote:
>>
>> Hello!
>>
>> Looks OK. Can you reproduce the behavior, or is it a one-time 
>> occurrence? What happens if you try to scan that cache? Anything 
>> suspicious in your logs?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> чт, 6 июн. 2019 г. в 18:30, goutham manchikatla :
>>>
>>> Hi,
>>>
>>> I didn't change any code between restarts. Below is the configuration.
>>>
>>> 
>>> >> class="org.apache.ignite.configuration.DataStorageConfiguration">
>>> 
>>> 
>>>
>>> 
>>> 
>>> >> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> >> class="org.apache.ignite.configuration.DataRegionConfiguration">
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> >> value="true"/>
>>> 
>>> >> value="RANDOM_LRU"/>
>>> 
>>> >> value="#{1024L * 1024 * 1024}"/>
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> >> class="org.apache.ignite.configuration.CacheConfiguration">
>>> >> value="500MB_Region"/>
>>> 
>>> 
>>>