Re: Cannot Run Spatial Query Example

2019-11-17 Thread Humphrey
Do you have a link to the file or the example?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: How to perform distributed compute in similar way to Spark vector UDF

2019-11-17 Thread camer314
Reading a little more in the Java docs about AffinityKey, I am thinking that,
much like vector UDF batch sizing, one way I could easily achieve my result
is to batch my rows into affinity keys. That is, for every 100,000 rows the
affinity key changes for example.

So cache keys [0...9] have affinity key 0, keys [10...19] have
affinity key 1 etc?

If that is the case, may I suggest you update the .NET documentation for
Data Grid regarding Affinity Colocation as it does not mention the use of
AffinityKey or go into anywhere near as much detail as the Java docs.






--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


How to perform distributed compute in similar way to Spark vector UDF

2019-11-17 Thread camer314
I asked this question on  StackOverflow

  

However I probably put too much weight on Spark.

My question really is, how can I load in a large CSV file to the cache and
send compute actions to the nodes which work in a similar way to Pandas UDF.
That is, they work on a subset of the data (rows).

In Ignite I imagine I could load the CSV to a cache using PARTITION mode and
then using affinity compute send functions to the nodes where the data is,
so each node is processing only the data that exists on it. This seems like
a nice way to go, each node is always only processing locally, and the
results of those actions would be adding back to the cache, so presumably
would only add locally as well.

However, I am not entirely sure how the partitioning works. The examples for
affinity show using a single key value.

Is there a way to load a CSV into a cache in PARTITION mode, so Ignite
evenly distributes across the grid but then run a compute job on every node
that works ONLY with the data in its own cache, that way i wont need to care
about keys?

For example, imagine a CSV file that is a matrix of numbers. My distributed
cache would really be a dataframe representation of that file. For arguments
sake lets say my cache is keyed by an increment ID with the data being an
array of doubles and the column names are A,B,C

That ID key is really pretty irrelevant. Its is meaningless to my
application.

Now lets say I wanted to perform the same maths on every row in that
dataframe, with the results being a new column in the cache.

If that formula was D = A * B * C then D becomes a new column.

Ignoring Spark SQL, in Spark I could write a UDF easily that creates column
D by passing columns [A,B,C]. Spark doesnt care about keys or ID columns in
this instance, it just gives you a vector of data and you return a vector of
results.

So in Ignite, how can i replicate that behaviour the most elegantly in code
(.NET), send compute to the grid that collectively processes all rows
without caring about the keys?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Question about memory when uploading CSV using .NET DataStreamer

2019-11-17 Thread camer314
Ok yes i see. Seems like with my code changes I made to provide the example
that the memory consumption is way more inline with expectations, so I guess
it was a code error on my part.

However, it seems strange that my client node, which has no cache, still
wants to hang onto over 1Gb of heap space even though its using less than
100Mb. Is there no way to release that back?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: recoveryBallotBoxes in MvccProcessorImpl memory leak?

2019-11-17 Thread mvkarp
Only other thing I can think of if it's through onDiscovery() is that
curCrd.local() somehow is returning true. However I am unable to find
exactly how local() is determined since there appears to be a big chain. 

I know that the node uuid on the leaking server is on a different physical
node as well as has a completely different node ID
(b-b-b-b-b-b) to what the MVCC coordinator is
(mvccCrd=a--a-a-a)

Is there any way that the curCrd.local() could be returning True on the
leaking server JVM? I am trying to investigate how local() is determined and
what could cause it to be true.


Ivan Pavlukhin wrote
> But currently I suspect that you faced a leak in
> MvccProcessorImpl.onDiscovery on non MVCC coordinator nodes. Do you
> think that there is other reason in you case?





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Baseline topology - Data inconsistency

2019-11-17 Thread userx
Hi team,

I went through https://apacheignite.readme.io/docs/baseline-topology , one
thing remained a question for me is what happens to the consistency of data
in the following case

1) Say 5 nodes are started as part of the cluster with BackUps = 1. M1, M2,
M3, M4 and M5 with persistence enabled.
2) The cluster is activated with all the nodes and all of the nodes form a
baseline topology. The cache is partitioned.
3) Say a cache c1 is created and we say c1.put ("1",1); say the entry with
key "1" is stored on M1 and the back up is stored on M5.
4) M5 goes down. But since it was a baseline node, re-balancing did not
happen since it is still a part of baseline topology.
5) C1.put("1",2) happens.
6) M5 comes back and loads the data from disk, still no re-balancing because
it was a baseline node.
7) If a client says c1.get("1"), whats the value it gets ?

What in short is the importance of baseline topology when it comes to this
kind of a situation ?







--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/