Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

UmurD Thu, 25 Jan 2018 15:38:41 -0800

Hello Apache Ignite Community,

I am currently working with Ignite and Spark; I'm specifically interested in
the Shared RDD functionality. I have a few questions and hope I can find
answers here.


Goal:
I am trying to have a single physical page with multiple sharers (multiple
processes map to the same physical page number) on a dataset. Is this
achievable with Apache Ignite?

Specifications:
This is all running on Ubuntu 14.04 on an x86-64 machine, with Ignite-2.3.0.

I will first introduce the simpler case using only Apache Ignite, and then
talk about integration and data sharing with Spark. I appreciate the
assistance.

IGNITE NODES ONLY
Approach:
I am trying to utilize the Shared RDD of Ignite. Since I also need my data
to persist after the spark processes, I am deploying the Ignite cluster
independently with the following command and config:

'$IGNITE_HOME/bin/ignite.sh
$IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. 

I populate the Ignite nodes using:

'mvn exec:java
-Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I
modified this file to only populate the SharedRDD cache (partitioned) with
100,000 <int,int> pairs.

Finally, I observe the status of the ignite cluster using:

'$IGNITE_home/bin/ignitevisorcmd.sh'

Results:
I can confirm that I have average 50,000 <int,int> pairs per node, totaling
at 100,000 key,value pairs. The memory usage of my Ignite nodes also
increase, confirming the populated RDD. However, when I compare the page
maps of both Ignite nodes, I see that they are oblivious to each others
memory space and have different Physical Page mappings. Is it possible for
me to set Ignite nodes up so that the nodes with the Shared RDD caches share
the datasets with single physical page mappings without duplicating?

SHARING AND INTEGRATION WITH SPARK (A more specific use case)
Approach:

In addition to the Ignite node deployment I mentioned earlier (2 Ignite
nodes with example-shared-rdd, populated using the SharedRDDExample), I also
try the Shared RDD with Spark. I deploy the master with
'$SPARK_HOME/sbin/start-master.sh', and workers are started with
'$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker
spark://master_host:master_port'

Here, I am trying to achieve a setup where I have multiple spark workers
that all share a dataset. More specifically, I need the multiple spark
workers/processes to be pointing at the same Physical Page Mappings on
startup (before writing). I first get in a spark-shell with the following
command:

'$SPARK_HOME/bin/spark-shell 
    --packages org.apache.ignite:ignite-spark:2.3.0
  --master spark://master_host:master_port
  --repositories http://repo.maven.apache.org/maven2/org/apache/ignite'

[When in the shell, I run the following scala code]:

import org.apache.ignite.spark._
import org.apache.ignite.configuration._

val ic = new IgniteContext(sc,
"examples/config/spark/example-shared-rdd.xml") # This is the same
configuration as the Ignite nodes
val sharedRDD = ic.fromCache[Integer, Integer]("sharedRDD") # The cache I
have in the config is named sharedRDD.

When I observe the Ignite cluster *before* doing any read/write operations
on the spark end, I see the 2 nodes I started up with about 50,000 key,value
pairs each. After running:

sharedRDD.filter(_._2 > 50000).count # Which should be a read and count
command?

I observe that I now have *4* nodes with about 25,000 key,value pairs each.
2 of these nodes are the Ignite nodes I deployed standalone, and the other 2
are launched from the context in the Spark processes. This leads to
different datasets in each process, and different page mappings fails to
achieve what I need.

In both cases (Ignite Nodes only, and Ignite+Spark), I observe different
physical page mappings. While the dataset seems shared to the outside world,
it is not truly shared at the page level. The nodes seem to be getting their
own set of private key,value pairs which are served to requesters, and a
sharing illusion is given to clients.

Is my understanding correct? If I am incorrect, how should I approach the
shared-dataset-multiple-processes setup with the same physical page mapping
using Ignite and SharedRDD (and Spark)?

Please let me know if you have any questions.

Sincerely,
Umur Darbaz
University of Illinois at Urbana-Champaign, Graduate Researcher



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

Reply via email to