Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

UmurD Fri, 26 Jan 2018 14:44:08 -0800

One update to this thread: I realized that the 2 nodes-50K keys to 4
nodes-25K redistribution was happening because I was not enforcing client
mode at the spark worker side. However, my question still stands:


Does Ignite use shared memory (shmem) to manage the Shared RDD? Can I set up
Ignite servers to share a dataset/in memory cache to use shared memory?

Sincerely,
Umur


UmurD wrote
> Val,
> 
> I would like to make one correction. Data could also be shared with Linux
> shared memory (like shm). It does not have to be through copy-on-writes
> with
> read-only mapped pages. A shared dataset in shared memory across different
> processes also fits my use case.
> 
> Sincerely,
> Umur
> UmurD wrote
>> Hi Val,
>> 
>> Thanks for the quick response.
>> 
>> I am referring to how Virtual and Physical Memory works.
>> 
>> For more background, when a process is launched, it will be allocated a
>> virtual address space. This virtual memory will have a translation to the
>> physical memory you have on your computer. The pages allocated to the
>> processes will have different permissions (Read vs Read-Write), and some
>> of
>> them will be exclusively mapped to the process it is assigned to, while
>> some
>> others will be shared.
>> 
>> A good example of shared physical pages is for say a library (it does not
>> have to be a library, and I'm only providing that as an example). If I
>> launch two identical processes on the same machine, the shared libraries
>> used by these processes will have the same physical address (after
>> translating from virtual to physical addresses). This is because the
>> library
>> might be read-only, and there is no need for two copies of the same
>> library
>> if it is only being read. The processes will not get their own copy until
>> they attempt to write to the shared page. When they do, this will incur a
>> page-fault and the process will be allocated it's own (exclusive) copy of
>> the previously shared page for modification. This is called a
>> Copy-On-Write
>> (CoW).
>> 
>> The case I am looking for specifically is when I launch 2 processes (say
>> Ignite for the sake of the example), and load up a dataset to be shared,
>> I
>> want these 2 processes to point to the same physical memory space for the
>> shared dataset (until one of them tries to modify it, of course). In
>> other
>> words, I want the loaded dataset to have the same physical address
>> translation from their respective virtual addresses. That is what I'm
>> referring to when I talk about identical physical page mappings.
>> 
>> This is for a research project I am conducting, so performance or
>> functionality is unimportant. The physical mapping is the only critical
>> component.
>> 
>> Sincerely,
>> Umur
>> vkulichenko wrote
>>> Umur,
>>> 
>>> When you talk about "physical page mappings", what exactly are you
>>> referring
>>> to? Can you please elaborate a bit more on what and why you're trying to
>>> achieve? What is the issue you're trying to solve?
>>> 
>>> -Val
>>> UmurD wrote
>>>> Hello Apache Ignite Community,
>>>> 
>>>> I am currently working with Ignite and Spark; I'm specifically
>>>> interested in
>>>> the Shared RDD functionality. I have a few questions and hope I can
>>>> find
>>>> answers here.
>>>> 
>>>> Goal:
>>>> I am trying to have a single physical page with multiple sharers
>>>> (multiple
>>>> processes map to the same physical page number) on a dataset. Is this
>>>> achievable with Apache Ignite?
>>>> 
>>>> Specifications:
>>>> This is all running on Ubuntu 14.04 on an x86-64 machine, with
>>>> Ignite-2.3.0.
>>>> 
>>>> I will first introduce the simpler case using only Apache Ignite, and
>>>> then
>>>> talk about integration and data sharing with Spark. I appreciate the
>>>> assistance.
>>>> 
>>>> IGNITE NODES ONLY
>>>> Approach:
>>>> I am trying to utilize the Shared RDD of Ignite. Since I also need my
>>>> data
>>>> to persist after the spark processes, I am deploying the Ignite cluster
>>>> independently with the following command and config:
>>>> 
>>>> '$IGNITE_HOME/bin/ignite.sh
>>>> $IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. 
>>>> 
>>>> I populate the Ignite nodes using:
>>>> 
>>>> 'mvn exec:java
>>>> -Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I
>>>> modified this file to only populate the SharedRDD cache (partitioned)
>>>> with
>>>> 100,000 
>>>> &lt;
>>>> int,int
>>>> &gt;
>>>>  pairs.
>>>> 
>>>> Finally, I observe the status of the ignite cluster using:
>>>> 
>>>> '$IGNITE_home/bin/ignitevisorcmd.sh'
>>>> 
>>>> Results:
>>>> I can confirm that I have average 50,000 
>>>> &lt;
>>>> int,int
>>>> &gt;
>>>>  pairs per node, totaling
>>>> at 100,000 key,value pairs. The memory usage of my Ignite nodes also
>>>> increase, confirming the populated RDD. However, when I compare the
>>>> page
>>>> maps of both Ignite nodes, I see that they are oblivious to each others
>>>> memory space and have different Physical Page mappings. Is it possible
>>>> for
>>>> me to set Ignite nodes up so that the nodes with the Shared RDD caches
>>>> share
>>>> the datasets with single physical page mappings without duplicating?
>>>> 
>>>> SHARING AND INTEGRATION WITH SPARK (A more specific use case)
>>>> Approach:
>>>> 
>>>> In addition to the Ignite node deployment I mentioned earlier (2 Ignite
>>>> nodes with example-shared-rdd, populated using the SharedRDDExample), I
>>>> also
>>>> try the Shared RDD with Spark. I deploy the master with
>>>> '$SPARK_HOME/sbin/start-master.sh', and workers are started with
>>>> '$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker
>>>> spark://master_host:master_port'
>>>> 
>>>> Here, I am trying to achieve a setup where I have multiple spark
>>>> workers
>>>> that all share a dataset. More specifically, I need the multiple spark
>>>> workers/processes to be pointing at the same Physical Page Mappings on
>>>> startup (before writing). I first get in a spark-shell with the
>>>> following
>>>> command:
>>>> 
>>>> '$SPARK_HOME/bin/spark-shell 
>>>>    --packages org.apache.ignite:ignite-spark:2.3.0
>>>>  --master spark://master_host:master_port
>>>>  --repositories http://repo.maven.apache.org/maven2/org/apache/ignite'
>>>> 
>>>> [When in the shell, I run the following scala code]:
>>>> 
>>>> import org.apache.ignite.spark._
>>>> import org.apache.ignite.configuration._
>>>> 
>>>> val ic = new IgniteContext(sc,
>>>> "examples/config/spark/example-shared-rdd.xml") # This is the same
>>>> configuration as the Ignite nodes
>>>> val sharedRDD = ic.fromCache[Integer, Integer]("sharedRDD") # The cache
>>>> I
>>>> have in the config is named sharedRDD.
>>>> 
>>>> When I observe the Ignite cluster *before* doing any read/write
>>>> operations
>>>> on the spark end, I see the 2 nodes I started up with about 50,000
>>>> key,value
>>>> pairs each. After running:
>>>> 
>>>> sharedRDD.filter(_._2 > 50000).count # Which should be a read and count
>>>> command?
>>>> 
>>>> I observe that I now have *4* nodes with about 25,000 key,value pairs
>>>> each.
>>>> 2 of these nodes are the Ignite nodes I deployed standalone, and the
>>>> other 2
>>>> are launched from the context in the Spark processes. This leads to
>>>> different datasets in each process, and different page mappings fails
>>>> to
>>>> achieve what I need.
>>>> 
>>>> In both cases (Ignite Nodes only, and Ignite+Spark), I observe
>>>> different
>>>> physical page mappings. While the dataset seems shared to the outside
>>>> world,
>>>> it is not truly shared at the page level. The nodes seem to be getting
>>>> their
>>>> own set of private key,value pairs which are served to requesters, and
>>>> a
>>>> sharing illusion is given to clients.
>>>> 
>>>> Is my understanding correct? If I am incorrect, how should I approach
>>>> the
>>>> shared-dataset-multiple-processes setup with the same physical page
>>>> mapping
>>>> using Ignite and SharedRDD (and Spark)?
>>>> 
>>>> Please let me know if you have any questions.
>>>> 
>>>> Sincerely,
>>>> Umur Darbaz
>>>> University of Illinois at Urbana-Champaign, Graduate Researcher
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

Reply via email to