Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD
Umur, No, it doesn't use shared memory and I doubt what you tell is even possible. However, I still not sure I understand what is the purpose of all this. What is your ultimate goal here? -Val -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD
One update to this thread: I realized that the 2 nodes-50K keys to 4 nodes-25K redistribution was happening because I was not enforcing client mode at the spark worker side. However, my question still stands: Does Ignite use shared memory (shmem) to manage the Shared RDD? Can I set up Ignite servers to share a dataset/in memory cache to use shared memory? Sincerely, Umur UmurD wrote > Val, > > I would like to make one correction. Data could also be shared with Linux > shared memory (like shm). It does not have to be through copy-on-writes > with > read-only mapped pages. A shared dataset in shared memory across different > processes also fits my use case. > > Sincerely, > Umur > UmurD wrote >> Hi Val, >> >> Thanks for the quick response. >> >> I am referring to how Virtual and Physical Memory works. >> >> For more background, when a process is launched, it will be allocated a >> virtual address space. This virtual memory will have a translation to the >> physical memory you have on your computer. The pages allocated to the >> processes will have different permissions (Read vs Read-Write), and some >> of >> them will be exclusively mapped to the process it is assigned to, while >> some >> others will be shared. >> >> A good example of shared physical pages is for say a library (it does not >> have to be a library, and I'm only providing that as an example). If I >> launch two identical processes on the same machine, the shared libraries >> used by these processes will have the same physical address (after >> translating from virtual to physical addresses). This is because the >> library >> might be read-only, and there is no need for two copies of the same >> library >> if it is only being read. The processes will not get their own copy until >> they attempt to write to the shared page. When they do, this will incur a >> page-fault and the process will be allocated it's own (exclusive) copy of >> the previously shared page for modification. This is called a >> Copy-On-Write >> (CoW). >> >> The case I am looking for specifically is when I launch 2 processes (say >> Ignite for the sake of the example), and load up a dataset to be shared, >> I >> want these 2 processes to point to the same physical memory space for the >> shared dataset (until one of them tries to modify it, of course). In >> other >> words, I want the loaded dataset to have the same physical address >> translation from their respective virtual addresses. That is what I'm >> referring to when I talk about identical physical page mappings. >> >> This is for a research project I am conducting, so performance or >> functionality is unimportant. The physical mapping is the only critical >> component. >> >> Sincerely, >> Umur >> vkulichenko wrote >>> Umur, >>> >>> When you talk about "physical page mappings", what exactly are you >>> referring >>> to? Can you please elaborate a bit more on what and why you're trying to >>> achieve? What is the issue you're trying to solve? >>> >>> -Val >>> UmurD wrote Hello Apache Ignite Community, I am currently working with Ignite and Spark; I'm specifically interested in the Shared RDD functionality. I have a few questions and hope I can find answers here. Goal: I am trying to have a single physical page with multiple sharers (multiple processes map to the same physical page number) on a dataset. Is this achievable with Apache Ignite? Specifications: This is all running on Ubuntu 14.04 on an x86-64 machine, with Ignite-2.3.0. I will first introduce the simpler case using only Apache Ignite, and then talk about integration and data sharing with Spark. I appreciate the assistance. IGNITE NODES ONLY Approach: I am trying to utilize the Shared RDD of Ignite. Since I also need my data to persist after the spark processes, I am deploying the Ignite cluster independently with the following command and config: '$IGNITE_HOME/bin/ignite.sh $IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. I populate the Ignite nodes using: 'mvn exec:java -Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I modified this file to only populate the SharedRDD cache (partitioned) with 100,000 int,int pairs. Finally, I observe the status of the ignite cluster using: '$IGNITE_home/bin/ignitevisorcmd.sh' Results: I can confirm that I have average 50,000 int,int pairs per node, totaling at 100,000 key,value pairs. The memory usage of my Ignite nodes also increase, confirming the populated RDD. However, when I compare the page maps of both Ignite nodes, I see that they are oblivious to each others memory space and have different Physical Page mappings. Is it possible for me to set
Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD
Val, I would like to make one correction. Data could also be shared with Linux shared memory (like shm). It does not have to be through copy-on-writes with read-only mapped pages. A shared dataset in shared memory across different processes also fits my use case. Sincerely, Umur -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD
Hi Val, Thanks for the quick response. I am referring to how Virtual and Physical Memory works. For more background, when a process is launched, it will be allocated a virtual address space. This virtual memory will have a translation to the physical memory you have on your computer. The pages allocated to the processes will have different permissions (Read vs Read-Write), and some of them will be exclusively mapped to the process it is assigned to, while some others will be shared. A good example of shared physical pages is for say a library (it does not have to be a library, and I'm only providing that as an example). If I launch two identical processes on the same machine, the shared libraries used by these processes will have the same physical address (after translating from virtual to physical addresses). This is because the library might be read-only, and there is no need for two copies of the same library if it is only being read. The processes will not get their own copy until they attempt to write to the shared page. When they do, this will incur a page-fault and the process will be allocated it's own (exclusive) copy of the previously shared page for modification. This is called a Copy-On-Write (CoW). The case I am looking for specifically is when I launch 2 processes (say Ignite for the sake of the example), and load up a dataset to be shared, I want these 2 processes to point to the same physical memory space for the shared dataset (until one of them tries to modify it, of course). In other words, I want the loaded dataset to have the same physical address translation from their respective virtual addresses. That is what I'm referring to when I talk about identical physical page mappings. This is for a research project I am conducting, so performance or functionality is unimportant. The physical mapping is the only critical component. Sincerely, Umur -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD
Umur, When you talk about "physical page mappings", what exactly are you referring to? Can you please elaborate a bit more on what and why you're trying to achieve? What is the issue you're trying to solve? -Val -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD
Hello Apache Ignite Community, I am currently working with Ignite and Spark; I'm specifically interested in the Shared RDD functionality. I have a few questions and hope I can find answers here. Goal: I am trying to have a single physical page with multiple sharers (multiple processes map to the same physical page number) on a dataset. Is this achievable with Apache Ignite? Specifications: This is all running on Ubuntu 14.04 on an x86-64 machine, with Ignite-2.3.0. I will first introduce the simpler case using only Apache Ignite, and then talk about integration and data sharing with Spark. I appreciate the assistance. IGNITE NODES ONLY Approach: I am trying to utilize the Shared RDD of Ignite. Since I also need my data to persist after the spark processes, I am deploying the Ignite cluster independently with the following command and config: '$IGNITE_HOME/bin/ignite.sh $IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. I populate the Ignite nodes using: 'mvn exec:java -Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I modified this file to only populate the SharedRDD cache (partitioned) with 100,000pairs. Finally, I observe the status of the ignite cluster using: '$IGNITE_home/bin/ignitevisorcmd.sh' Results: I can confirm that I have average 50,000 pairs per node, totaling at 100,000 key,value pairs. The memory usage of my Ignite nodes also increase, confirming the populated RDD. However, when I compare the page maps of both Ignite nodes, I see that they are oblivious to each others memory space and have different Physical Page mappings. Is it possible for me to set Ignite nodes up so that the nodes with the Shared RDD caches share the datasets with single physical page mappings without duplicating? SHARING AND INTEGRATION WITH SPARK (A more specific use case) Approach: In addition to the Ignite node deployment I mentioned earlier (2 Ignite nodes with example-shared-rdd, populated using the SharedRDDExample), I also try the Shared RDD with Spark. I deploy the master with '$SPARK_HOME/sbin/start-master.sh', and workers are started with '$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker spark://master_host:master_port' Here, I am trying to achieve a setup where I have multiple spark workers that all share a dataset. More specifically, I need the multiple spark workers/processes to be pointing at the same Physical Page Mappings on startup (before writing). I first get in a spark-shell with the following command: '$SPARK_HOME/bin/spark-shell --packages org.apache.ignite:ignite-spark:2.3.0 --master spark://master_host:master_port --repositories http://repo.maven.apache.org/maven2/org/apache/ignite' [When in the shell, I run the following scala code]: import org.apache.ignite.spark._ import org.apache.ignite.configuration._ val ic = new IgniteContext(sc, "examples/config/spark/example-shared-rdd.xml") # This is the same configuration as the Ignite nodes val sharedRDD = ic.fromCache[Integer, Integer]("sharedRDD") # The cache I have in the config is named sharedRDD. When I observe the Ignite cluster *before* doing any read/write operations on the spark end, I see the 2 nodes I started up with about 50,000 key,value pairs each. After running: sharedRDD.filter(_._2 > 5).count # Which should be a read and count command? I observe that I now have *4* nodes with about 25,000 key,value pairs each. 2 of these nodes are the Ignite nodes I deployed standalone, and the other 2 are launched from the context in the Spark processes. This leads to different datasets in each process, and different page mappings fails to achieve what I need. In both cases (Ignite Nodes only, and Ignite+Spark), I observe different physical page mappings. While the dataset seems shared to the outside world, it is not truly shared at the page level. The nodes seem to be getting their own set of private key,value pairs which are served to requesters, and a sharing illusion is given to clients. Is my understanding correct? If I am incorrect, how should I approach the shared-dataset-multiple-processes setup with the same physical page mapping using Ignite and SharedRDD (and Spark)? Please let me know if you have any questions. Sincerely, Umur Darbaz University of Illinois at Urbana-Champaign, Graduate Researcher -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/