One update to this thread: I realized that the 2 nodes-50K keys to 4 nodes-25K redistribution was happening because I was not enforcing client mode at the spark worker side. However, my question still stands:
Does Ignite use shared memory (shmem) to manage the Shared RDD? Can I set up Ignite servers to share a dataset/in memory cache to use shared memory? Sincerely, Umur UmurD wrote > Val, > > I would like to make one correction. Data could also be shared with Linux > shared memory (like shm). It does not have to be through copy-on-writes > with > read-only mapped pages. A shared dataset in shared memory across different > processes also fits my use case. > > Sincerely, > Umur > UmurD wrote >> Hi Val, >> >> Thanks for the quick response. >> >> I am referring to how Virtual and Physical Memory works. >> >> For more background, when a process is launched, it will be allocated a >> virtual address space. This virtual memory will have a translation to the >> physical memory you have on your computer. The pages allocated to the >> processes will have different permissions (Read vs Read-Write), and some >> of >> them will be exclusively mapped to the process it is assigned to, while >> some >> others will be shared. >> >> A good example of shared physical pages is for say a library (it does not >> have to be a library, and I'm only providing that as an example). If I >> launch two identical processes on the same machine, the shared libraries >> used by these processes will have the same physical address (after >> translating from virtual to physical addresses). This is because the >> library >> might be read-only, and there is no need for two copies of the same >> library >> if it is only being read. The processes will not get their own copy until >> they attempt to write to the shared page. When they do, this will incur a >> page-fault and the process will be allocated it's own (exclusive) copy of >> the previously shared page for modification. This is called a >> Copy-On-Write >> (CoW). >> >> The case I am looking for specifically is when I launch 2 processes (say >> Ignite for the sake of the example), and load up a dataset to be shared, >> I >> want these 2 processes to point to the same physical memory space for the >> shared dataset (until one of them tries to modify it, of course). In >> other >> words, I want the loaded dataset to have the same physical address >> translation from their respective virtual addresses. That is what I'm >> referring to when I talk about identical physical page mappings. >> >> This is for a research project I am conducting, so performance or >> functionality is unimportant. The physical mapping is the only critical >> component. >> >> Sincerely, >> Umur >> vkulichenko wrote >>> Umur, >>> >>> When you talk about "physical page mappings", what exactly are you >>> referring >>> to? Can you please elaborate a bit more on what and why you're trying to >>> achieve? What is the issue you're trying to solve? >>> >>> -Val >>> UmurD wrote >>>> Hello Apache Ignite Community, >>>> >>>> I am currently working with Ignite and Spark; I'm specifically >>>> interested in >>>> the Shared RDD functionality. I have a few questions and hope I can >>>> find >>>> answers here. >>>> >>>> Goal: >>>> I am trying to have a single physical page with multiple sharers >>>> (multiple >>>> processes map to the same physical page number) on a dataset. Is this >>>> achievable with Apache Ignite? >>>> >>>> Specifications: >>>> This is all running on Ubuntu 14.04 on an x86-64 machine, with >>>> Ignite-2.3.0. >>>> >>>> I will first introduce the simpler case using only Apache Ignite, and >>>> then >>>> talk about integration and data sharing with Spark. I appreciate the >>>> assistance. >>>> >>>> IGNITE NODES ONLY >>>> Approach: >>>> I am trying to utilize the Shared RDD of Ignite. Since I also need my >>>> data >>>> to persist after the spark processes, I am deploying the Ignite cluster >>>> independently with the following command and config: >>>> >>>> '$IGNITE_HOME/bin/ignite.sh >>>> $IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. >>>> >>>> I populate the Ignite nodes using: >>>> >>>> 'mvn exec:java >>>> -Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I >>>> modified this file to only populate the SharedRDD cache (partitioned) >>>> with >>>> 100,000 >>>> < >>>> int,int >>>> > >>>> pairs. >>>> >>>> Finally, I observe the status of the ignite cluster using: >>>> >>>> '$IGNITE_home/bin/ignitevisorcmd.sh' >>>> >>>> Results: >>>> I can confirm that I have average 50,000 >>>> < >>>> int,int >>>> > >>>> pairs per node, totaling >>>> at 100,000 key,value pairs. The memory usage of my Ignite nodes also >>>> increase, confirming the populated RDD. However, when I compare the >>>> page >>>> maps of both Ignite nodes, I see that they are oblivious to each others >>>> memory space and have different Physical Page mappings. Is it possible >>>> for >>>> me to set Ignite nodes up so that the nodes with the Shared RDD caches >>>> share >>>> the datasets with single physical page mappings without duplicating? >>>> >>>> SHARING AND INTEGRATION WITH SPARK (A more specific use case) >>>> Approach: >>>> >>>> In addition to the Ignite node deployment I mentioned earlier (2 Ignite >>>> nodes with example-shared-rdd, populated using the SharedRDDExample), I >>>> also >>>> try the Shared RDD with Spark. I deploy the master with >>>> '$SPARK_HOME/sbin/start-master.sh', and workers are started with >>>> '$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker >>>> spark://master_host:master_port' >>>> >>>> Here, I am trying to achieve a setup where I have multiple spark >>>> workers >>>> that all share a dataset. More specifically, I need the multiple spark >>>> workers/processes to be pointing at the same Physical Page Mappings on >>>> startup (before writing). I first get in a spark-shell with the >>>> following >>>> command: >>>> >>>> '$SPARK_HOME/bin/spark-shell >>>> --packages org.apache.ignite:ignite-spark:2.3.0 >>>> --master spark://master_host:master_port >>>> --repositories http://repo.maven.apache.org/maven2/org/apache/ignite' >>>> >>>> [When in the shell, I run the following scala code]: >>>> >>>> import org.apache.ignite.spark._ >>>> import org.apache.ignite.configuration._ >>>> >>>> val ic = new IgniteContext(sc, >>>> "examples/config/spark/example-shared-rdd.xml") # This is the same >>>> configuration as the Ignite nodes >>>> val sharedRDD = ic.fromCache[Integer, Integer]("sharedRDD") # The cache >>>> I >>>> have in the config is named sharedRDD. >>>> >>>> When I observe the Ignite cluster *before* doing any read/write >>>> operations >>>> on the spark end, I see the 2 nodes I started up with about 50,000 >>>> key,value >>>> pairs each. After running: >>>> >>>> sharedRDD.filter(_._2 > 50000).count # Which should be a read and count >>>> command? >>>> >>>> I observe that I now have *4* nodes with about 25,000 key,value pairs >>>> each. >>>> 2 of these nodes are the Ignite nodes I deployed standalone, and the >>>> other 2 >>>> are launched from the context in the Spark processes. This leads to >>>> different datasets in each process, and different page mappings fails >>>> to >>>> achieve what I need. >>>> >>>> In both cases (Ignite Nodes only, and Ignite+Spark), I observe >>>> different >>>> physical page mappings. While the dataset seems shared to the outside >>>> world, >>>> it is not truly shared at the page level. The nodes seem to be getting >>>> their >>>> own set of private key,value pairs which are served to requesters, and >>>> a >>>> sharing illusion is given to clients. >>>> >>>> Is my understanding correct? If I am incorrect, how should I approach >>>> the >>>> shared-dataset-multiple-processes setup with the same physical page >>>> mapping >>>> using Ignite and SharedRDD (and Spark)? >>>> >>>> Please let me know if you have any questions. >>>> >>>> Sincerely, >>>> Umur Darbaz >>>> University of Illinois at Urbana-Champaign, Graduate Researcher > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
