Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

2018-01-30 Thread vkulichenko
Umur,

No, it doesn't use shared memory and I doubt what you tell is even possible.
However, I still not sure I understand what is the purpose of all this. What
is your ultimate goal here?

-Val



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

2018-01-26 Thread UmurD
One update to this thread: I realized that the 2 nodes-50K keys to 4
nodes-25K redistribution was happening because I was not enforcing client
mode at the spark worker side. However, my question still stands:

Does Ignite use shared memory (shmem) to manage the Shared RDD? Can I set up
Ignite servers to share a dataset/in memory cache to use shared memory?

Sincerely,
Umur


UmurD wrote
> Val,
> 
> I would like to make one correction. Data could also be shared with Linux
> shared memory (like shm). It does not have to be through copy-on-writes
> with
> read-only mapped pages. A shared dataset in shared memory across different
> processes also fits my use case.
> 
> Sincerely,
> Umur
> UmurD wrote
>> Hi Val,
>> 
>> Thanks for the quick response.
>> 
>> I am referring to how Virtual and Physical Memory works.
>> 
>> For more background, when a process is launched, it will be allocated a
>> virtual address space. This virtual memory will have a translation to the
>> physical memory you have on your computer. The pages allocated to the
>> processes will have different permissions (Read vs Read-Write), and some
>> of
>> them will be exclusively mapped to the process it is assigned to, while
>> some
>> others will be shared.
>> 
>> A good example of shared physical pages is for say a library (it does not
>> have to be a library, and I'm only providing that as an example). If I
>> launch two identical processes on the same machine, the shared libraries
>> used by these processes will have the same physical address (after
>> translating from virtual to physical addresses). This is because the
>> library
>> might be read-only, and there is no need for two copies of the same
>> library
>> if it is only being read. The processes will not get their own copy until
>> they attempt to write to the shared page. When they do, this will incur a
>> page-fault and the process will be allocated it's own (exclusive) copy of
>> the previously shared page for modification. This is called a
>> Copy-On-Write
>> (CoW).
>> 
>> The case I am looking for specifically is when I launch 2 processes (say
>> Ignite for the sake of the example), and load up a dataset to be shared,
>> I
>> want these 2 processes to point to the same physical memory space for the
>> shared dataset (until one of them tries to modify it, of course). In
>> other
>> words, I want the loaded dataset to have the same physical address
>> translation from their respective virtual addresses. That is what I'm
>> referring to when I talk about identical physical page mappings.
>> 
>> This is for a research project I am conducting, so performance or
>> functionality is unimportant. The physical mapping is the only critical
>> component.
>> 
>> Sincerely,
>> Umur
>> vkulichenko wrote
>>> Umur,
>>> 
>>> When you talk about "physical page mappings", what exactly are you
>>> referring
>>> to? Can you please elaborate a bit more on what and why you're trying to
>>> achieve? What is the issue you're trying to solve?
>>> 
>>> -Val
>>> UmurD wrote
 Hello Apache Ignite Community,
 
 I am currently working with Ignite and Spark; I'm specifically
 interested in
 the Shared RDD functionality. I have a few questions and hope I can
 find
 answers here.
 
 Goal:
 I am trying to have a single physical page with multiple sharers
 (multiple
 processes map to the same physical page number) on a dataset. Is this
 achievable with Apache Ignite?
 
 Specifications:
 This is all running on Ubuntu 14.04 on an x86-64 machine, with
 Ignite-2.3.0.
 
 I will first introduce the simpler case using only Apache Ignite, and
 then
 talk about integration and data sharing with Spark. I appreciate the
 assistance.
 
 IGNITE NODES ONLY
 Approach:
 I am trying to utilize the Shared RDD of Ignite. Since I also need my
 data
 to persist after the spark processes, I am deploying the Ignite cluster
 independently with the following command and config:
 
 '$IGNITE_HOME/bin/ignite.sh
 $IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. 
 
 I populate the Ignite nodes using:
 
 'mvn exec:java
 -Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I
 modified this file to only populate the SharedRDD cache (partitioned)
 with
 100,000 
 
 int,int
 
  pairs.
 
 Finally, I observe the status of the ignite cluster using:
 
 '$IGNITE_home/bin/ignitevisorcmd.sh'
 
 Results:
 I can confirm that I have average 50,000 
 
 int,int
 
  pairs per node, totaling
 at 100,000 key,value pairs. The memory usage of my Ignite nodes also
 increase, confirming the populated RDD. However, when I compare the
 page
 maps of both Ignite nodes, I see that they are oblivious to each others
 memory space and have different Physical Page mappings. Is it possible
 for
 me to set 

Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

2018-01-25 Thread UmurD
Val,

I would like to make one correction. Data could also be shared with Linux
shared memory (like shm). It does not have to be through copy-on-writes with
read-only mapped pages. A shared dataset in shared memory across different
processes also fits my use case.

Sincerely,
Umur



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

2018-01-25 Thread UmurD
Hi Val,

Thanks for the quick response.

I am referring to how Virtual and Physical Memory works.

For more background, when a process is launched, it will be allocated a
virtual address space. This virtual memory will have a translation to the
physical memory you have on your computer. The pages allocated to the
processes will have different permissions (Read vs Read-Write), and some of
them will be exclusively mapped to the process it is assigned to, while some
others will be shared.

A good example of shared physical pages is for say a library (it does not
have to be a library, and I'm only providing that as an example). If I
launch two identical processes on the same machine, the shared libraries
used by these processes will have the same physical address (after
translating from virtual to physical addresses). This is because the library
might be read-only, and there is no need for two copies of the same library
if it is only being read. The processes will not get their own copy until
they attempt to write to the shared page. When they do, this will incur a
page-fault and the process will be allocated it's own (exclusive) copy of
the previously shared page for modification. This is called a Copy-On-Write
(CoW).

The case I am looking for specifically is when I launch 2 processes (say
Ignite for the sake of the example), and load up a dataset to be shared, I
want these 2 processes to point to the same physical memory space for the
shared dataset (until one of them tries to modify it, of course). In other
words, I want the loaded dataset to have the same physical address
translation from their respective virtual addresses. That is what I'm
referring to when I talk about identical physical page mappings.

This is for a research project I am conducting, so performance or
functionality is unimportant. The physical mapping is the only critical
component.

Sincerely,
Umur





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

2018-01-25 Thread vkulichenko
Umur,

When you talk about "physical page mappings", what exactly are you referring
to? Can you please elaborate a bit more on what and why you're trying to
achieve? What is the issue you're trying to solve?

-Val



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Sharing Dataset Across Multiple Ignite Processes with Same Physical Page Mappings, SharedRDD

2018-01-25 Thread UmurD
Hello Apache Ignite Community,

I am currently working with Ignite and Spark; I'm specifically interested in
the Shared RDD functionality. I have a few questions and hope I can find
answers here.

Goal:
I am trying to have a single physical page with multiple sharers (multiple
processes map to the same physical page number) on a dataset. Is this
achievable with Apache Ignite?

Specifications:
This is all running on Ubuntu 14.04 on an x86-64 machine, with Ignite-2.3.0.

I will first introduce the simpler case using only Apache Ignite, and then
talk about integration and data sharing with Spark. I appreciate the
assistance.

IGNITE NODES ONLY
Approach:
I am trying to utilize the Shared RDD of Ignite. Since I also need my data
to persist after the spark processes, I am deploying the Ignite cluster
independently with the following command and config:

'$IGNITE_HOME/bin/ignite.sh
$IGNITE_HOME/examples/config/spark/example-shared-rdd.xml'. 

I populate the Ignite nodes using:

'mvn exec:java
-Dexec.mainClass=org.apache.ignite.examples.spark.SharedRDDExample'. I
modified this file to only populate the SharedRDD cache (partitioned) with
100,000  pairs.

Finally, I observe the status of the ignite cluster using:

'$IGNITE_home/bin/ignitevisorcmd.sh'

Results:
I can confirm that I have average 50,000  pairs per node, totaling
at 100,000 key,value pairs. The memory usage of my Ignite nodes also
increase, confirming the populated RDD. However, when I compare the page
maps of both Ignite nodes, I see that they are oblivious to each others
memory space and have different Physical Page mappings. Is it possible for
me to set Ignite nodes up so that the nodes with the Shared RDD caches share
the datasets with single physical page mappings without duplicating?

SHARING AND INTEGRATION WITH SPARK (A more specific use case)
Approach:

In addition to the Ignite node deployment I mentioned earlier (2 Ignite
nodes with example-shared-rdd, populated using the SharedRDDExample), I also
try the Shared RDD with Spark. I deploy the master with
'$SPARK_HOME/sbin/start-master.sh', and workers are started with
'$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker
spark://master_host:master_port'

Here, I am trying to achieve a setup where I have multiple spark workers
that all share a dataset. More specifically, I need the multiple spark
workers/processes to be pointing at the same Physical Page Mappings on
startup (before writing). I first get in a spark-shell with the following
command:

'$SPARK_HOME/bin/spark-shell 
--packages org.apache.ignite:ignite-spark:2.3.0
  --master spark://master_host:master_port
  --repositories http://repo.maven.apache.org/maven2/org/apache/ignite'

[When in the shell, I run the following scala code]:

import org.apache.ignite.spark._
import org.apache.ignite.configuration._

val ic = new IgniteContext(sc,
"examples/config/spark/example-shared-rdd.xml") # This is the same
configuration as the Ignite nodes
val sharedRDD = ic.fromCache[Integer, Integer]("sharedRDD") # The cache I
have in the config is named sharedRDD.

When I observe the Ignite cluster *before* doing any read/write operations
on the spark end, I see the 2 nodes I started up with about 50,000 key,value
pairs each. After running:

sharedRDD.filter(_._2 > 5).count # Which should be a read and count
command?

I observe that I now have *4* nodes with about 25,000 key,value pairs each.
2 of these nodes are the Ignite nodes I deployed standalone, and the other 2
are launched from the context in the Spark processes. This leads to
different datasets in each process, and different page mappings fails to
achieve what I need.

In both cases (Ignite Nodes only, and Ignite+Spark), I observe different
physical page mappings. While the dataset seems shared to the outside world,
it is not truly shared at the page level. The nodes seem to be getting their
own set of private key,value pairs which are served to requesters, and a
sharing illusion is given to clients.

Is my understanding correct? If I am incorrect, how should I approach the
shared-dataset-multiple-processes setup with the same physical page mapping
using Ignite and SharedRDD (and Spark)?

Please let me know if you have any questions.

Sincerely,
Umur Darbaz
University of Illinois at Urbana-Champaign, Graduate Researcher



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/