Sure. 1. The first diagram is for understanding the data visibility aspect of the spark integration. Given that a cache exists on the ignite node, spark tries to create a data frame from the IgniteRDD and perform an action (df.show()) on it. Concurrently if there are changes made to the cache (either by another spark application or by another application using Ignite API) on the ignite node, the question is would spark worker be able to see those changes? My understanding based on our discussion so far is that the df.show() action would not display the latest changes in the cache since the underlying IgniteRDD might be updated but the dataframe is another layer about it.
2. The second diagram is to understand the locking and the concurrency behavior with the spark integration. Given that a cache exists on the ignite node, spark tries to create a data frame from the IgniteRDD and add a new column to the data (in the diagram, the email column). Concurrently if there are changes made to the cache (either by another spark application or by another application using Ignite API) on the ignite node, the question is a. What happens when spark tries to persist the RDD back to the ignite cache through the saveRDD() api? Would the changes made previously to the ignite cache be lost? b. What is the locking behavior when updating the ignite cache? Would it lock all the partitions of the cache preventing read/write access to the cache or can ignite determine the partitions that are going to be updated and lock only those? Thanks. -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Apache-Spark-Ignite-Integration-tp8556p9502.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
