vladhlinsky opened a new pull request #91: ATLAS-3655: Create 
'spark_application' type to avoid 'spark_process' from being updated for 
multiple operations
URL: https://github.com/apache/atlas/pull/91
 
 
   ## What changes were proposed in this pull request?
   
   Create `spark_application` type to avoid `spark_process` from being updated 
for multiple operations. Currently, Spark Atlas Connector uses `spark_process` 
as a top-level type for a Spark session, thus it's being updated for multiple 
operations within the same session.
   
   The following statements:
   ```
   spark.sql("create table table_1(col1 int,col2 string)");
   spark.sql("create table table_2 as select * from table_1");
   ```
   result in the next correct lineage:
   ```
   table1 ------> spark_process1 -------> table2
   ```
   but executing similar statements in the same spark session:
   ```
   spark.sql("create table table_3(col1 int,col2 string)"); 
   spark.sql("create table table_4 as select * from table_3");
   ```
   result in the same `spark_process` being updated and the lineage now 
connects all the 4 tables.
   The proposal is to create a `spark_application` entity and associate all 
`spark_process` entities (created within that session) to it.
   
   ## How was this patch tested?
   
   Manually using modified version of Spark Atlas Connector:
   - Installed and started Atlas.
   - Executed the next statements using spark-shell:
   
   ```
   spark.sql("create table table_1_17(col1 int,col2 string)");
   spark.sql("create table table_2_17 as select * from table_1_17");
   spark.sql("create table table_3_17(col1 int,col2 string)");
   spark.sql("create table table_4_17 as select * from table_3_17");
   ```
   
   - Verified that all 4 entites are connected in Atlas lineage.
   - `1100-spark_model.json` is updated with proposed changes.
   - Once again executed similar statements:
   
   ```
   spark.sql("create table table_1_37(col1 int,col2 string)");
   spark.sql("create table table_2_37 as select * from table_1_37");
   spark.sql("create table table_3_37(col1 int,col2 string)");
   spark.sql("create table table_4_37 as select * from table_3_37");
   ```
   
   - Verified that two `spark_process` entities are created,
   that have a single `spark_application` entity as `application`.
   Each of these processes has it's own lineage.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to