vladhlinsky opened a new pull request #91: ATLAS-3655: Create 'spark_application' type to avoid 'spark_process' from being updated for multiple operations URL: https://github.com/apache/atlas/pull/91 ## What changes were proposed in this pull request? Create `spark_application` type to avoid `spark_process` from being updated for multiple operations. Currently, Spark Atlas Connector uses `spark_process` as a top-level type for a Spark session, thus it's being updated for multiple operations within the same session. The following statements: ``` spark.sql("create table table_1(col1 int,col2 string)"); spark.sql("create table table_2 as select * from table_1"); ``` result in the next correct lineage: ``` table1 ------> spark_process1 -------> table2 ``` but executing similar statements in the same spark session: ``` spark.sql("create table table_3(col1 int,col2 string)"); spark.sql("create table table_4 as select * from table_3"); ``` result in the same `spark_process` being updated and the lineage now connects all the 4 tables. The proposal is to create a `spark_application` entity and associate all `spark_process` entities (created within that session) to it. ## How was this patch tested? Manually using modified version of Spark Atlas Connector: - Installed and started Atlas. - Executed the next statements using spark-shell: ``` spark.sql("create table table_1_17(col1 int,col2 string)"); spark.sql("create table table_2_17 as select * from table_1_17"); spark.sql("create table table_3_17(col1 int,col2 string)"); spark.sql("create table table_4_17 as select * from table_3_17"); ``` - Verified that all 4 entites are connected in Atlas lineage. - `1100-spark_model.json` is updated with proposed changes. - Once again executed similar statements: ``` spark.sql("create table table_1_37(col1 int,col2 string)"); spark.sql("create table table_2_37 as select * from table_1_37"); spark.sql("create table table_3_37(col1 int,col2 string)"); spark.sql("create table table_4_37 as select * from table_3_37"); ``` - Verified that two `spark_process` entities are created, that have a single `spark_application` entity as `application`. Each of these processes has it's own lineage.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services