[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed
[ https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-34163: - Description: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : {color:#0747A6}def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False{color} But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? Why streaming dataframe and normal dataframe differ in behavior? How to skip transformation on a column if it doesn't exists? was: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : {color:#0747A6}def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False{color} But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? > Spark Structured Streaming - Kafka avro transformation on optional field > Failed > --- > > Key: SPARK-34163 > URL: https://issues.apache.org/jira/browse/SPARK-34163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.7 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello All, > I have a spark structured streaming job to inject data from Kafka where > message from Kafka is avro type. > Some of the fields are optional in the data. And I have to perform > transformation if those optional fields are present in the data. > So I tried to check whether the column exists by : > {color:#0747A6}def has_column(dataframe, col): > """ > This function checks the existence of a given column in the given > DataFrame > :param dataframe: the dataframe > :type dataframe: DataFrame > :param col: the column name > :type col: str > :return: true if the column exists else false > :rtype: boolean > """ > try: > dataframe[col] > return True > except AnalysisException: > return False{color} > But it seems not working when its a streaming dataframe, but when the > dataframe is normal dataframe, and when a column is not present the above > check returns false, therefore I can ignore the transformation on the missing > column. > But on Streaming dataframe *has_column* always returns true and therefore the > transformation get executed and cause exception. What is the right approach > to check existence of column in a streaming dataframe before performing > transformation? > Why streaming dataframe and normal dataframe differ in behavior? How to skip > transformation on a column if it doesn't exists? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed
[ https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-34163: - Description: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : {color:#0747A6}def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False{color} But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? was: Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : *def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False* But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? > Spark Structured Streaming - Kafka avro transformation on optional field > Failed > --- > > Key: SPARK-34163 > URL: https://issues.apache.org/jira/browse/SPARK-34163 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.7 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello All, > I have a spark structured streaming job to inject data from Kafka where > message from Kafka is avro type. > Some of the fields are optional in the data. And I have to perform > transformation if those optional fields are present in the data. > So I tried to check whether the column exists by : > {color:#0747A6}def has_column(dataframe, col): > """ > This function checks the existence of a given column in the given > DataFrame > :param dataframe: the dataframe > :type dataframe: DataFrame > :param col: the column name > :type col: str > :return: true if the column exists else false > :rtype: boolean > """ > try: > dataframe[col] > return True > except AnalysisException: > return False{color} > But it seems not working when its a streaming dataframe, but when the > dataframe is normal dataframe, and when a column is not present the above > check returns false, therefore I can ignore the transformation on the missing > column. > But on Streaming dataframe *has_column* always returns true and therefore the > transformation get executed and cause exception. What is the right approach > to check existence of column in a streaming dataframe before performing > transformation? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed
Felix Kizhakkel Jose created SPARK-34163: Summary: Spark Structured Streaming - Kafka avro transformation on optional field Failed Key: SPARK-34163 URL: https://issues.apache.org/jira/browse/SPARK-34163 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.7 Reporter: Felix Kizhakkel Jose Hello All, I have a spark structured streaming job to inject data from Kafka where message from Kafka is avro type. Some of the fields are optional in the data. And I have to perform transformation if those optional fields are present in the data. So I tried to check whether the column exists by : *def has_column(dataframe, col): """ This function checks the existence of a given column in the given DataFrame :param dataframe: the dataframe :type dataframe: DataFrame :param col: the column name :type col: str :return: true if the column exists else false :rtype: boolean """ try: dataframe[col] return True except AnalysisException: return False* But it seems not working when its a streaming dataframe, but when the dataframe is normal dataframe, and when a column is not present the above check returns false, therefore I can ignore the transformation on the missing column. But on Streaming dataframe *has_column* always returns true and therefore the transformation get executed and cause exception. What is the right approach to check existence of column in a streaming dataframe before performing transformation? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175658#comment-17175658 ] Felix Kizhakkel Jose commented on SPARK-32583: -- [~hyukjin.kwon] I couldn't find any test for pySpark Structured Stream under pyspark code base. Am I missing something? > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support
[ https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175202#comment-17175202 ] Felix Kizhakkel Jose commented on SPARK-32583: -- [~rohitmishr1484] It would have been nice if you could have pointed me to a solution. > PySpark Structured Streaming Testing Support > > > Key: SPARK-32583 > URL: https://issues.apache.org/jira/browse/SPARK-32583 > Project: Spark > Issue Type: Question > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > > Hello, > I have investigated a lot but couldn't get any help or resource on how to > +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ > (ingesting from Kafka topics to S3) and how to build Continuous Integration > (CI)/Continuous Deployment (CD). > 1. Is it possible to test (unit test, integration test) pyspark structured > stream? > 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32583) PySpark Structured Streaming Testing Support
Felix Kizhakkel Jose created SPARK-32583: Summary: PySpark Structured Streaming Testing Support Key: SPARK-32583 URL: https://issues.apache.org/jira/browse/SPARK-32583 Project: Spark Issue Type: Question Components: PySpark Affects Versions: 2.4.4 Reporter: Felix Kizhakkel Jose Hello, I have investigated a lot but couldn't get any help or resource on how to +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ (ingesting from Kafka topics to S3) and how to build Continuous Integration (CI)/Continuous Deployment (CD). 1. Is it possible to test (unit test, integration test) pyspark structured stream? 2. How to build Continuous Integration (CI)/Continuous Deployment (CD)? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162311#comment-17162311 ] Felix Kizhakkel Jose commented on SPARK-26345: -- [~sha...@uber.com] I don't have permission to assign it to you. Probably someone who is part of committers list can assign it to you. > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31763) DataFrame.inputFiles() not Available
[ https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-31763: - Issue Type: Bug (was: New Feature) > DataFrame.inputFiles() not Available > > > Key: SPARK-31763 > URL: https://issues.apache.org/jira/browse/SPARK-31763 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have been trying to list inputFiles that compose my DataSet by using > *PySpark* > spark_session.read > .format(sourceFileFormat) > .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) > *.inputFiles();* > but I get an exception saying inputFiles attribute not present. But I was > able to get this functionality with Spark Java. > *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31763) DataFrame.inputFiles() not Available
Felix Kizhakkel Jose created SPARK-31763: Summary: DataFrame.inputFiles() not Available Key: SPARK-31763 URL: https://issues.apache.org/jira/browse/SPARK-31763 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 2.4.5 Reporter: Felix Kizhakkel Jose I have been trying to list inputFiles that compose my DataSet by using *PySpark* spark_session.read .format(sourceFileFormat) .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix) *.inputFiles();* but I get an exception saying inputFiles attribute not present. But I was able to get this functionality with Spark Java. *So is this something missing in PySpark?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095697#comment-17095697 ] Felix Kizhakkel Jose commented on SPARK-31599: -- Thank you [~gsomogyi]. But this is not a S3 issue. The issue is I have compacted files in the bucket and deleted the non compacted files, but didn't update/modify the "_spark_metadata" folder. And I could see that those Write Ahead Log Json files contain the deleted file names. And when I use Spark SQL to read the data, it first reads the Write Ahead logs from "_spark_metadata" and then try to read the files listed in it. So I am wondering how can we update the "_spark_metadata" content (Write Ahead Logs)? > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094917#comment-17094917 ] Felix Kizhakkel Jose commented on SPARK-31599: -- How do I do that? > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
[ https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-31599: - Description: I have a S3 bucket which has data streamed (Parquet format) to it by Spark Structured Streaming Framework from Kafka. Periodically I try to run compaction on this bucket (a separate Spark Job), and on successful compaction delete the non compacted (parquet) files. After which I am getting following error on Spark jobs which read from that bucket: *Caused by: java.io.FileNotFoundException: No such file or directory: s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need to delete the un-compacted files after successful compaction to save space. was: I have a S3 bucket which has data streamed (Parquet format) to it by Spark Structured Streaming Framework from Kafka. Periodically I try to run compaction on this bucket (a separate Spark Job), and on successful compaction delete the non compacted (parquet) files. After which I am getting following error on Spark jobs which read from that bucket: *Caused by: java.io.FileNotFoundException: No such file or directory: s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need to delete the un-compacted files after successful compaction to save space. > Reading from S3 (Structured Streaming Bucket) Fails after Compaction > > > Key: SPARK-31599 > URL: https://issues.apache.org/jira/browse/SPARK-31599 > Project: Spark > Issue Type: New Feature > Components: SQL, Structured Streaming >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I have a S3 bucket which has data streamed (Parquet format) to it by Spark > Structured Streaming Framework from Kafka. Periodically I try to run > compaction on this bucket (a separate Spark Job), and on successful > compaction delete the non compacted (parquet) files. After which I am getting > following error on Spark jobs which read from that bucket: > *Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* > How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need > to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction
Felix Kizhakkel Jose created SPARK-31599: Summary: Reading from S3 (Structured Streaming Bucket) Fails after Compaction Key: SPARK-31599 URL: https://issues.apache.org/jira/browse/SPARK-31599 Project: Spark Issue Type: New Feature Components: SQL, Structured Streaming Affects Versions: 2.4.5 Reporter: Felix Kizhakkel Jose I have a S3 bucket which has data streamed (Parquet format) to it by Spark Structured Streaming Framework from Kafka. Periodically I try to run compaction on this bucket (a separate Spark Job), and on successful compaction delete the non compacted (parquet) files. After which I am getting following error on Spark jobs which read from that bucket: *Caused by: java.io.FileNotFoundException: No such file or directory: s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet* How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need to delete the un-compacted files after successful compaction to save space. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071819#comment-17071819 ] Felix Kizhakkel Jose commented on SPARK-31072: -- Hello, Any updates or helps will be much appreciated. > Default to ParquetOutputCommitter even after configuring s3a committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26345) Parquet support Column indexes
[ https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070982#comment-17070982 ] Felix Kizhakkel Jose commented on SPARK-26345: -- I have created a Jira in Parquet-mr for Vectorized API - https://issues.apache.org/jira/browse/PARQUET-1830. But as per the discussion, it seems like a short term solution is "As Spark already use some internal API of parquet-mr we can step forward and implement the page skipping mechanism that is implemented in parquet-mr." [~gszadovszky]. So updating this Jira to have a short term solution to benefit from Column and Offset Index implementation in Parquet-MR 1.11 > Parquet support Column indexes > -- > > Key: SPARK-26345 > URL: https://issues.apache.org/jira/browse/SPARK-26345 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Parquet 1.11.0 supports column indexing. Spark can supports this feature for > good read performance. > More details: > https://issues.apache.org/jira/browse/PARQUET-1201 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
[ https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060408#comment-17060408 ] Felix Kizhakkel Jose edited comment on SPARK-31162 at 3/16/20, 6:13 PM: I have seen following in the API documentation: /** * Buckets the output by the given columns. *If specified, the output is laid out on the file system similar to Hive's bucketing scheme.** * * This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark * 2.1.0. * * @since 2.0 */ @scala.annotation.varargs def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T] = \{ this.numBuckets = Option(numBuckets) this.bucketColumnNames = Option(colName +: colNames) this } How can we specify that? was (Author: felixkjose): I have seen following in the API documentation: /** * Buckets the output by the given columns. *If specified, the output is laid out on the file* ** system similar to Hive's bucketing scheme.* * * This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark * 2.1.0. * * @since 2.0 */ @scala.annotation.varargs def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T] = { this.numBuckets = Option(numBuckets) this.bucketColumnNames = Option(colName +: colNames) this } How can we specify that? > Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing > - > > Key: SPARK-31162 > URL: https://issues.apache.org/jira/browse/SPARK-31162 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I couldn't find a configuration parameter to choose Hive Hashing instead of > Spark's default Murmur Hash when performing Spark BucketBy operation. > According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to > open a new JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
[ https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060408#comment-17060408 ] Felix Kizhakkel Jose commented on SPARK-31162: -- I have seen following in the API documentation: /** * Buckets the output by the given columns. *If specified, the output is laid out on the file* ** system similar to Hive's bucketing scheme.* * * This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark * 2.1.0. * * @since 2.0 */ @scala.annotation.varargs def bucketBy(numBuckets: Int, colName: String, colNames: String*): DataFrameWriter[T] = { this.numBuckets = Option(numBuckets) this.bucketColumnNames = Option(colName +: colNames) this } How can we specify that? > Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing > - > > Key: SPARK-31162 > URL: https://issues.apache.org/jira/browse/SPARK-31162 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > I couldn't find a configuration parameter to choose Hive Hashing instead of > Spark's default Murmur Hash when performing Spark BucketBy operation. > According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to > open a new JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059927#comment-17059927 ] Felix Kizhakkel Jose commented on SPARK-17495: -- [~maropu] [~tejasp] I have created a Jira to have a config parameter to choose Hive Hash instead of Murmur3Hash. Could you take a look? https://issues.apache.org/jira/browse/SPARK-31162 > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059925#comment-17059925 ] Felix Kizhakkel Jose commented on SPARK-19256: -- Expose a configuration parameter for Hive Hashing during BucketBy operation. This config is missing or not documented > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
Felix Kizhakkel Jose created SPARK-31162: Summary: Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing Key: SPARK-31162 URL: https://issues.apache.org/jira/browse/SPARK-31162 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 2.4.5 Reporter: Felix Kizhakkel Jose I couldn't find a configuration parameter to choose Hive Hashing instead of Spark's default Murmur Hash when performing Spark BucketBy operation. According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to open a new JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059893#comment-17059893 ] Felix Kizhakkel Jose commented on SPARK-17495: -- @[~maropu] [~hyukjin.kwon] So we need a new Jira to provide a configuration parameter to override Murmur hash and to use the HIVE hash function for Bucketing? Do we have a Jira to track it or are you going to create one? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059847#comment-17059847 ] Felix Kizhakkel Jose commented on SPARK-17495: -- @[~maropu] Is there any configuration parameter I have to use to use Hive hash instead of Murmur hash (spark default)? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059846#comment-17059846 ] Felix Kizhakkel Jose commented on SPARK-17495: -- Oh, thats great. I will check. > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Fix For: 2.2.0 > > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17495) Hive hash implementation
[ https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059681#comment-17059681 ] Felix Kizhakkel Jose commented on SPARK-17495: -- Is this going to available in the coming releases? > Hive hash implementation > > > Key: SPARK-17495 > URL: https://issues.apache.org/jira/browse/SPARK-17495 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Tejas Patil >Assignee: Tejas Patil >Priority: Minor > Labels: bulk-closed > > Spark internally uses Murmur3Hash for partitioning. This is different from > the one used by Hive. For queries which use bucketing this leads to different > results if one tries the same query on both engines. For us, we want users to > have backward compatibility to that one can switch parts of applications > across the engines without observing regressions. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055435#comment-17055435 ] Felix Kizhakkel Jose commented on SPARK-31072: -- Could you please provide some insights? > Default to ParquetOutputCommitter even after configuring s3a committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19256) Hive bucketing support
[ https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055007#comment-17055007 ] Felix Kizhakkel Jose commented on SPARK-19256: -- [~chengsu] Any updates on this feature (23163 and 26164)? {color:#172b4d}*Note:*{color} Also I have seen recent Hive has introduced some changes in their Bucketing Strategy and Big Data engines like Presto incorporated those changes. Are we considering the compatibility with these new HIVE changes? Link to Presto/HIVE Bucketing improvements/changes: [https://prestosql.io/blog/2019/05/29/improved-hive-bucketing.html] > Hive bucketing support > -- > > Key: SPARK-19256 > URL: https://issues.apache.org/jira/browse/SPARK-19256 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Priority: Minor > > JIRA to track design discussions and tasks related to Hive bucketing support > in Spark. > Proposal : > https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-31072: - Summary: Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned" (was: Default to ParquetOutputCommitter even after configuring committer as "partitioned") > Default to ParquetOutputCommitter even after configuring s3a committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31072) Default to ParquetOutputCommitter even after configuring committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-31072: - Summary: Default to ParquetOutputCommitter even after configuring committer as "partitioned" (was: Default to ParquetOutputCommitter even after configuring setting committer as "partitioned") > Default to ParquetOutputCommitter even after configuring committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053543#comment-17053543 ] Felix Kizhakkel Jose edited comment on SPARK-31072 at 3/6/20, 3:48 PM: --- [~steve_l], I have seen some issues you have addressed in this area (zero rename with s3a etc), could you please give me some insights? All, Please provide some help on this issue. was (Author: felixkjose): [~steve_l], I have seen some issues you have addressed in this area, could you please give me some insights? All, Please provide some help on this issue. > Default to ParquetOutputCommitter even after configuring setting committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053543#comment-17053543 ] Felix Kizhakkel Jose commented on SPARK-31072: -- [~steve_l], I have seen some issues you have addressed in this area, could you please give me some insights? All, Please provide some help on this issue. > Default to ParquetOutputCommitter even after configuring setting committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"
Felix Kizhakkel Jose created SPARK-31072: Summary: Default to ParquetOutputCommitter even after configuring setting committer as "partitioned" Key: SPARK-31072 URL: https://issues.apache.org/jira/browse/SPARK-31072 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.5 Reporter: Felix Kizhakkel Jose My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ even after I configure to use "PartitionedStagingCommitter" with the following configuration: * sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append"); * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", false); Application logs stacktrace: 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter But when I use _*ORC*_ as the file format, with the same configuration as above it correctly pick "PartitionedStagingCommitter": 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm version is 1 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned to output data to s3a: 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter PartitionedStagingCommitter** So I am wondering why Parquet and ORC has different behavior ? How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? I started this because when I was trying to save data to S3 directly with partitionBy() two columns - I was getting file not found exceptions intermittently. So how could I avoid this issue with *Parquet using Spark to S3 using s3A without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet
[ https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052476#comment-17052476 ] Felix Kizhakkel Jose commented on SPARK-20901: -- Thank you [~dongjoon]. > Feature parity for ORC with Parquet > --- > > Key: SPARK-20901 > URL: https://issues.apache.org/jira/browse/SPARK-20901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet
[ https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052384#comment-17052384 ] Felix Kizhakkel Jose commented on SPARK-20901: -- Hi [~dongjoon], I was trying to choose between ORC and Parquet formats (using AWS Glue /Spark). While researching I came across this parity feature SPARK-20901. What features are not implemented for ORC to have complete feature parity as Parquet in Spark? Here I could see everything (issues linked) listed here are either Resolved or Closed, so I am confused. Could you please provide some insights? > Feature parity for ORC with Parquet > --- > > Key: SPARK-20901 > URL: https://issues.apache.org/jira/browse/SPARK-20901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to track the feature parity for ORC with Parquet. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972605#comment-16972605 ] Felix Kizhakkel Jose commented on SPARK-29764: -- So it's still hard to follow the reproducer? Do I need to skim it further? > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost > task 0.0
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970451#comment-16970451 ] Felix Kizhakkel Jose commented on SPARK-29764: -- How do I get a help once the priority is reduced, it seems like no eyes are getting on this issue. Will there be any SLA on getting a help? [~apachespark] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969301#comment-16969301 ] Felix Kizhakkel Jose commented on SPARK-29764: -- [~hyukjin.kwon] Could you please help me with this? > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost > task 0.0 in stage 0.0 (TID 0,
[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411 ] Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:46 PM: --- [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] . Please let me know if anything else is required. I am totally stuck with this issue. was (Author: felixkjose): [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at >
[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411 ] Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:45 PM: --- [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] was (Author: felixkjose): [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at
[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-29764: - Attachment: SparkParquetSampleCode.docx > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) > ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to > stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost > task 0.0 in stage 0.0 (TID 0, localhost, executor driver): >
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411 ] Felix Kizhakkel Jose commented on SPARK-29764: -- [~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find the attached code sample for the issue I am talking [where I have dob and startDateTime field in Employee object - which is causing Spark fail to persist into parquet file. [^SparkParquetSampleCode.docx] > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Major > Attachments: SparkParquetSampleCode.docx > > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at > org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at > com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at > com.felix.Application.run(Application.java:63) at >
[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967800#comment-16967800 ] Felix Kizhakkel Jose commented on SPARK-29764: -- The Spark Schema generated for the POJO is: message spark_schema { optional group address { optional binary city (UTF8); optional binary streetName (UTF8); optional group zip { required int32 ext; required int32 zip; } } required int32 age; optional group dob { optional group chronology { optional binary calendarType (UTF8); optional binary id (UTF8); } required int32 dayOfMonth; optional binary dayOfWeek (UTF8); required int32 dayOfYear; optional group era { required int32 value; } required boolean leapYear; optional binary month (UTF8); required int32 monthValue; required int32 year; } optional group id { required int64 leastSignificantBits; required int64 mostSignificantBits; } optional binary name (UTF8); optional binary phone (UTF8); optional group startDateTime { required int32 dayOfMonth; optional binary dayOfWeek (UTF8); required int32 dayOfYear; required int32 hour; required int32 minute; optional binary month (UTF8); required int32 monthValue; required int32 nano; required int32 second; required int32 year; } } Also I don't know why its not recognized as int96 [timestamptype] or int i[datetype] instead its represented as group. I dont know whether thats the reason to get negativeArrayException when I have large data set to persist to parquet. Any help is very much appreciated > Error on Serializing POJO with java datetime property to a Parquet file > --- > > Key: SPARK-29764 > URL: https://issues.apache.org/jira/browse/SPARK-29764 > Project: Spark > Issue Type: Bug > Components: Java API, Spark Core, SQL >Affects Versions: 2.4.4 >Reporter: Felix Kizhakkel Jose >Priority: Blocker > > Hello, > I have been doing a proof of concept for data lake structure and analytics > using Apache Spark. > When I add a java java.time.LocalDateTime/java.time.LocalDate properties in > my data model, the serialization to Parquet start failing. > *My Data Model:* > @Data > public class Employee > { private UUID id = UUID.randomUUID(); private String name; private int age; > private LocalDate dob; private LocalDateTime startDateTime; private String > phone; private Address address; } > > *Serialization Snippet* > {color:#0747a6}public void serialize(){color} > {color:#0747a6}{ List inputDataToSerialize = > getInputDataToSerialize(); // this creates 100,000 employee objects > Encoder employeeEncoder = Encoders.bean(Employee.class); > Dataset employeeDataset = sparkSession.createDataset( > inputDataToSerialize, employeeEncoder ); employeeDataset.write() > .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); > }{color} > +*Exception Stack Trace:* > + > *java.lang.IllegalStateException: Failed to execute > CommandLineRunnerjava.lang.IllegalStateException: Failed to execute > CommandLineRunner at > org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) > at > org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) > at > org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) > at > org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) > at com.felix.Application.main(Application.java:45)Caused by: > org.apache.spark.SparkException: Job aborted. at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at >
[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-29764: - Description: Hello, I have been doing a proof of concept for data lake structure and analytics using Apache Spark. When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my data model, the serialization to Parquet start failing. *My Data Model:* @Data public class Employee { private UUID id = UUID.randomUUID(); private String name; private int age; private LocalDate dob; private LocalDateTime startDateTime; private String phone; private Address address; } *Serialization Snippet* {color:#0747a6}public void serialize(){color} {color:#0747a6}{ List inputDataToSerialize = getInputDataToSerialize(); // this creates 100,000 employee objects Encoder employeeEncoder = Encoders.bean(Employee.class); Dataset employeeDataset = sparkSession.createDataset( inputDataToSerialize, employeeEncoder ); employeeDataset.write() .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); }{color} +*Exception Stack Trace:* + *java.lang.IllegalStateException: Failed to execute CommandLineRunnerjava.lang.IllegalStateException: Failed to execute CommandLineRunner at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) at com.felix.Application.main(Application.java:45)Caused by: org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at com.felix.Application.run(Application.java:63) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-29764: - Description: Hello, I have been doing a proof of concept for data lake structure and analytics using Apache Spark. When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my data model, the serialization to Parquet start failing. *My Data Model:* @Data public class Employee { private UUID id = UUID.randomUUID(); private String name; private int age; private LocalDate dob; private LocalDateTime startDateTime; private String phone; private Address address; } *Serialization Snippet* public void serialize() { List inputDataToSerialize = getInputDataToSerialize(); // this creates 100,000 employee objects Encoder employeeEncoder = Encoders.bean(Employee.class); Dataset employeeDataset = sparkSession.createDataset( inputDataToSerialize, employeeEncoder ); employeeDataset.write() .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); } +*Exception Stack Trace:* + *java.lang.IllegalStateException: Failed to execute CommandLineRunnerjava.lang.IllegalStateException: Failed to execute CommandLineRunner at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) at com.felix.Application.main(Application.java:45)Caused by: org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at com.felix.Application.run(Application.java:63) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
[ https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Kizhakkel Jose updated SPARK-29764: - Description: Hello, I have been doing a proof of concept for data lake structure and analytics using Apache Spark. When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my data model, the serialization to Parquet start failing. *My Data Model:* @Data public class Employee { private UUID id = UUID.randomUUID(); private String name; private int age; private LocalDate dob; private LocalDateTime startDateTime; private String phone; private Address address; } *Serialization Snippet* public void serialize(){ List inputDataToSerialize = getInputDataToSerialize(); // this creates 100,000 employee objects Encoder employeeEncoder = Encoders.bean(Employee.class); Dataset employeeDataset = sparkSession.createDataset( inputDataToSerialize, employeeEncoder ); employeeDataset.write() .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); } +*Exception Stack Trace:* + *java.lang.IllegalStateException: Failed to execute CommandLineRunnerjava.lang.IllegalStateException: Failed to execute CommandLineRunner at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) at com.felix.Application.main(Application.java:45)Caused by: org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at com.felix.Application.run(Application.java:63) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Created] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file
Felix Kizhakkel Jose created SPARK-29764: Summary: Error on Serializing POJO with java datetime property to a Parquet file Key: SPARK-29764 URL: https://issues.apache.org/jira/browse/SPARK-29764 Project: Spark Issue Type: Bug Components: Java API, Spark Core, SQL Affects Versions: 2.4.4 Reporter: Felix Kizhakkel Jose Hello, I have been doing a proof of concept for data lake structure and analytics using Apache Spark. When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my data model, the serialization to Parquet start failing. *My Data Model:* @Data public class Employee { private UUID id = UUID.randomUUID(); private String name; private int age; private LocalDate dob; private LocalDateTime startDateTime; private String phone; private Address address; } public void serialize() { List inputDataToSerialize = getInputDataToSerialize(); // this creates 100,000 employee objects Encoder employeeEncoder = Encoders.bean(Employee.class); Dataset employeeDataset = sparkSession.createDataset( inputDataToSerialize, employeeEncoder ); employeeDataset.write() .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); } *Exception Stack Trace:* *java.lang.IllegalStateException: Failed to execute CommandLineRunnerjava.lang.IllegalStateException: Failed to execute CommandLineRunner at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803) at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784) at org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771) at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) at com.felix.Application.main(Application.java:45)Caused by: org.apache.spark.SparkException: Job aborted. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at com.felix.Application.run(Application.java:63) at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800) ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows. at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411) at