[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed

2021-01-19 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-34163:
-
Description: 
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


{color:#0747A6}def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False{color}


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?

Why streaming dataframe and normal dataframe differ in behavior? How to skip 
transformation on a column if it doesn't exists?

  was:
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


{color:#0747A6}def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False{color}


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?


> Spark Structured Streaming - Kafka avro transformation on optional field 
> Failed
> ---
>
> Key: SPARK-34163
> URL: https://issues.apache.org/jira/browse/SPARK-34163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.7
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello All,
> I have a spark structured streaming job to inject data from Kafka where 
> message from Kafka is avro type.
> Some of the fields are optional in the data. And I have to perform 
> transformation if those optional fields are present in the data. 
> So I tried to check whether the column exists by :
> {color:#0747A6}def has_column(dataframe, col):
> """
> This function checks the existence of a given column in the given 
> DataFrame
> :param dataframe: the dataframe
> :type dataframe: DataFrame
> :param col: the column name
> :type col: str
> :return: true if the column exists else false
> :rtype: boolean
> """
> try:
> dataframe[col]
> return True
> except AnalysisException:
> return False{color}
> But it seems not working when its a streaming dataframe, but when the 
> dataframe is normal dataframe, and when a column is not present the above 
> check returns false, therefore I can ignore the transformation on the missing 
> column. 
> But on Streaming dataframe *has_column* always returns true and therefore the 
> transformation get executed and cause exception. What is the right approach 
> to check existence of column in a streaming dataframe before performing 
> transformation?
> Why streaming dataframe and normal dataframe differ in behavior? How to skip 
> transformation on a column if it doesn't exists?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed

2021-01-19 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-34163:
-
Description: 
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


{color:#0747A6}def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False{color}


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?

  was:
Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


*def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False*


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?


> Spark Structured Streaming - Kafka avro transformation on optional field 
> Failed
> ---
>
> Key: SPARK-34163
> URL: https://issues.apache.org/jira/browse/SPARK-34163
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.7
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello All,
> I have a spark structured streaming job to inject data from Kafka where 
> message from Kafka is avro type.
> Some of the fields are optional in the data. And I have to perform 
> transformation if those optional fields are present in the data. 
> So I tried to check whether the column exists by :
> {color:#0747A6}def has_column(dataframe, col):
> """
> This function checks the existence of a given column in the given 
> DataFrame
> :param dataframe: the dataframe
> :type dataframe: DataFrame
> :param col: the column name
> :type col: str
> :return: true if the column exists else false
> :rtype: boolean
> """
> try:
> dataframe[col]
> return True
> except AnalysisException:
> return False{color}
> But it seems not working when its a streaming dataframe, but when the 
> dataframe is normal dataframe, and when a column is not present the above 
> check returns false, therefore I can ignore the transformation on the missing 
> column. 
> But on Streaming dataframe *has_column* always returns true and therefore the 
> transformation get executed and cause exception. What is the right approach 
> to check existence of column in a streaming dataframe before performing 
> transformation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34163) Spark Structured Streaming - Kafka avro transformation on optional field Failed

2021-01-19 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-34163:


 Summary: Spark Structured Streaming - Kafka avro transformation on 
optional field Failed
 Key: SPARK-34163
 URL: https://issues.apache.org/jira/browse/SPARK-34163
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.7
Reporter: Felix Kizhakkel Jose


Hello All,
I have a spark structured streaming job to inject data from Kafka where message 
from Kafka is avro type.
Some of the fields are optional in the data. And I have to perform 
transformation if those optional fields are present in the data. 
So I tried to check whether the column exists by :


*def has_column(dataframe, col):
"""
This function checks the existence of a given column in the given DataFrame

:param dataframe: the dataframe
:type dataframe: DataFrame
:param col: the column name
:type col: str
:return: true if the column exists else false
:rtype: boolean
"""
try:
dataframe[col]
return True
except AnalysisException:
return False*


But it seems not working when its a streaming dataframe, but when the dataframe 
is normal dataframe, and when a column is not present the above check returns 
false, therefore I can ignore the transformation on the missing column. 

But on Streaming dataframe *has_column* always returns true and therefore the 
transformation get executed and cause exception. What is the right approach to 
check existence of column in a streaming dataframe before performing 
transformation?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-11 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175658#comment-17175658
 ] 

Felix Kizhakkel Jose commented on SPARK-32583:
--

[~hyukjin.kwon] I couldn't find any test for pySpark Structured Stream under 
pyspark code base. Am I missing something?

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17175202#comment-17175202
 ] 

Felix Kizhakkel Jose commented on SPARK-32583:
--

[~rohitmishr1484] It would have been nice if you could have pointed me to a 
solution.

> PySpark Structured Streaming Testing Support
> 
>
> Key: SPARK-32583
> URL: https://issues.apache.org/jira/browse/SPARK-32583
> Project: Spark
>  Issue Type: Question
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> Hello,
> I have investigated a lot but couldn't get any help or resource on how to 
> +{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
> (ingesting from Kafka topics to S3) and how to build  Continuous Integration 
> (CI)/Continuous Deployment (CD).
> 1. Is it possible to test (unit test, integration test) pyspark structured 
> stream?
> 2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32583) PySpark Structured Streaming Testing Support

2020-08-10 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-32583:


 Summary: PySpark Structured Streaming Testing Support
 Key: SPARK-32583
 URL: https://issues.apache.org/jira/browse/SPARK-32583
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Felix Kizhakkel Jose


Hello,

I have investigated a lot but couldn't get any help or resource on how to 
+{color:#172b4d}*test my pyspark Structured Streaming pipeline job*{color}+ 
(ingesting from Kafka topics to S3) and how to build  Continuous Integration 
(CI)/Continuous Deployment (CD).

1. Is it possible to test (unit test, integration test) pyspark structured 
stream?

2. How to build  Continuous Integration (CI)/Continuous Deployment (CD)?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-07-21 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162311#comment-17162311
 ] 

Felix Kizhakkel Jose commented on SPARK-26345:
--

[~sha...@uber.com] I don't have permission to assign it to you. Probably 
someone who is part of committers list can assign it to you.

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-21 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31763:
-
Issue Type: Bug  (was: New Feature)

> DataFrame.inputFiles() not Available
> 
>
> Key: SPARK-31763
> URL: https://issues.apache.org/jira/browse/SPARK-31763
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have been trying to list inputFiles that compose my DataSet by using 
> *PySpark* 
> spark_session.read
>  .format(sourceFileFormat)
>  .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
>  *.inputFiles();*
> but I get an exception saying inputFiles attribute not present. But I was 
> able to get this functionality with Spark Java. 
> *So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31763) DataFrame.inputFiles() not Available

2020-05-19 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-31763:


 Summary: DataFrame.inputFiles() not Available
 Key: SPARK-31763
 URL: https://issues.apache.org/jira/browse/SPARK-31763
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 2.4.5
Reporter: Felix Kizhakkel Jose


I have been trying to list inputFiles that compose my DataSet by using 
*PySpark* 

spark_session.read
 .format(sourceFileFormat)
 .load(S3A_FILESYSTEM_PREFIX + bucket + File.separator + sourceFolderPrefix)
 *.inputFiles();*

but I get an exception saying inputFiles attribute not present. But I was able 
to get this functionality with Spark Java. 

*So is this something missing in PySpark?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-29 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17095697#comment-17095697
 ] 

Felix Kizhakkel Jose commented on SPARK-31599:
--

Thank you [~gsomogyi]. But this is not a S3 issue. The issue is I have 
compacted files in the bucket and deleted the non compacted files, but didn't 
update/modify the "_spark_metadata" folder. And I could see that those Write 
Ahead Log Json files contain the deleted file names. And when I use Spark SQL 
to read the data, it first reads the Write Ahead logs from "_spark_metadata" 
and then try to read the files listed in it. So I am wondering how can we 
update the "_spark_metadata" content (Write Ahead Logs)?

> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17094917#comment-17094917
 ] 

Felix Kizhakkel Jose commented on SPARK-31599:
--

How do I do that? 

> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31599:
-
Description: 
I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
Structured Streaming Framework from Kafka. Periodically I try to run compaction 
on this bucket (a separate Spark Job), and on successful compaction delete the 
non compacted (parquet) files. After which I am getting following error on 
Spark jobs which read from that bucket:
 *Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*

How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
to delete the un-compacted files after successful compaction to save space.

  was:
I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
Structured Streaming Framework from Kafka. Periodically I try to run compaction 
on this bucket (a separate Spark Job), and on successful compaction delete the 
non compacted (parquet) files. After which I am getting following error on 
Spark jobs which read from that bucket:
*Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*

How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need 
to delete the un-compacted files after successful compaction to save space.


> Reading from S3 (Structured Streaming Bucket) Fails after Compaction
> 
>
> Key: SPARK-31599
> URL: https://issues.apache.org/jira/browse/SPARK-31599
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Structured Streaming
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
> Structured Streaming Framework from Kafka. Periodically I try to run 
> compaction on this bucket (a separate Spark Job), and on successful 
> compaction delete the non compacted (parquet) files. After which I am getting 
> following error on Spark jobs which read from that bucket:
>  *Caused by: java.io.FileNotFoundException: No such file or directory: 
> s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*
> How do we run *_c_ompaction on Structured Streaming S3 bucket_s*. Also I need 
> to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31599) Reading from S3 (Structured Streaming Bucket) Fails after Compaction

2020-04-28 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-31599:


 Summary: Reading from S3 (Structured Streaming Bucket) Fails after 
Compaction
 Key: SPARK-31599
 URL: https://issues.apache.org/jira/browse/SPARK-31599
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Structured Streaming
Affects Versions: 2.4.5
Reporter: Felix Kizhakkel Jose


I have a S3 bucket which has data streamed (Parquet format) to it by Spark 
Structured Streaming Framework from Kafka. Periodically I try to run compaction 
on this bucket (a separate Spark Job), and on successful compaction delete the 
non compacted (parquet) files. After which I am getting following error on 
Spark jobs which read from that bucket:
*Caused by: java.io.FileNotFoundException: No such file or directory: 
s3a://spark-kafka-poc/intermediate/part-0-05ff7893-8a13-4dcd-aeed-3f0d4b5d1691-c000.gz.parquet*

How do we run *_c__ompaction on Structured Streaming S3 bucket_s*. Also I need 
to delete the un-compacted files after successful compaction to save space.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"

2020-03-31 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071819#comment-17071819
 ] 

Felix Kizhakkel Jose commented on SPARK-31072:
--

Hello,
Any updates or helps will be much appreciated. 

> Default to ParquetOutputCommitter even after configuring s3a committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26345) Parquet support Column indexes

2020-03-30 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17070982#comment-17070982
 ] 

Felix Kizhakkel Jose commented on SPARK-26345:
--

I have created a Jira in Parquet-mr for Vectorized API - 
https://issues.apache.org/jira/browse/PARQUET-1830. But as per the discussion, 
it seems like a short term solution is "As Spark already use some internal API 
of parquet-mr we can step forward and implement the page skipping mechanism 
that is implemented in parquet-mr." [~gszadovszky]. 

So updating this Jira to have a short term solution to benefit from Column and 
Offset Index implementation in Parquet-MR 1.11

> Parquet support Column indexes
> --
>
> Key: SPARK-26345
> URL: https://issues.apache.org/jira/browse/SPARK-26345
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Parquet 1.11.0 supports column indexing. Spark can supports this feature for 
> good read performance.
> More details:
> https://issues.apache.org/jira/browse/PARQUET-1201



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-16 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060408#comment-17060408
 ] 

Felix Kizhakkel Jose edited comment on SPARK-31162 at 3/16/20, 6:13 PM:


I have seen following in the API documentation:

/**
 * Buckets the output by the given columns. *If specified, the output is laid 
out on the file system similar to Hive's bucketing scheme.**
 *
 * This is applicable for all file-based data sources (e.g. Parquet, JSON) 
starting with Spark
 * 2.1.0.
 *
 * @since 2.0
 */
 @scala.annotation.varargs
 def bucketBy(numBuckets: Int, colName: String, colNames: String*): 
DataFrameWriter[T] = \{ this.numBuckets = Option(numBuckets) 
this.bucketColumnNames = Option(colName +: colNames) this }

How can we specify that?

 


was (Author: felixkjose):
I have seen following in the API documentation:



/**
 * Buckets the output by the given columns. *If specified, the output is laid 
out on the file*
 ** system similar to Hive's bucketing scheme.*
 *
 * This is applicable for all file-based data sources (e.g. Parquet, JSON) 
starting with Spark
 * 2.1.0.
 *
 * @since 2.0
 */
@scala.annotation.varargs
def bucketBy(numBuckets: Int, colName: String, colNames: String*): 
DataFrameWriter[T] = {
 this.numBuckets = Option(numBuckets)
 this.bucketColumnNames = Option(colName +: colNames)
 this
}

How can we specify that?

 

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-16 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060408#comment-17060408
 ] 

Felix Kizhakkel Jose commented on SPARK-31162:
--

I have seen following in the API documentation:



/**
 * Buckets the output by the given columns. *If specified, the output is laid 
out on the file*
 ** system similar to Hive's bucketing scheme.*
 *
 * This is applicable for all file-based data sources (e.g. Parquet, JSON) 
starting with Spark
 * 2.1.0.
 *
 * @since 2.0
 */
@scala.annotation.varargs
def bucketBy(numBuckets: Int, colName: String, colNames: String*): 
DataFrameWriter[T] = {
 this.numBuckets = Option(numBuckets)
 this.bucketColumnNames = Option(colName +: colNames)
 this
}

How can we specify that?

 

> Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing
> -
>
> Key: SPARK-31162
> URL: https://issues.apache.org/jira/browse/SPARK-31162
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> I couldn't find a configuration parameter to choose Hive Hashing instead of 
> Spark's default Murmur Hash when performing Spark BucketBy operation. 
> According to the discussion with @[~maropu] [~hyukjin.kwon], suggested to 
> open a new JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2020-03-15 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059927#comment-17059927
 ] 

Felix Kizhakkel Jose commented on SPARK-17495:
--

[~maropu] [~tejasp] I have created a Jira to have a config parameter to choose 
Hive Hash instead of Murmur3Hash. Could you take a look?
https://issues.apache.org/jira/browse/SPARK-31162

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing support

2020-03-15 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059925#comment-17059925
 ] 

Felix Kizhakkel Jose commented on SPARK-19256:
--

Expose a configuration parameter for Hive Hashing during BucketBy operation. 
This config is missing or not documented

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31162) Provide Configuration Parameter to select/enforce the Hive Hash for Bucketing

2020-03-15 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-31162:


 Summary: Provide Configuration Parameter to select/enforce the 
Hive Hash for Bucketing
 Key: SPARK-31162
 URL: https://issues.apache.org/jira/browse/SPARK-31162
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.4.5
Reporter: Felix Kizhakkel Jose


I couldn't find a configuration parameter to choose Hive Hashing instead of 
Spark's default Murmur Hash when performing Spark BucketBy operation. According 
to the discussion with @[~maropu] [~hyukjin.kwon], suggested to open a new 
JIRA. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2020-03-15 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059893#comment-17059893
 ] 

Felix Kizhakkel Jose commented on SPARK-17495:
--

@[~maropu] [~hyukjin.kwon] So we need a new Jira to provide a configuration 
parameter to override Murmur hash and to use the HIVE hash function for 
Bucketing? Do we have a Jira to track it or are you going to create one?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2020-03-15 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059847#comment-17059847
 ] 

Felix Kizhakkel Jose commented on SPARK-17495:
--

@[~maropu] Is there any configuration parameter I have to use to use Hive hash 
instead of Murmur hash (spark default)?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2020-03-15 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059846#comment-17059846
 ] 

Felix Kizhakkel Jose commented on SPARK-17495:
--

Oh, thats great. I will check.

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
> Fix For: 2.2.0
>
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17495) Hive hash implementation

2020-03-15 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-17495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17059681#comment-17059681
 ] 

Felix Kizhakkel Jose commented on SPARK-17495:
--

Is this going to available in the coming releases?

> Hive hash implementation
> 
>
> Key: SPARK-17495
> URL: https://issues.apache.org/jira/browse/SPARK-17495
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Tejas Patil
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: bulk-closed
>
> Spark internally uses Murmur3Hash for partitioning. This is different from 
> the one used by Hive. For queries which use bucketing this leads to different 
> results if one tries the same query on both engines. For us, we want users to 
> have backward compatibility to that one can switch parts of applications 
> across the engines without observing regressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"

2020-03-09 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055435#comment-17055435
 ] 

Felix Kizhakkel Jose commented on SPARK-31072:
--

Could you please provide some insights?

> Default to ParquetOutputCommitter even after configuring s3a committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19256) Hive bucketing support

2020-03-09 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-19256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17055007#comment-17055007
 ] 

Felix Kizhakkel Jose commented on SPARK-19256:
--

[~chengsu] Any updates on this feature (23163 and 26164)? 

{color:#172b4d}*Note:*{color} Also I have seen recent Hive has introduced some 
changes in their Bucketing  Strategy and Big Data engines like Presto 
incorporated those changes. Are we considering the compatibility with these new 
HIVE changes?
Link to Presto/HIVE Bucketing improvements/changes: 
[https://prestosql.io/blog/2019/05/29/improved-hive-bucketing.html]

> Hive bucketing support
> --
>
> Key: SPARK-19256
> URL: https://issues.apache.org/jira/browse/SPARK-19256
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Priority: Minor
>
> JIRA to track design discussions and tasks related to Hive bucketing support 
> in Spark.
> Proposal : 
> https://docs.google.com/document/d/1a8IDh23RAkrkg9YYAeO51F4aGO8-xAlupKwdshve2fc/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31072:
-
Summary: Default to ParquetOutputCommitter even after configuring s3a 
committer as "partitioned"  (was: Default to ParquetOutputCommitter even after 
configuring committer as "partitioned")

> Default to ParquetOutputCommitter even after configuring s3a committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31072) Default to ParquetOutputCommitter even after configuring committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31072:
-
Summary: Default to ParquetOutputCommitter even after configuring committer 
as "partitioned"  (was: Default to ParquetOutputCommitter even after 
configuring setting committer as "partitioned")

> Default to ParquetOutputCommitter even after configuring committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053543#comment-17053543
 ] 

Felix Kizhakkel Jose edited comment on SPARK-31072 at 3/6/20, 3:48 PM:
---

[~steve_l], 
 I have seen some issues you have addressed in this area (zero rename with s3a 
etc), could you please give me some insights?

All,
 Please provide some help on this issue.


was (Author: felixkjose):
[~steve_l], 
I have seen some issues you have addressed in this area, could you please give 
me some insights?

All,
Please provide some help on this issue.

> Default to ParquetOutputCommitter even after configuring setting committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053543#comment-17053543
 ] 

Felix Kizhakkel Jose commented on SPARK-31072:
--

[~steve_l], 
I have seen some issues you have addressed in this area, could you please give 
me some insights?

All,
Please provide some help on this issue.

> Default to ParquetOutputCommitter even after configuring setting committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-31072:


 Summary: Default to ParquetOutputCommitter even after configuring 
setting committer as "partitioned"
 Key: SPARK-31072
 URL: https://issues.apache.org/jira/browse/SPARK-31072
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.5
Reporter: Felix Kizhakkel Jose


My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
even after I configure to use "PartitionedStagingCommitter" with the following 
configuration:
 * 
sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
 "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
 * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
 * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append");
 * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
 * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
false);

Application logs stacktrace:

20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter

But when I use _*ORC*_ as the file format, with the same configuration as above 
it correctly pick "PartitionedStagingCommitter":
20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 1
20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned 
to output data to s3a:
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
PartitionedStagingCommitter**

So I am wondering why Parquet and ORC has different behavior ?
How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?

I started this because when I was trying to save data to S3 directly with 
partitionBy() two columns -  I was getting  file not found exceptions 
intermittently.  
So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet

2020-03-05 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052476#comment-17052476
 ] 

Felix Kizhakkel Jose commented on SPARK-20901:
--

Thank you [~dongjoon]. 

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20901) Feature parity for ORC with Parquet

2020-03-05 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17052384#comment-17052384
 ] 

Felix Kizhakkel Jose commented on SPARK-20901:
--

Hi [~dongjoon],
I was trying to choose between ORC and Parquet formats (using AWS Glue /Spark). 
While researching I came across this parity feature SPARK-20901.
What features are not implemented  for ORC to have complete feature parity as 
Parquet in Spark? Here I could see everything (issues linked) listed  here are 
either Resolved or Closed, so I am confused. Could you please provide some 
insights?

> Feature parity for ORC with Parquet
> ---
>
> Key: SPARK-20901
> URL: https://issues.apache.org/jira/browse/SPARK-20901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This issue aims to track the feature parity for ORC with Parquet.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-12 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16972605#comment-16972605
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

So it's still hard to follow the reproducer? Do I need to skim it further?

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
> task 0.0 

[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-08 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970451#comment-16970451
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

How do I get a help once the priority is reduced, it seems like no eyes are 
getting on this issue. Will there be any SLA on getting a help? [~apachespark]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage 

[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-07 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969301#comment-16969301
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

[~hyukjin.kwon] Could you please help me with this?

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
> task 0.0 in stage 0.0 (TID 0, 

[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411
 ] 

Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:46 PM:
---

[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
 [^SparkParquetSampleCode.docx] .   

 

Please let me know if anything else is required. I am totally stuck with this 
issue. 


was (Author: felixkjose):
[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
 [^SparkParquetSampleCode.docx]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> 

[jira] [Comment Edited] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411
 ] 

Felix Kizhakkel Jose edited comment on SPARK-29764 at 11/6/19 2:45 PM:
---

[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
 [^SparkParquetSampleCode.docx]


was (Author: felixkjose):
[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
[^SparkParquetSampleCode.docx]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at 

[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-29764:
-
Attachment: SparkParquetSampleCode.docx

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
>  ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to 
> stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost 
> task 0.0 in stage 0.0 (TID 0, localhost, executor driver): 
> 

[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968411#comment-16968411
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

[~hyukjin.kwon] Sorry, I didn't know critical is for committers. Please find 
the attached code sample for the issue I am talking [where I have dob and 
startDateTime field in Employee object - which is causing Spark fail to persist 
into parquet file.
[^SparkParquetSampleCode.docx]

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Major
> Attachments: SparkParquetSampleCode.docx
>
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
> com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
> com.felix.Application.run(Application.java:63) at 
> 

[jira] [Commented] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-05 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16967800#comment-16967800
 ] 

Felix Kizhakkel Jose commented on SPARK-29764:
--

The Spark Schema generated for the POJO is:
message spark_schema {
 optional group address {
 optional binary city (UTF8);
 optional binary streetName (UTF8);
 optional group zip {
 required int32 ext;
 required int32 zip;
 }
 }
 required int32 age;
 optional group dob {
 optional group chronology {
 optional binary calendarType (UTF8);
 optional binary id (UTF8);
 }
 required int32 dayOfMonth;
 optional binary dayOfWeek (UTF8);
 required int32 dayOfYear;
 optional group era {
 required int32 value;
 }
 required boolean leapYear;
 optional binary month (UTF8);
 required int32 monthValue;
 required int32 year;
 }
 optional group id {
 required int64 leastSignificantBits;
 required int64 mostSignificantBits;
 }
 optional binary name (UTF8);
 optional binary phone (UTF8);
 optional group startDateTime {
 required int32 dayOfMonth;
 optional binary dayOfWeek (UTF8);
 required int32 dayOfYear;
 required int32 hour;
 required int32 minute;
 optional binary month (UTF8);
 required int32 monthValue;
 required int32 nano;
 required int32 second;
 required int32 year;
 }
}

Also I don't know why its not recognized as int96 [timestamptype] or int 
i[datetype] instead its represented as group.  I dont know whether thats the 
reason to get negativeArrayException when I have large data set to persist to 
parquet. Any help is very much appreciated

> Error on Serializing POJO with java datetime property to a Parquet file
> ---
>
> Key: SPARK-29764
> URL: https://issues.apache.org/jira/browse/SPARK-29764
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core, SQL
>Affects Versions: 2.4.4
>Reporter: Felix Kizhakkel Jose
>Priority: Blocker
>
> Hello,
>  I have been doing a proof of concept for data lake structure and analytics 
> using Apache Spark. 
>  When I add a java java.time.LocalDateTime/java.time.LocalDate properties in 
> my data model, the serialization to Parquet start failing.
>  *My Data Model:*
> @Data
>  public class Employee
> { private UUID id = UUID.randomUUID(); private String name; private int age; 
> private LocalDate dob; private LocalDateTime startDateTime; private String 
> phone; private Address address; }
>  
>  *Serialization Snippet*
> {color:#0747a6}public void serialize(){color}
> {color:#0747a6}{ List inputDataToSerialize = 
> getInputDataToSerialize(); // this creates 100,000 employee objects 
> Encoder employeeEncoder = Encoders.bean(Employee.class); 
> Dataset employeeDataset = sparkSession.createDataset( 
> inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
> .mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
> }{color}
> +*Exception Stack Trace:*
>  +
>  *java.lang.IllegalStateException: Failed to execute 
> CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
> CommandLineRunner at 
> org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
>  at 
> org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
>  at 
> org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
>  at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:316) at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
> at 
> org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
> at com.felix.Application.main(Application.java:45)Caused by: 
> org.apache.spark.SparkException: Job aborted. at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> 

[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-05 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-29764:
-
Description: 
Hello,
 I have been doing a proof of concept for data lake structure and analytics 
using Apache Spark. 
 When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my 
data model, the serialization to Parquet start failing.
 *My Data Model:*

@Data
 public class Employee

{ private UUID id = UUID.randomUUID(); private String name; private int age; 
private LocalDate dob; private LocalDateTime startDateTime; private String 
phone; private Address address; }

 

 *Serialization Snippet*

{color:#0747a6}public void serialize(){color}

{color:#0747a6}{ List inputDataToSerialize = 
getInputDataToSerialize(); // this creates 100,000 employee objects 
Encoder employeeEncoder = Encoders.bean(Employee.class); 
Dataset employeeDataset = sparkSession.createDataset( 
inputDataToSerialize, employeeEncoder ); employeeDataset.write() 
.mode(SaveMode.Append) .parquet("/Users/felix/Downloads/spark.parquet"); 
}{color}

+*Exception Stack Trace:*
 +
 *java.lang.IllegalStateException: Failed to execute 
CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
CommandLineRunner at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
 at 
org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
at com.felix.Application.main(Application.java:45)Caused by: 
org.apache.spark.SparkException: Job aborted. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
com.felix.Application.run(Application.java:63) at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
 ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0, localhost, executor driver): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 

[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-05 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-29764:
-
Description: 
Hello,
 I have been doing a proof of concept for data lake structure and analytics 
using Apache Spark. 
 When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my 
data model, the serialization to Parquet start failing.
 *My Data Model:*

@Data
 public class Employee

{ private UUID id = UUID.randomUUID(); private String name; private int age; 
private LocalDate dob; private LocalDateTime startDateTime; private String 
phone; private Address address; }

 

 *Serialization Snippet*

public void serialize()

{ List inputDataToSerialize = getInputDataToSerialize(); // this 
creates 100,000 employee objects Encoder employeeEncoder = 
Encoders.bean(Employee.class);

Dataset employeeDataset = sparkSession.createDataset( 
inputDataToSerialize, employeeEncoder );

employeeDataset.write() .mode(SaveMode.Append) 
.parquet("/Users/felix/Downloads/spark.parquet"); }

+*Exception Stack Trace:*
 +
 *java.lang.IllegalStateException: Failed to execute 
CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
CommandLineRunner at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
 at 
org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
at com.felix.Application.main(Application.java:45)Caused by: 
org.apache.spark.SparkException: Job aborted. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
com.felix.Application.run(Application.java:63) at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
 ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0, localhost, executor driver): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 

[jira] [Updated] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-05 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-29764:
-
Description: 
Hello,
 I have been doing a proof of concept for data lake structure and analytics 
using Apache Spark. 
 When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my 
data model, the serialization to Parquet start failing.
 *My Data Model:*

@Data
 public class Employee

{ private UUID id = UUID.randomUUID(); private String name; private int age; 
private LocalDate dob; private LocalDateTime startDateTime; private String 
phone; private Address address; }

 

 *Serialization Snippet*

public void serialize(){

List inputDataToSerialize = getInputDataToSerialize(); // this 
creates 100,000 employee objects Encoder employeeEncoder = 
Encoders.bean(Employee.class);

Dataset employeeDataset = sparkSession.createDataset( 
inputDataToSerialize, employeeEncoder );

employeeDataset.write() .mode(SaveMode.Append) 
.parquet("/Users/felix/Downloads/spark.parquet");

}

+*Exception Stack Trace:*
+
 *java.lang.IllegalStateException: Failed to execute 
CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
CommandLineRunner at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
 at 
org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
at com.felix.Application.main(Application.java:45)Caused by: 
org.apache.spark.SparkException: Job aborted. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
com.felix.Application.run(Application.java:63) at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
 ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0, localhost, executor driver): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 

[jira] [Created] (SPARK-29764) Error on Serializing POJO with java datetime property to a Parquet file

2019-11-05 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-29764:


 Summary: Error on Serializing POJO with java datetime property to 
a Parquet file
 Key: SPARK-29764
 URL: https://issues.apache.org/jira/browse/SPARK-29764
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core, SQL
Affects Versions: 2.4.4
Reporter: Felix Kizhakkel Jose


Hello,
I have been doing a proof of concept for data lake structure and analytics 
using Apache Spark. 
When I add a java java.time.LocalDateTime/java.time.LocalDate properties in my 
data model, the serialization to Parquet start failing.
*My Data Model:*

@Data
public class Employee {
 private UUID id = UUID.randomUUID();
 private String name;
 private int age;
 private LocalDate dob;
 private LocalDateTime startDateTime;
 private String phone;
 private Address address;
}




 

 

public void serialize() {

List inputDataToSerialize = getInputDataToSerialize(); // this 
creates 100,000 employee objects


Encoder employeeEncoder = Encoders.bean(Employee.class);
Dataset employeeDataset = sparkSession.createDataset(
 inputDataToSerialize,
 employeeEncoder
);


 employeeDataset.write()
 .mode(SaveMode.Append)
 .parquet("/Users/felix/Downloads/spark.parquet");
}







*Exception Stack Trace:*
*java.lang.IllegalStateException: Failed to execute 
CommandLineRunnerjava.lang.IllegalStateException: Failed to execute 
CommandLineRunner at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:803)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:784)
 at 
org.springframework.boot.SpringApplication.afterRefresh(SpringApplication.java:771)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:316) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1186) 
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1175) 
at com.felix.Application.main(Application.java:45)Caused by: 
org.apache.spark.SparkException: Job aborted. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
 at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:170)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:131)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:155)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
 at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at 
org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:676)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
 at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) 
at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:290) 
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) at 
com.felix.SparkParquetSerializer.serialize(SparkParquetSerializer.java:24) at 
com.felix.Application.run(Application.java:63) at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:800)
 ... 6 moreCaused by: org.apache.spark.SparkException: Job aborted due to stage 
failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 
in stage 0.0 (TID 0, localhost, executor driver): 
org.apache.spark.SparkException: Task failed while writing rows. at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:257)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:177)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:411)
 at