Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-05 Thread Wenchen Fan
I think many advanced Spark users already have customer catalyst rules, to deal with the query plan directly, so it makes a lot of sense to standardize the logical plan. However, instead of exploring possible operations ourselves, I think we should follow the SQL standard. ReplaceTable, RTAS:

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-05 Thread Ryan Blue
Thanks for responding! I’ve been coming up with a list of the high-level operations that are needed. I think all of them come down to 5 questions about what’s happening: - Does the target table exist? - If it does exist, should it be dropped? - If not, should it get created? - Should

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
Sure. Obviously, there is going to be some overlap as the project transitions to being part of mainline Spark development. As long as you are consciously working toward moving discussions into this dev list, then all is good. On Mon, Feb 5, 2018 at 1:56 PM, Matt Cheah wrote:

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Matt Cheah
I think in this case, the original design that was proposed before the document was implemented on the Spark on K8s fork, that we took some time to build separately before proposing that the fork be merged into the main line. Specifically, the timeline of events was: We started building

Re: Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Mark Hamstra
That's good, but you should probably stop and consider whether the discussions that led up to this document's creation could have taken place on this dev list -- because if they could have, then they probably should have as part of the whole spark-on-k8s project becoming part of mainline spark

Spark on Kubernetes Builder Pattern Design Document

2018-02-05 Thread Matt Cheah
Hi everyone, While we were building the Spark on Kubernetes integration, we realized that some of the abstractions we introduced for building the driver application in spark-submit, and building executor pods in the scheduler backend, could be improved for better readability and clarity. We

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
In that case, I'd recommend tracking down the node where the files were created and reporting it to EMR. On Mon, Feb 5, 2018 at 10:38 AM, Dong Jiang wrote: > Thanks for the response, Ryan. > > We have transient EMR cluster, and we do rerun the cluster whenever the > cluster

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
Hi, Ryan, Do you have any suggestions on how we could detect and prevent this issue? This is the second time we encountered this issue. We have a wide table, with 134 columns in the file. The issue seems only impact one column, and very hard to detect. It seems you have encountered this issue

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
Thanks for the response, Ryan. We have transient EMR cluster, and we do rerun the cluster whenever the cluster failed. However, in this particular case, the cluster succeeded, not reporting any errors. I was able to null out the corrupted the column and recover the rest of the 133 columns. I do

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
We ensure the bad node is removed from our cluster and reprocess to replace the data. We only see this once or twice a year, so it isn't a significant problem. We've discussed options for adding write-side validation, but it is expensive and still unreliable if you don't trust the hardware. rb

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
If you can still access the logs, then you should be able to find where the write task ran. Maybe you can get an instance ID and open a ticket with Amazon. Otherwise, it will probably start failing the HW checks when the instance hardware is reused, so I wouldn't worry about it. The _SUCCESS file

Re: Corrupt parquet file

2018-02-05 Thread Dong Jiang
Hi, Ryan, Many thanks for your quick response. We ran Spark on transient EMR clusters. Nothing in the log or EMR events suggests any issues with the cluster or the nodes. We also see the _SUCCESS file on the S3. If we see the _SUCCESS file, does that suggest all data is good? How can we prevent

Re: Corrupt parquet file

2018-02-05 Thread Ryan Blue
Dong, We see this from time to time as well. In my experience, it is almost always caused by a bad node. You should try to find out where the file was written and remove that node as soon as possible. As far as finding out what is wrong with the file, that's a difficult task. Parquet's encoding

Re: Union in Spark context

2018-02-05 Thread Suchith J N
Thank you very much. I had overlooked the differences between the two. The public API part is understandable. Coming to second part. - I see that it creates an instance of UnionRDD with all RDDs as parent there by preventing long lineage chain. Is my understanding correct? On 5 February 2018 at

Corrupt parquet file

2018-02-05 Thread Dong Jiang
Hi, We are running on Spark 2.2.1, generating parquet files, like the following pseudo code df.write.parquet(...) We have recently noticed parquet file corruptions, when reading the parquet in Spark or Presto, as the following: Caused by: org.apache.parquet.io.ParquetDecodingException: Can not

Re: Union in Spark context

2018-02-05 Thread Mark Hamstra
First, the public API cannot be changed except when there is a major version change, and there is no way that we are going to do Spark 3.0.0 just for this change. Second, the change would be a mistake since the two different union methods are quite different. The method in RDD only ever works on

Re: Union in Spark context

2018-02-05 Thread 0xF0F0F0
There is one on RDD but `SparkContext.union` prevents lineage from growing. Check https://stackoverflow.com/q/34461804 Sent with [ProtonMail](https://protonmail.com) Secure Email. Original Message On February 5, 2018 5:04 PM, Suchith J N wrote: > Hi,

Union in Spark context

2018-02-05 Thread Suchith J N
Hi, Seems like simple clean up - Why do we have union() on RDDs in SparkContext? Shouldn't it reside in RDD? There is one in RDD, but it seems like a wrapper around this. Regards, Suchith