Re: A new external catalog

2018-02-13 Thread Steve Loughran
On 13 Feb 2018, at 21:20, Tayyebi, Ameen > wrote: Yes, I’m thinking about upgrading to these: 1.9.0 1.11.272 From: 1.7.3 1.11.76 272 is the earliest that has Glue. How about I let the build system run the tests and if things start breaking

Re: A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Yes, I’m thinking about upgrading to these: 1.9.0 1.11.272 From: 1.7.3 1.11.76 272 is the earliest that has Glue. How about I let the build system run the tests and if things start breaking I fall back to shading Glue’s specific SDK? From: Steve Loughran Date:

Inefficient state management in stream to stream join in 2.3

2018-02-13 Thread Yogesh Mahajan
In 2.3, stream to stream joins(both Inner and Outer) are implemented using symmetric hash join(SHJ) algorithm, and that is a good choice and I am sure you had compared with other family of algorithms like XJoin and non-blocking sort based algorithms like progressive merge join (PMJ

Re: A new external catalog

2018-02-13 Thread Steve Loughran
On 13 Feb 2018, at 19:50, Tayyebi, Ameen > wrote: The biggest challenge is that I had to upgrade the AWS SDK to a newer version so that it includes the Glue client since Glue is a new service. So far, I haven’t see any jar hell issues, but

A new external catalog

2018-02-13 Thread Tayyebi, Ameen
Hello everyone, For those of you not familiar with AWS Glue Catalog, it’s a Hive Metastore implemented as a web service. The Glue service is composed of different components, but the one I’m interested in is the Catalog. Today, there’s a Hive metastore

Re: Regarding NimbusDS JOSE JWT jar 3.9 security vulnerability

2018-02-13 Thread PJ Fanning
Hi Sujith, I didn't find the nimbusds dependency in any spark 2.2 jars. Maybe I missed something. Could you tell us which spark jar has the nimbusds dependency? -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-13 Thread Sameer Agarwal
The issue with SPARK-23292 is that we currently run the python tests related to pandas and pyarrow with python 3 (which is already installed on all amplab jenkins machines). Since the code path is fully tested, we decided to not mark it as a blocker; I've reworded the title to better indicate

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-13 Thread Sean Owen
+1 from me. Again, licenses and sigs look fine. I built the source distribution with "-Phive -Phadoop-2.7 -Pyarn -Pkubernetes" and all tests passed. Remaining issues for 2.3.0, none of which are a Blocker: SPARK-22797 Add multiple column support to PySpark Bucketizer SPARK-23083 Adding

Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Thanks for your feedback Sean, I agree with you. I have logged a JIRA case (https://issues.apache.org/jira/browse/SPARK-23409), I will take a look at the code more in detail and see if I come up with a PR to handle this. On 13 February 2018 at 12:00, Sean Owen wrote: > I

Re: Corrupt parquet file

2018-02-13 Thread Steve Loughran
On 12 Feb 2018, at 20:21, Ryan Blue > wrote: I wouldn't say we have a primary failure mode that we deal with. What we concluded was that all the schemes we came up with to avoid corruption couldn't cover all cases. For example, what about when

Re: There is no space for new record

2018-02-13 Thread SNEHASISH DUTTA
Hi, Will it be possible to overcome this with some spark configuration tweak , since EMR has spark version available only till 2.2.1 Regards, Snehasish On Tue, Feb 13, 2018 at 2:00 PM, Marco Gaido wrote: > You can check all the versions where the fix is available on

Re: redundant decision tree model

2018-02-13 Thread Sean Owen
I think the simple pruning you have in mind was just never implemented. That sort of pruning wouldn't help much if the nodes maintained a distribution over classes, as those are rarely identical, but, they just maintain a single class prediction. After training, I see no value in keeping those

Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Hello Nick, thanks for the pointer, that's interesting. However, there seems to be a major difference with what I was discussing. The JIRA issue relates to overfitting and consideration on information gain, while what I propose is a much simpler "syntactic" pruning. Consider a fragment of the

Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it: https://issues.apache.org/jira/browse/SPARK-3155. It is probably still a useful feature to have for trees but the priority is not that high since it may not be that useful for the tree ensemble models. On Tue, 13 Feb 2018 at 11:52 Alessandro

redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Hello community, I have recently manually inspected some decision trees computed with Spark (2.2.1, but the behavior is the same with the latest code on the repo). I have observed that the trees are always complete, even if an entire subtree leads to the same prediction in its different leaves.

Re: There is no space for new record

2018-02-13 Thread Marco Gaido
You can check all the versions where the fix is available on the JIRA SPARK-23376. Anyway it will be available in the upcoming 2.3.0 release. Thanks. On 13 Feb 2018 9:09 a.m., "SNEHASISH DUTTA" wrote: > Hi, > > In which version of Spark will this fix be available ? >

Re: There is no space for new record

2018-02-13 Thread SNEHASISH DUTTA
Hi, In which version of Spark will this fix be available ? The deployment is on EMR Regards, Snehasish On Fri, Feb 9, 2018 at 8:51 PM, Wenchen Fan wrote: > It should be fixed by https://github.com/apache/spark/pull/20561 soon. > > On Fri, Feb 9, 2018 at 6:16 PM, Wenchen