Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-07 Thread Alessandro Solimando
+1 (non-binding) I have been following this standardization effort and I think it is sound and it provides the needed flexibility via the option. Best regards, Alessandro On Mon, 7 Oct 2019 at 10:24, Gengliang Wang wrote: > Hi everyone, > > I'd like to call for a new vote on SPARK-28885 >

Re: Thoughts on dataframe cogroup?

2019-02-20 Thread Alessandro Solimando
Hello, I fail to see how an equi-join on the key columns is different than the cogroup you propose. I think the accepted answer can shed some light: https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark Now you apply an udf on each iterable,

Re: [DISCUSS] Default values and data sources

2018-12-20 Thread Alessandro Solimando
Hello, I agree that Spark should check whether the underlying datasource support default values or not, and adjust its behavior accordingly. If we follow this direction, do you see the default-values capability in scope of the "DataSourceV2 capability API"? Best regards, Alessandro On Fri, 21

Re: Pushdown in DataSourceV2 question

2018-12-10 Thread Alessandro Solimando
ilter (there >> can be many depending on your scenario, the format etc). Instead of >> “discussing” between Spark and the data source it is much less costly that >> Spark checks that the filters are consistently applied. >> >> Am 09.12.2018 um 12:

Re: Pushdown in DataSourceV2 question

2018-12-09 Thread Alessandro Solimando
Hello, that's an interesting question, but after Frank's reply I am a bit puzzled. If there is no control over the pushdown status how can Spark guarantee the correctness of the final query? Consider a filter pushed down to the data source, either Spark has to know if it has been applied or not,

Re: Array indexing functions

2018-11-20 Thread Alessandro Solimando
Hi Petar, I have implemented similar functions a few times through ad-hoc UDFs in the past, so +1 from me. Can you elaborate a bit more on how you practically implement those functions? Are they UDF or "native" functions like those in sql.functions package? I am asking because I wonder if/how

Re: [DISCUSS] Syntax for table DDL

2018-10-02 Thread Alessandro Solimando
I agree with Ryan, a "standard" and more widely adopted syntax is usually a good idea, with possibly some slight improvements like "bulk deletion" of columns (especially because both the syntax and the semantics are clear), rather than stay with Hive syntax at any cost. I am personally following

Re: [VOTE] SPIP: Standardize SQL logical plans

2018-07-18 Thread Alessandro Solimando
+1 (non-binding) On 18 July 2018 at 17:32, Xiao Li wrote: > +1 (binding) > > Like what Ryan and I discussed offline, the contents of implementation > sketch is not part of this vote. > > Cheers, > > Xiao > > 2018-07-18 8:00 GMT-07:00 Russell Spitzer : > >> +1 (non-binding) >> >> On Wed, Jul 18,

Re: redundant decision tree model

2018-02-16 Thread Alessandro Solimando
if interested: https://github.com/apache/spark/pull/20632 On 13 February 2018 at 14:39, Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Thanks for your feedback Sean, I agree with you. > > I have logged a JIRA case (https://issues.apache.org/jir > a/browse/SPARK-2

Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando
ue in > keeping those nodes. Whatever impurity gain the split managed on the > training data is 'lost' when the prediction is collapsed to a single class > anyway. > > Whether it's easy to implement in the code I don't know, but it's > straightforward conceptually. > > On Tue,

Re: redundant decision tree model

2018-02-13 Thread Alessandro Solimando
; It is probably still a useful feature to have for trees but the priority > is not that high since it may not be that useful for the tree ensemble > models. > > > On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando < > alessandro.solima...@gmail.com> wrote: > >> Hello com

redundant decision tree model

2018-02-13 Thread Alessandro Solimando
Hello community, I have recently manually inspected some decision trees computed with Spark (2.2.1, but the behavior is the same with the latest code on the repo). I have observed that the trees are always complete, even if an entire subtree leads to the same prediction in its different leaves.

transformSchema method policy for "duplicated" column names

2018-01-13 Thread Alessandro Solimando
Hello everyone, after one month without any reply on stackoverflow ( https://stackoverflow.com/questions/47789265/inconsistency-in-handling-duplicate-names-in-dataframe-schema) I try to pose the question here. Context: I am refactoring some code of mine, transforming scala methods with a