Re: K Means Clustering Explanation

2018-03-02 Thread Alessandro Solimando
Hi Matt, similarly to what Christoph does, I first derive the cluster id for the elements of my original dataset, and then I use a classification algorithm (cluster ids being the classes here). For this method to be useful you need a "human-readable" model, tree-based models are generally a good c

Re: K Means Clustering Explanation

2018-03-04 Thread Alessandro Solimando
On 2 March 2018 at 15:42, Matt Hicks wrote: > Thanks Alessandro and Christoph. I appreciate the feedback, but I'm still > having issues determining how to actually accomplish this with the API. > > Can anyone point me to an example in code showing how to accomplish this? > >

Re: Union of multiple data frames

2018-04-06 Thread Alessandro Solimando
Hello Cesar, can you add some details like: number of columns, avg number of rows in the DFs, time spent to compute the plan with all the unions, and the time needed to perform the action? Thanks, Alessandro On 5 April 2018 at 23:22, Cesar wrote: > Thanks for your answers. > > The suggested met

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Alessandro Solimando
Hi Shiyuan, can you show us the output of ¨explain¨ over df (as a last step)? On 11 April 2018 at 19:47, Shiyuan wrote: > Variable name binding is a python thing, and Spark should not care how the > variable is named. What matters is the dependency graph. Spark fails to > handle this dependency

Re: spark sql StackOverflow

2018-05-14 Thread Alessandro Solimando
Hi, I am not familiar with ATNConfigSet, but some thoughts that might help. How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time? Are these records ingested in real-time or are they loaded from a datastore? If the latter case the DB might be able t

Re: spark sql StackOverflow

2018-05-15 Thread Alessandro Solimando
text files that would be copied in some > directory over and over > > Are you suggesting that i dont need to use spark-streaming? > > Sent using Zoho Mail <https://www.zoho.com/mail/> > > > On Tue, 15 May 2018 11:26:42 +0430 *Alessandro Solimando > >* wrote > --

Re: [Spark SQL] error in performing dataset union with complex data type (struct, list)

2018-06-03 Thread Alessandro Solimando
Hi Pranav, I don´t have an answer to your issue, but what I generally do in this cases is to first try to simplify it to a point where it is easier to check what´s going on, and then adding back ¨pieces¨ one by one until I spot the error. In your case I can suggest to: 1) project the dataset to t

Re: Need to convert Dataset to HashMap

2018-09-28 Thread Alessandro Solimando
Hi, as a first attempt I would try to cache "freq", to be sure that the dataset is not re-loaded at each iteration later on. Btw, what's the original data format you are importing from? I suspect also that an appropriate case class rather than Row would help as well, instead of converting to Stri

Re: Need to convert Dataset to HashMap

2018-09-28 Thread Alessandro Solimando
Hi, sorry indeed you have to cache the dataset, before the groupby (otherwise it will be loaded at each time from disk). For the case class you can have a look at the accepted answer here: https://stackoverflow.com/questions/45017556/how-to-convert-a-simple-dataframe-to-a-dataset-spark-scala-with-

Re: Re: spark-sql force parallel union

2018-11-21 Thread Alessandro Solimando
Hello, maybe I am overlooking the problem but what I would go for something similar: def unionDFs(dfs: List[DataFrame]): DataFrame = { dfs.drop(0).foldRight(dfs.apply(0))((df1: DataFrame, df2: DataFrame) => df1 union df2) } (Would be better to keep dfs as-is and you use an empty DF with the co

Re: Structured Streaming & Query Planning

2019-03-14 Thread Alessandro Solimando
Hello Paolo, generally speaking, query planning is mostly based on statistics and distributions of data values for the involved columns, which might significantly change over time in a streaming context, so for me it makes a lot of sense that it is run at every schedule, even though I understand yo

Re: Recover RFormula Column Names

2019-10-29 Thread Alessandro Solimando
Hello Andrew, few years ago I had the same need and I found this SO's answer the way to go. Here an extract of my (Scala) code (which was doing other things on top), I have removed the irrelevant parts but without testing it, so it might not work out o

Re: Recover RFormula Column Names

2019-10-30 Thread Alessandro Solimando
as > the metadata is carried over. > > Andrew > > On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando < > alessandro.solima...@gmail.com> wrote: > >> Hello Andrew, >> few years ago I had the same need and I found this SO's answer >> <https://

Re: How to troubleshoot MetadataFetchFailedException: Missing an output location for shuffle 0

2019-12-16 Thread Alessandro Solimando
Hi Warren, it's often an exception stemming from an OOM at the executor level. If you are caching data make sure you spill to disk, if needed. You could also try to increase off-heap memory to alleviate the issue. Of course also handing more memory to the executor helps. Best regards, Alessandr