Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Jonathan Winandy
2019 at 6:55 AM Jonathan Winandy < > jonathan.wina...@gmail.com> wrote: > >> Hello Snehasish >> >> If you are not using UDFs, you will have very similar performance with >> those languages on SQL. >> >> So it go down to : >> * if you know pyt

Re: Benchmark Java/Scala/Python for Apache spark

2019-03-11 Thread Jonathan Winandy
Hello Snehasish If you are not using UDFs, you will have very similar performance with those languages on SQL. So it go down to : * if you know python, go for python. * if you are used to the JVM, and are ready for a bit of paradigm shift, go for Scala. Our team is using Scala, however we help

Re: Thoughts on dataframe cogroup?

2019-02-25 Thread Jonathan Winandy
For info, in our team have defined our own cogroup on dataframe in the past on different projects using different methods (rdd[row] based or union all collect list based). I might be biased, but find the approach very useful in project to simplify and speed up transformations, and remove a lot of

Re: Spark madness

2017-05-22 Thread Jonathan Winandy
Hi Saikat, You may use the wrong mailing list for your question (=> spark user). If you want to make a single string, it's : red.collect.mkString("\n") Be careful of driver explosion ! Cheers, Jonathan On Fri, 19 May 2017, 05:21 Saikat Kanjilal, wrote: > One additional

Re:

2015-08-06 Thread Jonathan Winandy
. Please tell me what you think. Have a nice day, Jonathan On 5 August 2015 at 19:18, Jonathan Winandy jonathan.wina...@gmail.com wrote: Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Long):Boolean = { val context: SparkContext = rdd.sparkContext

Re:

2015-08-05 Thread Jonathan Winandy
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Long):Boolean = { val context: SparkContext = rdd.sparkContext val grp: String = Random.alphanumeric.take(10).mkString context.setJobGroup(grp, exist) val count: Accumulator[Long] =

Re: New Feature Request

2015-07-31 Thread Jonathan Winandy
Hello ! You could try something like that : def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = { rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n } If would work for large datasets and large value of n. Have a nice day, Jonathan On 31 July 2015 at 11:29, Carsten

Re: Converting DataFrame to RDD of case class

2015-07-27 Thread Jonathan Winandy
Hello ! Can both methods be compare in term of performance ? Tried the pull request and it felt slow compare to manual mapping. Cheers, Jonathan On Mon, Jul 27, 2015, 8:51 PM Reynold Xin r...@databricks.com wrote: There is this pull request: https://github.com/apache/spark/pull/5713 We mean

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Jonathan Winandy
by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply

Re: countByValue on dataframe with multiple columns

2015-07-20 Thread Jonathan Winandy
Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame)