2019 at 6:55 AM Jonathan Winandy <
> jonathan.wina...@gmail.com> wrote:
>
>> Hello Snehasish
>>
>> If you are not using UDFs, you will have very similar performance with
>> those languages on SQL.
>>
>> So it go down to :
>> * if you know pyt
Hello Snehasish
If you are not using UDFs, you will have very similar performance with
those languages on SQL.
So it go down to :
* if you know python, go for python.
* if you are used to the JVM, and are ready for a bit of paradigm shift, go
for Scala.
Our team is using Scala, however we help
For info, in our team have defined our own cogroup on dataframe in the past
on different projects using different methods (rdd[row] based or union all
collect list based).
I might be biased, but find the approach very useful in project to simplify
and speed up transformations, and remove a lot of
Hi Saikat,
You may use the wrong mailing list for your question (=> spark user).
If you want to make a single string, it's :
red.collect.mkString("\n")
Be careful of driver explosion !
Cheers,
Jonathan
On Fri, 19 May 2017, 05:21 Saikat Kanjilal, wrote:
> One additional
.
Please tell me what you think.
Have a nice day,
Jonathan
On 5 August 2015 at 19:18, Jonathan Winandy jonathan.wina...@gmail.com
wrote:
Hello !
You could try something like that :
def exists[T](rdd:RDD[T])(f:T=Boolean, n:Long):Boolean = {
val context: SparkContext = rdd.sparkContext
Hello !
You could try something like that :
def exists[T](rdd:RDD[T])(f:T=Boolean, n:Long):Boolean = {
val context: SparkContext = rdd.sparkContext
val grp: String = Random.alphanumeric.take(10).mkString
context.setJobGroup(grp, exist)
val count: Accumulator[Long] =
Hello !
You could try something like that :
def exists[T](rdd:RDD[T])(f:T=Boolean, n:Int):Boolean = {
rdd.filter(f).countApprox(timeout = 1).getFinalValue().low n
}
If would work for large datasets and large value of n.
Have a nice day,
Jonathan
On 31 July 2015 at 11:29, Carsten
Hello !
Can both methods be compare in term of performance ? Tried the pull request
and it felt slow compare to manual mapping.
Cheers,
Jonathan
On Mon, Jul 27, 2015, 8:51 PM Reynold Xin r...@databricks.com wrote:
There is this pull request: https://github.com/apache/spark/pull/5713
We mean
by default).
Regards,
Olivier
2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:
Ahoy !
Maybe you can get countByValue by using sql.GroupedData :
// some DFval df: DataFrame =
sqlContext.createDataFrame(sc.parallelize(List(A,B, B,
A)).map(Row.apply
Ahoy !
Maybe you can get countByValue by using sql.GroupedData :
// some DFval df: DataFrame =
sqlContext.createDataFrame(sc.parallelize(List(A,B, B,
A)).map(Row.apply(_)), StructType(List(StructField(n,
StringType
df.groupBy(n).count().show()
// generic
def countByValueDf(df:DataFrame)
10 matches
Mail list logo