Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
Hey sparkers, an aggregate function *count* in *org.apache.spark.sql.functions* package takes a *column* as an argument. Is this needed for something? I find it confusing that I need to supply a column there. It feels like it might be distinct count or something. This can be seen in latest documen

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
ing method in > sql/core/src/main/scala/org/apache/spark/sql/functions.scala : > > def count(e: Column): Column = withAggregateFunction { > > Did you notice this method ? > > def count(columnName: String): TypedColumn[Any, Long] = > > On Wed, Jun 22, 2

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Jakub Dubovsky
nt for `functions.count` is needed for per-column counting; > df.groupBy($"a").agg(count($"b")) > > // maropu > > On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu wrote: > >> See the first example in: >> >> http://www.w3schools.com/sql/sql_func_c

RDD of ImmutableList

2015-10-05 Thread Jakub Dubovsky
But I cannot think of a workaround and I do not believe that using ImmutableList with RDD is not possible. How this is solved?   Thank you in advance!    Jakub Dubovsky

Re: RDD of ImmutableList

2015-10-05 Thread Jakub Dubovsky
the class to scala class which would translate the data during (de) serialization?   Thanks!   Jakub Dubovsky -- Původní zpráva -- Od: Igor Berman Komu: Jakub Dubovsky Datum: 5. 10. 2015 20:11:35 Předmět: Re: RDD of ImmutableList " kryo doesn't support guava&

Re: RDD of ImmutableList

2015-10-07 Thread Jakub Dubovsky
I did not realized that scala's and java's immutable collections uses different api which causes this. Thank you for reminder. This makes some sense now... -- Původní zpráva -- Od: Jonathan Coveney Komu: Jakub Dubovsky Datum: 7. 10. 2015 1:29:34 Předmět:

Does a driver jvm houses some rdd partitions?

2016-08-31 Thread Jakub Dubovsky
Hey all, I have a conceptual question which I have hard time finding answer for. Is the jvm where spark driver is running also used to run computations over rdd partitions and persist them? The answer is obvious for local mode (yes). But when it runs on yarn/mesos/standalone with many executors i

Re: Does a driver jvm houses some rdd partitions?

2016-09-01 Thread Jakub Dubovsky
ntent is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 31 August 2016 at 14:53, Jakub Dubovsky > wrote: > >> Hey all, >> >> I have a conceptual question which I h

Why there is no top method in dataset api

2016-09-01 Thread Jakub Dubovsky
Hey all, in RDD api there is very usefull method called top. It finds top n records in according to certain ordering without sorting all records. Very usefull! There is no top method nor similar functionality in Dataset api. Has anybody any clue why? Is there any specific reason for this? Any th

Re: Why there is no top method in dataset api

2016-09-05 Thread Jakub Dubovsky
-like > counterpart already that doesn't really need wrapping in a different > API. > > On Thu, Sep 1, 2016 at 12:53 PM, Jakub Dubovsky > wrote: > > Hey all, > > > > in RDD api there is very usefull method called top. It finds top n > records > >

Re: Why there is no top method in dataset api

2016-09-13 Thread Jakub Dubovsky
omputation of "top N" on a Dataset, so I don't think this is > relevant. > > > ​orderBy + take is already the way to accomplish "Dataset.top". It works > on Datasets, and therefore DataFrames too, for the reason you give. I'm not > sure what you're askin

import sql.implicits._

2016-10-14 Thread Jakub Dubovsky
Hey community, I would like to *educate* myself about why all *sql implicits* (most notably conversion to Dataset API) are imported from *instance* of SparkSession and not using static imports. Having this design one runs into problems like this

Re: import sql.implicits._

2016-10-15 Thread Jakub Dubovsky
; seq => dataset​ >> >> On Fri, Oct 14, 2016 at 5:47 PM, Koert Kuipers wrote: >> >>> for example when do you Seq(1,2,3).toDF("a") it needs to get the >>> SparkSession from somewhere. by importing the implicits from >>> spark.impl

Dataset encoders for further types?

2016-12-15 Thread Jakub Dubovsky
efined case classes containing scala.collection.immutable.List(s). This does not work now because these lists are converted to ArrayType (Seq). This then fails a constructor lookup because of seq-is-not-a-list error... This means that for now we are stuck with using RDDs. Thanks for any insights!

Re: Dataset encoders for further types?

2016-12-16 Thread Jakub Dubovsky
manually specify the > kryo > encoder > <http://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/sql/Encoders.html#kryo(scala.reflect.ClassTag)> > . > > On Thu, Dec 15, 2016 at 8:18 AM, Jakub Dubovsky < > spark.dubovsky.ja...@gmail.com> wrote: > >>

Number of partitions in Dataset aggregations

2017-03-01 Thread Jakub Dubovsky
. Any thoughts or pointers to relevant design documents appreciated... Thanks! Jakub Dubovsky

Re: Including data nucleus tools

2014-12-20 Thread Jakub Dubovsky
Hi DB,   I cherry-picked the commit into branch-1.2 and it solved the problem. It solves the problem but has some bits and pieces around which was not finalized thus reverted beeing late in release process.   Jakub -- "Just out of my curiosity. Do you manually apply this patch and see if t