Thanks Sun, this explain why I was getting too many jobs running, my RDDs were empty.
2016-03-02 10:29 GMT-03:00 Sun, Rui <rui....@intel.com>: > This is nothing to do with object serialization/deserialization. It is > expected behavior that take(1) most likely runs slower than count() on an > empty RDD. > > This is all about the algorithm with which take() is implemented. Take() > 1. Reads one partition to get the elements > 2. If the fetched elements do not satisfy the limit, it will estimate the > number of additional partitions and fetch elements in them. > Take() repeats the step 2 until it get the desired number of elements or > it will go through all partitions. > > So take(1) on an empty RDD will go through all partitions in a sequential > way. > > Comparing with take(), Count() also computes all partition, but the > computation is parallel on all partitions at once. > > Take() implementation in SparkR is less optimized than that in Scala as > SparkR won't estimate the number of additional partitions but will read > just one partition in each fetch. > > -----Original Message----- > From: Sean Owen [mailto:so...@cloudera.com] > Sent: Wednesday, March 2, 2016 3:37 AM > To: Dirceu Semighini Filho <dirceu.semigh...@gmail.com> > Cc: user <user@spark.apache.org> > Subject: Re: SparkR Count vs Take performance > > Yeah one surprising result is that you can't call isEmpty on an RDD of > nonserializable objects. You can't do much with an RDD of nonserializable > objects anyway, but they can exist as an intermediate stage. > > We could fix that pretty easily with a little copy and paste of the > take() code; right now isEmpty is simple but has this drawback. > > On Tue, Mar 1, 2016 at 7:18 PM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > > Great, I didn't noticed this isEmpty method. > > Well serialization is been a problem in this project, we have noticed > > a lot of time been spent in serializing and deserializing things to > > send and get from the cluster. > > > > 2016-03-01 15:47 GMT-03:00 Sean Owen <so...@cloudera.com>: > >> > >> There is an "isEmpty" method that basically does exactly what your > >> second version does. > >> > >> I have seen it be unusually slow at times because it must copy 1 > >> element to the driver, and it's possible that's slow. It still > >> shouldn't be slow in general, and I'd be surprised if it's slower > >> than a count in all but pathological cases. > >> > >> > >> > >> On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho > >> <dirceu.semigh...@gmail.com> wrote: > >> > Hello all. > >> > I have a script that create a dataframe from this operation: > >> > > >> > mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) > >> > > >> > rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) > >> > dFrame <- > >> > join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) > >> > > >> > After filtering this dFrame with this: > >> > > >> > > >> > I tried to execute the following > >> > filteredDF <- filterRDD(toRDD(dFrame),function (row) {row['COLUMN'] > >> > %in% c("VALUES", ...)}) Now I need to know if the resulting > >> > dataframe is empty, and to do that I tried this two codes: > >> > if(count(filteredDF) > 0) > >> > and > >> > if(length(take(filteredDF,1)) > 0) > >> > I thought that the second one, using take, shoule run faster than > >> > count, but that didn't happen. > >> > The take operation creates one job per partition of my rdd (which > >> > was > >> > 200) > >> > and this make it to run slower than the count. > >> > Is this the expected behaviour? > >> > > >> > Regards, > >> > Dirceu > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional > commands, e-mail: user-h...@spark.apache.org > >