If not reading the whole dataset, how do you know the total number of records? If not knowing total number, how do you choose 30%?
> On May 31, 2016, at 00:45, pbaier <patrick.ba...@zalando.de> wrote: > > Hi all, > > I have to following use case: > I have around 10k of jsons that I want to use for learning. > The jsons are all stored in one file. > > For learning a ML model, however, I only need around 30% of the jsons (the > rest is not needed at all). > So, my idea was to load all data into a RDD and then use the rdd.sample > method to get my fraction of the data. > I implemented this, and in the end it took as long as loading the whole data > set. > So I was wondering if Spark is still loading the whole dataset from disk and > does the filtering afterwards? > If this is the case, why does Spark not push down the filtering and load > only a fraction of data from the disk? > > Cheers, > > Patrick > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Behaviour-of-RDD-sampling-tp27052.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org