Behaviour of RDD sampling

2016-05-31 Thread pbaier
Hi all, I have to following use case: I have around 10k of jsons that I want to use for learning. The jsons are all stored in one file. For learning a ML model, however, I only need around 30% of the jsons (the rest is not needed at all). So, my idea was to load all data into a RDD and then use t

Behaviour of RDD sampling

2016-05-31 Thread pbaier
Hi all, I have to following use case: I have around 10k of jsons that I want to use for learning. The jsons are all stored in one file. For learning a ML model, however, I only need around 30% of the jsons (the rest is not needed at all). So, my idea was to load all data into a RDD and then use t