Re: Hi all,

2017-11-04 Thread אורן שמון
Hi Jean, We prepare the data for all another jobs. We have a lot of jobs that schedule to different time but all of them need to read same raw data. On Fri, Nov 3, 2017 at 12:49 PM Jean Georges Perrin wrote: > Hi Oren, > > Why don’t you want to use a GroupBy? You can cache

Bucket vs repartition

2017-10-31 Thread אורן שמון
Hi all, I have 2 spark jobs one is pre-process and the second is the process. Process job needs to calculate for each user in the data. I want to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the

Read parquet files as buckets

2017-10-31 Thread אורן שמון
Hi all, I have Parquet files as result from some job , the job saved them in bucket mode by userId . How can I read the files in bucket mode in another job ? I tried to read it but it didnt bucket the data (same user in same partition)

Hi all,

2017-10-31 Thread אורן שמון
I have 2 spark jobs one is pre-process and the second is the process. Process job needs to calculate for each user in the data. I want to avoid shuffle like groupBy so I think about to save the result of the pre-process as bucket by user in Parquet or to re-partition by user and save the result .