Re: Can not control bucket files number if it was speficed

2016-09-19 Thread Fridtjof Sander
I didn't follow all of this thread, but if you want to have exactly one bucket-output-file per RDD-partition, you have to repartition (shuffle) your data on the bucket-key. If you don't repartition (shuffle), you may have records with different bucket-keys in the same RDD-partition, leading to

Re: Can not control bucket files number if it was speficed

2016-09-19 Thread Qiang Li
I tried dataframe writer with coalesce or repartition api, but it can not meet my requirements, I still can get far more files than bucket number, and spark jobs is very slow after I add coalesce or repartition. I've get back to Hive, use Hive to do data conversion. Thanks. On Sat, Sep 17, 2016

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
Ok You have an external table in Hive on S3 with partition and bucket. say .. PARTITIONED BY (year int, month string) CLUSTERED BY (prod_id) INTO 256 BUCKETS STORED AS ORC. with have within each partition buckets on prod_id equally spread to 256 hash partitions/bucket. bucket is the

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
I want to run job to load existing data from one S3 bucket, process it, then store to another bucket with Partition, and Bucket (data format conversion from tsv to parquet with gzip). So source data and results both are in S3, different are the tools which I used to process data. First I process

Re: Can not control bucket files number if it was speficed

2016-09-17 Thread Mich Talebzadeh
It is difficult to guess what is happening with your data. First when you say you use Spark to generate test data are these selected randomly and then stored in Hive/etc table? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Can not control bucket files number if it was speficed

2016-09-17 Thread Qiang Li
Hi, I use spark to generate data , then we use hive/pig/presto/spark to analyze data, but I found even I add used bucketBy and sortBy with bucket number in Spark, the results files was generate by Spark is always far more than bucket number under each partition, then Presto can not recognize the