I didn't follow all of this thread, but if you want to have exactly one
bucket-output-file per RDD-partition, you have to repartition (shuffle)
your data on the bucket-key.
If you don't repartition (shuffle), you may have records with different
bucket-keys in the same RDD-partition, leading to
I tried dataframe writer with coalesce or repartition api, but it can not
meet my requirements, I still can get far more files than bucket number,
and spark jobs is very slow after I add coalesce or repartition.
I've get back to Hive, use Hive to do data conversion.
Thanks.
On Sat, Sep 17, 2016
Ok
You have an external table in Hive on S3 with partition and bucket. say
..
PARTITIONED BY (year int, month string)
CLUSTERED BY (prod_id) INTO 256 BUCKETS
STORED AS ORC.
with have within each partition buckets on prod_id equally spread to 256
hash partitions/bucket. bucket is the
I want to run job to load existing data from one S3 bucket, process it,
then store to another bucket with Partition, and Bucket (data format
conversion from tsv to parquet with gzip). So source data and results both
are in S3, different are the tools which I used to process data.
First I process
It is difficult to guess what is happening with your data.
First when you say you use Spark to generate test data are these selected
randomly and then stored in Hive/etc table?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi,
I use spark to generate data , then we use hive/pig/presto/spark to analyze
data, but I found even I add used bucketBy and sortBy with bucket number in
Spark, the results files was generate by Spark is always far more than
bucket number under each partition, then Presto can not recognize the