bucketing is certainly helpful when you have finite number of values on a
different column in a partitioned column.
though bucketing would mean that when you load data into the table, it
can't be a straight forward load data in path, you will need to run it via
hive queries (which does not seem to
Hi Nitin/Prasan,
Thanks for your replies, I appreciate your help :)
Clustering looks to be quite close to what we want. However one main gap is
that we need to fire hive query to populate clusters. In our case, the
clustered data is already there. So computation in Hive query would be
redundant.
Hi Saumitra,
You might want to look into clustering within the partition. That is, partition
by "day", but cluster by "generated by" (within those partitions), and see if
that improves performance. Refer to the CLUSTER BY command in the Hive language
Manual.
-Prasan
On Mar 25, 2014, at 4:26
in general when you have large number of partitions, your hive query
performance drops. This has been significantly addressed in current
releases but still see the performance issues. sadly I currently do not
have that larger dataset where I need to create large number of partitions.
This issue la
Hi Nitin,
We are not facing small files problem since data is in S3. Also we do not
want to merge files. Merging files are creating large analyze table for say
one day would slow down queries fired on specific day and *generated_by.*
Let me explain my problem in other words.
Right now we are over
see if this is what you are looking for https://github.com/sskaje/hive_merge
On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) <
saumitra.shahap...@vizury.com> wrote:
> Hello,
>
> We are using Hive to query S3 data. For one of our tables named analyze,
> we generate data hierarchica
Hello,
We are using Hive to query S3 data. For one of our tables named analyze, we
generate data hierarchically. First level of hierarchy is date and second
level is a field named *generated_by*. e.g. for 20 march we may have S3
directories as
s3://analyze/20140320/111/
s3://analyze/20140320/222/