Re: Handling hierarchical data in Hive

2014-03-25 Thread Nitin Pawar
bucketing is certainly helpful when you have finite number of values on a different column in a partitioned column. though bucketing would mean that when you load data into the table, it can't be a straight forward load data in path, you will need to run it via hive queries (which does not seem to

Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hi Nitin/Prasan, Thanks for your replies, I appreciate your help :) Clustering looks to be quite close to what we want. However one main gap is that we need to fire hive query to populate clusters. In our case, the clustered data is already there. So computation in Hive query would be redundant.

Re: Handling hierarchical data in Hive

2014-03-25 Thread Prasan Samtani
Hi Saumitra, You might want to look into clustering within the partition. That is, partition by "day", but cluster by "generated by" (within those partitions), and see if that improves performance. Refer to the CLUSTER BY command in the Hive language Manual. -Prasan On Mar 25, 2014, at 4:26

Re: Handling hierarchical data in Hive

2014-03-25 Thread Nitin Pawar
in general when you have large number of partitions, your hive query performance drops. This has been significantly addressed in current releases but still see the performance issues. sadly I currently do not have that larger dataset where I need to create large number of partitions. This issue la

Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hi Nitin, We are not facing small files problem since data is in S3. Also we do not want to merge files. Merging files are creating large analyze table for say one day would slow down queries fired on specific day and *generated_by.* Let me explain my problem in other words. Right now we are over

Re: Handling hierarchical data in Hive

2014-03-25 Thread Nitin Pawar
see if this is what you are looking for https://github.com/sskaje/hive_merge On Tue, Mar 25, 2014 at 4:21 PM, Saumitra Shahapure (Vizury) < saumitra.shahap...@vizury.com> wrote: > Hello, > > We are using Hive to query S3 data. For one of our tables named analyze, > we generate data hierarchica

Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
Hello, We are using Hive to query S3 data. For one of our tables named analyze, we generate data hierarchically. First level of hierarchy is date and second level is a field named *generated_by*. e.g. for 20 march we may have S3 directories as s3://analyze/20140320/111/ s3://analyze/20140320/222/