Re: Re: sql mapjoin very slow

2015-08-28 Thread r7raul1...@163.com
I found a method in HashMapWrapper class . I think hive will use statistics to adjust threshold automatically. public static int calculateTableSize( float keyCountAdj, int threshold, float loadFactor, long keyCount) { if (keyCount = 0 keyCountAdj != 0) { // We have statistics for the table.

Re: sql mapjoin very slow

2015-08-28 Thread Sergey Shelukhin
Can you check if this is actually being used in your case? From: r7raul1...@163.commailto:r7raul1...@163.com r7raul1...@163.commailto:r7raul1...@163.com Reply-To: user user@hive.apache.orgmailto:user@hive.apache.org Date: Friday, August 28, 2015 at 00:53 To: user

Re: sql mapjoin very slow

2015-08-28 Thread Gopal Vijayaraghavan
I have a question. I use hive 1.1.0 ,so hive.stats.dbclass default value is fs. Mean store statistics in local filesystem. Any one can tell what is the file path to store statistics ? The statistics aren't stored in the file system long term - the final destination for stats is the metastore.

python libraries to execute or call hive queries

2015-08-28 Thread Giri P
Hi All, Can anyone suggest any python libraries to call hive queries from python scripts ? what is the best practice to execute queries from python like using hive cli , beeline, jdbc etc.., Thanks Giri

Re: python libraries to execute or call hive queries

2015-08-28 Thread Gopal Vijayaraghavan
Can anyone suggest any python libraries to call hive queries from python scripts ? https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Pyth on Though I suspect that's out of date. https://github.com/t3rmin4t0r/amplab-benchmark/blob/master/runner/run_query .py#L604 is

Join vs. Where...In

2015-08-28 Thread Raviv Murciano-Goroff
Hi, I often have the following situation: I have a small table with a list of unique IDs and a very large table of events associated with the IDs. I want to perform some aggregation including only events associated with IDs from the small table. Is there a rule of thumb for whether performing a

Re: UDF Configure method not getting called

2015-08-28 Thread Moore, Douglas
Writing side files from a map reduce job was more common a while ago. There are severe disadvantages to doing so and resulting complexities. One complexity is failure handling and retry, the other is speculative execution running multiple attempts over the same split. You say you want to look

Re: UDF Configure method not getting called

2015-08-28 Thread Rahul Sharma
So the use case is like this: We want to be able to let the user point us to any number of columns in a table and then run analysis on the values within that column irrespective of the type of column (simple, complex, datatypes etc). The analysis can be thought of as looking at all the values or a