re: Gather Partition Locations

2019-11-11 Thread Gopal Vijayaraghavan
Hi, > I have a question about how to get the location for a bunch of partitions. ... > But in an enterprise environment I'm pretty sure this approach would not be > the best because the RDS (mysql or derby) is maybe not reachable or > I don't have the permission to it. That was the reason Hive

Re: Error: java.io.IOException: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError

2019-07-19 Thread Gopal Vijayaraghavan
Hi, > java.lang.NoSuchMethodError: > org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I > (state=,code=0) Are you rolling your own Hadoop install? https://issues.apache.org/jira/browse/HADOOP-14683 Cheers, Gopal

Re: Predicate Push Down Vs On Clause

2019-04-28 Thread Gopal Vijayaraghavan
> Yes both of these are valid ways of filtering data before join in Hive. This has several implementation specifics attached to it. If you're looking at Hive 1.1 or before, it might not work the same way as Vineet mentioned. In older versions Calcite rewrites aren't triggered, which prevented

Re: Hive on Tez vs Impala

2019-04-22 Thread Gopal Vijayaraghavan
> I wish the Hive team to keep things more backward-compatible as well. Hive is > such an enormous system with a wide-spread impact so any > backward-incompatible change could cause an uproar in the community. The incompatibilities were not avoidable in a set of situations - a lot of those

Re: Hive on Tez vs Impala

2019-04-15 Thread Gopal Vijayaraghavan
Hi, >> However, we have built Tez on CDH and it runs just fine. Down that path you'll also need to deploy a slightly newer version of Hive as well, because Hive 1.1 is a bit ancient & has known bugs with the tez planner code. You effectively end up building the hortonworks/hive-release

Re: out of memory using Union operator and array column type

2019-03-11 Thread Gopal Vijayaraghavan
> I'll try the simplest query I can reduce it to  with loads of memory and see > if that gets anywhere. Other pointers are much appreciated. Looks like something I'm testing right now (to make the memory setting cost-based). https://issues.apache.org/jira/browse/HIVE-21399 A less

Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
Hi, That looks like the TopN hash optimization didn't kick in, that must be a settings issue in the install. | Reduce Output Operator | | key expressions: _col0 (type: string) | | sort order: + | |

Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
>I am running an older version of Hive on MR. Does it have it too? Hard to tell without an explain. AFAIK, this was fixed in Aug 2013 - how old is your build? Cheers, Gopal

Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
> I expect the maps to do some sorting and limiting in parallel. That way the > reducer load would be small. I don’t think it does that. Can you tell me why?  They do. Which version are you running, is it Tez and do you have an explain for the plan? Cheers, Gopal

Re: Out Of Memory Error

2019-01-10 Thread Gopal Vijayaraghavan
>   ,row_number() over ( PARTITION BY A.dt,A.year, A.month, >A.bouncer,A.visitor_type,A.device_type order by A.total_page_view_time desc ) >as rank from content_pages_agg_by_month A The row_number() window function is a streaming function, so this should not consume a significant

Re: hive 3.1 mapjoin with complex predicate produce incorrect results

2018-12-22 Thread Gopal Vijayaraghavan
Hi, > Subject: Re: hive 3.1 mapjoin with complex predicate produce incorrect results ... > |                         0 if(_col0 is null, 44, _col0) (type: int) | > |                         1 _col0 (type: int)        | That rewrite is pretty neat, but I feel like the IF expression nesting is

Re: HiveServer2 performance references?

2018-10-15 Thread Gopal Vijayaraghavan
Hi, > I was looking at HiveServer2 performance going through Knox in KNOX-1524 and > found that HTTP mode is significantly slower. The HTTP mode does re-auth for every row before HIVE-20621 was fixed – Knox should be doing cookie-auth to prevent ActiveDirectory/LDAP from throttling this. I

Re: [feature request] auto-increment field in Hive

2018-09-15 Thread Gopal Vijayaraghavan
Hi, > It doesn't help if you need concurrent threads writing to a table but we are > just using the row_number analytic and a max value subquery to generate > sequences on our star schema warehouse. Yup, you're right the row_number doesn't help with concurrent writes - it doesn't even scale

Re: UDFClassLoader isolation leaking

2018-09-13 Thread Gopal Vijayaraghavan
Hi, > Hopefully someone can tell me if this is a bug, expected behavior, or > something I'm causing myself :) I don't think this is expected behaviour, but where the bug is what I'm looking into. > We have a custom StorageHandler that we're updating from Hive 1.2.1 to Hive > 3.0.0. Most

Re: Queries to custom serde return 'NULL' until hiveserver2 restart

2018-09-10 Thread Gopal Vijayaraghavan
>query the external table using HiveCLI (e.g. SELECT * FROM >my_external_table), HiveCLI prints out a table with the correct If the error is always on a "select *", then the issue might be the SerDe's handling of included columns. Check what you get for colNames =

Re: Not able to read Hive ACID table data created by Hive 2.1.1 in hive 2.3.3

2018-09-06 Thread Gopal Vijayaraghavan
> msck repair table ; msck repair does not work on ACID tables. In Hive 2.x, there is no way to move, replicate or rehydrate ACID tables from a cold store - the only way it works if you connect to the old metastore. Cheers, Gopal

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

2018-08-29 Thread Gopal Vijayaraghavan
> Because I believe string should be able to handle integer as well.  No, because it is not a lossless conversion. Comparisons are lost. "9" > "11", but 9 < 11 Even float -> double is lossy (because of epsilon). You can always apply the Hive workaround suggested, otherwise you might find

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

2018-08-29 Thread Gopal Vijayaraghavan
Hi, > on some days parquet was created by hive 2.1.1 and on some days it was > created by using glue … > After some drill down i saw schema of columns inside both type of parquet > file using parquet tool and found different data types for some column ... > optional int32 action_date (DATE); >

Re: Improve performance of Analyze table compute statistics

2018-08-28 Thread Gopal Vijayaraghavan
> Will it be referring to orc metadata or it will be loading the whole file and > then counting the rows. Depends on the partial-scan setting or if it is computing full column stats (the full column stats does an nDV, which reads all rows). hive> analyze table compute statistics ...

Re: Auto Refresh Hive Table Metadata

2018-08-10 Thread Gopal Vijayaraghavan
> By the way, if you want near-real-time tables with Hive, maybe you should > have a look at this project from Uber: https://uber.github.io/hudi/ > I don't know how mature it is yet, but I think it aims at solving that kind > of challenge. Depending on your hive setup, you don't need a

Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Gopal Vijayaraghavan
A hive version would help to preface this, because that matters for this (like TEZ-3709 doesn't apply for hive-1.2). > I’m trying to simply change the format of a very large partitioned table from > Json to ORC. I’m finding that it is unexpectedly resource intensive, > primarily due to a

Re: Clustering and Large-scale analysis of Hive Queries

2018-08-03 Thread Gopal Vijayaraghavan
> I am interested in working on a project that takes a large number of Hive > queries (as well as their meta data like amount of resources used etc) and > find out common sub queries and expensive query groups etc. This was roughly the central research topic of one of the Hive CBO devs,

Re: Total length of orc clustered table is always 2^31 in TezSplitGrouper

2018-07-24 Thread Gopal Vijayaraghavan
> Search ’Total length’ in log sys_dag_xxx, it is 2147483648. This is the INT_MAX “placeholder” value for uncompacted ACID tables. This is because with ACIDv1 there is no way to generate splits against uncompacted files, so this gets “an empty bucket + unknown number of inserts + updates”

Re: Using snappy compresscodec in hive

2018-07-23 Thread Gopal Vijayaraghavan
> "TBLPROPERTIES ("orc.compress"="Snappy"); " That doesn't use the Hadoop SnappyCodec, but uses a pure-java version (which is slower, but always works). The Hadoop snappyCodec needs libsnappy installed on all hosts. Cheers, Gopal

Re: Hive generating different DAGs from the same query

2018-07-19 Thread Gopal Vijayaraghavan
> My conclusion is that a query can update some internal states of HiveServer2, > affecting DAG generation for subsequent queries. Other than the automatic reoptimization feature, there's two other potential suspects. First one would be to disable the in-memory stats cache's variance param,

Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

2018-07-14 Thread Gopal Vijayaraghavan
​​> Or a simple insert will be automatically sorted as the table DDL mention ? Simple insert should do the sorting, older versions of Hive had ability to disable that (which is a bad thing & therefore these settings are now just hard-configed to =true in Hive3.x) -- set

Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

2018-07-13 Thread Gopal Vijayaraghavan
> I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3 > daemons for a total of 34816 MB. Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL workloads, but this is a different problem. > PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string) >

Re: Hive LLAP Macro and Window Function

2018-06-27 Thread Gopal Vijayaraghavan
> When LLAP Execution Mode is set to 'only' you can't have a macro and window > function in the same select statement. The "only" part isn't enforced for the simple select query, but is enforced for the complex one (the PTF one). > select col_1, col_2 from macro_bug where otrim(col_1) is not

Re: Hive LLAP Macro and Window Function

2018-06-27 Thread Gopal Vijayaraghavan
> When LLAP Execution Mode is set to 'only' you can't have a macro and window function in the same select statement. The "only" part isn't enforced for the simple select query, but is enforced for the complex one (the PTF one). > select col_1, col_2 from macro_bug where

Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-25 Thread Gopal Vijayaraghavan
> This is Hadoop 3.0.3 > java.lang.NoSuchMethodError: > org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I > (state=08S01,code=1) > Something is missing here! Is this specific to ORC tables? No, it is a Hadoop BUG. https://issues.apache.org/jira/browse/HADOOP-1468

Re: Hive storm streaming with s3 file system

2018-06-12 Thread Gopal Vijayaraghavan
> So transactional tables only work with hdfs. Thanks for the confirmation > Elliot. No, that's not what said. Streaming ingest into transactional tables requires strong filesystem consistency and a flush-to-remote operation (hflush). S3 supports neither of those things and HDFS is not the

Re: issues with Hive 3 simple sellect from an ORC table

2018-06-08 Thread Gopal Vijayaraghavan
> It is 2.7.3 + > Error: java.io.IOException: java.lang.RuntimeException: ORC split generation > failed with exception: java.lang.NoSuchMethodError: > org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I > (state=,code=0)

Re: MERGE performances issue

2018-05-07 Thread Gopal Vijayaraghavan
> Then I am wondering if the merge statement is impracticable because > of bad use of myself or because this feature is just not mature enough. Since you haven't mentioned a Hive version here, I'm going to assume you're some variant of Hive 1.x & that has some fundamental physical planning

Re: insert overwrite to hive orc table in aws

2018-05-01 Thread Gopal Vijayaraghavan
> delta_000_000 ... > I am using Glue data catalog as metastore, so should there be any link up to > these tables from hive? That would be why transactions are returning as 0 (there is never a transaction 0), because it is not using a Hive standard metastore. You might not be able to

Re: Hive External Table with Zero Bytes files

2018-04-29 Thread Gopal Vijayaraghavan
> We are copying data from upstream system into our storage S3. As part of > copy, directories along with Zero bytes files are been copied. Is this exactly the same issue as the previous thread or a different one?

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Gopal Vijayaraghavan
ot; so there asking "where is the Hive bucketing spec". Is it just to read the code for that function? They were looking for something more explicit, I think. Thanks - Original Message - From: "Gopal Vijayaraghavan" <gop...@apache

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Gopal Vijayaraghavan
>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Gopal Vijayaraghavan
There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it

Re: Changing compression format of existing table from snappy to zlib

2018-03-14 Thread Gopal Vijayaraghavan
Hi, > Would this also ensure that all the existing data compressed in snappy format > and the new data stored in zlib format can work in tandem with no disruptions > or issues to end users who query the table. Yes. Each file encodes its own compressor kind & readers use that. The writers

Re: Best way/tool to debug memory leaks in HiveServer2

2018-03-13 Thread Gopal Vijayaraghavan
> It also shows that the process is consuming more than 30GB. However, it is > not clear what is causing the process to consume more than 30GB. The Xmx only applies to the heap size, there's another factor that is usually ignored which are the network buffers and compression buffers used by

Re: Hive 1.2.1 (HDP) ArrayIndexOutOfBounds for highly compressed ORC files

2018-02-26 Thread Gopal Vijayaraghavan
Hi, > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453) In general HDP specific issues tend to get more attention on HCC, but this is a pretty old issue stemming from MapReduce being designed for fairly

Re: HQL parser internals

2018-02-16 Thread Gopal Vijayaraghavan
> However, ideally we wish to manipulate the original query as delivered by the > user (or as close to it as possible), and we’re finding that the tree has > been modified significantly by the time it hits the hook That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the

Re: Question on accessing LLAP as data cache from external containers

2018-02-02 Thread Gopal Vijayaraghavan
> For example, a Hive job may start Tez containers, which then retrieve data > from LLAP running concurrently. In the current implementation, this is > unrealistic That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, instead of an all-or-nothing implementation. Here

Re: Hive performance issue with _ character in query

2018-01-18 Thread Gopal Vijayaraghavan
Hi, > I wanted to understand why hive has a performance issue with using _ > character in queries. This is somewhat of a missed optimization issue - the "%" impl uses a fast BoyerMoore algorithm and avoids converting from utf-8 bytes -> String.

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-27 Thread Gopal Vijayaraghavan
Hi, If you've got the 1st starvation fixed (with Hadoop 2.8 patch), all these configs + enable log4j2 async logging, you should definitely see a performance improvement. Here's the log patches, which need a corresponding LLAP config (& have to be disabled in HS2, for the progress bar to work)

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-24 Thread Gopal Vijayaraghavan
Hi, > In our test, we found the shuffle stage of LLAP is very slow. Whether need to > configure some related shuffle value or not? Shuffle is the one hit by the 2nd, 3rd and 4th resource starvation issues listed earlier (FDs, somaxconn & DNS UDP packet loss). > And we get the following log

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-22 Thread Gopal Vijayaraghavan
Hi, > With these configurations, the cpu utilization of llap is very low. Low CPU usage has been observed with LLAP due to RPC starvation. I'm going to assume that the build you're testing is a raw Hadoop 2.7.3 with no additional patches? Hadoop-RPC is single-threaded & has a single mutex

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-21 Thread Gopal Vijayaraghavan
Hi, > Please help us find whether we use the wrong configuration. Thanks for your > help. Since there are no details, I'm not sure what configuration you are discussing here. A first step would be to check if LLAP cache is actually being used (the LLAP IO in the explain), vectorization is

Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Gopal Vijayaraghavan
> Why jdbc read them as control symbols? Most likely this is already fixed by https://issues.apache.org/jira/browse/HIVE-1608 That pretty much makes the default as set hive.query.result.fileformat=SequenceFile; Cheers, Gopal

Re: Hive JDBC - Method not Supported

2017-11-01 Thread Gopal Vijayaraghavan
Hi, > org.apache.hive.jdbc.HiveResultSetMetaData.getTableName(HiveResultSetMetaData.java:102) https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveResultSetMetaData.java#L102 I don't think this issue is fixed in any release - this probably needs to go into a

Re: In reduce task,i have a join operation ,and i found "org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1" cast much long

2017-10-19 Thread Gopal Vijayaraghavan
> . I didn't see data skew for that reducer. It has similar amount of > REDUCE_INPUT_RECORDS as other reducers. … > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 rows for > join key [4092813312923569] The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is

Re: hive window function can only calculate the main table?

2017-10-09 Thread Gopal Vijayaraghavan
> ) t_result where formable = ’t1' … > This sql using 29+ hours in 11 computers cluster within 600G memory. > In my opinion, the time wasting in the `order by sampledate` and `calculate > the table B’s record`. Is there a setting to avoid `table B`’s record not to > get ‘avg_wfoy_b2’ column,

Re: Hive query starts own session for LLAP

2017-09-27 Thread Gopal Vijayaraghavan
> Now we need an explanation of "map" -- can you supply it? The "map" mode runs all tasks with a TableScan operator inside LLAP instances and all other tasks in Tez YARN containers. This is the LLAP + Tez hybrid mode, which introduces some complexity in debugging a single query. The "only"

Re: hive on spark - why is it so hard?

2017-09-26 Thread Gopal Vijayaraghavan
Hi, > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark > session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create > spark client. I get inexplicable errors with Hive-on-Spark unless I do a three step build. Build Hive first, use that version to

Re: Benchmarking Hive ACID functionality

2017-09-25 Thread Gopal Vijayaraghavan
> Are there any frameworks like TPC-DS to benchmark Hive ACID functionality? Are you trying to work on and improve Hive ACID? I have a few ACID micro-benchmarks like this https://github.com/t3rmin4t0r/acid2x-jmh so that I can test the inner loops of ACID without having any ORC data at all.

Re: Error when running TPCDS query with Hive+LLAP

2017-09-25 Thread Gopal Vijayaraghavan
> Caused by: > org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError: > VectorMapJoin Hash table loading exceeded memory limits. > estimatedMemoryUsage: 1644167752 noconditionalTaskSize: 463667612 > inflationFactor: 2.0 threshold: 927335232 effectiveThreshold: 927335232 Most

Re: Hive LLAP service is not starting

2017-09-11 Thread Gopal Vijayaraghavan
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: > /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) LLAP expects to find a tarball where tez.lib.uris is - looks like you've got a directory? Cheers, Gopal

Re: ORC Transaction Table - Spark

2017-08-24 Thread Gopal Vijayaraghavan
> Or, is this an artifact of an incompatibility between ORC files written by > the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde? > 3. Is there a difference in the ORC file format spec. at play here? Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x. We

Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread Gopal Vijayaraghavan
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, … > GROUPING_ID() AS gid, > COUNT(1) AS dummy There are two things which prevent Hive from optimize multiple count distincts. Another aggregate like a count(1) or a Grouping sets

Re: Hive index + Tez engine = no performance gain?!

2017-08-22 Thread Gopal Vijayaraghavan
TL;DR - A Materialized view is a much more useful construct than trying to get limited indexes to work. That is pretty lively project which has been going on for a while with Druid+LLAP https://issues.apache.org/jira/browse/HIVE-14486 > This seems out of the blue but my initial benchmarks

Re: Long time compiling query/explain.....

2017-08-14 Thread Gopal Vijayaraghavan
> Running Hive 2.2 w/ LLAP enabled (tried the same thing in Hive 2.3 w/ LLAP), > queries working but when we submit queries like the following (via our > automated test framework), they just seem to hang with Parsing > CommandOther queries seem to work fine Any idea on what's going on

Re: LLAP Query Failed with no such method exception

2017-08-02 Thread Gopal Vijayaraghavan
Hi, > java.lang.Exception: java.util.concurrent.ExecutionException: > java.lang.NoSuchMethodError: > org.apache.hadoop.tracing.SpanReceiverHost.getInstance(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/tracing/SpanReceiverHost; There's a good possibility that you've built

Re: group by + two nulls in a row = bug?

2017-06-27 Thread Gopal Vijayaraghavan
>               cast(NULL as bigint) as malone_id, >               cast(NULL as bigint) as zpid, I ran this on master (with text vectorization off) and I get 20170626123 NULLNULL10 However, I think the backtracking for the columns is broken, somewhere - where both the nulls

Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan
> I guess I see different things. Having used all the tech. In particular for > large hive queries I see OOM simply SCANNING THE INPUT of a data directory, > after 20 seconds! If you've got an LLAP deployment you're not happy with - this list is the right place to air your grievances. I

Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan
> It is not that simple. The average Hadoop user has years 6-7 of data. They do > not have a "magic" convert everything button. They also have legacy processes > that don't/can't be converted. … > They do not want the "fastest format" they want "the fastest hive for their > data". I've yet

Re: Format dillema

2017-06-22 Thread Gopal Vijayaraghavan
> I kept hearing about vectorization, but later found out it was going to work > if i used ORC. Yes, it's a tautology - if you cared about performance, you'd use ORC, because ORC is the fastest format. And doing performance work to support folks who don't quite care about it, is not exactly

Re: Hive query on ORC table is really slow compared to Presto

2017-06-22 Thread Gopal Vijayaraghavan
> 1711647 -1032220119 Ok, so this is the hashCode skew issue, probably the one we already know about. https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4 String hashcode distribution is much better in master after that. Hopefully that fixes the distinct speed issue

Re: Format dillema

2017-06-20 Thread Gopal Vijayaraghavan
> 1) both do the same thing.  The start of this thread is the exact opposite - trying to suggest ORC is better for storage & wanting to use it. > As it relates the columnar formats, it is silly arms race. I'm not sure "silly" is the operative word - we've lost a lot of fragmentation of the

Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Gopal Vijayaraghavan
> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds Ok, I misunderstood your gist. > While ip is more unique that id, ip runs many times faster than id. > > How can I debug this ? Nearly the same way - just replace "ip" with "id" in my

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Gopal Vijayaraghavan
Hi, I think this is worth fixing because this seems to be triggered by the data quality itself - so let me dig in a bit into a couple more scenarios. > hive.optimize.distinct.rewrite is True by default FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer now (which

Re: Migrating Variable Length Files to Hive

2017-06-02 Thread Gopal Vijayaraghavan
> We are looking at migrating  files(less than 5 Mb of data in total) with > variable record lengths from a mainframe system to hive. https://issues.apache.org/jira/browse/HIVE-10856 + https://github.com/rbheemana/Cobol-to-Hive/ came up on this list a while back. > Are there other

Re: question on setting up llap

2017-05-10 Thread Gopal Vijayaraghavan
> for the slider 0.92, the patch is already applied, right? Yes, except it has been refactored to a different place. https://github.com/apache/incubator-slider/blob/branches/branch-0.92/slider-agent/src/main/python/agent/NetUtil.py#L44 Cheers, Gopal

Re: question on setting up llap

2017-05-09 Thread Gopal Vijayaraghavan
> NetUtil.py:60 - [Errno 8] _ssl.c:492: EOF occurred in violation of protocol The error is directly related to the SSL verification error - TLSv1.0 vs TLSv1.2. JDK8 defaults to v1.2 and Python 2.6 defaults to v1.0. Python 2.7.9 + the patch in 0.92 might be needed to get this to work. AFAIK,

Re: question on setting up llap

2017-05-09 Thread Gopal Vijayaraghavan
> ERROR 2017-05-09 22:04:56,469 NetUtil.py:62 - SSLError: Failed to connect. > Please check openssl library versions. … > I am using hive 2.1.0, slider 0.92.0, tez 0.8.5 AFAIK, this was reportedly fixed in 0.92. https://issues.apache.org/jira/browse/SLIDER-942 I'm not sure if the fix in that

Re: Hive LLAP with Parquet format

2017-05-04 Thread Gopal Vijayaraghavan
Hi, > Does Hive LLAP work with Parquet format as well? LLAP does work with the Parquet format, but it does not work very fast, because the java Parquet reader is slow. https://issues.apache.org/jira/browse/PARQUET-131 + https://issues.apache.org/jira/browse/HIVE-14826 In particular to

Re: Hive Partitioned View query error

2017-04-24 Thread Gopal Vijayaraghavan
> But on Hue or JDBC interface to Hive Server 2, the following error occurs > while SELECT querying the view. You should be getting identical errors for HS2 and CLI, so that suggests you might be running different CLI and HS2 versions. > SELECT COUNT(1) FROM pk_test where ds='2017-04-20'; >

Re: How to create auto increment key for a table in hive?

2017-04-12 Thread Gopal Vijayaraghavan
> I'd like to remember that Hive supports ACID (in a very early stages yet) but > most often that is a feature that most people don't use for real production > systems. Yes, you need ACID to maintain multiple writers correctly. ACID does have a global primary key (which is not a single

Re: beeline connection to Hive using both Kerberos and LDAP with SSL

2017-04-07 Thread Gopal Vijayaraghavan
> Is there anyway one can enable both (Kerberos and LDAP with SSL) on Hive? I believe what you're looking for is Apache Knox SSO. And for LDAP users, Apache Ranger user-sync handles auto-configuration. That is how SSL+LDAP+JDBC works in the HD Cloud gateway [1]. There might be a similar

Re: Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Gopal Vijayaraghavan
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A

Re: LLAP queries create a yarn app per query

2017-03-28 Thread Gopal Vijayaraghavan
> My bad. Looks like the thrift server is cycling through various AMs it > started when the thrift server was started. I think this is different from > either Hive 2.0.1 or LLAP.  This has been roughly been possible since hive-1.0, if you follow any of the Tez BI tuning guides over the last 4

Re: Hive on Tez: Tez taking nX more containers than Mapreduce for union all

2017-03-17 Thread Gopal Vijayaraghavan
> We are using a query with union all and groupby and same table is read > multiple times in the union all subquery. … > When run with Mapreduce, the job is run in one stage consuming n mappers and > m reducers and all union all scans are done with the same job. The logical plans are identical

Re: [Hive on Tez] Running queries in tez non-session mode not working

2017-03-14 Thread Gopal Vijayaraghavan
> by setting tez.am.mode.session=false in hive-cli and hive-jdbc via > hive-server2. That setting does not work if you do "set tez.am.*" parameters (any tez.am params). Can you try doing hive --hiveconf tez.am.mode.session=false instead of a set; param and see if that works? Cheers,

Re: ODBC - NullPointerException when loading data

2017-02-02 Thread Gopal Vijayaraghavan
> Using Apache Hive 1.2.1, I get a NullPointerExcetion when performing a > request through an ODBC driver. > The request is just a simple LOAD DATA request: Looks like the NPE is coming from the getResultMetaData() call, which returns the schema of the rows returned. LOAD is probably one of

Re: Compactions doesn't launch on Hive 1.1.0

2017-02-02 Thread Gopal Vijayaraghavan
> I try reduce check interval and launch it manuallty with command "Alter > table tx_tbl compaction 'major';". Nothing helps. You can check the hive metastore log and confirm it also has the DbTxnManager set up & that it is triggering the compactions. Without a standalone metastore, the hive

Re: Experimental results using TPC-DS (versus Spark and Presto)

2017-01-30 Thread Gopal Vijayaraghavan
> Gopal : (yarn logs -application $APPID) doesn't contain a line > containing HISTORY so it doesn't produce svg file. Should I turn on > some option to get the lines containing HISTORY in yarn application > log? There's a config option tez.am.log.level=INFO which controls who much data is

Re: Experimental results using TPC-DS (versus Spark and Presto)

2017-01-30 Thread Gopal Vijayaraghavan
> Hive LLAP shows better performance than Presto and Spark for most queries, > but it shows very poor performance on the execution of query 72. My suspicion will be the the inventory x catalog_sales x warehouse join - assuming the column statistics are present and valid. If you could send the

Re: Hive Tez on External Table running on Single Mapper

2017-01-30 Thread Gopal Vijayaraghavan
> > 'skip.header.line.count'='1', Trying removing that config option. I've definitely seen footer markers disabling file splitting, possibly header also does. Cheers, Gopal

Re: Parquet tables with snappy compression

2017-01-25 Thread Gopal Vijayaraghavan
> Has there been any study of how much compressing Hive Parquet tables with > snappy reduces storage space or simply the table size in quantitative terms? http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet/20 Since SNAPPY is just LZ77, I would assume it would be

Re: Only External tables can have an explicit location

2017-01-25 Thread Gopal Vijayaraghavan
> Error 40003]: Only External tables can have an explicit location … > using hive 1.2. I got this error. This was definitely not a requirement > before Are you using Apache hive or some vendor fork? Some BI engines demand there be no aliasing for tables, so each table needs a unique location

Re: Hive Tez on External Table running on Single Mapper

2017-01-23 Thread Gopal Vijayaraghavan
> We have 20 GB txt File, When we have created external table on top of 20 > Gb file, we see Tez is creating only one mapper. For an uncompressed file, that is very strange. Is this created as "STORED AS TEXTFILE" or some other strange format? Cheers, Gopal

Re: Cannot connect to Hive even from local

2017-01-22 Thread Gopal Vijayaraghavan
> !connect jdbc:hive2://localhost:1/default; -n hiveuser -p hivepassword ... > What's missing here? how do I fix it? Thank you very much Mostly, this is missing the actual protocol specs - this is something which is never a problem for real clusters because ZK load-balancing automatically

Re: unable to create or grant privileges or roles under beeline

2017-01-20 Thread Gopal Vijayaraghavan
> I have been following the instructions under > https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization > in great detail with no success. … > Error: org.apache.spark.sql.catalyst.parser.ParseException: You're reading the docs for Apache Hive and trying to

Re: how to customize tez query app name

2017-01-20 Thread Gopal Vijayaraghavan
> So no one has a solution? … > “mapreduce.job.name” works for M/R queries, not Tez. Depends on the Hive version you're talking about. https://issues.apache.org/jira/browse/HIVE-12357 That doesn't help you with YARN, but only with the TezUI (since each YARN AM runs > 1 queries). For

Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Gopal Vijayaraghavan
> Sounds like VARCHAR and CHAR types were created for Hive to have ANSI SQL > Compliance. Otherwise they seem to be practically the same as String types. They are relatively identical in storage, except both are slower on the CPU in actual use (CHAR has additional padding code in the

Re: Vectorised Queries in Hive

2017-01-11 Thread Gopal Vijayaraghavan
> I have also noticed that this execution mode is only applicable to single > predicate search. It does not work with multiple predicates searches. Can > someone confirms this please? Can you explain what you mean? Vectorization supports multiple & nested AND+OR predicates - with some extra

Re: Zero Bytes Files importance

2017-01-03 Thread Gopal Vijayaraghavan
> Thanks Gopal. Yeah I'm using CloudBerry.  Storage is Azure. Makes sense, only an object store would have this. > Are you saying this _0,1,2,3 are directories ?. No, only the zero size "files". This is really for compat with regular filesystems. If you have /tmp/1/foo in an object

Re: Zero Bytes Files importance

2016-12-29 Thread Gopal Vijayaraghavan
> For any insert operation, there will be one Zero bytes file. I would like to > know importance of this Zero bytes file. They are directories. I'm assuming you're using S3A + screenshots from something like Bucket explorer. These directory entries will not be shown if you do something like

Re: Can Beeline handle HTTP redirect?

2016-12-22 Thread Gopal Vijayaraghavan
> I want to know whether Beeline can handle HTTP redirect or not. I was > wondering if some of Beeline experts can answer my question? Beeline uses the hive-jdbc driver, which is the one actually handling network connections. That driver in turn, uses a standard

Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
> Actually, we don't have that many partitions - there are lot of gaps both in > days and time events as well. Your partition description sounded a lot like one of the FAQs from Mithun's talks, which is why I asked

Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
> The partition is by year/month/day/hour/minute. I have two directories - over > two years, and the total number of records is 50Million.  That's a million partitions with 50 rows in each of them? > I am seeing it takes more than 1hr to complete. Any thoughts, on what could > be the issue or

  1   2   3   4   >