Re: Optimizing limit 0 / null scan cases (HIVE-7203)

2016-08-19 Thread Gopal Vijayaraghavan
> CREATE TABLE T . AS The limit clause in the outermost query should prevent the entire query from executing. However, the CREATE TABLE expression and the UNION ALL are rather challenging in this matter. If you have queries which don't hit the NULL-scan fully, a BUG report would be

Re: hive throws ConcurrentModificationException when executing insert overwrite table

2016-08-16 Thread Gopal Vijayaraghavan
> Yes, Kylin generated the query. I'm using Kylin 1.5.3. I would report a bug to Kylin about DISTRIBUTE BY RAND(). This is what happens when a node which ran a Map task fails and the whole task is retried. Assume that the first attempt of the Map task0 wrote value1 into reducer-99, because

Re: Loading Sybase to hive using sqoop

2016-08-24 Thread Gopal Vijayaraghavan
> val d = HiveContext.read.format("jdbc").options( ... >> The sqoop job takes 7 hours to load 15 days of data, even while setting >>the direct load option to 6. Hive is using MR framework. In generaly, the jdbc implementations tend to react rather badly to large extracts like this - the

[DISCUSS] Writing Fast Vectorized UDFs

2016-09-06 Thread Gopal Vijayaraghavan
Hi, I'm writing this because I realize this stuff needs users to comment on before it gets set into an ABI release. Hive 2.2.0 will check if a user's UDF has a vectorized version before wrapping it row-by-row. @VectorizedExpressions(value = { VectorStringRot13.class })

Re: hive.root.logger influencing query plan?? so it's not so

2016-09-06 Thread Gopal Vijayaraghavan
> another case of a query hangin' in v2.1.0. I'm not sure that's a hang. If you can repro this, can you please do a jstack while it is "hanging" (like a jstack of hiveserver2 or cli)? I have a theory that you're hitting a slow path in HDFS remote read because of the following stacktrace.

Re: Beeline throws OOM on large input query

2016-09-06 Thread Gopal Vijayaraghavan
> 1) confirm your beeline java process is indeed running with expanded memory The OOM error is clearly coming from the HiveServer2 CBO codepath post beeline. at org.apache.calcite.rel.AbstractRelNode$1.explain_(AbstractRelNode.java:409) at

Re: Quota for rogue ad-hoc queries

2016-09-01 Thread Gopal Vijayaraghavan
> Are there any other ways? Are you running Tez? Tez heartbeats counters back to the AppMaster every few seconds, so the AppMaster has an accurate (but delayed) count of HDFS_BYTES_WRITTEN. Cheers, Gopal

Re: hive.root.logger influencing query plan?? so it's not so

2016-08-30 Thread Gopal Vijayaraghavan
> I have a query that hangs (never returns). However, when i turn on >logging to DEBUG level it works. I'm stumped. Slowing down things does make a lot of stuff work - logging does something more than slow things down, it actually introduces a synchronization point (global lock) for each log.

Re: load data Failed with exception java.lang.IndexOutOfBoundsException

2016-09-09 Thread Gopal Vijayaraghavan
> It will be ok if the file has more than two characters,that is a little > interesting. I can not understand the result of function checkInputFormat is  > OrcInputFormat,maybe that is just right. My guess is that it is trying to read the 3 letter string "ORC" from that file and failing.

Re: Help with 'alter is not possible' in metastore

2016-09-13 Thread Gopal Vijayaraghavan
> java.sql.SQLException: Error while processing statement: FAILED: Execution > Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Unable to > alter partition. alter is not possible > Altering works fine in a test table in our dev env. The logs here aren't too > helpful. The

Re: HDFS small files to Sequence file using Hive

2016-09-23 Thread Gopal Vijayaraghavan
> Is there a way to create an external table on a directory, extract 'key' as > file name and 'value' as file content and write to a sequence file table? Do you care that it is a sequence file? The HDFS HAR format was invented for this particular problem, check if the "hadoop archive" command

Re: on duplicate update equivalent?

2016-09-23 Thread Gopal Vijayaraghavan
> Dimensions change, and I'd rather do update than recreate a snapshot. Slow changing dimensions are the common use-case for Hive's ACID MERGE. The feature you need is most likely covered by https://issues.apache.org/jira/browse/HIVE-10924 2nd comment from that JIRA "Once an hour, a set of

Re: hive 2.1.0 + drop view

2016-08-26 Thread Gopal Vijayaraghavan
> NULL::character%20varying) ... > i want to say this is somehow related to a java version (we're using 8) >but i'm not sure. The "character varying" looks like a lot like a Postgres issue to me (though character varying could be the real term for varchar in another DB). The hive-metastore.log

Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Gopal Vijayaraghavan
> anybody run up against this one? hive 2.1.0 + using a "not in" on a >list + the column is a partition key participant. The partition filters are run before the plan is generated. >AND etl_source_database not in ('foo') Is there a 'foo' in etl_source_database? > predicate:

Re: hive 2.1.0 and "NOT IN ( list )" and column is a partition_key

2016-08-25 Thread Gopal Vijayaraghavan
> not array_contains(array('foo'), partition_key) And this is why that works. https://issues.apache.org/jira/browse/HIVE-13951 :( Cheers, Gopal

Re: Hive on Tez CTAS query breaks

2016-11-09 Thread Gopal Vijayaraghavan
> If I run a query with CREATE TABLE AS, it breaks with the error below. > However, just running the query works if I don't try to create a table from > the results. It does not happen to all CTAS queries.  Not sure if that's related to Tez at all. Can try running it with set

Re: Hive Runtime Error processing row

2016-11-10 Thread Gopal Vijayaraghavan
> I'm running into the below error occasionally and I'm not 100% certain what's > going on. Does anyone have a hunch what might be happening here or where we > can dig for more ideas? Removed row contents but there are multiple columns. You can try a repro run by doing set

Re: Connect metadata

2016-10-25 Thread Gopal Vijayaraghavan
> Could someone provide me with a code snippet (preferably Java) that installs > the schema (through datanucleus) on my empty metastore (postgres) I wish it was that simple, but do not leave it to the Hive startup to create it - create it explicitly with schematool

Re: Question about partition pruning when there's a type mismatch

2016-11-28 Thread Gopal Vijayaraghavan
> I'm wondering why Hive tries to scan all partitions when the quotes are > omitted. Without the quotes, shouldn't 2016-11-28-00 get evaluated as an > arithmetic expression, then get cast to a string, and then partitioning > pruning still occur? The order of evaluation is different - String =

Re: Hive on Tez CTAS query breaks

2016-11-11 Thread Gopal Vijayaraghavan
> Thanx for the suggestion. It works with the setting you suggested. > > What does this mean? Do I need to special case this query. You need to report a bug on https://issues.apache.org/jira/browse/HIVE Because, this needs to get fixed. > Turning off CBO cluster-wide won't be the right thing

Re: a GROUP BY that is not fully grouping

2016-11-01 Thread Gopal Vijayaraghavan
> I've run into a GROUP BY that does not work reliably in the newer version: > the GROUP BY results are not always fully aggregated. Instead, I get lots of > duplicate + triplicate sets of group values. Seems like a Hive bug to me That does sound like a bug, but this information is not enough

Re: a GROUP BY that is not fully grouping

2016-11-02 Thread Gopal Vijayaraghavan
> Attached (assuming attachments actually work on this list) are three > explains: .. > TungstenAggregate(key=[artid#5,artorsec#0,page#11,geo#12, The explain plan shows you are not using Hive-on-Spark, but SparkSQL. > If you'd like any other info, or if you'd like me to test with other

Re: Hive/Tez local mode running out of memory

2016-11-01 Thread Gopal Vijayaraghavan
> I am not sure what is going on here. You can check the /tmp/$USER/hive.log and see what's happening in detail. Cheers, Gopal

Re: Hive/Tez local mode running out of memory

2016-11-01 Thread Gopal Vijayaraghavan
> Thanx for the reply. We don't override the log level. According to the docs, > looks like the default level is INFO. > Any other ideas? That at a first glance looks like a broken install. A good approach would be to use a Tez cluster install instead of messing with a local mode runner

Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
> The partition is by year/month/day/hour/minute. I have two directories - over > two years, and the total number of records is 50Million.  That's a million partitions with 50 rows in each of them? > I am seeing it takes more than 1hr to complete. Any thoughts, on what could > be the issue or

Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
> Actually, we don't have that many partitions - there are lot of gaps both in > days and time events as well. Your partition description sounded a lot like one of the FAQs from Mithun's talks, which is why I asked

Re: Hive Stored Textfile to Stored ORC taking long time

2016-12-08 Thread Gopal Vijayaraghavan
> I have spark with only one worker (same for HDFS) so running now a standalone > server but with 25G and 14 cores on that worker. Which version of Hive was this? And was the input text file compressed with something like gzip? Cheers, Gopal

Re: Vectorised Queries in Hive

2017-01-11 Thread Gopal Vijayaraghavan
> I have also noticed that this execution mode is only applicable to single > predicate search. It does not work with multiple predicates searches. Can > someone confirms this please? Can you explain what you mean? Vectorization supports multiple & nested AND+OR predicates - with some extra

Re: Hive shell not using manually set tez container size

2016-12-01 Thread Gopal Vijayaraghavan
> set tez.task.resource.memory.mb to a different value than listed in > tez-site.xml, the query that's run doesn't seem to pick up the setting and > instead uses the one in the config file. Why not use the setting Hive uses in the submitted vertex? set hive.tez.container.size=? Cheers,

Re: Hive shell not using manually set tez container size

2016-12-04 Thread Gopal Vijayaraghavan
> even that setting is not being applied after the hive shell is started and a > query is executed.  Are you increasing it or decreasing it? Tez will reuse existing larger containers, instead of releasing them - reducing the parameter has almost no effect without a session restart. Also

Re: Bucketed table info

2016-11-30 Thread Gopal Vijayaraghavan
> If I have an orc table bucketed and sorted on a column, where does hive keep > the mapping from column value to bucket? Specifically, if I know the column > value, and need to find the specific hdfs file, is there an api to do this? The closest to an API is

Re: Zero Bytes Files importance

2017-01-03 Thread Gopal Vijayaraghavan
> Thanks Gopal. Yeah I'm using CloudBerry.  Storage is Azure. Makes sense, only an object store would have this. > Are you saying this _0,1,2,3 are directories ?. No, only the zero size "files". This is really for compat with regular filesystems. If you have /tmp/1/foo in an object

Re: Zero Bytes Files importance

2016-12-29 Thread Gopal Vijayaraghavan
> For any insert operation, there will be one Zero bytes file. I would like to > know importance of this Zero bytes file. They are directories. I'm assuming you're using S3A + screenshots from something like Bucket explorer. These directory entries will not be shown if you do something like

Re: Can Beeline handle HTTP redirect?

2016-12-22 Thread Gopal Vijayaraghavan
> I want to know whether Beeline can handle HTTP redirect or not. I was > wondering if some of Beeline experts can answer my question? Beeline uses the hive-jdbc driver, which is the one actually handling network connections. That driver in turn, uses a standard

Re: LLAP queries create a yarn app per query

2017-03-28 Thread Gopal Vijayaraghavan
> My bad. Looks like the thrift server is cycling through various AMs it > started when the thrift server was started. I think this is different from > either Hive 2.0.1 or LLAP.  This has been roughly been possible since hive-1.0, if you follow any of the Tez BI tuning guides over the last 4

Re: Hive on Tez: Tez taking nX more containers than Mapreduce for union all

2017-03-17 Thread Gopal Vijayaraghavan
> We are using a query with union all and groupby and same table is read > multiple times in the union all subquery. … > When run with Mapreduce, the job is run in one stage consuming n mappers and > m reducers and all union all scans are done with the same job. The logical plans are identical

Re: [Hive on Tez] Running queries in tez non-session mode not working

2017-03-14 Thread Gopal Vijayaraghavan
> by setting tez.am.mode.session=false in hive-cli and hive-jdbc via > hive-server2. That setting does not work if you do "set tez.am.*" parameters (any tez.am params). Can you try doing hive --hiveconf tez.am.mode.session=false instead of a set; param and see if that works? Cheers,

Re: Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Gopal Vijayaraghavan
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts; … > 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s] I'm hoping this is not rewriting to the approx_distinct() in Presto. > I got similar performance with Hive + LLAP too. This is a logical plan issue, so I don't know if LLAP helps a lot. A

Re: How to create auto increment key for a table in hive?

2017-04-12 Thread Gopal Vijayaraghavan
> I'd like to remember that Hive supports ACID (in a very early stages yet) but > most often that is a feature that most people don't use for real production > systems. Yes, you need ACID to maintain multiple writers correctly. ACID does have a global primary key (which is not a single

Re: beeline connection to Hive using both Kerberos and LDAP with SSL

2017-04-07 Thread Gopal Vijayaraghavan
> Is there anyway one can enable both (Kerberos and LDAP with SSL) on Hive? I believe what you're looking for is Apache Knox SSO. And for LDAP users, Apache Ranger user-sync handles auto-configuration. That is how SSL+LDAP+JDBC works in the HD Cloud gateway [1]. There might be a similar

Re: Hive Partitioned View query error

2017-04-24 Thread Gopal Vijayaraghavan
> But on Hue or JDBC interface to Hive Server 2, the following error occurs > while SELECT querying the view. You should be getting identical errors for HS2 and CLI, so that suggests you might be running different CLI and HS2 versions. > SELECT COUNT(1) FROM pk_test where ds='2017-04-20'; >

Re: LLAP Query Failed with no such method exception

2017-08-02 Thread Gopal Vijayaraghavan
Hi, > java.lang.Exception: java.util.concurrent.ExecutionException: > java.lang.NoSuchMethodError: > org.apache.hadoop.tracing.SpanReceiverHost.getInstance(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/tracing/SpanReceiverHost; There's a good possibility that you've built

Re: Long time compiling query/explain.....

2017-08-14 Thread Gopal Vijayaraghavan
> Running Hive 2.2 w/ LLAP enabled (tried the same thing in Hive 2.3 w/ LLAP), > queries working but when we submit queries like the following (via our > automated test framework), they just seem to hang with Parsing > CommandOther queries seem to work fine Any idea on what's going on

Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread Gopal Vijayaraghavan
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users, > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, … > GROUPING_ID() AS gid, > COUNT(1) AS dummy There are two things which prevent Hive from optimize multiple count distincts. Another aggregate like a count(1) or a Grouping sets

Re: question on setting up llap

2017-05-09 Thread Gopal Vijayaraghavan
> ERROR 2017-05-09 22:04:56,469 NetUtil.py:62 - SSLError: Failed to connect. > Please check openssl library versions. … > I am using hive 2.1.0, slider 0.92.0, tez 0.8.5 AFAIK, this was reportedly fixed in 0.92. https://issues.apache.org/jira/browse/SLIDER-942 I'm not sure if the fix in that

Re: question on setting up llap

2017-05-10 Thread Gopal Vijayaraghavan
> for the slider 0.92, the patch is already applied, right? Yes, except it has been refactored to a different place. https://github.com/apache/incubator-slider/blob/branches/branch-0.92/slider-agent/src/main/python/agent/NetUtil.py#L44 Cheers, Gopal

Re: question on setting up llap

2017-05-09 Thread Gopal Vijayaraghavan
> NetUtil.py:60 - [Errno 8] _ssl.c:492: EOF occurred in violation of protocol The error is directly related to the SSL verification error - TLSv1.0 vs TLSv1.2. JDK8 defaults to v1.2 and Python 2.6 defaults to v1.0. Python 2.7.9 + the patch in 0.92 might be needed to get this to work. AFAIK,

Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Gopal Vijayaraghavan
Hi, I think this is worth fixing because this seems to be triggered by the data quality itself - so let me dig in a bit into a couple more scenarios. > hive.optimize.distinct.rewrite is True by default FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer now (which

Re: Format dillema

2017-06-20 Thread Gopal Vijayaraghavan
> 1) both do the same thing.  The start of this thread is the exact opposite - trying to suggest ORC is better for storage & wanting to use it. > As it relates the columnar formats, it is silly arms race. I'm not sure "silly" is the operative word - we've lost a lot of fragmentation of the

Re: Hive query on ORC table is really slow compared to Presto

2017-06-22 Thread Gopal Vijayaraghavan
> 1711647 -1032220119 Ok, so this is the hashCode skew issue, probably the one we already know about. https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4 String hashcode distribution is much better in master after that. Hopefully that fixes the distinct speed issue

Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan
> I guess I see different things. Having used all the tech. In particular for > large hive queries I see OOM simply SCANNING THE INPUT of a data directory, > after 20 seconds! If you've got an LLAP deployment you're not happy with - this list is the right place to air your grievances. I

Re: Format dillema

2017-06-22 Thread Gopal Vijayaraghavan
> I kept hearing about vectorization, but later found out it was going to work > if i used ORC. Yes, it's a tautology - if you cared about performance, you'd use ORC, because ORC is the fastest format. And doing performance work to support folks who don't quite care about it, is not exactly

Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan
> It is not that simple. The average Hadoop user has years 6-7 of data. They do > not have a "magic" convert everything button. They also have legacy processes > that don't/can't be converted. … > They do not want the "fastest format" they want "the fastest hive for their > data". I've yet

Re: Migrating Variable Length Files to Hive

2017-06-02 Thread Gopal Vijayaraghavan
> We are looking at migrating  files(less than 5 Mb of data in total) with > variable record lengths from a mainframe system to hive. https://issues.apache.org/jira/browse/HIVE-10856 + https://github.com/rbheemana/Cobol-to-Hive/ came up on this list a while back. > Are there other

Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Gopal Vijayaraghavan
> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds > SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds Ok, I misunderstood your gist. > While ip is more unique that id, ip runs many times faster than id. > > How can I debug this ? Nearly the same way - just replace "ip" with "id" in my

Re: Hive LLAP with Parquet format

2017-05-04 Thread Gopal Vijayaraghavan
Hi, > Does Hive LLAP work with Parquet format as well? LLAP does work with the Parquet format, but it does not work very fast, because the java Parquet reader is slow. https://issues.apache.org/jira/browse/PARQUET-131 + https://issues.apache.org/jira/browse/HIVE-14826 In particular to

Re: group by + two nulls in a row = bug?

2017-06-27 Thread Gopal Vijayaraghavan
>               cast(NULL as bigint) as malone_id, >               cast(NULL as bigint) as zpid, I ran this on master (with text vectorization off) and I get 20170626123 NULLNULL10 However, I think the backtracking for the columns is broken, somewhere - where both the nulls

Re: Hive LLAP service is not starting

2017-09-11 Thread Gopal Vijayaraghavan
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: > /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory) LLAP expects to find a tarball where tez.lib.uris is - looks like you've got a directory? Cheers, Gopal

Re: hive on spark - why is it so hard?

2017-09-26 Thread Gopal Vijayaraghavan
Hi, > org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark > session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create > spark client. I get inexplicable errors with Hive-on-Spark unless I do a three step build. Build Hive first, use that version to

Re: Benchmarking Hive ACID functionality

2017-09-25 Thread Gopal Vijayaraghavan
> Are there any frameworks like TPC-DS to benchmark Hive ACID functionality? Are you trying to work on and improve Hive ACID? I have a few ACID micro-benchmarks like this https://github.com/t3rmin4t0r/acid2x-jmh so that I can test the inner loops of ACID without having any ORC data at all.

Re: Error when running TPCDS query with Hive+LLAP

2017-09-25 Thread Gopal Vijayaraghavan
> Caused by: > org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError: > VectorMapJoin Hash table loading exceeded memory limits. > estimatedMemoryUsage: 1644167752 noconditionalTaskSize: 463667612 > inflationFactor: 2.0 threshold: 927335232 effectiveThreshold: 927335232 Most

Re: Hive query starts own session for LLAP

2017-09-27 Thread Gopal Vijayaraghavan
> Now we need an explanation of "map" -- can you supply it? The "map" mode runs all tasks with a TableScan operator inside LLAP instances and all other tasks in Tez YARN containers. This is the LLAP + Tez hybrid mode, which introduces some complexity in debugging a single query. The "only"

Re: ORC Transaction Table - Spark

2017-08-24 Thread Gopal Vijayaraghavan
> Or, is this an artifact of an incompatibility between ORC files written by > the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde? > 3. Is there a difference in the ORC file format spec. at play here? Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x. We

Re: Hive index + Tez engine = no performance gain?!

2017-08-22 Thread Gopal Vijayaraghavan
TL;DR - A Materialized view is a much more useful construct than trying to get limited indexes to work. That is pretty lively project which has been going on for a while with Druid+LLAP https://issues.apache.org/jira/browse/HIVE-14486 > This seems out of the blue but my initial benchmarks

Re: hive window function can only calculate the main table?

2017-10-09 Thread Gopal Vijayaraghavan
> ) t_result where formable = ’t1' … > This sql using 29+ hours in 11 computers cluster within 600G memory. > In my opinion, the time wasting in the `order by sampledate` and `calculate > the table B’s record`. Is there a setting to avoid `table B`’s record not to > get ‘avg_wfoy_b2’ column,

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-27 Thread Gopal Vijayaraghavan
Hi, If you've got the 1st starvation fixed (with Hadoop 2.8 patch), all these configs + enable log4j2 async logging, you should definitely see a performance improvement. Here's the log patches, which need a corresponding LLAP config (& have to be disabled in HS2, for the progress bar to work)

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-24 Thread Gopal Vijayaraghavan
Hi, > In our test, we found the shuffle stage of LLAP is very slow. Whether need to > configure some related shuffle value or not? Shuffle is the one hit by the 2nd, 3rd and 4th resource starvation issues listed earlier (FDs, somaxconn & DNS UDP packet loss). > And we get the following log

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-22 Thread Gopal Vijayaraghavan
Hi, > With these configurations, the cpu utilization of llap is very low. Low CPU usage has been observed with LLAP due to RPC starvation. I'm going to assume that the build you're testing is a raw Hadoop 2.7.3 with no additional patches? Hadoop-RPC is single-threaded & has a single mutex

Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-21 Thread Gopal Vijayaraghavan
Hi, > Please help us find whether we use the wrong configuration. Thanks for your > help. Since there are no details, I'm not sure what configuration you are discussing here. A first step would be to check if LLAP cache is actually being used (the LLAP IO in the explain), vectorization is

Re: Hive JDBC - Method not Supported

2017-11-01 Thread Gopal Vijayaraghavan
Hi, > org.apache.hive.jdbc.HiveResultSetMetaData.getTableName(HiveResultSetMetaData.java:102) https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveResultSetMetaData.java#L102 I don't think this issue is fixed in any release - this probably needs to go into a

Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Gopal Vijayaraghavan
> Why jdbc read them as control symbols? Most likely this is already fixed by https://issues.apache.org/jira/browse/HIVE-1608 That pretty much makes the default as set hive.query.result.fileformat=SequenceFile; Cheers, Gopal

Re: MERGE performances issue

2018-05-07 Thread Gopal Vijayaraghavan
> Then I am wondering if the merge statement is impracticable because > of bad use of myself or because this feature is just not mature enough. Since you haven't mentioned a Hive version here, I'm going to assume you're some variant of Hive 1.x & that has some fundamental physical planning

Re: issues with Hive 3 simple sellect from an ORC table

2018-06-08 Thread Gopal Vijayaraghavan
> It is 2.7.3 + > Error: java.io.IOException: java.lang.RuntimeException: ORC split generation > failed with exception: java.lang.NoSuchMethodError: > org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I > (state=,code=0)

Re: Hive storm streaming with s3 file system

2018-06-12 Thread Gopal Vijayaraghavan
> So transactional tables only work with hdfs. Thanks for the confirmation > Elliot. No, that's not what said. Streaming ingest into transactional tables requires strong filesystem consistency and a flush-to-remote operation (hflush). S3 supports neither of those things and HDFS is not the

Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-25 Thread Gopal Vijayaraghavan
> This is Hadoop 3.0.3 > java.lang.NoSuchMethodError: > org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I > (state=08S01,code=1) > Something is missing here! Is this specific to ORC tables? No, it is a Hadoop BUG. https://issues.apache.org/jira/browse/HADOOP-1468

Re: insert overwrite to hive orc table in aws

2018-05-01 Thread Gopal Vijayaraghavan
> delta_000_000 ... > I am using Glue data catalog as metastore, so should there be any link up to > these tables from hive? That would be why transactions are returning as 0 (there is never a transaction 0), because it is not using a Hive standard metastore. You might not be able to

Re: Hive External Table with Zero Bytes files

2018-04-29 Thread Gopal Vijayaraghavan
> We are copying data from upstream system into our storage S3. As part of > copy, directories along with Zero bytes files are been copied. Is this exactly the same issue as the previous thread or a different one?

Re: In reduce task,i have a join operation ,and i found "org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1" cast much long

2017-10-19 Thread Gopal Vijayaraghavan
> . I didn't see data skew for that reducer. It has similar amount of > REDUCE_INPUT_RECORDS as other reducers. … > org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 rows for > join key [4092813312923569] The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is

Re: Hive performance issue with _ character in query

2018-01-18 Thread Gopal Vijayaraghavan
Hi, > I wanted to understand why hive has a performance issue with using _ > character in queries. This is somewhat of a missed optimization issue - the "%" impl uses a fast BoyerMoore algorithm and avoids converting from utf-8 bytes -> String.

Re: Question on accessing LLAP as data cache from external containers

2018-02-02 Thread Gopal Vijayaraghavan
> For example, a Hive job may start Tez containers, which then retrieve data > from LLAP running concurrently. In the current implementation, this is > unrealistic That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, instead of an all-or-nothing implementation. Here

Re: HQL parser internals

2018-02-16 Thread Gopal Vijayaraghavan
> However, ideally we wish to manipulate the original query as delivered by the > user (or as close to it as possible), and we’re finding that the tree has > been modified significantly by the time it hits the hook That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the

Re: Clustering and Large-scale analysis of Hive Queries

2018-08-03 Thread Gopal Vijayaraghavan
> I am interested in working on a project that takes a large number of Hive > queries (as well as their meta data like amount of resources used etc) and > find out common sub queries and expensive query groups etc. This was roughly the central research topic of one of the Hive CBO devs,

Re: Auto Refresh Hive Table Metadata

2018-08-10 Thread Gopal Vijayaraghavan
> By the way, if you want near-real-time tables with Hive, maybe you should > have a look at this project from Uber: https://uber.github.io/hudi/ > I don't know how mature it is yet, but I think it aims at solving that kind > of challenge. Depending on your hive setup, you don't need a

Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Gopal Vijayaraghavan
A hive version would help to preface this, because that matters for this (like TEZ-3709 doesn't apply for hive-1.2). > I’m trying to simply change the format of a very large partitioned table from > Json to ORC. I’m finding that it is unexpectedly resource intensive, > primarily due to a

Re: Improve performance of Analyze table compute statistics

2018-08-28 Thread Gopal Vijayaraghavan
> Will it be referring to orc metadata or it will be loading the whole file and > then counting the rows. Depends on the partial-scan setting or if it is computing full column stats (the full column stats does an nDV, which reads all rows). hive> analyze table compute statistics ...

Re: Hive generating different DAGs from the same query

2018-07-19 Thread Gopal Vijayaraghavan
> My conclusion is that a query can update some internal states of HiveServer2, > affecting DAG generation for subsequent queries. Other than the automatic reoptimization feature, there's two other potential suspects. First one would be to disable the in-memory stats cache's variance param,

Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

2018-07-14 Thread Gopal Vijayaraghavan
​​> Or a simple insert will be automatically sorted as the table DDL mention ? Simple insert should do the sorting, older versions of Hive had ability to disable that (which is a bad thing & therefore these settings are now just hard-configed to =true in Hive3.x) -- set

Re: Using snappy compresscodec in hive

2018-07-23 Thread Gopal Vijayaraghavan
> "TBLPROPERTIES ("orc.compress"="Snappy"); " That doesn't use the Hadoop SnappyCodec, but uses a pure-java version (which is slower, but always works). The Hadoop snappyCodec needs libsnappy installed on all hosts. Cheers, Gopal

Re: Total length of orc clustered table is always 2^31 in TezSplitGrouper

2018-07-24 Thread Gopal Vijayaraghavan
> Search ’Total length’ in log sys_dag_xxx, it is 2147483648. This is the INT_MAX “placeholder” value for uncompacted ACID tables. This is because with ACIDv1 there is no way to generate splits against uncompacted files, so this gets “an empty bucket + unknown number of inserts + updates”

Re: Not able to read Hive ACID table data created by Hive 2.1.1 in hive 2.3.3

2018-09-06 Thread Gopal Vijayaraghavan
> msck repair table ; msck repair does not work on ACID tables. In Hive 2.x, there is no way to move, replicate or rehydrate ACID tables from a cold store - the only way it works if you connect to the old metastore. Cheers, Gopal

Re: Queries to custom serde return 'NULL' until hiveserver2 restart

2018-09-10 Thread Gopal Vijayaraghavan
>query the external table using HiveCLI (e.g. SELECT * FROM >my_external_table), HiveCLI prints out a table with the correct If the error is always on a "select *", then the issue might be the SerDe's handling of included columns. Check what you get for colNames =

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

2018-08-29 Thread Gopal Vijayaraghavan
> Because I believe string should be able to handle integer as well.  No, because it is not a lossless conversion. Comparisons are lost. "9" > "11", but 9 < 11 Even float -> double is lossy (because of epsilon). You can always apply the Hive workaround suggested, otherwise you might find

Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

2018-08-29 Thread Gopal Vijayaraghavan
Hi, > on some days parquet was created by hive 2.1.1 and on some days it was > created by using glue … > After some drill down i saw schema of columns inside both type of parquet > file using parquet tool and found different data types for some column ... > optional int32 action_date (DATE); >

Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

2018-07-13 Thread Gopal Vijayaraghavan
> I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3 > daemons for a total of 34816 MB. Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL workloads, but this is a different problem. > PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string) >

Re: Changing compression format of existing table from snappy to zlib

2018-03-14 Thread Gopal Vijayaraghavan
Hi, > Would this also ensure that all the existing data compressed in snappy format > and the new data stored in zlib format can work in tandem with no disruptions > or issues to end users who query the table. Yes. Each file encodes its own compressor kind & readers use that. The writers

Re: Best way/tool to debug memory leaks in HiveServer2

2018-03-13 Thread Gopal Vijayaraghavan
> It also shows that the process is consuming more than 30GB. However, it is > not clear what is causing the process to consume more than 30GB. The Xmx only applies to the heap size, there's another factor that is usually ignored which are the network buffers and compression buffers used by

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Gopal Vijayaraghavan
ot; so there asking "where is the Hive bucketing spec". Is it just to read the code for that function? They were looking for something more explicit, I think. Thanks - Original Message - From: "Gopal Vijayaraghavan" <gop...@apache

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Gopal Vijayaraghavan
There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Gopal Vijayaraghavan
>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. > My understanding was that this was related to the number of buckets, but you > are relating it to ORC stripes. It is odd that no examples that I've seen > include the SORTED BY statement other than in relation to

Re: Hive 1.2.1 (HDP) ArrayIndexOutOfBounds for highly compressed ORC files

2018-02-26 Thread Gopal Vijayaraghavan
Hi, > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453) In general HDP specific issues tend to get more attention on HCC, but this is a pretty old issue stemming from MapReduce being designed for fairly

<    1   2   3   4   >