re: Gather Partition Locations

2019-11-11 Thread Gopal Vijayaraghavan
Hi,

> I have a question about how to get the location for a bunch of partitions.
...
> But in an enterprise environment I'm pretty sure this approach would not be
> the best because the RDS (mysql or derby) is maybe not reachable or
> I don't have the permission to it.

That was the reason Hive shipped with metatool, though it remains fairly 
obscure outside of the devs.

hive --service metatool -executeJDOQL "select database.name + '.' + tableName 
from org.apache.hadoop.hive.metastore.model.MTable"

You need to join MPartition -> MStorageDescriptor on sd to get the locations 
out. 

Cheers,
Gopal






Re: Error: java.io.IOException: java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError

2019-07-19 Thread Gopal Vijayaraghavan
Hi,

> java.lang.NoSuchMethodError: 
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I 
> (state=,code=0)

Are you rolling your own Hadoop install?

https://issues.apache.org/jira/browse/HADOOP-14683

Cheers,
Gopal




Re: Predicate Push Down Vs On Clause

2019-04-28 Thread Gopal Vijayaraghavan
> Yes both of these are valid ways of filtering data before join in Hive.

This has several implementation specifics attached to it. If you're looking at 
Hive 1.1 or before, it might not work the same way as Vineet mentioned.

In older versions Calcite rewrites aren't triggered, which prevented some of 
the PPD rewrites.

This became massively better after the pre-join transforms in Calcite were made 
as a default, even without statistics.

I think this actually had a number of fixes in Hive 1.2.1, which also did 
column pruning through the join (i.e the condition column is no longer part of 
the join values).

Hive3+ the column pruning is replaced by substitution, so you won't send the 
whole column through, but evaluate it to a boolean before joining (see TPC-DS 
query2).

> As long as the join is not outer and the ON condition is not on non-null 
> generating side of join Hive planner will try to push the predicate down to 
> table scan.

Since I learned it the hard  way a few years back, here's an example from 
tpc-ds where the where clause does something different from the on clause.

https://github.com/apache/hive/blob/master/ql/src/test/queries/clientpositive/perf/query93.q

if the reason condition was an on-clause, then the join would produce different 
results. Because it is in the where clause, the 'Did not like the warranty' 
condition changes the join type on optimization.

Cheers,
Gopal




Re: Hive on Tez vs Impala

2019-04-22 Thread Gopal Vijayaraghavan


> I wish the Hive team to keep things more backward-compatible as well. Hive is 
> such an enormous system with a wide-spread impact so any 
> backward-incompatible change could cause an uproar in the community.

The incompatibilities were not avoidable in a set of situations - a lot of 
those were in Hive2, but hidden away or deliberately disabled to make Hive 3 
into what it is.

Here's a quick run-down of how the incompatibilities at the table level allow a 
final user to run more SQL

https://www.slideshare.net/dbist/hive-3-a-new-horizon/10

The incompatibilities form the foundation for something like "How do I have 
Kafka streams offloaded to S3 cold data stores, but still query down to the 
last second without the small file problem?".

Cheers,
Gopal




Re: Hive on Tez vs Impala

2019-04-15 Thread Gopal Vijayaraghavan


Hi,

>> However, we have built Tez on CDH and it runs just fine.

Down that path you'll also need to deploy a slightly newer version of Hive as 
well, because Hive 1.1 is a bit ancient & has known bugs with the tez planner 
code.

You effectively end up building the hortonworks/hive-release builds, by undoing 
the non-htrace tracing impl & applying the htrace one back etc.

> Lol. I was hoping that the merger would unblock the "saltyness". 

Historically, I've unofficially supported folks using Tez on CDH in prod 
(assuming they buy me enough coffee), though I might have discontinue that.

https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/vendor-repos.xml#L11

Cheers,
Gopal




Re: out of memory using Union operator and array column type

2019-03-11 Thread Gopal Vijayaraghavan


> I'll try the simplest query I can reduce it to  with loads of memory and see 
> if that gets anywhere. Other pointers are much appreciated.

Looks like something I'm testing right now (to make the memory setting 
cost-based).

https://issues.apache.org/jira/browse/HIVE-21399

A less "cost-based" solution would be to do 

set hive.map.aggr=false;

And disable the feature.

Cheers,
Gopal




Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan
Hi,

That looks like the TopN hash optimization didn't kick in, that must be a 
settings issue in the install.

|   Reduce Output Operator   |
| key expressions: _col0 (type: string) |
| sort order: +  |
| Statistics: Num rows: 1 Data size: 762813939712 Basic 
stats: PARTIAL Column stats: NONE |

https://github.com/apache/hive/commit/265ae7b4f81ec7cf19c6f0b59a13a3e7dfb942e4#diff-ea752552821a2ae5f3a33c6db210ef0a

I'd check if the configs for that are setup for you.

Cheers,
Gopal






Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan


>I am running an older version of Hive on MR. Does it have it too?

Hard to tell without an explain.

AFAIK, this was fixed in Aug 2013 - how old is your build?

Cheers,
Gopal






Re: Hive Order By Question

2019-02-06 Thread Gopal Vijayaraghavan


> I expect the maps to do some sorting and limiting in parallel. That way the 
> reducer load would be small. I don’t think it does that. Can you tell me why? 

They do.

Which version are you running, is it Tez and do you have an explain for the 
plan?

Cheers,
Gopal  




Re: Out Of Memory Error

2019-01-10 Thread Gopal Vijayaraghavan
>   ,row_number() over ( PARTITION BY A.dt,A.year, A.month, 
>A.bouncer,A.visitor_type,A.device_type order by A.total_page_view_time desc ) 
>as rank 
from content_pages_agg_by_month A

The row_number() window function is a streaming function, so this should not 
consume a significant part of memory as part of this operation.

I suspect there's some issue with the build of Hive you are using which is 
preventing it from using less memory, but I can't be sure.

While the query is running, take a jstack of one of the TezChild instances and 
then you can possibly file a bug with your vendor & get a patch for the problem.

This particular function was improved significantly in Hive 3.0, by vectorizing 
the implementation natively

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/ptf/VectorPTFEvaluatorRowNumber.java#L29
 
Cheers,
Gopal 



Re: hive 3.1 mapjoin with complex predicate produce incorrect results

2018-12-22 Thread Gopal Vijayaraghavan
Hi,

> Subject: Re: hive 3.1 mapjoin with complex predicate produce incorrect results
...
> |                         0 if(_col0 is null, 44, _col0) (type: int) |
> |                         1 _col0 (type: int)        |

That rewrite is pretty neat, but I feel like the IF expression nesting is what 
is broken here.

Can you run the same query with "set 
hive.vectorized.reuse.scratch.columns=false;" and see if this is a join 
expression column reuse problem.

If that does work, can you send out a 

explain vectorization detail ;

I'll eventually get back to my dev env in a week, but this looks like a 
low-level exec issue right now.

Cheers,
Gopal




Re: HiveServer2 performance references?

2018-10-15 Thread Gopal Vijayaraghavan
Hi,


> I was looking at HiveServer2 performance going through Knox in KNOX-1524 and 
> found that HTTP mode is significantly slower.


The HTTP mode does re-auth for every row before HIVE-20621 was fixed – Knox 
should be doing cookie-auth to prevent ActiveDirectory/LDAP from throttling 
this.

I fixed it for the open-source JDBC drivers, but I think the proprietary ODBC 
drivers are still broken (the C++ code doesn’t save cookies).

> I also didn't see any information about reducing amount of data over the wire 
> with compression.

It’s possible Knox doesn’t generate/forward the Accept-Encoding headers.

 

https://issues.apache.org/jira/browse/HIVE-17194


Cheers,

Gopal

 

From: Prasanth Jayachandran 
Reply-To: "user@hive.apache.org" 
Date: Monday, October 15, 2018 at 12:34 PM
To: "user@hive.apache.org" , "user@hive.apache.org" 

Subject: Re: HiveServer2 performance references?

 

One performance fix that is fixed recently and is related is HIVE-20621 

 

Thanks

Prasanth



On Mon, Oct 15, 2018 at 12:11 PM -0700, "Kevin Risden"  
wrote:

Are there any resources on HiveServer2 performance - specifically binary vs 
HTTP mode? I would also be interested in any information about enabling 
compression between HiveServer2 and a client like beeline.

 

I was looking at HiveServer2 performance going through Knox in KNOX-1524 and 
found that HTTP mode is significantly slower. I searched and didn't find any 
references to performance of HiveServer2 out of the box. I also didn't see any 
information about reducing amount of data over the wire with compression.

 

Any pointers would be very helpful. Thanks!

 

Kevin Risden



Re: [feature request] auto-increment field in Hive

2018-09-15 Thread Gopal Vijayaraghavan
Hi,

> It doesn't help if you need concurrent threads writing to a table but we are 
> just using the row_number analytic and a max value subquery to generate 
> sequences on our star schema warehouse.

Yup, you're right the row_number doesn't help with concurrent writes - it 
doesn't even scale beyond a single cpu core.

Because it is a dense sequence, there's a 1 reducer chokepoint for that OVER() 
clause and it shows up as a significant performance issue as the total number 
of rows start to go up.

However, that is good enough for doing it if you are doing it once and for a 
very small number of rows.

> It has worked pretty well so far. To provide true sequence support would 
> require changes on the hive meta database side as well as locking so nothing 
> has been done on it in a long time

Since we added a sequence + locking in Hive ACID, there's a Surrogate Key 
prototype (for Hive 3.0) that's built which uses the Txn manager sequence to 
seed concurrent sequence numbering.

This should work for any format  (Parquet, for instance), because all data 
formats can be made transactional in Hive3 (though update/delete remains to be 
implemented for Parquet).

Here's a video describing that from one of the recent meetups discussing Hive 
3.x (around the 56minute mark).

https://www.youtube.com/watch?v=b_KUVK7Lq14#t=56m
 
This is not an auto_increment key, but the numbering is forward ordered.

This is mostly useful to turn a complex join type into a simple equi-join (also 
PK-FK, which CBO can optimize with), when offloading an OLTP system like Oracle 
into a historical Hive EDW system.

So, if you have something like the same business key with historical 
information (like a UPC with a different price based on day of sale/store 
location or say a EUR:USD conversion rate by day), you can generate a surrogate 
key for each change and turn a query which has a item_id = item_id and 
sale_date between start_price_date and end_price_date and sale_location = 
price_state into a simple = query on item_price_sk.

The dimension table does not grow in size due to this transformation, but it 
acquires a primary key column which is auto-generated on INSERT.

Cheers,
Gopal




Re: UDFClassLoader isolation leaking

2018-09-13 Thread Gopal Vijayaraghavan
Hi,

> Hopefully someone can tell me if this is a bug, expected behavior, or 
> something I'm causing myself :)

I don't think this is expected behaviour, but where the bug is what I'm looking 
into.

>  We have a custom StorageHandler that we're updating from Hive 1.2.1 to Hive 
> 3.0.0.  

Most likely this bug existed in Hive 1.2 as well, but the FetchTask conversion 
did not happen for these queries.

I'll probably test out your SerDe tomorrow, but I have two target cases to look 
into right now.

The first one is that this is related to a different issue I noticed with 
Hadoop-Common code (i.e a direct leak).

https://issues.apache.org/jira/browse/HADOOP-10513

The second one is that this is only broken with the Local FetchTask (which gets 
triggered when you run "select ... limit n").

> SELECT * FROM my_ext_table;

So those theories, I recommend trying out

set hive.fetch.task.conversion=none;

and running the same query so that the old Hive1 codepaths for reading from the 
SerDe get triggered.

Cheers,
Gopal




Re: Queries to custom serde return 'NULL' until hiveserver2 restart

2018-09-10 Thread Gopal Vijayaraghavan
>query the external table using HiveCLI (e.g. SELECT * FROM
>my_external_table), HiveCLI prints out a table with the correct

If the error is always on a "select *", then the issue might be the SerDe's 
handling of included columns.

Check what you get for  

colNames = 
Arrays.asList(tblProperties.getProperty(serdeConstants.LIST_COLUMNS).split(","));

Or to confirm it, try doing "Select col from table" instead of "*".

Cheers,
Gopal




Re: Not able to read Hive ACID table data created by Hive 2.1.1 in hive 2.3.3

2018-09-06 Thread Gopal Vijayaraghavan
> msck repair table ;

msck repair does not work on ACID tables.

In Hive 2.x, there is no way to move, replicate or rehydrate ACID tables from a 
cold store - the only way it works if you connect to the old metastore.

Cheers,
Gopal




Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

2018-08-29 Thread Gopal Vijayaraghavan
> Because I believe string should be able to handle integer as well. 

No, because it is not a lossless conversion. Comparisons are lost.

"9" > "11", but 9 < 11

Even float -> double is lossy (because of epsilon).

You can always apply the Hive workaround suggested, otherwise you might  find 
more information on Parquet lists.

Cheers,
Gopal




Re: Problem in reading parquet data from 2 different sources(Hive + Glue) using hive tables

2018-08-29 Thread Gopal Vijayaraghavan
Hi,

> on some days parquet was created by hive 2.1.1 and on some days it was 
> created by using glue
…
> After some drill down i saw schema of columns inside both type of parquet 
> file using parquet tool and found different data types for some column
...
> optional int32 action_date (DATE);
> optional binary action_date (UTF8);

Those two column types aren't convertible implicitly between each other, which 
is probably the problem.

You'll have to create 2 different external tables and separate the files into 
different directories, create tables with appropriate schema and write a view 
to do UNION ALL.

If your goal is to have good performance for Hive, it is best to write the data 
from Hive & maintain the schema on write.

Cheers,
Gopal




Re: Improve performance of Analyze table compute statistics

2018-08-28 Thread Gopal Vijayaraghavan


> Will it be referring to orc metadata or it will be loading the whole file and 
> then counting the rows.

Depends on the partial-scan setting or if it is computing full column stats 
(the full column stats does an nDV, which reads all rows).

hive> analyze table compute statistics ... partialscan;

https://issues.apache.org/jira/browse/HIVE-4177

AFAIK, this got removed in Hive 3.x (because we really want autogather column 
stats on insert, not just basic stats from this).

> Is there any place to cache this information so that I don't need to scan all 
> the files every time.

https://cwiki.apache.org/confluence/display/Hive/LLAP

Cheers,
Gopal




Re: Auto Refresh Hive Table Metadata

2018-08-10 Thread Gopal Vijayaraghavan


> By the way, if you want near-real-time tables with Hive, maybe you should 
> have a look at this project from Uber: https://uber.github.io/hudi/
> I don't know how mature it is yet, but I think it aims at solving that kind 
> of challenge.

Depending on your hive setup, you don't need a different backend to do 
near-real-time tables.

https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest

Prasanth has a benchmark for Hive 3.x, which is limited by HDFS bandwidth at 
the moment with 64 threads.

https://github.com/prasanthj/culvert

$ ./culvert -u thrift://localhost:9183 -db testing -table culvert -p 64 -n 
10
Total rows committed: 9210
Throughput: 1535000 rows/second

Cheers,
Gopal




Re: Optimal approach for changing file format of a partitioned table

2018-08-06 Thread Gopal Vijayaraghavan


A hive version would help to  preface this, because that matters for this (like 
TEZ-3709 doesn't apply for hive-1.2).

> I’m trying to simply change the format of a very large partitioned table from 
> Json to ORC. I’m finding that it is unexpectedly resource intensive, 
> primarily due to a shuffle phase with the partition key. I end up running out 
> of disk space in what looks like a spill to disk in the reducers. 

The "shuffle phase with the partition key" sounds like you have the dynamic 
sort partition enabled, which is necessary to avoid OOMs on the writer due to 
split generation complications (as you'll see below).

> However, the partitioning scheme is identical on both the source and the 
> destination so my expectation is a map only job that simply rencodes each 
> file.

That would've been nice to have, except the split generation in MR InputFormats 
will use the locality of files & split a single partition into multiple splits, 
then recombine them by hostname - so the splits aren't aligned along partitions.

> However, can anyone advise a more resource and time efficient approach?

If you don't have enough scratch space to store the same data 2x (well, a 
minimum - the shuffle merge has a complete spill for every 100 inputs), it 
might be helpful to do this as separate jobs (i.e relaunch AMs) so that you can 
delete all the scratch storage between the partitions.

The usual chunk size I use is around 30Tb per insert (this corresponds to 7 
years in my warehouse tests).

I have for loop scripts which go over the data & generate chunked insert 
scripts, but they are somewhat trivial to write for a different use-case.

The scratch-space issue is actually tied to some assumptions in this codepath 
(all the way from 2007), which optimizes for shuffle via a spinning disk, for 
the spill + merge (to cut down on IOPS). I hope I can rewrite it entirely with 
something like Apache Crail (to take advantage of NVRAM+RDMA) once there's no 
need for compatibility with spinning disks.

However, most of the next set of optimizations require a closer inspection of 
the counters from the task, cluster size and total data-size.

Cheers,
Gopal




Re: Clustering and Large-scale analysis of Hive Queries

2018-08-03 Thread Gopal Vijayaraghavan


> I am interested in working on a project that takes a large number of Hive 
> queries (as well as their meta data like amount of resources used etc) and 
> find out common sub queries and expensive query groups etc.

This was roughly the central research topic of one of the Hive CBO devs, except 
was implemented for PIG (not Hive).

https://hal.inria.fr/hal-01353891
+
https://github.com/jcamachor/pigreuse

I think there's a lot of interest in this topic for ETL workloads and the goal 
is to pick this up as ETL becomes the target problem.

There's a recent SIGMOID paper which talks about the same sort of reuse.

https://www.microsoft.com/en-us/research/uploads/prod/2018/03/cloudviews-sigmod2018.pdf

If you are interested in looking into this using existing infra in Hive, I 
recommend looking at Zoltan's recent work which tracks query plans + runtime 
statistics from the RUNTIME_STATS table in the metastore.

You can debug through what this does by doing

"explain reoptimization  ;"

Cheers,
Gopal




Re: Total length of orc clustered table is always 2^31 in TezSplitGrouper

2018-07-24 Thread Gopal Vijayaraghavan
> Search ’Total length’ in log sys_dag_xxx, it is 2147483648.


This is the INT_MAX “placeholder” value for uncompacted ACID tables.

This is because with ACIDv1 there is no way to generate splits against 
uncompacted files, so this gets “an empty bucket + unknown number of inserts + 
updates” placeholder value.


Cheers,

Gopal



Re: Using snappy compresscodec in hive

2018-07-23 Thread Gopal Vijayaraghavan


> "TBLPROPERTIES ("orc.compress"="Snappy"); " 

That doesn't use the Hadoop SnappyCodec, but uses a pure-java version (which is 
slower, but always works).

The Hadoop snappyCodec needs libsnappy installed on all hosts.

Cheers,
Gopal





Re: Hive generating different DAGs from the same query

2018-07-19 Thread Gopal Vijayaraghavan
> My conclusion is that a query can update some internal states of HiveServer2, 
> affecting DAG generation for subsequent queries. 

Other than the automatic reoptimization feature, there's two other potential 
suspects.

First one would be to disable the in-memory stats cache's variance param, which 
might be triggering some residual effects.

hive.metastore.aggregate.stats.cache.max.variance

I set it to 0.0 when I suspect that feature is messing with the runtime plans 
or just disable the cache entirely with

set hive.metastore.aggregate.stats.cache.enabled=false;

Other than that, query24 is an interesting query.

Is probably one of the corner cases where the predicate push-down is actually 
hurting the shared work optimizer.

Also cross-check if you have accidentally loaded store_sales with 
ss_item_sk(int) and if the item i_item_sk is a bigint (type mismatches will 
trigger a slow join algorithm, but without any consistency issues).

Cheers,
Gopal




Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

2018-07-14 Thread Gopal Vijayaraghavan



​​> Or a simple insert will be automatically sorted as the table DDL mention ?

Simple insert should do the sorting, older versions of Hive had ability to 
disable that (which is a bad thing & therefore these settings are now just 
hard-configed to =true in Hive3.x)

-- set hive.enforce.bucketing=true;
-- set hive.enforce.sorting=true; 

It will pick 8 reducers are the default count, which might not work for the # 
of partitions you have.

set hive.optimize.sort.dynamic.partition=true;

is what was used to fix these sort of reducer count issues when you are using 
bucketing + partitioning on a table (using bucketing without partitioning 
doesn't need that).

With every test run I end up inserting 3Tb or so into 2500 partitions using 
these settings.

https://github.com/hortonworks/hive-testbench/blob/hdp3/settings/load-partitioned.sql

Cheers,
Gopal





Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

2018-07-13 Thread Gopal Vijayaraghavan


> I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3 
> daemons for a total of 34816 MB.

Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL 
workloads, but this is a different problem.

> PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string)
> CLUSTERED BY (ID_TICKETTYPE, ID_PERSONTYPE, NPOOLNR, NKASSANR) INTO 8 BUCKETS 
> STORED AS ORC
...
> Total number of partitions is 137k.

20Gb divided by 137k makes for very poorly written ORC files, because I'd guess 
that it has too few rows in a file (will be much smaller than 1 HDFS block) - 
partitioning this fine is actually a performance issue on compile time.

You can make this insert work by changing the insert shuffle mechanism (run an 
explain with/without to see the difference).

set hive.optimize.sort.dynamic.partition=true; -- 
https://issues.apache.org/jira/browse/HIVE-6455

But I suspect you will be very disappointed by the performance of the read 
queries after this insert.

>  ,NPOOLNR decimal(4,0)
> ,NZUTRNR decimal(3,0)
> ,NKASSANR decimal(3,0)
> ,ID_TICKETTYPE decimal(5,0)
> ,ID_PERSONTYPE decimal(6,0)
> ,ID_TICKETPERSONTYPEDEF decimal(6,0)

That's also going to hurt - your schema raises a lot of red-flags that I find 
people do when they first migrated to hive.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/

In general, you need to fix the partition count, bucketing structure (how 
clustered by does not "cluster", you need another "sorted by"), zero scale 
decimals.

Can you try running with (& see what your query read-perf looks like)

https://gist.github.com/t3rmin4t0r/087b61f79514673c307bb9a88327a4db

Cheers,
Gopal




Re: Hive LLAP Macro and Window Function

2018-06-27 Thread Gopal Vijayaraghavan
> When LLAP Execution Mode is set to 'only' you can't have a macro and window 
> function in the same select statement. 

The "only" part isn't enforced for the simple select query, but is enforced for 
the complex one (the PTF one).

> select col_1, col_2 from macro_bug where otrim(col_1) is not null;

That becomes a non-LLAP case, because macro_bug is so small - to trigger this 
issue for that query, force it through the Llap decider.

set hive.fetch.task.conversion=none;

I know that's not entirely helpful here, but the point is that the macros are 
not expanded in Hive at compile-time.

  Filter Operator
predicate: otrim(col_1) is not null (type: boolean)

is what I get in the plan, which LLAP recognizes as a "temporary function".

The actual expansion is actually evaluated at runtime by GenericUDFMacro, not 
expanded by the compiler.

If you're 100% sure the temporary function will work, you can take off the 
safety rails and try with 

set hive.llap.skip.compile.udf.check=true;

which will just let LLAP plan deserialization figure out if has the necessary 
classes/fail (& not care about the "name" of the UDF).

Cheers,
Gopal



Re: Hive LLAP Macro and Window Function

2018-06-27 Thread Gopal Vijayaraghavan


> When LLAP Execution Mode is set to 'only' you can't have a macro and 
window function in the same select statement. 

The "only" part isn't enforced for the simple select query, but is enforced 
for the complex one (the PTF one).

> select col_1, col_2 from macro_bug where otrim(col_1) is not null;

That becomes a non-LLAP case, because macro_bug is so small - to trigger 
this issue for that query, force it through the Llap decider.

set hive.fetch.task.conversion=none;

I know that's not entirely helpful here, but the point is that the macros 
are not expanded in Hive at compile-time.

  Filter Operator
predicate: otrim(col_1) is not null (type: boolean)

is what I get in the plan, which LLAP recognizes as a "temporary function".

The actual expansion is actually evaluated at runtime by GenericUDFMacro, 
not expanded by the compiler.

If you're 100% sure the temporary function will work, you can take off the 
safety rails and try with 

set hive.llap.skip.compile.udf.check=true;

which will just let LLAP plan deserialization figure out if has the 
necessary classes/fail (& not care about the "name" of the UDF).

Cheers,
Gopal






Re: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed with exception

2018-06-25 Thread Gopal Vijayaraghavan
> This is Hadoop 3.0.3
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I 
> (state=08S01,code=1)
> Something is missing here! Is this specific to ORC tables?

No, it is a Hadoop BUG.

https://issues.apache.org/jira/browse/HADOOP-1468

Fixed in Hadoop-2.8.2+ & most likely this is happening because you have built 
something against Hadoop-2.7.3 (Parquet and ORC default to the 2.7.3, so you 
need to rebuild them over for Hadoop3).

Cheers,
Gopal




Re: Hive storm streaming with s3 file system

2018-06-12 Thread Gopal Vijayaraghavan


> So transactional tables only work with hdfs. Thanks for the confirmation 
> Elliot.

No, that's not what said.

Streaming ingest into transactional tables requires strong filesystem 
consistency and a flush-to-remote operation (hflush).

S3 supports neither of those things and HDFS is not the only filesystem which 
has both those features.

For eg. see AdlFsOutputStream.java#L38.

Cheers,
Gopal






Re: issues with Hive 3 simple sellect from an ORC table

2018-06-08 Thread Gopal Vijayaraghavan


> It is 2.7.3
+
> Error: java.io.IOException: java.lang.RuntimeException: ORC split generation 
> failed with exception: java.lang.NoSuchMethodError: 
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I 
> (state=,code=0)

https://issues.apache.org/jira/browse/HADOOP-14683

Cheers,
Gopal 





Re: MERGE performances issue

2018-05-07 Thread Gopal Vijayaraghavan
> Then I am wondering if the merge statement is impracticable because 
> of bad use of myself or because this feature is just not mature enough.

Since you haven't mentioned a Hive version here, I'm going to assume you're 
some variant of Hive 1.x & that has some fundamental physical planning issues 
which makes an UPDATE + INSERT faster than an UPSERT.

This is because an UPDATE uses an inner join which is rotated around so that 
the smaller table can always be the hash table side, while UPSERT requires a 
LEFT OUTER where the join scales poorly when the big table side is the target 
table for merge (which is your case).

I recommend you run "explain " and see the physical plan for the merge 
you're running (90% sure you have a shuffle join without vectorization).

However tackling that directly is a bit hard, because the ACIDv1 did not allow 
for easy predicate push-down when the table had pending UPDATEs (i.e you 
couldn't skip rows in deltas when reading them).

So the 1st major thing that went into Hive 3.x was the new ACIDv2 
implementation allows for the predicates to be applied directly for 100% 
performance (Spark might find it easier to read ACIDv2, but that is more broken 
with lock-handling rather than the format readers right now).

https://issues.apache.org/jira/browse/HIVE-11320

The next big thing that went in was the new Tez semi-join reduction which kicks 
in for MERGE, which turns the 1.5B join into a bloom filter scan first (this 
would be similar to the UPDATE). You will see this as much more useful if 
you're mirroring a data-set which has auto-incremental or natural order keys 
rather than randomized keys (like a UUID would be bad, but a customer_id 
autoinc would be good).

https://issues.apache.org/jira/browse/HIVE-15269

However, if your source table is an external table & it does not have column 
stats, there's no ability today to inject this semi-join via a query-hint right 
now, which is probably what's needed for this to work if your ingest is from 
CSV tables.

https://issues.apache.org/jira/browse/HIVE-19115

That's probably a trivial problem, but at that point what happens is that the 
semi-join + merge looks like this.

http://people.apache.org/~gopalv/semijoin-merge.png

So you might be ahead of the curve here, but the feedback from people using 
Hive1+Merge is going into Hive3+Merge fixes.

> Second bad point: Right now spark is not able to read an ACID table
> without Major compaction. Meaning, the table needs to be rebuild
> from scratch behind the scene.

If your goal is to do ML workloads in Spark, this might be interesting to keep 
track of .

https://issues.apache.org/jira/browse/HIVE-19305

This basically forms a 1:1 bridge between PySpark and Hive-ACID (or well, any 
other hive table).

Cheers,
Gopal




Re: insert overwrite to hive orc table in aws

2018-05-01 Thread Gopal Vijayaraghavan

> delta_000_000
...
> I am using Glue data catalog as metastore, so should there be any link up to 
> these tables from hive?

That would be why transactions are returning as 0 (there is never a transaction 
0), because it is not using a Hive standard metastore.

You might not be able to use transactional tables with a non-standard 
metastore, because DbTxnManager & DbLockManager might not be working without a 
DB.

Looking at that, it looks like you will not be able to use transactional tables 
without using a proper Hive metastore backed by RDS.

Cheers,
Gopal




Re: Hive External Table with Zero Bytes files

2018-04-29 Thread Gopal Vijayaraghavan

> We are copying data from upstream system into our storage S3. As part of 
> copy, directories along with Zero bytes files are been copied. 

Is this exactly the same issue as the previous thread or a different one?



Cheers,
Gopal




Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Gopal Vijayaraghavan
> so there asking "where is the Hive bucketing spec".  Is it just to read the 
> code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 
implemented Streaming ingest via Storm, it used an explicit naming "bucket_" 
for the filename.

Since until the compaction runs the actual base files don't exist, the ACID 
bucketing implementation has to handle missing buckets as 0 rows in base file + 
possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work 
similarly, for the ability to do bucket map-joins between ACID & non-ACID 
bucketed tables. Particularly about the modulus for -ve numbers, which was 
broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for 
bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/00_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/00_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/01_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/01_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/02_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/02_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/03_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/03_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per 
bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you 
can see that is only applicable for the 1st insert to table (& the regex will 
remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of 
Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross" <r...@oaktreepeak.com> wrote:

Gopal,

The Presto devs say they are willing to make the changes to adhere to the 
Hive bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" <gop...@apache.org>
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER 
BY.  My understanding was that this was related to the number of buckets, but 
you are relating it to ORC stripes.  It is odd that no examples that I've seen 
include the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with 

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-03 Thread Gopal Vijayaraghavan
>* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  
> My understanding was that this was related to the number of buckets, but you 
> are relating it to ORC stripes.  It is odd that no examples that I've seen 
> include the SORTED BY statement other than in relation to ORC indexes (that I 
> understand).  So the question is; regardless of whether efficient ORC stripes 
> are created (wouldn't I have to also specify 'orc.create.index’=’true’ for 
> this to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing more 
predictable order during shuffle.

Then the key1 can be RLE encoded so that ORC vector impl will pass it on as 
key1x1024 repetitions & do 1000x fewer comparisons when filtering rows for 
integers.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5

was written as a warning to customers who use bucketing to try & solve 
performance problems, but have ended up bucketing as their main problem.

Most of what I have written above was discussed a few years back and in 
general, bucketing on a high cardinality column + sorting on a low cardinality 
together has given good results to my customers.

>I hadn't thought of the even number issue, not having looked at the 
> function; I had assumed that it was a hash, not a modulus; shame on me.  
> Reading the docs I see that hash is only used on string columns

Actually a hash is used in theory, but I entirely blame Java for it - the Java 
hash is an identity function for Integers.

scala> 42.hashCode
res1: Int = 42

scala> 42L.hashCode
res2: Int = 42

> Finally, I'm not sure that I got a specific answer to my original question, 
> which is can I force Tez to create all bucket files so Presto queries can 
> succeed?  Anyway, I will be testing today and the solution will either be to 
> forgo buckets completely or to simply rely on ORC indexes.

There's no config to do that today & Presto is already incompatible with Hive 
3.0 tables (Update/Delete support).

Presto could fix their fail-safe for bucketing implementation to actually trust 
the Hive bucketing spec & get you out of this mess - the bucketing contract for 
Hive is actual file name -> hash % buckets (Utilities::getBucketIdFromFile).

The file-count is a very flaky way to check if the table is bucketed correctly 
- either you trust the user to have properly bucketed the table or you don't 
use it. Failing to work on valid tables does look pretty bad, instead of soft 
fallbacks.

I wrote a few UDFs which was used to validate suspect tables and fix them for 
customers who had bad historical data, which was loaded with 
"enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.

https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27

LLAP has a bucket pruning implementation if Presto wants to copy from it 
(LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).

Optimizer: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
Runtime: 
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281

That actually does things according to the Hive bucketing contract where 
uncovered buckets are assumed to have 0 rows without a file present & not error 
out instead.

If you do have the ability to redeploy Hive, the change you are looking for is 
a 1-liner to enable.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248

Cheers,
Gopal




Re: Hive, Tez, clustering, buckets, and Presto

2018-04-02 Thread Gopal Vijayaraghavan

There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 
https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal




Re: Changing compression format of existing table from snappy to zlib

2018-03-14 Thread Gopal Vijayaraghavan
 

Hi,

> Would this also ensure that all the existing data compressed in snappy format 
> and the new data stored in zlib format can work in tandem with no disruptions 
> or issues to end users who query the table.

Yes. 

Each file encodes its own compressor kind & readers use that. The writers use 
the table properties.

 

Cheers,

Gopal



Re: Best way/tool to debug memory leaks in HiveServer2

2018-03-13 Thread Gopal Vijayaraghavan
> It also shows that the process is consuming more than 30GB. However, it is 
> not clear what is causing the process to consume more than 30GB.

The Xmx only applies to the heap size, there's another factor that is usually 
ignored which are the network buffers and compression buffers used by Java.

Most of these are only collected on a full GC pause (or OS memory pressure, I 
think).

Try running a jmap -histo:live and see what happens to the extra buffers on 
that list.

Cheers,
Gopal




Re: Hive 1.2.1 (HDP) ArrayIndexOutOfBounds for highly compressed ORC files

2018-02-26 Thread Gopal Vijayaraghavan
Hi,

> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1453)

In general HDP specific issues tend to get more attention on HCC, but this is a 
pretty old issue stemming from MapReduce being designed for fairly low-memory 
JVMs.

The io.sort.mb size is the reason for this crash, it has a wrap-around case 
where sort buffers which are > 1Gb trigger a corner case.

As odd as this might sound, if you have fewer splits the sort buffer wouldn't 
wrap around enough times to generate a -ve offset. 

You can lower the mapreduce.task.io.sort.mb to 1024Mb or lower as a slower 
workaround.

I ran into this issue in 2013 and started working on optimizing sort for larger 
buffers for MapReduce (MAPREDUCE-4755), but ended up rewriting the entire thing 
& then added it to Tez.

Cheers,
Gopal




Re: HQL parser internals

2018-02-16 Thread Gopal Vijayaraghavan

> However, ideally we wish to manipulate the original query as delivered by the 
> user (or as close to it as possible), and we’re finding that the tree has 
> been modified significantly by the time it hits the hook

That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook - the 
bushy join conversion is already done by the time the hook gets called.

We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer hook.

> Additionally we wish to track back ASTNodes to the character sequences in the 
> source HQL that were their origin (where sensible), and ultimately hope to be 
> able regenerate the query text from the AST.

I started work on a Hive-unparser a while back based on this class, but it a 
world of verbose coding.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850

If you're doing active work on this, I'd like to help, because I need the AST 
-> query to debug CBO.

> The use case, if you are interested, is a mutation testing framework for HQL. 
> The testing of mutants is operational, but now we need to report on 
> survivors, hence the need to track back from specific query elements to 
> character sequences in the original query string.

This sounds a lot like the fuzzing random-query-gen used in Cloudera to have 
Impala vs Hive bug-for-bug compat.

https://cwiki.apache.org/confluence/download/attachments/27362054/Random%20Query%20Gen-%20Hive%20Meetup.pptx

Cheers,
Gopal




Re: Question on accessing LLAP as data cache from external containers

2018-02-02 Thread Gopal Vijayaraghavan

> For example, a Hive job may start Tez containers, which then retrieve data 
> from LLAP running concurrently. In the current implementation, this is 
> unrealistic

That is how LLAP was built - to push work from Tez to LLAP vertex by vertex, 
instead of an all-or-nothing implementation.

Here are the slides describing how that is plugged in LLAP from Hadoop Summit 
2015.

https://www.slideshare.net/Hadoop_Summit/llap-longlived-execution-in-hive/21

The flag in question is hive.llap.execution.mode - the most common use-case 
imagined for it was something like the mode=map, where only table-scan + all 
secure operators (i.e no temporary UDFs) are run inside LLAP (to take advantage 
of the cache).

LLAP can shuffle data to a Tez container, but it cannot shuffle data from a Tez 
container back into the daemon (& that's not very useful, since it won't be 
cached).

Here's the class that decides the hybrid execution tree & the plans the split 
between LLAP and Tez in the same query DAG.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/LlapDecider.java#L81

If you want to consume the LLAP cached rows from something like GPUs running 
Caffee, you can access LLAP cache via the SparkSQL data-source APIs.

https://github.com/hortonworks/spark-llap-release/blob/HDP-2.6.3.0-235-tag/examples/src/main/python/spark_llap_dsl.py

This is faster than directly reading off Cloud filesystems (because of LLAP's 
SSD cache), but even with a perf penalty on-prem it is very useful to restrict 
the access of the Spark ML[1] to certain columns (i.e you can extract lat/long, 
from a table which has other PII data) without having to make a complete copy 
of the data after projections to share from the EDW end of the shop to the ML 
side of it, even if the entire data-set is HDFS encrypted.

Cheers,
Gopal
[1] - https://hortonworks.com/blog/row-column-level-control-apache-spark/




Re: Hive performance issue with _ character in query

2018-01-18 Thread Gopal Vijayaraghavan
Hi,

> I wanted to understand why hive has a performance issue with using _ 
> character in queries.

This is somewhat of a missed optimization issue - the "%" impl uses a fast 
BoyerMoore algorithm and avoids converting from utf-8 bytes -> String.

http://mail-archives.apache.org/mod_mbox/hive-user/201608.mbox/%3cd3c8de1e.4bcab%25go...@hortonworks.com%3E

That's the old thread for the same issue.

Filed a bug, since it came up again - 
https://issues.apache.org/jira/browse/HIVE-18487

If you want to pick this up and put up a patch, I can review it.

Cheers,
Gopal




Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-27 Thread Gopal Vijayaraghavan
Hi,

If you've got the 1st starvation fixed (with Hadoop 2.8 patch), all these 
configs + enable log4j2 async logging, you should definitely see a performance 
improvement.

Here's the log patches, which need a corresponding LLAP config (& have to be 
disabled in HS2, for the progress bar to work)

https://issues.apache.org/jira/browse/HIVE-13027
+
https://issues.apache.org/jira/browse/HIVE-16061

You might want to share the explain plans + a jstack, if you want me to narrow 
down on the issue.

Pick a simple query like Query55 or 52, to get a sane baseline.

> And For somaxconn, why the somaxconn value of the shuffle port (15551) is 50 
> not 16384? Thanks for your help.

Good catch, I think you might have more backports to match my GA build (which 
spikes ~80% cpu on 16 cores easily)

https://issues.apache.org/jira/browse/HIVE-16737

Cheers,
Gopal




Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-24 Thread Gopal Vijayaraghavan
Hi,
 
> In our test, we found the shuffle stage of LLAP is very slow. Whether need to 
> configure some related shuffle value or not? 

Shuffle is the one hit by the 2nd, 3rd and 4th resource starvation issues 
listed earlier (FDs, somaxconn & DNS UDP packet loss).

> And we get the following log from the LLAP daemon in shuffle stage:
> 2017-11-23T12:48:40,718 INFO  [New I/O worker #128 ()] 
> org.apache.hadoop.hive.llap.shufflehandler.ShuffleHandler: Setting connection 
> close header...

This is from a Tez setting (& is unexpected with LLAP). Enable keep-alive on 
the Tez client-side 

https://github.com/t3rmin4t0r/tez-autobuild/blob/llap/tez-site.xml.frag#L171

FYI, the whole autobuild repo was made to allow easy rebuilding of Hive-LLAP 
with some sort of basic settings for a machine with 256Gb RAM  & 32 HT cores.

> Now " hive.llap.execution.mode" have "auto,none,all,map,only" mode. About the 
> four mode, do you have some suggestions? Whether the "all" mode can gain the 
> best performance or not? And  how the "auto" and "only" mode work?

The fastest mode is "only" - "auto" was designed for hybrid execution of ETL 
queries (i.e small tasks run in LLAP, large tasks run in Tez) & slows down BI 
queries (i.e slower queries, higher throughput, assuming bulk ETL).

The "all" model was designed to throw as much work into LLAP as possible, 
without failing queries. The trouble with this mode is that the performance of 
queries will be unpredictable, if some part of them falls out of LLAP. The 
"only" mode fails those queries, so that you can report a bug or at least, know 
that the performance loss is because LLAP is not active.

Once you switch to the "only" mode, it is a good idea to disable the 
HybridGraceHashJoin, to get much more performance out of the engine. Because 
the HybridGrace is a stateful hashtable, it produces 1 hashtable per-thread to 
avoid race conditions, while disabling it produces 1 read-only stateless 
hashtable which ends up being interleaved across all NUMA zones, but built 
exactly once per LLAP daemon.

This will cause a reduction in CPU usage (which is a good thing, unlike the 
lock starvation problems).

Cheers,
Gopal  




Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-22 Thread Gopal Vijayaraghavan
Hi,

> With these configurations,  the cpu utilization of llap is very low.

Low CPU usage has been observed with LLAP due to RPC starvation.

I'm going to assume that the build you're testing is a raw Hadoop 2.7.3 with no 
additional patches?

Hadoop-RPC is single-threaded & has a single mutex lock in the 2.7.x branch, 
which is fixed in 2.8.

Can you confirm if you have backported either 

https://issues.apache.org/jira/browse/HADOOP-11772
or 
https://issues.apache.org/jira/browse/HADOOP-12475

to your Hadoop implementation?

The secondary IO starvation comes from a series of HDFS performance problems 
which are easily worked around. Here's the root-cause

https://issues.apache.org/jira/browse/HDFS-9146

The current workaround is to make sure that the HDFS & Hive user has a limits.d 
entry to allow it to open a large number of sockets (which are fds).

https://github.com/apache/ambari/blob/trunk/ambari-server/src/main/resources/common-services/HDFS/2.1.0.2.0/package/templates/hdfs.conf.j2
+
https://github.com/apache/ambari/blob/trunk/ambari-server/src/main/resources/common-services/HIVE/2.1.0.3.0/package/templates/hive.conf.j2

This increases the FD limit for Hive & HDFS users (YARN also needs it, in case 
of Tez due to shuffle being served out of the NodeManager).

After increasing the FDs, LLAP is fast enough to run through 128 socket 
openings within the Linux TCP MSL (60 seconds)

The RHEL default for somaxconn  is 128, which causes 120s timeouts when HDFS 
silently loses packets & forces the packet timeout to expire before retrying.

To know whether the problem has already happened, check the SNMP traps

# netstat -s | grep "overflow"

 times the listen queue of a socket overflowed

Or to know when the SYN flood issue has been worked around by the kernel with 
cookies.

# dmesg | grep cookies

After this, you get hit by the DNS starvation within LLAP where the DNS server 
traffic (port 53 UDP) gets lost (or the DNS server bans an IP due to massive 
number of packets).

This is a JDK internal detail, which ignores the DNS actual TTL values, which 
can be worked around by running nscd or sssd on the host to cache dns lookups 
without generating UDP network packets constantly.

If you need more detail on any of these, ask away. I've had to report and get 
backports for several of these issues into HDP (mostly because perf issues are 
not generally community backports & whatever has good workarounds remain off 
the priority lists).

Cheers,
Gopal





Re: Hive +Tez+LLAP does not have obvious performance improvement than HIVE + Tez

2017-11-21 Thread Gopal Vijayaraghavan
Hi,

>  Please help us find whether we use the wrong configuration. Thanks for your 
> help.

Since there are no details, I'm not sure what configuration you are discussing 
here.

A first step would be to check if LLAP cache is actually being used (the LLAP 
IO in the explain), vectorization is being used (llap, vectorized for tasks), 
that the column stats show as COMPLETE (instead of NONE).

Here's some basic config defaults LLAP in an HDP install ships with 

https://github.com/apache/ambari/blob/trunk/ambari-server/src/main/resources/common-services/HIVE/2.1.0.3.0/configuration/hive-interactive-site.xml

You're probably in for a fairly long configuration journey - in the HDP 
install, we've got almost ~2x perf gains in some queries with by using Log4J2 
async logging (but only for LLAP, it is sync logging within HiveServer2).

These configs are all driven by the installer, because Hive only contains logj 
.template files in the release tarballs.

Cheers,
Gopal







Re: READING STRING, CONTAINS \R\N, FROM ORC FILES VIA JDBC DRIVER PRODUCES DIRTY DATA

2017-11-02 Thread Gopal Vijayaraghavan

> Why jdbc read them as control symbols?

Most likely this is already fixed by 

https://issues.apache.org/jira/browse/HIVE-1608

That pretty much makes the default as

set hive.query.result.fileformat=SequenceFile;

Cheers,
Gopal 





Re: Hive JDBC - Method not Supported

2017-11-01 Thread Gopal Vijayaraghavan
Hi,


> org.apache.hive.jdbc.HiveResultSetMetaData.getTableName(HiveResultSetMetaData.java:102)


https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveResultSetMetaData.java#L102

 

I don't think this issue is fixed in any release - this probably needs to go 
into a BUG in JIRA instead.

 

Cheers,

Gopal
 



Re: In reduce task,i have a join operation ,and i found "org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1" cast much long

2017-10-19 Thread Gopal Vijayaraghavan
> . I didn't see data skew for that reducer. It has similar amount of 
> REDUCE_INPUT_RECORDS as other reducers.
…
> org.apache.hadoop.hive.ql.exec.CommonJoinOperator: table 0 has 8000 rows for 
> join key [4092813312923569]


The ratio of REDUCE_INPUT_RECORDS and REDUCE_INPUT_GROUPS is what is relevant.

 

The row containers being spilled to disk means that at least 1 key in the join 
has > 1 values.

If you have Tez, this comes up when you run the SkewAnalyzer.

https://github.com/apache/tez/blob/master/tez-tools/analyzers/job-analyzer/src/main/java/org/apache/tez/analyzer/plugins/SkewAnalyzer.java#L41

 

Cheers,

Gopal



Re: hive window function can only calculate the main table?

2017-10-09 Thread Gopal Vijayaraghavan


> ) t_result where formable = ’t1'
…
> This sql using 29+ hours in 11 computers cluster within 600G memory.
> In my opinion, the time wasting in the `order by sampledate` and `calculate 
> the table B’s record`. Is there a setting to avoid `table B`’s record not to 
> get ‘avg_wfoy_b2’ column, in the `t_result` will filter 'formable = ’t1’' 

> hive(1.2.0)

https://issues.apache.org/jira/browse/HIVE-12808
+
https://issues.apache.org/jira/browse/HIVE-16797

These are the relevant fixes for the issue you're looking into, both went into 
hive 2.x branches.

Since you have nested queries for OVER() clauses, you might also need this 
un-committed patch

https://issues.apache.org/jira/browse/HIVE-17668

Cheers,
Gopal




Re: Hive query starts own session for LLAP

2017-09-27 Thread Gopal Vijayaraghavan

> Now we need an explanation of "map" -- can you supply it?

The "map" mode runs all tasks with a TableScan operator inside LLAP instances 
and all other tasks in Tez YARN containers. This is the LLAP + Tez hybrid mode, 
which introduces some complexity in debugging a single query.

The "only" mode is so far the best option since, the LlapDecider runs very late 
in the optimizer order the earlier optimizers need to hedge their bets on 
whether LLAP will finally be used for a vertex or not. The "only" mode sort of 
short-cuts that by assuring all optimizers that it is "LLAP or Bust!".

Cheers,
Gopal





Re: hive on spark - why is it so hard?

2017-09-26 Thread Gopal Vijayaraghavan
Hi,

> org.apache.hadoop.hive.ql.parse.SemanticException: Failed to get a spark 
> session: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to create 
> spark client.
 
I get inexplicable errors with Hive-on-Spark unless I do a three step build.

Build Hive first, use that version to build Spark, use that Spark version to 
rebuild Hive.

I have to do this to make it work because Spark contains Hive jars and Hive 
contains Spark jars in the class-path.

And specifically I have to edit the pom.xml files, instead of passing in params 
with -Dspark.version, because the installed pom files don't get replacements 
from the build args.

Cheers,
Gopal




Re: Benchmarking Hive ACID functionality

2017-09-25 Thread Gopal Vijayaraghavan

> Are there any frameworks like TPC-DS to benchmark Hive ACID functionality?

Are you trying to work on and improve Hive ACID?

I have a few ACID micro-benchmarks like this 

https://github.com/t3rmin4t0r/acid2x-jmh

so that I can test the inner loops of ACID without having any ORC data at all.

https://issues.apache.org/jira/browse/HIVE-17089

look for this patch in your branch, before you start working on performance 
improvements

Cheers,
Gopal




Re: Error when running TPCDS query with Hive+LLAP

2017-09-25 Thread Gopal Vijayaraghavan

> Caused by: 
> org.apache.hadoop.hive.ql.exec.mapjoin.MapJoinMemoryExhaustionError: 
> VectorMapJoin Hash table loading exceeded memory limits. 
> estimatedMemoryUsage: 1644167752 noconditionalTaskSize: 463667612 
> inflationFactor: 2.0 threshold: 927335232 effectiveThreshold: 927335232

Most likely the table does not have column statistics to allow for it to 
estimate join sizes correctly and run through hive's CBO.

Check if the explain plan says "Optimized by CBO".

Also, check if it says in_bloom_filter() on the store_sales scanner, because if 
the COLUMN STATS: COMPLETE is missing the bloom filters get disabled because 
they can't be sized from the row-counts.

> query25 against a 25GB dataset (my instance memory size is 64GB)

This is an artificial error, which is setup so that no single query can 
overload a daemon.

With a single node + single query setup, you probably can just disable the 
checking.

set hive.llap.mapjoin.memory.monitor.check.interval=0;

Cheers,
Gopal




Re: Hive LLAP service is not starting

2017-09-11 Thread Gopal Vijayaraghavan
> java.util.concurrent.ExecutionException: java.io.FileNotFoundException: 
> /tmp/staging-slider-HHIwk3/lib/tez.tar.gz (Is a directory)

LLAP expects to find a tarball where tez.lib.uris is - looks like you've got a 
directory?

Cheers,
Gopal




Re: ORC Transaction Table - Spark

2017-08-24 Thread Gopal Vijayaraghavan
> Or, is this an artifact of an incompatibility between ORC files written by 
> the Hive 2.x ORC serde not being readable by the Hive 1.x ORC serde?  
> 3. Is there a difference in the ORC file format spec. at play here?

Nope, we're still defaulting to hive-0.12 format ORC files in Hive-2.x.

We haven't changed the format compatibility in 5 years, so we're due for a 
refresh soon.

> 5. What’s the mechanism that affects Spark here?

SparkSQL has never properly supported ACID, because to do this correctly Spark 
has to grab locks on the table and heartbeat the lock, to prevent a compaction 
from removing a currently used ACID snapshot.

AFAIK, there's no code in SparkSQL to handle transactions in Hive - this is not 
related to the format, it is related to the directory structure used to 
maintain ACID snapshots, so that you can delete a row without failing queries 
in progress.

However, that's mostly an operational issue for production. Off the raw 
filesystem (i.e not table),  I've used SparkSQL to read the ACID 2.x raw data 
to write a acidfsck which checks underlying structures by reading them as raw 
data, so that I can easily do tests like "There's only 1 delete for each 
ROW__ID" when ACID 2.x was in dev.

You can think of the ACID data as basically

Struct , Struct

when reading it raw.

> 6. Any similar issues with Parquet format in Hive 1.x and 2.x?

Not similar - but a different set of Parquet incompatibilities are inbound, 
with parquet.writer.version=v2.

Cheers,
Gopal

 
 
 





Re: How to optimize multiple count( distinct col) in Hive SQL

2017-08-22 Thread Gopal Vijayaraghavan
> COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
> COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
…
> GROUPING_ID() AS gid,
> COUNT(1) AS dummy

There are two things which prevent Hive from optimize multiple count distincts.

Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.

The multiple count distincts are rewritten into a ROLLUP internally by the CBO.

https://issues.apache.org/jira/browse/HIVE-10901

A single count distinct + other aggregates (like min,max,count,count_distinct 
in 1 pass) is fixed via 

https://issues.apache.org/jira/browse/HIVE-16654

There's no optimizer rule to combine both those scenarios.

https://issues.apache.org/jira/browse/HIVE-15045

There's a possibility that you're using Hive-1.x release branch the CBO doesn't 
kick in unless column stats are present, but in the Hive-2.x series you'll 
notice that some of these optimizations are not driven by a cost function and 
are always applied if CBO is enabled.

> is there any way to rewrite it to optimize the memory usage.

If you want it to run through very slowly without errors, you can try disabling 
all in-memory aggregations.

set hive.map.aggr=false;

Cheers,
Gopal




Re: Hive index + Tez engine = no performance gain?!

2017-08-22 Thread Gopal Vijayaraghavan

TL;DR - A Materialized view is a much more useful construct than trying to get 
limited indexes to work.

That is pretty lively project which has been going on for a while with 
Druid+LLAP

https://issues.apache.org/jira/browse/HIVE-14486

> This seems out of the blue but my initial benchmarks have shown that there's 
> no performance gain when Hive index is used with Tez engine.

Indexes in Hive are broken once columnar formats came in, because there is no 
concept of directly locating a row in another table from an existing table - 
file+offsets doesn't work for a columnar format in any practical sense.

Once you have a fast format, these indexes are only good enough to answer 
queries directly from an index by maintaining a more compact copy of data, 
which is really not what an index is supposed to do.

> I think that index is a perfect solution for non-ORC file format since you 
> can selectively build an index table and leverage Tez to only look at those 
> blocks and/or files that we need to scan

Since LLAP is a distributed system, the concept of "looking up indexes and then 
looking at blocks" does not work the same way it would work for a single 
machine DB with common memory across all threads - the index file scans cannot 
go and specify which other blocks to read, because there is no single frame of 
reference for this.

We can only get meaningful speedups if the index are co-located with the blocks 
on the same machine and can reference column groups (10k rows in each chunk), 
which is exactly what ORC+LLAP manages to do with ORC's bloom filter indexes.

> we still want to have fast ad-hoc query via Hive LLAP / Tez

LLAP can do fast ad-hoc queries at around ~100 million rows/sec on a single 
node, which is pretty fast without needing a separate index + a lookup loop.

In a production cluster, with constant ETL, the global indexes were actually 
lowering throughput since they got stale every few minutes and rebuilding fell 
behind because it is lower priority operations through-out.

Here's an experiment for you to try

CREATE INDEX idx_COUNTRY 
ON TABLE customer(c_birth_country)
AS 'org.apache.hadoop.hive.ql.index.AggregateIndexHandler'
WITH DEFERRED REBUILD
IDXPROPERTIES('AGGREGATES'='count(c_birth_country)')
STORED AS ORC;

Which does build an index in ORC (why not?), can only be made in MR because the 
index handlers weren't implemented in Tez. 

Tez can still use those indexes.

0: jdbc:hive2://localhost:10007/tpcds_bin_par> set 
hive.optimize.index.groupby=true;

0: jdbc:hive2://localhost:10007/tpcds_bin_par> 
 Vertex dependency in root stage
 
 Reducer 2 <- Map 1 (SIMPLE_EDGE)   
 

 
 Stage-0
 
   Fetch Operator   
 
 limit:-1   
 
 Stage-1
 
   Reducer 2 vectorized
   File Output Operator [FS_8]  
 
 Group By Operator [GBY_7] (rows=1060 width=9536)   
 
   
Output:["_col0","_col1"],aggregations:["sum(VALUE._col0)"],keys:KEY._col0   
  
 <-Map 1 [SIMPLE_EDGE] 
   SHUFFLE [RS_3]   
 
 PartitionCols:_col0
 
 Group By Operator [GBY_2] (rows=2120 width=9536)   
 
   
Output:["_col0","_col1"],aggregations:["sum(_count_of_c_birth_country)"],keys:c_birth_country
 
   Select Operator [SEL_1] (rows=2120 width=9536)   
 
 Output:["c_birth_country","_count_of_c_birth_country"] 
 
 TableScan [TS_0] (rows=2120 width=9536) 

Re: Long time compiling query/explain.....

2017-08-14 Thread Gopal Vijayaraghavan


> Running Hive 2.2 w/ LLAP enabled (tried the same thing in Hive 2.3 w/ LLAP), 
> queries working but when we submit queries like the following (via our 
> automated test framework), they just seem to hang with Parsing 
> CommandOther queries seem to work fine Any idea on what's going on or 
> how to debug?  I have debug log set and all we're seeing in the HiveServer2 
> logs are: 

Get a jstack - like 20 jstacks 1s apart.

I think this be similar to the parenthesis parsing bug fixed in 2.3.0.

https://issues.apache.org/jira/browse/HIVE-16074

Also getting a dummy table schema would help me run explain on my own builds.
 
Cheers,
Gopal





Re: LLAP Query Failed with no such method exception

2017-08-02 Thread Gopal Vijayaraghavan
Hi,

> java.lang.Exception: java.util.concurrent.ExecutionException: 
> java.lang.NoSuchMethodError: 
> org.apache.hadoop.tracing.SpanReceiverHost.getInstance(Lorg/apache/hadoop/conf/Configuration;)Lorg/apache/hadoop/tracing/SpanReceiverHost;

There's a good possibility that you've built hive-2.1.1 against Hadoop-2.6 
without changing the pom.xml?

> hadoop 2.7.3

https://github.com/apache/hadoop/commit/472fd563e4110adcb5e90bf2cb325a9fa4c8c440

-  public static SpanReceiverHost getInstance(Configuration conf) {
+  public static SpanReceiverHost get(Configuration conf, String confPrefix) {


-this.spanReceiverHost = SpanReceiverHost.getInstance(conf);
 +this.spanReceiverHost =
 +  SpanReceiverHost.get(conf, DFSConfigKeys.DFS_SERVER_HTRACE_PREFIX);

So, there's definitely a mismatch between the version inside LLAP tarball and 
the version inside the cluster.

Cheers,
Gopal





Re: group by + two nulls in a row = bug?

2017-06-27 Thread Gopal Vijayaraghavan
>               cast(NULL as bigint) as malone_id,
>               cast(NULL as bigint) as zpid,

I ran this on master (with text vectorization off) and I get

20170626123 NULLNULL10

However, I think the backtracking for the columns is broken, somewhere - where 
both the nulls end up being represented by 1 column & that I think breaks text 
vectorization somewhere.

> Output:["_col0","_col1","_col2","_col3","_col4"],aggregations:["sum(VALUE._col0)"],keys:20170626,
>  123, KEY._col2, KEY._col2

See the repetition of _col2, while output has a _col3 (and _col4 is the 
aggregate result).

Hive-1.2 has similar issues (which I assume 2.1.0 has too).

Group By Operator
  aggregations: sum(COALESCE(10,0))
  keys: 20170626 (type: int), 123 (type: int), null (type: 
bigint), null (type: bigint)
  mode: hash
  outputColumnNames: _col0, _col1, _col2, _col3, _col4
  Statistics: Num rows: 1 Data size: 32 Basic stats: 
COMPLETE Column stats: COMPLETE
  Reduce Output Operator
key expressions: 20170626 (type: int), 123 (type: int), 
_col3 (type: bigint)
sort order: +++
Map-reduce partition columns: 20170626 (type: int), 123 
(type: int), _col3 (type: bigint)
Statistics: Num rows: 1 Data size: 32 Basic stats: 
COMPLETE Column stats: COMPLETE
value expressions: _col3 (type: bigint)

_col4 should've been the value expression, not _col3 and _col2 should've been 
in the key expression + partition columns (because you're grouping by 3 
columns).

> what do you think? is it me? or is it hive?

Definitely Hive.

If you file a JIRA, please run against a 1-row ORC table and report the 
vectorization issue too.

A performant fix to the problem would be to fix this similarly to how I'm 
trying to fix views with PTF + filters (i.e the filter injects a constant into 
a window function).

https://issues.apache.org/jira/browse/HIVE-16541

Doing the same with the GroupBy would prevent constants from showing up in a 
group-by like this.

These can happen because of good engineering too, you don't end up writing a 
group-by with a "cast(null as bigint)" - you write a view with a groupby and 
then call it with a "where zpid is null and malone_id is null".

Cheers,
Gopal








Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan

> I guess I see different things. Having used all the tech. In particular for 
> large hive queries I see OOM simply SCANNING THE INPUT of a data directory, 
> after 20 seconds! 

If you've got an LLAP deployment you're not happy with - this list is the right 
place to air your grievances. I usually try help anybody who haven't quite 
figured their way around it & I'm always happy to debug performance issues 
across the board.

But, if your definition of "all the tech" is text files and gzip on MR, then 
Datanami has an article out today.

https://www.datanami.com/2017/06/22/hadoop-engines-compete-comcast-query-smackdown/

And my favourite quote is the side-bar image - MapReduce SQL query performance 
is “a dumpster fire”.

> "Magically" jk. Impala allow me to query those TEXT files in milliseconds, so 
> logical deduction says the format of the data ORC/TEXT can't be the most 
> important factor here.

You're right - the most important factor is the execution engine for Hive. That 
drives the logical plans and optimizations specific to those engines, which are 
definitely more significant.

Picking the best Hive engine is the most important factor, if you're doing ETL 
or BI  - Comcast got Parquet on LLAP to work fast as well.

https://www.slideshare.net/Hadoop_Summit/hadoop-query-performance-smackdown/20

Stinger wasn't just about ORC or Vectorization or ACID or Tez, but to be 
greater than sum of the parts for an EDW.

Your milliseconds comment reminded me of an off-hand comment made by the Yahoo 
Japan folks about the time they tried Impala out for their 700 node use-case.

https://hortonworks.com/blog/impala-vs-hive-performance-benchmark/

Impala did very well at low workloads. But low load is a very artificial 
condition in a well utilized cluster. Production clusters usually run at 
near-full utilization or they're leaving money on the table.

This is also a scale problem, running at 50% utilization at 2 nodes is very 
different from running at 50% on 50, in real dollars.

>From their utilization tests, Yahoo Japan detailed their experience running 
>LLAP with 300+ concurrent queries last year (with *NO* query failures, instead 
>of "fast or die").

https://www.slideshare.net/HadoopSummit/achieving-100k-queries-per-hour-on-hive-on-tez/38

> 2 impala server (each node)

The problem with making up your mind based on a 2 node cluster, is that scale 
problems are somewhat like the tides - huge and massive, ignore it at your 
risk, but impossible to measure in a swimming pool.

If you want, I can go into details about what is different about 3-replica HDFS 
data on 700 nodes vs 2 nodes - also imagine that your last hour is "hot" in the 
BI dashboard (and that means 4 HDFS blocks are 85% of IO reqs, which in an MPP 
database model runs out of capacity at 0.5% disk IO utilization) .

If you're happy with Impala at the single digit node scale, that's useful to 
know - but it does not extrapolate naturally.

Cheers,
Gopal




Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan

> It is not that simple. The average Hadoop user has years 6-7 of data. They do 
> not have a "magic" convert everything button. They also have legacy processes 
> that don't/can't be converted. 
…
> They do not want the "fastest format" they want "the fastest hive for their 
> data".

I've yet to run into that sort of luddite yet - maybe engineers can hold onto 
an opinion like that in isolation, but businesses are in general cost sensitive 
when it comes to storage & compute. 

The cynic in me says that if there are a few more down rounds, ORC adoption 
will suddenly skyrocket in companies which hoard data.

ORC has massive compression advantages over Text, especially for 
attribute+metric SQL data. A closer look at this is warranted.

Some of this stuff literally blows my mind - customer_demographics in TPC-DS is 
a great example of doing the impossible.

tpcds_bin_partitioned_orc_1000.customer_demographics [numFiles=1, 
numRows=1920800, totalSize=46194, rawDataSize=726062400]

which makes it 0.19 bit per-row (not byte, *BIT*).

Compare to Parquet (which is still far better than text)

tpcds_bin_partitioned_parq_1000.customer_demographics  [numFiles=1, 
numRows=1920800, totalSize=16813614, rawDataSize=17287200]

which uses 70 bits per-row.

So as companies "age" in their data over years, they tend to be very receptive 
to the idea of switching their years old data to ORC (and then use tiered HDFS 
etc).

Still no magic button, but apparently money is a strong incentive to solve hard 
problems.

> They get data dumps from potentially non sophisticated partners maybe using 
> S3 and csv and, cause maybe their partner uses vertica or redshift. I think 
> you understand this.

That is something I'm painfully aware of - after the first few months, the 
second request is "Can you do Change-Data-Capture, so that we can reload every 
30 mins? Can we do every 5 minutes?".

And that's why Hive ACID has got SQL MERGE statements, so that you can grab a 
ChangeLog and apply it over with an UPSERT/UPDATE LATEST. And unlike the old 
lock manager, writes don't lock out any readers.

Then as the warehouse gets bigger, "can you prevent the UPSERT from thrashing 
my cache & IO? Because the more data I have in the warehouse the longer the 
update takes."

And that's what the min-max SemiJoin reduction in Tez does (i.e the min/max 
from the changelog goes pushed into the ORC index on the target table scan, so 
that only the intersection is loaded into cache). We gather a runtime range 
from the updates and push it to the ACID base, so that we don't have to read 
data into memory that doesn't have any updates.

Also, if you have a sequential primary key on the OLTP side, this comes in as a 
~100x speed up for such a use-case … because ACID ORC has 
transaction-consistent indexes built-in.

> Suppose you have 100 GB text data in an S3 bucket, and say queying it takes 
> lets just say "50 seconds for a group by type query".
… 
> Now that second copy..Maybe I can do the same group by in 30 seconds. 

You're off by a couple of orders of magnitude - in fact, that was my last 
year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC + LLAP.

http://people.apache.org/~gopalv/LLAP-S3.gif (GIANT 38Mb GIF)

That's doing nearly a billion rows a second across 9 nodes, through a join + 
group-by - a year ago. You can probably hit 720M rows/sec with plain Text with 
latest LLAP on the same cluster today.

And with LLAP, adding S3 SSE (encrypted data on S3) adds a ~4% overhead for 
ORC, which is another neat trick. And with S3Guard, we have the potential to 
get the consistency needed for ACID.

The format improvements are foundational to the cost-effectiveness on the cloud 
- you can see the impact of the format on the IO costs when you use a non-Hive 
engine like AWS Athena with ORC and Parquet [1].

> 1) io bound 
> 2) have 10 seconds of startup time anyway. 

LLAP is memory bandwidth bound today, because the IO costs are so low & is 
hidden by async IO - the slowest part of LLAP for a BI query is the amount of 
time it takes to convert the cache structures into vectorized rows.

Part of it is due to the fact ORC is really tightly compressed and decode loops 
need to get more instructions-per-clock. Some parts of ORC decompression can be 
faster than a raw memcpy of the original data, because of cache access patterns 
(rather, writing sequentially to the same buffers again is faster than writing 
to a new location). If the focus on sequential writes makes you think of disks, 
this is why memory is treated as the new disk.

The startup time for something like Tableau is actually closer to 240ms (& that 
can come down to 90ms if we disable failure tolerance).

We've got sub-second SQL execution, sub-second compiles, sub-second submissions 
… with all of it adding up to a single or double digit seconds over a billion 
rows of data.

Cheers,
Gopal
[1] - http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html






Re: Format dillema

2017-06-22 Thread Gopal Vijayaraghavan

> I kept hearing about vectorization, but later found out it was going to work 
> if i used ORC. 

Yes, it's a tautology - if you cared about performance, you'd use ORC, because 
ORC is the fastest format.

And doing performance work to support folks who don't quite care about it, is 
not exactly "see a need, fill a need".

> Litterally years have come and gone and we are talking like 3.x is going to 
> vectorize text.

Literally years have gone by since the feature came into Hive. Though it might 
have crept up on you - if Vectorization had been enabled by default, it 
would've been immediately obvious.

HIVE-9937 is so old, that I'd say the first line towards Text vectorization 
came in in Q1 2015.

In the current master, you can get a huge boost out of it - if you want you can 
run BI over 100Tb of text.

https://www.slideshare.net/Hadoop_Summit/llap-building-cloudfirst-bi/27

> … where some not negligible part of the features ONLY work with ORC.

You've got it backwards - ORC was designed to support those features.

Parquet could be following ORC closely, but at least the Java implementation 
hasn't.

Cheers,
Gopal






Re: Hive query on ORC table is really slow compared to Presto

2017-06-22 Thread Gopal Vijayaraghavan

> 1711647 -1032220119

Ok, so this is the hashCode skew issue, probably the one we already know about.

https://github.com/apache/hive/commit/fcc737f729e60bba5a241cf0f607d44f7eac7ca4

String hashcode distribution is much better in master after that. Hopefully 
that fixes the distinct speed issue here.

> Turning off map side aggregations definitely helped the query on id . The 
> query time went to 1 minute from the earlier 3+ hours. 
> 
> Based on the output above, both id and name have a lot of collisions, but the 
> name query was fast earlier too which is interesting.

The String equals check has a fast-path for length == length, so equal width id 
columns and different width name columns might have very different performance 
characteristics.

The collisions also build a binary tree in each hash bucket (JEP 180), which is 
sensitive to the order of inserts for its CPU usage (balancing trees do a lot 
of rebalancing if you insert pre-sorted data into them).

All that code exists only if the map.aggr=true, if that is disabled then all 
data is shuffled to reducers using Murmur3 hash (Tez ReduceSinks are marked 
UNIFORM|AUTOPARALLEL, to indicate this).

Cheers,
Gopal




Re: Format dillema

2017-06-20 Thread Gopal Vijayaraghavan

> 1) both do the same thing. 

The start of this thread is the exact opposite - trying to suggest ORC is 
better for storage & wanting to use it.

> As it relates the columnar formats, it is silly arms race. 

I'm not sure "silly" is the operative word - we've lost a lot of fragmentation 
of the community and are down to 2 good choices, neither of them wrong.

Impala's original format was Trevni, which lives on in Avro docs. And there was 
RCFile - a sequence file format, which stored columnar data in a  pair. 
And then there was LazySimple SequenceFile, LazyBinary SequenceFile, Avro and 
Text with many SerDes.

Purely speculatively, we're headed into more fragmentation again, with people 
rediscovering that they need updates.

Uber's Hoodie is the Parquet fork, but for Spark, not Impala. While ORC ACID is 
getting much easier to update with MERGE statements and a deadlock aware txn 
manager.

> Parquet had C/C++ right off the bat of course because impala has to work in 
> C/C++.

I think that is the primary reason why the Java Parquet readers are still way 
behind in performance.

Nobody sane wants to work on performance tuning a data reader library in Java, 
when it is so much easier to do it in C++.

Doing C++ after tuning the format for optimal performance in Java8 makes a lot 
of sense, in hindsight. The marshmallow test is easier if you can't have a 
marshmallow now.

> 1) uses text file anyway because it is the ONLY format all tools support

I see this often, folks who just throw in plain text into S3 and querying it.

Hive 3.x branch has text vectorization and LLAP cache support for it, so 
hopefully the only relevant concern about Text will be the storage costs due to 
poor compression (& the lack of updates).

Cheers,
Gopal




Re: Hive query on ORC table is really slow compared to Presto

2017-06-14 Thread Gopal Vijayaraghavan

> SELECT COUNT(DISTINCT ip) FROM table - 71 seconds
> SELECT COUNT(DISTINCT id) FROM table - 12,399 seconds

Ok, I misunderstood your gist.

> While ip is more unique that id, ip runs many times faster than id.
>
> How can I debug this ?

Nearly the same way - just replace "ip" with "id" in my exploratory queries.
 
count(distinct hash(id)) from the table?

count count(1) as collisions, hash(id) from table group by hash(id) order by 
collisions desc limit 10;

And, if those show many collisions

set tez.runtime.io.sort.mb=640;
set hive.map.aggr=false;
set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e 
retries are more expensive, happy path is faster)

select count(distinct id) from ip_table;

Java's hashCode() implementation is pretty horrible (& Hive defaults to using 
it). If you're seeing a high collision count, I think I might know what's 
happening here.

Cheers,
Gopal




Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Gopal Vijayaraghavan
Hi,

I think this is worth fixing because this seems to be triggered by the data 
quality itself - so let me dig in a bit into a couple more scenarios.

> hive.optimize.distinct.rewrite is True by default

FYI, we're tackling the count(1) + count(distinct col) case in the Optimizer 
now (which came up after your original email).

https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.optimize.countdistinct

> On running the orcfiledump utility, I see that the column on which I want to 
> run the distinct query is encoded with a DIRECT encoding.  When I run 
> distinct on other columns in the table that are encoded with the dictionary 
> encoding, the query runs quickly. 

So the cut-off for dictionary encoding is that the value repeats at least ~2x 
in each stripe - so very unique patterns won't trigger this.

If the total # of rows of IP == total IP values, I don't expect it to be 
encoded as a dictionary.

Also interesting detail - I prefer to now store IPs as 2 bigint cols.

bigint ip1, bigint ip2

This was primarily driven by the crazy math required to join different 
contractions of the IPv6 formatting.

The two colon contractions are crazy when you want to joins across different 
data sources, if you store as a text string. Maybe 2017 is the year of IPv6 :D.

> CLUSTERED BY (ip) INTO 16 BUCKETS

This is something that completely annoys me - CLUSTERED BY does not cluster, 
but that doesn't help you here since IP is unique.

You need SORTED BY (ip) to properly generate clusters in Hive.

> Running a count(distinct) query on master id took 3+ hours. It looks like the 
> CPU was busy when running this query.

Can you do me a favour and run some intermediate state data exploratory 
queries, because some part of the slowness is probably triggered due to the 
failure tolerance checkpoints.

count(distinct hash(ip)) from the table? 

count count(1) as collisions, hash(ip) from table group by hash(ip) order by 
collisions desc limit 10;

And, if those show many collisions

set tez.runtime.io.sort.mb=640;
set hive.map.aggr=false;
set tez.runtime.pipelined.shuffle=true; // this reduces failure tolerance (i.e 
retries are more expensive, happy path is faster)

select count(distinct ip) from ip_table;

Cheers,
Gopal 





Re: Migrating Variable Length Files to Hive

2017-06-02 Thread Gopal Vijayaraghavan

> We are looking at migrating  files(less than 5 Mb of data in total) with 
> variable record lengths from a mainframe system to hive.

https://issues.apache.org/jira/browse/HIVE-10856
+
https://github.com/rbheemana/Cobol-to-Hive/ 

came up on this list a while back.

> Are there other alternative or better approaches for this solution.

Does that JIRA discuss a potential solution to your problem?

Cheers,
Gopal





Re: question on setting up llap

2017-05-10 Thread Gopal Vijayaraghavan

> for the slider 0.92, the patch is already applied, right?

Yes, except it has been refactored to a different place.

https://github.com/apache/incubator-slider/blob/branches/branch-0.92/slider-agent/src/main/python/agent/NetUtil.py#L44

Cheers,
Gopal






Re: question on setting up llap

2017-05-09 Thread Gopal Vijayaraghavan

> NetUtil.py:60 - [Errno 8] _ssl.c:492: EOF occurred in violation of protocol

The error is directly related to the SSL verification error - TLSv1.0 vs 
TLSv1.2.

JDK8 defaults to v1.2 and Python 2.6 defaults to v1.0.

Python 2.7.9 + the patch in 0.92 might be needed to get this to work.

AFAIK, there's no way to disable SSL in Apache Slider.

Cheers,
Gopal




Re: question on setting up llap

2017-05-09 Thread Gopal Vijayaraghavan

> ERROR 2017-05-09 22:04:56,469 NetUtil.py:62 - SSLError: Failed to connect. 
> Please check openssl library versions. 
…
> I am using hive 2.1.0, slider 0.92.0, tez 0.8.5

AFAIK, this was reportedly fixed in 0.92.

https://issues.apache.org/jira/browse/SLIDER-942

I'm not sure if the fix in that patch will work on python 2.6.6.

Cheers,
Gopal




Re: Hive LLAP with Parquet format

2017-05-04 Thread Gopal Vijayaraghavan
Hi,


 > Does Hive LLAP work with Parquet format as well?

 

LLAP does work with the Parquet format, but it does not work very fast, because 
the java Parquet reader is slow.

https://issues.apache.org/jira/browse/PARQUET-131
+

https://issues.apache.org/jira/browse/HIVE-14826

In particular to your question, Parquet's columnar data reads haven't been 
optimized for Azure/S3/GCS.

There was a comparison of ORC vs Parquet for NYC taxi data and it found that 
for simple queries Parquet read ~4x more data over the network - your problem 
might be bandwidth related.

You might want to convert a small amount to ORC and see whether the BYTES_READ 
drops or not.

In my tests with a recent LLAP, Text data was faster on LLAP on S3 & Azure than 
Parquet, because Text has a vectorized reader & cache support.


Cheers,

Gopal



Re: Hive Partitioned View query error

2017-04-24 Thread Gopal Vijayaraghavan

> But on Hue or JDBC interface to Hive Server 2, the following error occurs 
> while SELECT querying the view.

You should be getting identical errors for HS2 and CLI, so that suggests you 
might be running different CLI and HS2 versions.

> SELECT COUNT(1) FROM pk_test where ds='2017-04-20';
> org.apache.hadoop.hive.ql.optimizer.calcite.rules.HiveRelFieldTrimmer.trimFields(org.apache.calcite.rel.core.Project,org.apache.calcite.util.ImmutableBitSet,java.util.Set)'
…
> Does anyone know whats going on here?

My  guess is that the CBO fieldtrimmer is getting confused that we're reading 0 
columns from a partitioned table.

Try doing 

SELECT COUNT(build) from pk_test where ds='2017-04-20';

If that works, but the count(1) fails - then we can dig a bit deeper into the 
problem.

Cheers,
Gopal




Re: How to create auto increment key for a table in hive?

2017-04-12 Thread Gopal Vijayaraghavan

> I'd like to remember that Hive supports ACID (in a very early stages yet) but 
> most often that is a feature that most people don't use for real production 
> systems.

Yes, you need ACID to maintain multiple writers correctly.

ACID does have a global primary key (which is not a single integer) - ROW__ID.

select ROW__ID, * from acid_table;

will return a unique value for each row.

Cheers,
Gopal





Re: beeline connection to Hive using both Kerberos and LDAP with SSL

2017-04-07 Thread Gopal Vijayaraghavan

> Is there anyway one can enable both (Kerberos and LDAP with SSL) on Hive?

I believe what you're looking for is Apache Knox SSO. And for LDAP users, 
Apache Ranger user-sync handles auto-configuration.

That is how SSL+LDAP+JDBC works in the HD Cloud gateway [1].

There might be a similar solution from CDH, if you go digging for it. 

Cheers,
Gopal
[1] - https://hortonworks.github.io/hdp-aws/security-network/#protected-gateway




Re: Hive query on ORC table is really slow compared to Presto

2017-04-04 Thread Gopal Vijayaraghavan
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]

I'm hoping this is not rewriting to the approx_distinct() in Presto.

> I got similar performance with Hive + LLAP too.

This is a logical plan issue, so I don't know if LLAP helps a lot.

A count + a count(distinct) is planned as a full shuffle of 100% of rows.

Run with 

set hive.tez.exec.print.summary=true;

And see the output row-count of Map 1.

> What can be done to get the hive query to run faster in hive?

Try with (see if it generates a Reducer 2 + Reducer 3, which is what the 
speedup comes from).

set hive.optimize.distinct.rewrite=true;

or try a rewrite

select id from accounts group by id having count(1) > 1;

Both approaches enable full-speed vectorization for the query.

Cheers,
Gopal




Re: LLAP queries create a yarn app per query

2017-03-28 Thread Gopal Vijayaraghavan
> My bad. Looks like the thrift server is cycling through various AMs it 
> started when the thrift server was started. I think this is different from 
> either Hive 2.0.1 or LLAP. 

This has been roughly been possible since hive-1.0, if you follow any of the 
Tez BI tuning guides over the last 4 releases [1].

The issue is more of a "User left Tableau open and went for lunch" capacity 
issue, so there is no direct connectivity from the JDBC  connection -> Tez 
session.

This is the default with LLAP and this specific tuning was off previously when 
the same Hive was used for ETL, Reporting and BI interchangeably.

> Is there a setting I'm missing or is this expected behavior.

There's probably some value in removing the round-robin through the pool and 
switching over to the "latest used first" to help a single user avoid confusion.

The queue in TezSessionPoolManager has to be a dequeue, to allow that - so that 
a single user coming back from lunch has to only cold-start one AM before they 
get fast queries on LLAP.

Cheers,
Gopal
[1] - 
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_performance_tuning/content/section_set_up_interactive_queues_for_HDP_2-2.html




Re: Hive on Tez: Tez taking nX more containers than Mapreduce for union all

2017-03-17 Thread Gopal Vijayaraghavan

> We are using a query with union all and groupby and same table is read 
> multiple times in the union all subquery.
…
> When run with Mapreduce, the job is run in one stage consuming n mappers and 
> m reducers and all union all scans are done with the same job.

The logical plans are identical btw - MR effectively reads the same table again 
and again, unless the correlation optimizer is folding this.

I doubt that due to the unix_timestamp(). An explain would be useful.

> Hence if there are 50 union alls in a query, the 50n map vertex tasks are 
> launched which is huge.

Tez lets you scale the mappers up/down using split grouping parameters, so you 
can tweak it to scale down if you want to.

set tez.grouping.split-waves=0.1;

would try to shrink the width of the mappers.

An alternative is to use a CTE + materialization (HIVE-11752), but for that you 
need Hive2.

> http://pastebin.com/u7Rw6Hag

You can probably get a ~2x speedup by removing the UNIX_TIMESTAMP() and using 
CURRENT_TIMESTAMP instead.

Cheers,
Gopal




Re: [Hive on Tez] Running queries in tez non-session mode not working

2017-03-14 Thread Gopal Vijayaraghavan


> by setting tez.am.mode.session=false in hive-cli and hive-jdbc via 
> hive-server2.

That setting does not work if you do "set tez.am.*" parameters (any tez.am 
params).

Can you try doing 

hive --hiveconf tez.am.mode.session=false 

instead of a set; param and see if that works?

Cheers,
Gopal




Re: ODBC - NullPointerException when loading data

2017-02-02 Thread Gopal Vijayaraghavan
> Using Apache Hive 1.2.1, I get a NullPointerExcetion when performing a 
> request through an ODBC driver.
> The request is just a simple LOAD DATA request:

Looks like the NPE is coming from the getResultMetaData() call, which returns 
the schema of the rows returned.

LOAD is probably one of the few calls which have no schema for its result 
returns.

> Any help on this would be greatly appreciated.

File a JIRA (with odbc driver version + exact version of Hive), because looks 
like the JDBC drivers don’t trigger this codepath - specifically.

at 
org.apache.hive.service.cli.CLIService.getResultSetMetadata(CLIService.java:435)
at 
org.apache.hive.service.cli.thrift.ThriftCLIService.GetResultSetMetadata(ThriftCLIService.java:658)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetResultSetMetadata.getResult(TCLIService.java:1537)
at 
org.apache.hive.service.cli.thrift.TCLIService$Processor$GetResultSetMetadata.getResult(TCLIService.java:1522

for a LOAD DATA return path.

Cheers,
Gopal




Re: Compactions doesn't launch on Hive 1.1.0

2017-02-02 Thread Gopal Vijayaraghavan




> I try reduce check interval and launch it manuallty with command "Alter
> table tx_tbl compaction 'major';". Nothing helps.

You can check the hive metastore log and confirm it also has the DbTxnManager
set up & that it is triggering the compactions.

Without a standalone metastore, the hive ACID compactor threads are not 
triggered.

Cheers,
Gopal




Re: Experimental results using TPC-DS (versus Spark and Presto)

2017-01-30 Thread Gopal Vijayaraghavan

> Gopal : (yarn logs -application $APPID) doesn't contain a line
>  containing HISTORY so it doesn't produce svg file. Should I turn on
>  some option to get the lines containing HISTORY in yarn application
>  log?

There's a config option tez.am.log.level=INFO which controls who much data is 
written to the log there.

I think there's an interval type clause in the 72 query, which might be a 
problem.

> and d3.d_date > d1.d_date + 5

That might be doing UDFToDouble(d_date) > UDFToDouble(d_date) + 5, which will 
evaluate into 

NULL > NULL + 5

Because UDFToDouble("1997-01-01") is NULL.

So, seeing your explain would go a long way in finding out what's going on.

The swimlane raw data is also somewhat interesting to me, because I also draw a 
differen t set of graphs from the same HISTORY data.

http://people.apache.org/~gopalv/q21_suppliers_who_kept_orders_waiting.svg

to locate bottlenecks in the system.

Cheers,
Gopal





Re: Experimental results using TPC-DS (versus Spark and Presto)

2017-01-30 Thread Gopal Vijayaraghavan

> Hive LLAP shows better performance than Presto and Spark for most queries, 
> but it shows very poor performance on the execution of query 72.

My suspicion will be the the inventory x catalog_sales x warehouse join - 
assuming the column statistics are present and valid.

If you could send the explain formatted plans and swimlanes for LLAP, I can 
probably debug this better.

https://github.com/apache/tez/blob/master/tez-tools/swimlanes/yarn-swimlanes.sh

Use the "submitted to " in this to get the diagram.

Cheers,
Gopal




Re: Hive Tez on External Table running on Single Mapper

2017-01-30 Thread Gopal Vijayaraghavan

> > 'skip.header.line.count'='1',

Trying removing that config option.

I've definitely seen footer markers disabling file splitting, possibly header 
also does.

Cheers,
Gopal




Re: Parquet tables with snappy compression

2017-01-25 Thread Gopal Vijayaraghavan

> Has there been any study of how much compressing Hive Parquet tables with 
> snappy reduces storage space or simply the table size in quantitative terms?

http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet/20

Since SNAPPY is just LZ77, I would assume it would be useful in cases of 
Parquet leaves containing text with large common sub-chunks (like URLs or log 
data).

If you want to experiment with that corner case, the L_COMMENT field from TPC-H 
lineitem is a good compression-thrasher.

Cheers,
Gopal




Re: Only External tables can have an explicit location

2017-01-25 Thread Gopal Vijayaraghavan
> Error 40003]: Only External tables can have an explicit location
…
> using hive 1.2. I got this error. This was definitely not a requirement 
> before 

Are you using Apache hive or some vendor fork?

Some BI engines demand there be no aliasing for tables, so each table needs a 
unique location to avoid schema issues.

I tested this on master & HDP builds, the vanilla Apache doesn't have this 
restriction (if there is, an explicit version would be helpful).

$ hive --version
Hive 1.2.1000.2.5.3.0-37

hive> use testing;
OK
Time taken: 1.245 seconds
hive> create table foo (x int) location '/tmp/foo';
OK
Time taken: 0.59 seconds
hive> desc formatted foo;
OK
# col_name  data_type   comment 
 
x   int 
 
# Detailed Table Information 
Database:   testing  
Owner:  gopal
CreateTime: Wed Jan 25 15:38:27 EST 2017 
LastAccessTime: UNKNOWN  
Protect Mode:   None 
Retention:  0
Location:   hdfs://nnha/tmp/foo  
Table Type: MANAGED_TABLE
Table Parameters:
numFiles2   
totalSize   66  
transient_lastDdlTime   1485376707  
 
# Storage Information
SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
 
InputFormat:org.apache.hadoop.mapred.TextInputFormat 
OutputFormat:   
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat   
Compressed: No   
Num Buckets:-1   
Bucket Columns: []   
Sort Columns:   []   
Storage Desc Params: 
serialization.format1   
Time taken: 0.455 seconds, Fetched: 28 row(s)
hive> 

Cheers,
Gopal





Re: Hive Tez on External Table running on Single Mapper

2017-01-23 Thread Gopal Vijayaraghavan
> We have 20 GB txt File, When we have created external table on top of 20 
>  Gb file, we see Tez is creating only one mapper. 

For an uncompressed file, that is very strange. Is this created as "STORED AS 
TEXTFILE" or some other strange format?

Cheers,
Gopal





Re: Cannot connect to Hive even from local

2017-01-22 Thread Gopal Vijayaraghavan
> !connect jdbc:hive2://localhost:1/default; -n hiveuser -p hivepassword
...
> What's missing here? how do I fix it? Thank you very much

Mostly, this is missing the actual protocol specs - this is something which is 
never a problem for real clusters because ZK load-balancing automatically 
configures it.

The HS2 is most likely running in HTTP mode, which needs additional parameters

https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-ConnectionURLWhenHiveServer2IsRunninginHTTPMode

My paths end up looking like

jdbc:hive2://sandbox.hortonworks.com:1/tpcds_flat_orc_2;transportMode=http;httpPath=cliservice?

Cheers,
Gopal




Re: unable to create or grant privileges or roles under beeline

2017-01-20 Thread Gopal Vijayaraghavan
> I have been following the instructions under 
> https://cwiki.apache.org/confluence/display/Hive/SQL+Standard+Based+Hive+Authorization
>  in great detail with no success.
…
>  Error: org.apache.spark.sql.catalyst.parser.ParseException:

You're reading the docs for Apache Hive and trying to replicate it on Apache 
Spark.

Cheers,
Gopal





Re: how to customize tez query app name

2017-01-20 Thread Gopal Vijayaraghavan

> So no one has a solution?
…
> “mapreduce.job.name” works for M/R queries, not Tez. 

Depends on the Hive version you're talking about.

https://issues.apache.org/jira/browse/HIVE-12357

That doesn't help you with YARN, but only with the TezUI (since each YARN AM 
runs > 1 queries).

For something like an ETL workload, I suspect we can name CLI sessions, but not 
queries independently.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L314

Cheers,
Gopal





Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Gopal Vijayaraghavan

> Sounds like VARCHAR and CHAR types were created for Hive to have ANSI SQL 
> Compliance. Otherwise they seem to be practically the same as String types.

They are relatively identical in storage, except both are slower on the CPU in 
actual use (CHAR has additional padding code in the hot-path).

There is no constant form for those two types, so all string operations like 
say = 'NONE' would get promoted up as 

UDFToString(varcharcol) = 'NONE'

Resulting in all ORC/Parquet index pushdowns being turned off due to the cast 
on the column & if you run an explain and notice something similar, it will 
cause a significant performance loss.

In general, I see 2-3x performance degradation in case of CHAR/VARCHAR when 
doing constant filter operations & other issues when joining different sized 
ops (Varchar(3) x Varchar(4) would go this route).

The default String types are faster purely because they are the destination 
type for any up-conversion or constant-folding conversions.

Cheers,
Gopal




Re: Vectorised Queries in Hive

2017-01-11 Thread Gopal Vijayaraghavan


> I have also noticed that this execution mode is only applicable to single 
> predicate search. It does not work with multiple predicates searches. Can 
> someone confirms this please?

Can you explain what you mean?

Vectorization supports multiple & nested AND+OR predicates - with some extra 
SIMD efficiencies in place for constants or repeated values.

Cheers,
Gopal




Re: Zero Bytes Files importance

2017-01-03 Thread Gopal Vijayaraghavan
> Thanks Gopal. Yeah I'm using CloudBerry.  Storage is Azure. 

Makes sense, only an object store would have this.

> Are you saying this _0,1,2,3 are directories ?. 

No, only the zero size "files".

This is really for compat with regular filesystems.

If you have /tmp/1/foo in an object store that's a single key. That does not 
imply you'll find "/tmp" or "/tmp/1" in the object store keys.

A FileSystem however assumes parent directories are "real things", so any 
FileSystem abstraction has to maintain "/tmp", "/tmp/1/" and "/tmp/1/foo" to 
keep up the basic compatibility requirements of fs.exists("/tmp").

Cheers,
Gopal




Re: Zero Bytes Files importance

2016-12-29 Thread Gopal Vijayaraghavan

> For any insert operation, there will be one Zero bytes file. I would like to 
> know importance of this Zero bytes file.

They are directories.

I'm assuming you're using S3A + screenshots from something like Bucket explorer.

These directory entries will not be shown if you do something like "hadoop fs 
-ls s3a://…/"

I had a recent talk covering the specifics of S3 + Hive - 
https://www.slideshare.net/secret/3cfQbeo3cI6GpK/3

Cheers,
Gopal




Re: Can Beeline handle HTTP redirect?

2016-12-22 Thread Gopal Vijayaraghavan
> I want to know whether Beeline can handle HTTP redirect or not. I was 
> wondering if some of Beeline experts can answer my question?

Beeline uses the hive-jdbc driver, which is the one actually handling network 
connections.

That driver in turn, uses a standard org.apache.http.impl.client.HttpClients to 
talk to the servers.

You can probably dig through the code usage and find out if it is allowing 302 
or not (POST + 100 Continue is another question to answer by experimenting).

https://github.com/apache/hive/blob/master/jdbc/src/java/org/apache/hive/jdbc/HiveConnection.java#L348

Cheers,
Gopal





Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
 
> Actually, we don't have that many partitions - there are lot of gaps both in 
> days and time events as well.

Your partition description sounded a lot like one of the FAQs from Mithun's 
talks, which is why I asked

http://www.slideshare.net/Hadoop_Summit/hive-at-yahoo-letters-from-the-trenches/24

> But, I would like to understand when you say " time spent might partly be 
> query planning with million partitions"? I presume, this is in producing the 
> physical plan? -- does it spend time in allocating group of partition 
> directories to each map task

Yes, the physical planner is significant overhead, since all map-tasks get a 
list of all partitions and match each read against that list (partition schema 
can evolve, this is closer to O(n^2)) & the split-generation is bottlenecked by 
the total # of files involved in the operation (a slow O(n) operation is still 
slow).

The overhead in maintaining partitions is fairly high & the entire query 
planning will try to do something like a du -sh on each partition, if all the 
basic file statistics are missing etc.

Also, if you have >100k partitions, disabling the metadata optimizer 
(hive.optimize.metadataonly=false) would be a good thing, since that codepath 
is single threaded, while a compute-heavy full-table scan is much faster due to 
parallelism - it might burn more CPU, but it would come back in less than an 
hour.

You might want to opt for daily partitions & also run the stats gathering ops 
with "analyze table  compute statistics partialscan;" & "analyze table 
 compute statistics for columns" to speed up further queries.

At least in my experience with ORC + valid stats, a query like "select count(*) 
from table" should takes <500 milliseconds.

Cheers,
Gopal





Re: Hive/TEZ/Parquet

2016-12-15 Thread Gopal Vijayaraghavan
> The partition is by year/month/day/hour/minute. I have two directories - over 
> two years, and the total number of records is 50Million.  

That's a million partitions with 50 rows in each of them?

> I am seeing it takes more than 1hr to complete. Any thoughts, on what could 
> be the issue or approach that can be taken to improve the performance?

Looks like you have over-partitioned your data massively - the 1 hour might be 
partly query planning with million partitions and the rest might be file-count 
related overheads.

At least in case of ORC, I recommend that the partitions contain at least 1 Gb 
of data & that if you really need to query down to finer levels, to use bloom 
filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.

http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4

Cheers,
Gopal




  1   2   3   4   >