Re: Bucket pruning

2015-03-12 Thread Gopal Vijayaraghavan
Hi,

No and it¹s a shame because we¹re stuck on some compatibility details with
this.

The primary issue is the fact that the InputFormat is very generic and
offers no way to communicate StorageDescriptor or bucketing.

The split generation for something SequenceFileInputFormat lives inside
MapReduce, where it has no idea about bucketing.

So InputFormat.getSplits(conf) returns something relatively arbitrary,
which contains a mixture of files when CombineInputFormat is turned on.

I have implemented this twice so far for ORC (for custom Tez jobs, with
huge wins) by using an MRv2 PathFilter over the regular OrcNewInputFormat
implementation, by turning off combine input and using Tez grouping
instead.

But that has proved to be very fragile for a trunk feature, since with
schema evolution of partitioned tables older partitions may be bucketed
with a different count from a newer partition - so the StorageDescriptor
for each partition has to be fetched across before we can generate a valid
PathFilter.

The SARGs are probably a better way to do this eventually as they can
implement IN_BUCKET(1,2) to indicate 1 of 2 instead of the ³0_1²
PathFilter which is fragile.


Right now, the most fool-proof solution we¹ve hit upon was to apply the
ORC bloom filter to the bucket columns, which is far safer as it does not
care about the DDL - but does a membership check on the actual metadata 
prunes deeper at the stripe-level if it is sorted as well.

That is somewhat neat since this doesn¹t need any new options for querying
- it automatically(*) kicks in for your query pattern.

Cheers,
Gopal
(*) - conditions apply - there¹s a threshold for file-size for these
filters to be evaluated during planning (to prevent HS2 from burning CPU).


From:  Daniel Haviv daniel.ha...@veracity-group.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Thursday, March 12, 2015 at 2:36 AM
To:  user@hive.apache.org user@hive.apache.org
Subject:  Bucket pruning


Hi,
We created a bucketed table and when we select in the following way:
select * 
from testtble
where bucket_col ='X';

We observe that there all of the table is being read and not just the
specific bucket.

Does Hive support such a feature ?


Thanks,
Daniel




Re: Executing HQL files from JAVA application.

2015-03-24 Thread Gopal Vijayaraghavan
Hi,

Any mechanism which bypasses schema layers for SQL is a bad idea.

See this example for how you can connect to HiveServer2 directly from Java
- 
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveSe
rver2Clients-JDBCClientSampleCode

Use the JDBC driver to access HiveServer2 through a first-class Java API.

If you find any performance issues with this method, let me know and I can
fix it.

Cheers,
Gopal

From:  Amal Gupta amal.gup...@aexp.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Sunday, March 22, 2015 at 10:53 PM
To:  user@hive.apache.org user@hive.apache.org
Subject:  RE: Executing HQL files from JAVA application.


Hey Mich,

 
Got any clues regarding the failure of the code that I sent?
 
I was going through the project and the code again and I suspect the
mis-matching dependencies to be the culprits. I am currently trying to
re-align the dependencies
 as per the pom given on the mvnrepository.com while trying to see if a
particular configuration succeeds.
 
Will keep you posted on my progress.  Thanks again for all the help that
you are providing.
J
 
Regards,
Amal

 
From: Amal Gupta

Sent: Sunday, March 22, 2015 7:52 AM
To: user@hive.apache.org
Subject: RE: Executing HQL files from JAVA application.


 
Hi Mich,
 
J A coincidence. Even I am new to hive. My test script which I am trying
to execute
 contains a drop and a create statement.
 
Script :-

use test_db;
DROP TABLE IF EXISTS demoHiveTable;
CREATE EXTERNAL TABLE demoHiveTable (
demoId string,
demoName string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'

STORED AS TEXTFILE LOCATION '/hive/';
 
 
Java Code: -  
Not sure whether this will have an impact but the the code is a part of
Spring batch Tasklet being triggered from the Batch-Context. This tasklet
runs in parallel with other tasklets.
 
  
public RepeatStatus execute(StepContribution arg0, ChunkContext arg1)
   
throws Exception {
String[] args =
{-d,BeeLine.BEELINE_DEFAULT_JDBC_DRIVER,-u,jdbc:hive2://server-name:
1/test_db,
   
-n,**,-p,**,
-f,C://Work//test_hive.hql};

BeeLine
beeline = new BeeLine();
ByteArrayOutputStream os =
new ByteArrayOutputStream();
PrintStream beelineOutputStream =
new PrintStream(os);
beeline.setOutputStream(beelineOutputStream);
beeline.setErrorStream(beelineOutputStream);
beeline.begin(args,null);
String output = os.toString(UTF8);
System.out.println(output);

return RepeatStatus.FINISHED;
   }
 
It will be great if you can share the piece of code that worked for you.
May be it will give me some pointers on how to go ahead.

 
Best Regards,
Amal

 
From: Mich Talebzadeh [mailto:m...@peridale.co.uk]

Sent: Sunday, March 22, 2015 2:58 AM
To: user@hive.apache.org
Subject: RE: Executing HQL files from JAVA application.


 
Hi Amal;
 
Me coming from relational database (Oracle, Sybase) background
J always expect that a DDL statement like DROP TABLE has to run in its own
transaction and cannot be
 combined with a DML statement.
 
Now I suspect that when you run the command DROP TABLE IF EXIASTS
TABLE_NAME;  like below in beehive it works
 
0: jdbc:hive2://rhes564:10010/default drop table if exists mytest;
No rows affected (0.216 seconds)
 
That runs in its own transaction so it works. However, I suspect in JAVA
that is not the case. Can you possibly provide your JAVA code to see what
exactly it is
 doing.
 
Thanks,
 
Mich
 
http://talebzadehmich.wordpress.com
 
Publications due shortly:
Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
Coherence Cache
 
NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
 recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Ltd, its
subsidiaries or their employees, unless expressly so stated. It is the
responsibility of the recipient to ensure
 that this email is virus free, therefore neither Peridale Ltd, its
subsidiaries nor their employees accept any responsibility.

 
From: Amal Gupta [mailto:amal.gup...@aexp.com]

Sent: 21 March 2015 18:16
To: user@hive.apache.org
Subject: RE: Executing HQL files from JAVA application.


 
Hi Mich,

 
Thank you for your response.  I was not aware of beeline. I have now
included this in my app and this looks a much better solution going
forward.  In the last
 couple of hours I have tried to work with beeline but have been facing
some issues.

 
1.  
I was able to run on the remote server command line a beeline command
given below . This was successful.

beeline -u jdbc:hive2://server-name:1/test_db
org.apache.hive.jdbc.HiveDriver -n * -p ** -f

Re: ORDER BY clause in Hive

2015-03-29 Thread Gopal Vijayaraghavan
Hi Lefty,

Couldn¹t find the documentation for what
hive.groupby.orderby.position.alias=true does.

I suspect that might be what Mich was looking for (though I tend to write
the column names explicitly).

Cheers,
Gopal

From:  Lefty Leverenz leftylever...@gmail.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Sunday, March 29, 2015 at 8:32 PM
To:  user@hive.apache.org user@hive.apache.org
Subject:  Re: ORDER BY clause in Hive

I added information about this in the Order By
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy#Lang
uageManualSortBy-SyntaxofOrderBy  section of the wiki.  Thanks, Mich and
Gopal!

-- Lefty

On Sun, Mar 29, 2015 at 5:24 AM, Mich Talebzadeh m...@peridale.co.uk
wrote:
 Hi Lefty, Gopal,
  
 It appears that ORDER BY 3 is not interpreted as ORDERR BY ³the result set
 column three² which standard SQL as evident from Oracle and Sybase does.
  
 So I made it an ORDER BY from the result set EXPLICITELY as shown below and it
 worked OK
  
 SELECT
   rs.Customer_ID
 , rs.Number_of_orders
 , rs.Total_customer_amount
 , rs.Average_order
 , rs.Standard_deviation
 FROM
 (
 SELECT cust_id AS Customer_ID,
 COUNT(amount_sold) AS Number_of_orders,
 SUM(amount_sold) AS Total_customer_amount,
 AVG(amount_sold) AS Average_order,
 STDDEV(amount_sold) AS Standard_deviation
 FROM sales
 GROUP BY cust_id
 HAVING SUM(amount_sold)  94000
 AND AVG(amount_sold)  STDDEV(amount_sold)
 ) rs
 ORDER BY
  rs.Total_customer_amount
 ;
  
  
 +-+--+---+
 -++--+
 | rs.customer_id  | rs.number_of_orders  | rs.total_customer_amount  |
 rs.average_order   | rs.standard_deviation  |
 +-+--+---+
 -++--+
 | 1743.0  | 238  | 94786.123 |
 398.2610504201678   | 581.0439095219863  |
 | 429.0   | 231  | 94819.416 |
 410.4736363636366   | 613.7057080691426  |
 | 2994.0  | 227  | 94862.616 |
 417.89696035242315  | 623.1607772763742  |
 | 6395.0  | 268  | 97010.478 |
 361.97940298507456  | 576.9120977984521  |
 | 12783.0 | 240  | 97573.546 |
 406.556458332   | 590.4445500393804  |
 | 4974.0  | 235  | 98006.162 |
 417.0474893617022   | 624.337482834059   |
 | 42167.0 | 266  | 98585.962 |
 370.6239097744362   | 590.965120684093   |
 | 10747.0 | 256  | 99578.087 |
 388.9769140624999   | 600.7615005975689  |
 | 11407.0 | 248  | 103412.655|
 416.9865322580643   | 622.221465710723   |
 +-+--+---+
 -++--+
 9 rows selected (209.699 seconds)
  
  
 Regards,
  
 
 Mich Talebzadeh
  
 http://talebzadehmich.wordpress.com
  
 Publications due shortly:
 Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
 Coherence Cache
  
 NOTE: The information in this email is proprietary and confidential. This
 message is for the designated recipient only, if you are not the intended
 recipient, you should destroy it immediately. Any information in this message
 shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries
 or their employees, unless expressly so stated. It is the responsibility of
 the recipient to ensure that this email is virus free, therefore neither
 Peridale Ltd, its subsidiaries nor their employees accept any responsibility.
  
 
 From: Mich Talebzadeh [mailto:m...@peridale.co.uk]
 Sent: 29 March 2015 00:11
 To: user@hive.apache.org
 Subject: ORDER BY clause in Hive
 
  
 Hi,
  
 Can someone point me to doc or otherwise to see if ORDER BY clause in Hive is
 working OK
  
 I have a simple aggregate query as follows:
  
 SELECT cust_id AS Customer_ID,
 COUNT(amount_sold) AS Number_of_orders,
 SUM(amount_sold) AS Total_customer_amount,
 AVG(amount_sold) AS Average_order,
 STDDEV(amount_sold) AS Standard_deviation
 FROM sales
 GROUP BY cust_id
 HAVING SUM(amount_sold)  94000
 AND AVG(amount_sold)  STDDEV(amount_sold)
 ORDER BY 3 ;
  
 The original table and data are from Oracle sh.sales table
  
 Oracle comes back for this query with
  
 CUSTOMER_ID NUMBER_OF_ORDERS TOTAL_CUSTOMER_AMOUNT AVERAGE_ORDER
 STANDARD_DEVIATION
 ---  - -
 --
1743  238  94786.13 398.26105
 582.26845
 429  231  94819.41410.473636
 

Re: Predicate pushdown on HBase snapshots

2015-03-30 Thread Gopal Vijayaraghavan

Looking at the current implementation on trunk, hive's hbase integration
doesn't currently seem to support predicate pushdown for queries over
HBase snapshots. Does this seem like a reasonable feature to add?
It would be nice to have relative feature parity between queries running
over snapshots and queries running over live tables.

Are you suggesting taking advantage of the sorted order to seek to the key
mentioned in a SARG?

That particular method will be limited to simple filters on exactly one
key or perhaps with a few seeks, the more generic IN/BETWEEN SARGs.

But for that case, it will provided a significant boost.

Cheers,
Gopal




Re: Standard deviation (STDDEV) function calculation in Hive

2015-04-01 Thread Gopal Vijayaraghavan
Hi Lefty,

ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF(stddev, new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF(stddev_pop, new GenericUDAFStd());
ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java:
system.registerGenericUDAF(stddev_samp, new GenericUDAFStdSample());

Looks like stddev() in hive is stddev_pop(), you can update the UDF docs to
match the FunctionRegistry aliases.

Cheers,
Gopal

From:  Lefty Leverenz leftylever...@gmail.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Tuesday, March 31, 2015 at 11:13 PM
To:  user@hive.apache.org user@hive.apache.org
Subject:  Re: Standard deviation (STDDEV) function calculation in Hive

Mich, the linked documentation is for Impala, not Hive.  (Perhaps Hive is
the same, I don't know.)  But the Hive documentation doesn't explain much:
Built-in Aggregate Functions (UDAF)
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#Languag
eManualUDF-Built-inAggregateFunctions(UDAF) .

-- Lefty

On Mon, Mar 30, 2015 at 5:26 PM, Mich Talebzadeh m...@peridale.co.uk
wrote:
 Hi,
  
 Basically, the standard deviation (STDDEV) is a measure that is used to
 quantify the amount of variation or dispersion of a set of data values. It is
 widely used in trading systems such as FX.
  
 STDEDV in Hive is explained here
 http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/v2-
 0-x/topics/impala_stddev.html  as below and I quote:
  
 The STDDEV_POP() and STDDEV_SAMP() functions compute the population standard
 deviation and sample standard deviation, respectively, of the input values.
 (STDDEV() is an alias for STDDEV_SAMP().) Both functions evaluate all input
 rows matched by the query. The difference is that STDDEV_SAMP() is scaled by
 1/(N-1) while STDDEV_POP() is scaled by 1/N.
  
 Fair enough to me the common STDDEV is what Hive calls STDDEV_SAMP and
 according to above it ought to be STDDEV_SAMP.
  
 However, when I work these out (and also use straight forward calculation
 myself), it turns out that the alias seems to be to STDDVE_POP as opposed to
 STDDEV_SAMP!.
  
 The following calculation shows this
  
 SELECT
   rs.Customer_ID
 , rs.Total_customer_amount
, rs.stddev
 , rs.sdddev_samp
 , rs.mystddev
 FROM
 (
 SELECT cust_id AS Customer_ID,
 COUNT(amount_sold) AS Number_of_orders,
 SUM(amount_sold) AS Total_customer_amount,
 AVG(amount_sold) AS Average_order,
 stddev(amount_sold) AS stddev,
 stddev_samp(amount_sold) AS sdddev_samp,
 CASE
 WHEN  COUNT (amount_sold) = 1
THEN  0
  ELSE  
 SQRT((SUM(POWER(amount_sold,2))-(COUNT(1)*POWER(AVG(amount_sold),2)))/(COUNT(1
 )-1))
  END AS mystddev
 FROM sales
 GROUP BY cust_id
 HAVING SUM(amount_sold)  94000
 AND AVG(amount_sold)  stddev(amount_sold)
 ) rs
 ORDER BY
   rs.Total_customer_amount DESC
 ;
 +-+---++--
 --++--+
 | rs.customer_id  | rs.total_customer_amount  | rs.stddev  |
 rs.sdddev_samp   |   rs.mystddev |
 +-+---++--
 --++--+
 | 11407.0 | 103412.655| 622.221465710723   |
 623.4797510518939  | 623.4797510518938  |
 | 10747.0 | 99578.087 | 600.7615005975689  |
 601.9383117167412  | 601.9383117167412  |
  
 OK so looking above, we notice that rs.sdddev_sampandrs.mystddevare
 practically identical, whereas what is referred to as rs.stddev in Hive is not
 the one used in industry?
  
 To show I ran the same in Oracle and the below is the result.
  
 Connected to:
 Oracle Database 11g Enterprise Edition Release 11.2.0.1.0 - 64bit Production
 With the Partitioning, OLAP, Data Mining and Real Application Testing options
  
 CUSTOMER_ID TOTAL_CUSTOMER_AMOUNTSTDDEV  MYSTDDEV
 --- - - -
   11407 103412.66 623.4797510518940 623.4797510518940
   10747  99578.09 601.9383117167410 601.9383117167410
  
 So sounds like for one reason or other what is called STDDEV in Hive and
 aliased to STDDEV_SAMP is incorrect?
  
 Thanks,
  
 Mich Talebzadeh
  
 http://talebzadehmich.wordpress.com
  
 Publications due shortly:
 Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and
 Coherence Cache
  
 NOTE: The information in this email is proprietary and confidential. This
 message is for the designated recipient only, if you are not the intended
 recipient, you should destroy it immediately. Any information in this message
 shall not be understood as given or endorsed by Peridale Ltd, its 

Re: CombineHiveInputFormat does not call getSplits on custom InputFormat

2015-02-25 Thread Gopal Vijayaraghavan
Hi,

There¹s a special interface in hive-1.0, which gives more information to
the input format.

https://hive.apache.org/javadocs/r1.0.0/api/ql/org/apache/hadoop/hive/ql/io
/CombineHiveInputFormat.AvoidSplitCombination.html


But entirely skipping combination results in so many performance problems
that in Tez we are forced to abandon this approach and have Tez generate
grouped-splits on the application master (which basically call
InputFormat::getSplits(), then groups them to get locality splits).

This is differentiated by hive.tez.input.format instead of just via
hive.input.format.

Cheers,
Gopal

On 2/19/15, 10:09 AM, Luke Lovett luke.lov...@10gen.com wrote:

I'm working on defining a custom InputFormat and OutputFormat for use
with Hive. I'd like tables using these IF/OF to be native tables, so
that I can LOAD DATA and INSERT INTO them. However, I'm finding that
with the default CombineHiveInputFormat, the getSplits method of my
InputFormat is not being called. If I set
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;, then
getSplits is called.

What I want to know is:
- Is this difference in behavior between CombineHiveInputFormat and
HiveInputFormat intentional?
- Is there any way of forcing CombineHiveInputFormat to call getSplits
on my own InputFormat? I was reading through the code for
CombineHiveInputFormat, and it looks like it might only call my own
InputFormat's getSplits method if the table is non-native. I'm not sure
if I'm interpreting this correctly.
- Is it better to set hive.input.format to work around this, or to
create a StorageHandler and make non-native tables?

Thanks for any advice.




Re: Select distinct on partitioned column requires reading all the files?

2015-02-23 Thread Gopal Vijayaraghavan
Hi,

Are you sure you have

hive.optimize.metadataonly=true ?

I¹m not saying it will complete instantaneously (possibly even be very slow,
due to the lack of a temp-table optimization of that), but it won¹t read any
part of the actual table.

Cheers,
Gopal

From:  Stephen Boesch java...@gmail.com
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Monday, February 23, 2015 at 10:26 PM
To:  user@hive.apache.org user@hive.apache.org
Subject:  Select distinct on partitioned column requires reading all the
files?


When querying a hive table according to a partitioning column, it would be
logical that a simple
select count(distinct partitioned_column_name) from my_partitioned_table
would complete almost instantaneously.
But we are seeing that both hive and impala are unable to execute this query
properly: they just read the entire table!
What do we need to do to ensure the above command executes rapidly?




Re: How efficient is memory allocation in tez.

2015-04-10 Thread Gopal Vijayaraghavan
 wasn't being limited by memory but I tried to get the memory usage of
each tez task down so it could spawn more tasks(but it didn't)  Giving
tez more or less memory didn't really improve the performance.
 How would one go about find out the limiting factor on the performance
of a job. would job counters be the only way in tez ?

Hive uses a lot of in-memory processing, particularly for broadcast JOINs
and group-by aggregations.

Cutting down memory is likely to have a tipping point where smaller JVMs
are much slower, due to having to shuffle more data. I usually use 4Gb
containers for everything.

YARN allocates memory according to the scheduler min-allocation, so
lowering tez container size might have wastage if it is not a multiple of
the min-alloc.

A great resource for monitoring this closely is the YARN JMX output - on
YARN NodeManager, there¹s http://workernode0:8042/jmx, which gives out
max/min/std-dev of memory usage of each container.

In hive-1.0 + tez, it works out how much memory was actually returned by
YARN and tries to resize Xmx to 80% of that. You can ask for a 513Mb
container and if you get a 1Gb container, the Xmx will be set to 800Mb
instead of assuming 400Mb based on the resource request.

We¹re slowly building a generic Counter analysis kit (TEZ-2076) for Tez,
which tracks things slightly closer than just visually inspecting counters.

That I want to use to detect skewed keys, because we track keys/records
per-edge - so I get to find out if a reducer is slow because it had ~10x
values as the others, but ~1x keys.

Until then, the primary tool for Hive would be to look at the Swimlanes in
Tez.

https://github.com/apache/tez/tree/master/tez-tools/swimlanes


This generates something which looks like this

http://people.apache.org/~gopalv/query27.svg

It doesn¹t work with failed queries, but for successful queries this is
extremely useful as each box is a clickable link to that particular task's
log.

Also I debug Tez locality delay allocations, speedups during reuse etc
with that.

Cheers,
Gopal 









Re: merge small orc files

2015-04-20 Thread Gopal Vijayaraghavan
Hi,

How to set the configuration hive-site.xml to automatically merge small
orc file (output from mapreduce job) in hive 0.14 ?

Hive cannot add work-stages to a map-reduce job.

Hive follows merge.mapfiles=true when Hive generates a plan, by adding
more work to the plan as a conditional task.

-rwxr-xr-x   1 root hdfs  29072 2015-04-20 15:23
/apps/hive/warehouse/coordinate/zone=2/part-r-0

This looks like it was written by an MRv2 Reducer and not by the Hive
FileSinkOperator  handled by the MR outputcommitter instead of the Hive
MoveTask.

But 0.14 has an option which helps ³hive.merge.orcfile.stripe.level². If
that is true (like your setting), then do

³alter table table concatenate²

which effectively concatenates ORC blocks (without decompressing them),
while maintaining metadata linkage of start/end offsets in the footer.

Cheers,
Gopal




Re: Parsing and moving data to ORC from HDFS

2015-04-22 Thread Gopal Vijayaraghavan

 I have about 100 TB of data, approximately 180 billion events, in my
HDFS cluster. It is my raw data stored as GZIP files. At the time of
setup this was due to saving the data until we figured out what to do
with it.

 After attending @t3rmin4t0r's ORC 2015 session @hadoopsummit in Brussels
last week I was amazed by the results presented.

I run at the very cutting edge of the builds all the time :)

The bloom filters are there in hive-1.2.0 which is currently sitting in
svn/git today.


 I have decided I will move my raw-data into HIVE using ORC and zlib. How
would you guys recommend I would do that?

The best mechanism is always to write it via a Hive SQL ETL query.

The real question is how the events are exactly organized. Is it a flat
structure with something like a single line of JSON for each data item?

That is much more easy to process than other data formats - the gzipped
data can be natively read by Hive without any trouble.

The Hive-JSON-Serde is very useful for that, because it allows you to read
random data out of the system - each ³view² would be an external table
enforcing a schema onto a fixed data set (including maps/arrays).

You would create maybe 3-4 of these schema-on-read tables, then insert
into your ORC structures from those tables.

If you had binary data, then it would be much easier to write a convertor
to JSON  then follow the same process as well instead of attempting a
direct ORC writer, if you want 1 views out of the same table using
external tables.

 2) write a storm-topology to read the parsed_topic and stream them to
Hive/ORC.

You need to effectively do that to keep a live system running.

We¹ve had some hiccups with the ORC feeder bolt earlier with the 2s ETL
speeds (see 
https://github.com/apache/storm/tree/master/external/storm-hive).

That needs some metastore tweaking to work perfectly (tables to be marked
transactional etc), but nothing beyond config params.

 3) use spark instead of map-reduce. Only, I dont see any benefits in
doing so with this scenario.

The ORC writers in Spark (even if you merge the PR SPARK-2883) are really
slow because they are built against hive-13.x (which was my ³before²
comparison in all my slides).

I really wish they¹d merge those changes into a release, so that I could
make ORC+Spark fast.


Cheers,
Gopal 




Re: MapredContext not available when tez enabled

2015-04-21 Thread Gopal Vijayaraghavan

 
 A bit digging shows that GenericUDF#configure method was not called. So
in this case, is it possible to get counters through other means, or we
have to implement Counter concept ourselves?

You should be getting a TezContext object there (which inherits from
MapRedContext).

And the method should get called depending on a needConfigure() check - if
it is not getting called, that is very strange.

Cheers,
Gopal




Re: Parsing and moving data to ORC from HDFS

2015-04-22 Thread Gopal Vijayaraghavan

 In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
filters might move to GA?

I¹d say that it will be in the next release, considering it is already
checked into hive-trunk.

Bloom filters aren¹t too crazy today. They are written within the ORC file
right next to the row-index data, so that there¹s no staleness issues with
this today  after that they¹re fairly well-understood structures.

I¹m working through ³bad use² safety scenarios like someone searching for
³11² (as a string) in a data-set which contains doubles.

Hive FilterOperator casts this dynamically, but the ORC PPD has to do
those type promotions exacty as hive would do in FilterOperator throughout
the bloom filter checks.

Calling something production-ready needs that sort of work, rather than
the feature¹s happy path of best performance.
 

 The data is single-line text events. Nothing fancy, no multiline or any
binary. Each event is 200 - 800 bytes long.
 The format of these events are in 5 types (from which application
produce them) and none are JSON. I wrote a small lib with 5 Java classes
 which interface parse(String raw) and return a JSONObject - utilized in
my Storm bolts.

You could define that as a regular 1 column TEXTFILE and use a non-present
character as a delimiter (like ^A), which means you should be able to do
something like

select x.a, x.b, x.c from (select parse_my_format(line) as x from
raw_text_table);

a UDF is massively easier to write than a SerDe.

I effectively do something similar with get_json_object() to extract 1
column out (FWIW, Tez SimpleHistoryLogging writes out a Hive table).
 

 So I need to write my own format reader, a custom SerDe - specifically
the Deserializer part? Then 5 schema-on-read external tables using my
custom SerDe.
...
 That doesn't sound too bad! I expect bugs :)

Well, the UDF returning a Struct is an alternative to writing a SerDe.

 This all is just to catch up and clean our historical, garbage bin of
data which piled up while we got Kafka - Storm - Elasticsearch running :-)

One problem at a time, I guess.

If any of this needs help, that¹s the sort of thing this list exists for.

Cheers,
Gopal





Re: Parsing and moving data to ORC from HDFS

2015-04-22 Thread Gopal Vijayaraghavan

 In production we run HDP 2.2.4. Any thought when crazy stuff like bloom
filters might move to GA?

I¹d say that it will be in the next release, considering it is already
checked into hive-trunk.

Bloom filters aren¹t too crazy today. They are written within the ORC file
right next to the row-index data, so that there¹s no staleness issues with
this today  after that they¹re fairly well-understood structures.

I¹m working through ³bad use² safety scenarios like someone searching for
³11² (as a string) in a data-set which contains doubles.

Hive FilterOperator casts this dynamically, but the ORC PPD has to do
those type promotions exacty as hive would do in FilterOperator throughout
the bloom filter checks.

Calling something production-ready needs that sort of work, rather than
the feature¹s happy path of best performance.
 

 The data is single-line text events. Nothing fancy, no multiline or any
binary. Each event is 200 - 800 bytes long.
 The format of these events are in 5 types (from which application
produce them) and none are JSON. I wrote a small lib with 5 Java classes
 which interface parse(String raw) and return a JSONObject - utilized in
my Storm bolts.

You could define that as a regular 1 column TEXTFILE and use a non-present
character as a delimiter (like ^A), which means you should be able to do
something like

select x.a, x.b, x.c from (select parse_my_format(line) as x from
raw_text_table);

a UDF is massively easier to write than a SerDe.

I effectively do something similar with get_json_object() to extract 1
column out (FWIW, Tez SimpleHistoryLogging writes out a Hive table).
 

 So I need to write my own format reader, a custom SerDe - specifically
the Deserializer part? Then 5 schema-on-read external tables using my
custom SerDe.
...
 That doesn't sound too bad! I expect bugs :)

Well, the UDF returning a Struct is an alternative to writing a SerDe.

 This all is just to catch up and clean our historical, garbage bin of
data which piled up while we got Kafka - Storm - Elasticsearch running :-)

One problem at a time, I guess.

If any of this needs help, that¹s the sort of thing this list exists for.

Cheers,
Gopal



Re: merge small orc files

2015-04-21 Thread Gopal Vijayaraghavan


alter table table concatenate do not work? I have a dynamic
partitioned table (stored as orc). I tried to alter concatenate, but it
did not work. See my test result.

ORC fast concatenate does work on partitioned tables, but it doesn¹t work
on bucketed tables.

Bucketed tables cannot merge files, since the file count is capped by the
numBuckets parameter.

hive dfs -ls 
${hiveconf:hive.metastore.warehouse.dir}/orc_merge5a/st=0.8/;
Found 2 items
-rw-r--r--   3 patcharee hdfs534 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/00_0
-rw-r--r--   3 patcharee hdfs533 2015-04-21 12:33
/apps/hive/warehouse/orc_merge5a/st=0.8/01_0

Is this a bucketed table?

When you look at the point of view of split generation  cluster
parallelism, bucketing is an anti-pattern, since in most query schemas it
significantly slows down the slowest task.

Making the fastest task faster isn¹t often worth it, if the overall query
time goes up.

Also if you want to, you can send me the yarn logs -applicationId app-id
and the desc formatted of the table, which will help me understand what¹s
happening better.

Cheers,
Gopal




Re: hive sql on tez run forever

2015-05-11 Thread Gopal Vijayaraghavan
Hi,

You’re correct - that is not a valid rewrite.

Both tables have to be shuffled across due to the OR clause with no
reductions.

Cheers,
Gopal

On 5/11/15, 10:43 AM, Eugene Koifman ekoif...@hortonworks.com wrote:

This isn’t a valid rewrite.
if a(x,y) has 1 row (1,2) and b(x,z) has 1 row (1,1) then the 1st query
will produce 1 row
but the 2nd query with subselects will not.

On 5/11/15, 10:13 AM, Gopal Vijayaraghavan gop...@apache.org wrote:

Hi,

 I change the sql where condition to (where t.update_time =
'2015-05-04') , the sql can return result for a while. Because
t.update_time
 = '2015-05-04' can  filter many row when table scan. But why change
where condition to
 (where t.update_time = '2015-05-04' or length(t8.end_user_id)0) ,the
sql run forever as follows:


The OR clause is probably causing the problems.

We¹re probably not pushing down the OR clauses down to the original table
scans.

This is most likely a hive PPD miss where you do something like

select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);

where it doesn¹t get planned as

select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.*
from b where b.z = 1) b1 where a1.x = b1.x;

instead gets planned as a full-scan JOIN, then a filter.

Can you spend some time and try to rewrite down your case to something
like the above queries?

If that works, then file a JIRA.

Cheers,
Gopal







Re: hive sql on tez run forever

2015-05-11 Thread Gopal Vijayaraghavan
Hi,

 I change the sql where condition to (where t.update_time =
'2015-05-04') , the sql can return result for a while. Because
t.update_time
 = '2015-05-04' can  filter many row when table scan. But why change
where condition to
 (where t.update_time = '2015-05-04' or length(t8.end_user_id)0) ,the
sql run forever as follows:


The OR clause is probably causing the problems.

We¹re probably not pushing down the OR clauses down to the original table
scans.

This is most likely a hive PPD miss where you do something like

select a.*,b.* from a,b where a.x = b.x and (a.y = 1 or b.z = 1);

where it doesn¹t get planned as

select a1.*, b1.* from (select a.* from a where a.y=1) a1, (select b.*
from b where b.z = 1) b1 where a1.x = b1.x;

instead gets planned as a full-scan JOIN, then a filter.

Can you spend some time and try to rewrite down your case to something
like the above queries?

If that works, then file a JIRA.

Cheers,
Gopal




Re: Analyze table compute statistics on wide table taking too long

2015-04-08 Thread Gopal Vijayaraghavan

 I'm happy to look into improving the Regex serde performance, any tips
on where I should start looking?.

There are three things off the top of my head.

First up, the matcher needs to be reused within a single scan. You can
also check the groupCount exactly once for a given pattern.

matcher.reset() offers performance benefits in the inner loop.

Second, Text does not implement CharSequence, which would be ideal to run
regex (zero-copy) over ASCII text (tblproperties, I guess).

Converting byte sequence to unicode points is mostly wasted CPU, I would
guess - Text::toString() is actually expensive.

This is not something I¹m entirely certain of, since java Regex might have
fast-paths for String classes - to be experimented with before fixing it.

A ByteCharSequence could technically be implemented for utf-8 as well
(using ByteBuffer::getChar() instead) - but a really fast path for 7 bit
ASCII is mostly where RegexSerde needs help.

Finally, column projection and SerDe StatsProvidingRecordReader.

There is no reason to deserialize all columns that show up in the original
DDL - compute stats only cares about row-count, but which is effectively
skipping ALL of what a RegexSerde does.

You can find out which columns are being read and only extract those
groups.

That is a combination of ColumnProjectionUtils.isReadAllColumns(conf) and
ColumnProjectionUtils.getReadColumnIDs() from the operator conf.


And in case no columns are being read (like in ANALYZE or count(1)), skip
the regex checker entirely, generating merely how many Text instances were
encountered in total.

Does all of that make sense?

I haven¹t seen too much use of the RegexSerde btw, which is why these were
generally left on the backburner (the perf problem is limited to a single
³create table² off it into ORC and use the vectorized filters for
performance).

Cheers,
Gopal




Re: same query works with TEXTFILE and fails with ORC

2015-04-13 Thread Gopal Vijayaraghavan
 I¹m getting an error in Hive when executing a query on a table in ORC
format.

This is not an ORC bug, this looks like a vectorization issue.

Can you try comparing both query plans (³explain query²) for the
Execution mode: vectorized markers?

TextFile queries are not vectorized today, since you cannot find if any
column is marked as isRepeating=true in a row-major format.

 SELECT CONCAT(TO_DATE(datetime), '-'),   SUM(gpa)  FROM students_orc
GROUP BY CONCAT(TO_DATE(datetime), '-Œ);

...
 Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unsuported
vector output type: StringGroup
at 
org.apache.hadoop.hive.ql.exec.vector.VectorColumnSetInfo.addKey(VectorCol
umnSetInfo.java:139)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorHashKeyWrapperBatch.compileKey
WrapperBatch(VectorHashKeyWrapperBatch.java:521)
at 
org.apache.hadoop.hive.ql.exec.vector.VectorGroupByOperator.initializeOp(V
ectorGroupByOperator.java:786)

The correct fix would be to handle this query pattern for vectorization
(or automatically disable vectorization, like it has to do for Unions).

Can you log a bug on Apache JIRA against the correct version of hive which
threw this error up?

Cheers,
Gopal



Re: External Table with unclosed orc files.

2015-04-14 Thread Gopal Vijayaraghavan

 What will Hive do if querying an external table containing orc files
that are still being written to?

Doing that directly won¹t work at all. Because ORC files are only readable
after the Footer is written out, which won¹t be for any open files.

 I won¹t be able to test these scenarios till tomorrow and would like to
have some idea of what to expect this afternoon.

If I remember correctly, your previous question was about writing ORC from
Storm.

If you¹re on a recent version of Storm, I¹d advise you to look at
storm-hive/ 

https://github.com/apache/storm/tree/master/external/storm-hive


Or alternatively, there¹s a ³hortonworks trucking demo² which does a
partition insert instead.

Cheers,
Gopal




Re: External Table with unclosed orc files.

2015-04-14 Thread Gopal Vijayaraghavan

0.14 . Acid tables have been a real pain for us. We don¹t believe they are
production ready. At least in our use cases, Tez crashes for assorted
reasons or only assigns 1 mapper to the partition. Having delta files and
no base files borks mapper assignments.

Some of the chicken-egg problems for those were solved recently in
HIVE-10114.

Then TEZ-1993 is coming out in the next version of Tez, into which we¹re
plugging in HIVE-7428 (no fix yet).

Currently delta-only splits have 0 bytes as the ³file size², so it grouped
together to make a 16Mb chunk (rather a huge single 0 sized split).

Those patches are the effect of me shaving the yak from the ³1 mapper²
issue.

After which the writer has to follow up on HIVE-9933 to get the locality
of files fixed.

name are left scattered about, borking queries. Latency is higher with
streaming than writing to an orc file in hdfs, forcing obscene quantities
of buckets and orc files smaller than any reasonable orc stripe / hdfs
block size. The compactor hangs seemingly at random for no reason we¹ve
been able to discern.

I haven¹t seen these issues yet, but I am not dealing with a large volume
insert rate, so haven¹t produced latency issues there.

Since I work on Hive performance and I haven¹t seen too many bugs filed,
so I haven¹t paid attention to the performance of ACID.

Please file bugs when you find them, so that it appears on the radar for
folks like me.

I¹m poking about because I want a live stream into LLAP to work seamlessly
 return sub-second query results when queried (pre-cache/stage  merge
etc).

An orc file without a footer is junk data (or, at least, the last stripe
is junk data). I suppose my question should have been 'what will the hive
query do when it encounters this? Skip the stripe / file? Error out the
query? Something else?¹

It should throw an exception, because that¹s a corrupt ORC file.

The trucking demo uses Storm without ACID - this is likely to get better
once we use Apache Falcon to move the data around.

Cheers,
Gopal




Re: How efficient is memory allocation in tez.

2015-04-06 Thread Gopal Vijayaraghavan

 I have a map join in which the smaller tables together are 200 MB and
trying to  have one block of main table be processed by one tez task.
...
 What am I missing and is this even the right way of approaching the
problem ?

You need to be more specific about the Hive version. Hive-13 needs ~6x the
amount of map-join memory for Tez compared to Hive-14.

Hive-1.0 branch is a bit better at estimating map-join sizes as well,
since it counts the memory overheads of JavaDataModel.

Hive-1.1 got a little worse, which will get fixed when we get to hive-1.2.

But for the 1.x line, the approx size of data that fits within a map-join
is (container Xmx - io.sort.mb)/3.

This plays into the NewRatio settings in JDK7 as well, make sure you have
set the new ratio to only 1/8th the memory instead of using 1/3rd default
(which means 30% of your memory cannot be used by the sort buffer or the
map-join since they are tenured data).

Also running ³ANALYZE TABLE tbl compute statistics;² on the small tables
will fill in the uncompressed size fields so that we don¹t estimate
map-joins based on zlib sizes (which coincidentally is ~3x off).

And if you still keep getting heap errors, I can take a look at it if you
have a .hprof.bz2 file to share  fix any corner cases we might¹ve missed.

Cheers,
Gopal
PS: The current trunk implements a Grace HashJoin which is another
approach to the memory limit problem - a more traditional solution than
fixing mem sizes.




Re: Writing ORC Files

2015-04-07 Thread Gopal Vijayaraghavan

 addRow() is called in execute(). Does something look wrong with the call?
Š

There is no need for an explicit flush, but addRow() is rather far below
the layers of useful abstraction (and barely any seatbelts).

Can you try logging the category/type fields of your inspector and make
sure it is returning a Struct object inspector?

HCat has a neater writer abstraction layer -
https://github.com/apache/hive/blob/trunk/hcatalog/core/src/test/java/org/a
pache/hive/hcatalog/data/TestReaderWriter.java#L141


That said, I¹m working on ORC WriterImpl today, so if this is simple
enough to run/repro on a single VM, you can send me instructions about how
to run this example  I can debug into it.

Cheers,
Gopal














Re: Analyze table compute statistics on wide table taking too long

2015-04-07 Thread Gopal Vijayaraghavan

 The table also has a large Regex serde.

There are no stats fast paths for Regex SerDe.

The statistics computation is lifting each row into memory, parsing it and
throwing it away.

Most of your time would be spent in GC (check the GC time millis), due to
the huge expense of the Regex Serde.

For a direct comparison you could compute stats while turning it into
another format

set hive.stats.autogather=true;
create table tmp1 stored as orc as select * from oldtable;

Due to the nature of the columnar SerDes, that ETL would happen in
parallel to the compute stats off the same stream (i.e autogather).

That said, I have noticed performance issues with the RegexSerde, but
haven¹t bothered to fix it yet - maybe you¹d want to take a shot at fixing
it?


Cheers,
Gopal




Re: Dataset for hive

2015-04-02 Thread Gopal Vijayaraghavan


 https://github.com/hortonworks/hive-testbench

 The official procedure to generate and upload the data has never worked
for me (and it looks like it's not a supported software), so it could be
a bit tricky to do it manually and on a single host.

I wrote the MapReduce jobs for that (tpcds-gen/tpch-gen) after waiting a
whole weekend for 1Tb of data to be generated on a single machine.

If you or anyone else has issues with it, I can take a look at it.

Cheers,
Gopal




Re: Order of Partition column and Non Partition column in the WHERE clause

2015-05-19 Thread Gopal Vijayaraghavan
 Would the order of partition column in the where clause matter for
performance?

No, unless you have more complex predicates than an AND.

There¹s one recent regression though -
https://issues.apache.org/jira/browse/HIVE-10122

Which release are you on?

 Also, how can I make sure that ³partition pruning² is working as
intended when checking the execution plan?
 

explain extended query

shows all the partitions being read via the Path - Partition section.

Cheers,
Gopal




Re: Top N query

2015-06-04 Thread Gopal Vijayaraghavan

 I does't understand about your solution. Could you please provide exact
query?

...
 GROUP BY recommend_ratings2.movieid  ORDER BY Measure_0 DESC limit 2 ) T_0
ON ( recommend_ratings2.movieid = T_0.movieid_0)


Follow the docs on the mysql impl of the space-ship operator -
https://dev.mysql.com/doc/refman/5.0/en/comparison-operators.html#operator
_equal-to

The hive impl follows the same rules which is (a.x = b.x) OR (a.x is NULL
and b.x is NULL).

AFAIK, most of the complex optimizations are turned off in the presence of
null-safe operators.

Cheers,
Gopal









Re: When using ORC, how to confirm Stipes/row-groups are being skipped ?

2015-06-04 Thread Gopal Vijayaraghavan
Hi,

 While using ORC file format, I would like to see in the logs that
stripes and/or row-groups are being skipped based on my where clause.

There¹s no logging in the inner loop there.

 Is that info even outputted ? If so, what do I need to enable it ?

You can do a query run with the following to see the difference.

hive set hive.tez.print.exec.summary=true;
hive set hive.optimize.index.filter=false;
// run query 
hive set hive.optimize.index.filter=true;
// run query

You¹ll get numbers which will indicate how much row-filtering is
happening, since the input records count for the vertex will track the
actual records read off ORC.

For an example of what that does, see
http://www.slideshare.net/Hadoop_Summit/orc-2015-faster-better-smaller/21

If you have hive-1.2.0 builds, then you can also try setting the
TBLPROPERTIES for orc.bloom.filter.columns to use the new row indexes as
well.

For Strings, that should work much better than the current min-max index.

Cheers,
Gopal




Re: Hive on Tez

2015-06-10 Thread Gopal Vijayaraghavan
Hi,

There¹s no relationship between number of containers and tasks ­ well the
number of tasks is the maximum number of containers you can use.

You can run an entire vertex containing many task attempts in one container
if there are no more available ­ because of container reuse.

The memory/cpu settings are actually setup via a configuration parameter ­
hive.tez.container.size.

The Vertex is expanded into multiple tasks ­ The number of map-tasks are
determined by the split-grouping
(tez.grouping.min-size/tez.grouping.split-waves) and the reducers are
estimated from the ReduceSink statistics (divided by
hive.exec.bytes.per.reducer).

Even the reducer number is not final, since the plan-time value is only the
max value for that - you can schedule 1009 reducers and end up only running
11, with Tez auto-reducer parallelism, which only merges adjacent reducers.

This is split between the Tez SplitGrouper, HiveSplitGenerator and
SetReducerParallelism.

Cheers,
Gopal

From:  Yunqi Zhang yu...@umich.edu
Reply-To:  user@hive.apache.org user@hive.apache.org
Date:  Tuesday, June 9, 2015 at 5:07 PM
To:  user@hive.apache.org user@hive.apache.org
Subject:  Hive on Tez

Hi guys,
 
I¹m playing with the code that integrates Hive on Tez, and have couple
questions regarding to the resource allocation.
 
To my understanding (correct me if I am wrong), Hive creates a DAG composed
of MapVertex and ReduceVertex, where each Vertex will later be translated to
task running on potentially multiple containers by Tez. I was wondering how
the resource requirement is determined (how many containers are needed for
each Vertex, and what are the requirements for CPU and memory, etc.) in the
current implementation, and where I can find the code corresponding to this.
 
Thank you!


Yunqi




Re: Top N query

2015-06-03 Thread Gopal Vijayaraghavan
Hi,

This particular case can be handled by a special case of the inner
equijoin.

 ( recommend_ratings2.movieid = T_0.movieid_0 or
((recommend_ratings2.movieid IS NULL) AND
 (T_0.movieid_0 IS NULL)))

Can be rewritten as a null-safe equi-join (i.e where movieid =
movieid_0, which is not SQL semantics).

That however is a corner-case to workaround folks who want OR joins to
just handle NULL == NULL as a match.

Cheers,
Gopal




Re: Read error : Varchar cannot be cast to string

2015-06-18 Thread Gopal Vijayaraghavan
Hi,

I don¹t think I have the privileges to move an ATLAS JIRA into HIVE.

And the pre-commit tests for HIVE do not run for ATLAS issues.

If you have access to the JIRA admin for that project, please move that
issue over into the right project.

Cheers,
Gopal

On 6/18/15, 4:11 AM, Devansh Srivastava
devansh.srivast...@datametica.com wrote:

Gopal,

I have raised the bug (ATLAS-27).

Thanks,
Devansh


From: Devansh Srivastava devansh.srivast...@datametica.com
Sent: Thursday, June 18, 2015 4:05 PM
To: user@hive.apache.org
Subject: Re: Read error : Varchar cannot be cast to string

Hi Gopal,

Yes, one of the partition column is having VARCHAR as datatype.

My target table has structure like this :--

CREATE EXTERNAL TABLE test_table(
  dob string COMMENT '',
  version_nbr int COMMENT '',
  record_status string COMMENT '',
  creation_timestamp timestamp COMMENT '')
PARTITIONED BY (
  src_sys_cd varchar(10) COMMENT '',batch_id string COMMENT '')
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
STORED AS ORC
LOCATION
  '/test/test_table';

My source table has structure like below :--

CREATE EXTERNAL TABLE test_staging_table(
  dob string COMMENT '',
  version_nbr int COMMENT '',
  record_status string COMMENT '',
  creation_timestamp timestamp COMMENT ''
  src_sys_cd varchar(10) COMMENT '',
batch_id string COMMENT '')
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '|'
STORED AS ORC
LOCATION
  '/test/test_staging_table';

We were loading data using pig script. Its a direct load, no
transformation needed. But when i was checking test_table's data in hive.
It is giving previously mentioned error.

With Regards
Devansh

From: Gopal Vijayaraghavan go...@hortonworks.com on behalf of Gopal
Vijayaraghavan gop...@apache.org
Sent: Wednesday, June 17, 2015 8:20 PM
To: user@hive.apache.org
Subject: Re: Read error : Varchar cannot be cast to string

Hi,

 Caused by: java.lang.ClassCastException:
org.apache.hadoop.hive.common.type.HiveVarchar cannot be cast to
java.lang.String
at
org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addPartitionC
o
lsToBatch(VectorizedRowBatchCtx.java:566)

Is it a partition column the one marked as a varchar?

Can you write a small test-case and post a bug about this?

I can take a look at this, looks like a simple missed call to toString().

Cheers,
Gopal




Re: Hive - Tez error with big join - Container expired.

2015-06-18 Thread Gopal Vijayaraghavan
 I have a pretty big Hive Query. I¹m joining over 3 Hive-Tables which
have thousands of lines each. I¹m grouping this join by several columns.

Hive-on-Tez shouldn¹t have any issue even with billion of lines on a JOIN.

 0 failed, info=[Containercontainer_1434357133795_0008_01_39 finished
while trying to launch. Diagnostics: [Container failed. Container expired
since it was unused]], TaskAttempt 1 failed,

Looks like your node manager is actually not spinning up a container that
was allocated (i.e allocation succeeded, but the task spin up failed).

Which YARN scheduler are you running (fair/capacity?) and do you have any
idea on what the logs on the NodeManager logs say about trying to spin up
this container?

If I¹m not wrong, you need to also check if the YARN user has a ulimit set
for the total number of processes on the NM nodes.


 
Cheers,
Gopal




Re: hive -e run tez query error

2015-06-26 Thread Gopal Vijayaraghavan

 perhaps deleteOnExit() is set somewhere

fs.cache.disable settings from hdfs-site.xml are usually to blame for that.

Till hive-1.0, HiveServer2 used to leak filesystem objects, so the cache
was disabled.

 2015-06-25 15:54:33,673 INFO FSNamesystem.audit: allowed=true
ugi=lujian (auth:SIMPLE)ip=/10.17.28.11 http://10.17.28.11/
cmd=delete  
src=/user/lujian/lujian/_tez_session_dir/abb91da9-ac07-4024-a09f-8622ee1ca
edf   dst=nullperm=null


But keeping that client cache disabled when running against trunk
generally kills queries all the time with occasional errors like these.

Cheers,
Gopal




Re: Hive indexing optimization

2015-06-26 Thread Gopal Vijayaraghavan
Hi,

Hive indexes won¹t really help you speed up that query right now, because
of the plan it generates due to the = clauses.

 CREATETABLE ipv4table
 AS
 SELECT logon.IP, ipv4.Country
 FROM
 (SELECT * FROM logontable WHERE isIpv4(IP)) logon
 LEFT OUTER JOIN
 (SELECT StartIp, EndIp, Country FROM ipv4geotable) ipv4 ON
 isIpv4(logon.IP) 
 WHERE ipv4.StartIp =logon.IP AND logon.IP = ipv4.EndIp;

That¹s a cross-product join, which can¹t be distributed at all  will take
forever, even if you use Tez/LLAP.

Range join queries have partial distribution rewrites, but AFAIK none of
them use filter indexes in hive.

But before I suggest a rewrite, can you post the ³explain query;² for
the above query, so that I can check if this is indeed producing a
cross-product + filter?

Cheers,
Gopal










Re: nested join issue

2015-06-12 Thread Gopal Vijayaraghavan
Hi

 Thanks for investigating..  Trying to locate the patch that fixes this
between 1.1 and 2.0.0-SNAPSHOT. Any leads on what Jira this fix was part
of? Or what part of the code the patch is likely to be on?

git bisect is the only way usually to identify these things.

But before you hunt into the patches I suggest trying combinations of
constant propogation, null-scan and identity projection remover
optimizations to see if there¹s a workaround in there.

An explain of the query added to a new JIRA would be good, to continue the
analysis.

Cheers,
Gopal




Re: Read error : Varchar cannot be cast to string

2015-06-17 Thread Gopal Vijayaraghavan
Hi,

 Caused by: java.lang.ClassCastException:
org.apache.hadoop.hive.common.type.HiveVarchar cannot be cast to
java.lang.String
at 
org.apache.hadoop.hive.ql.exec.vector.VectorizedRowBatchCtx.addPartitionCo
lsToBatch(VectorizedRowBatchCtx.java:566)

Is it a partition column the one marked as a varchar?

Can you write a small test-case and post a bug about this?

I can take a look at this, looks like a simple missed call to toString().

Cheers,
Gopal




Re: Empty Table in MR with union all (created in Tez)

2015-06-11 Thread Gopal Vijayaraghavan

 
 set hive.execution.engine=mr;
 
 select * from test_table;  --gives empty table
 
 Any ideas on why this must be happening? This issue does not occur when
I don¹t use a ³union all² query.
 

What is the value of the parameter
mapreduce.input.fileinputformat.input.dir.recursive in your
mapred-site.xml?

Cheers,
Gopal




Re: nested join issue

2015-06-11 Thread Gopal Vijayaraghavan
Hi,

 I'm running into a peculiar issue with nested joins and outer select. I
see this error on 1.1.0 and 1.2.0 but not 0.13 which seems like a
regression.
...
 create table events (s string, st2 string, n int, timestamp int);


The issue does not seem to be happening in hive-2.0.0-SNAPSHOT, which
means it has already been fixed  possibly can be backported easily to
1.2.1.

Your test-cases threw a parse-exception when run as-is - naming a column
³timestamp² will kill you when you upgrade to the next version.

Cheers,
Gopal




Re: NoSuchMethodError when hive.execution.engine value its tez

2015-06-02 Thread Gopal Vijayaraghavan

 I am using *hive 1.0.0* and *apache tez 0.4.1* When I configure hive to use
 tez I get an exception.
Use the tez.version that hive-1.0.0 is tested with ­ 0.5.2
(https://github.com/apache/hive/blob/branch-1.0/pom.xml#L155)

I suspect you¹ll get several build failures trying to build 1.0.0 against
tez-0.4.x branch, since the APIs changed signficantly during Tez¹s promotion
to a top-level project.

Cheers,
Gopal





Re: current_date function in hive

2015-06-02 Thread Gopal Vijayaraghavan
Hi,

 You may try to_date(FROM_UNIXTIME(UNIX_TIMESTAMP()))

That would be a very bad idea for query correctness.

The current_date UDF was introduced because of that particular
anti-pattern.

The unix_timestamp() is evaluated when a row is read, so each row gets a
slightly different timestamp when doing ETLs  I have seen it give
completely incorrect results when used near midnight (i.e query kicked off
at 11:59, executed at 00:01 next day).

Plus map-reduce falure handling mandates (logically) identical outputs on
map task retries or speculation.

I¹ve recently pulled out the current_timestamp() impl out into a reusable
UDF 
recently so that I can use that in hive-1.0.x installs as well.

https://github.com/t3rmin4t0r/current-timestamp

You can use that UDF JAR with hive-1.0+, the constant folding will replace
the value in-place during compilation.

Cheers,
Gopal




Re: schedule data ingestion to hive table using ftp

2015-06-30 Thread Gopal Vijayaraghavan
Hi,

 So, I want to schedule data ingestion to hive from ftp. I have to
schedule a job to check for files that are getting generated and when
they get generated, move it to hdfs.

There is no ³best² way unfortunately.

The options start with Apache Oozie, the bog standard solution. Then
there¹s Falcon which uses Oozie to run things inside, but handles it
closer to hive¹s use-cases.

And there¹s the combination of Azkaban + Gobblin from Linkedin.


For those who prefer Python to Java, there¹s Luigi from Spotify.

If you¹re feeling really lazy, you can go through the NFS mount option in
HDFS, so that you can use regular cron to curl sftp - nfsv3 into that.

The last option, is totally tied to unix cron, so it is not the best for
terabyte scale but it¹s the one that is the easiest to fix when it breaks.

Cheers,
Gopal




Re: Limiting outer join

2015-07-06 Thread Gopal Vijayaraghavan

 In the following query, it is possible to limit the amount of entries
returned by an outer join to a single value? I want to obtain a single
country from ipv4geotable for each entry in logontable.

Yes, the PTF DENSE_RANK()/ROW_NUMBER() basically gives you that - you can
read the first row out of each logon.IP except, there¹s no way to force
which country wins over the other without an order by country in the
OVER() clause as well.

That said, it will only get slower to produce 1 row per group, because of
the distributed nature of the SQL engine, the reduction of data happens
after a ordering shuffle.

You¹re doing range joins in a SQL engine without theta joins and MapReduce
had no way to implement those at runtime (Tez has, with EdgeManager
plugins).

The easiest/traditional approach out of doing geo-IP lookups is a compact
UDF model without any joins at all.

There¹s some old threads on discussing this as a built-in  some code
(with potential licensing issues) -
http://markmail.org/message/w54j4upwg2wbh3xg

Cheers,
Gopal




Re: Tez : Anyway to avoid creating subdirectories by Insert with union all² ?

2015-08-19 Thread Gopal Vijayaraghavan
 Is there anyway to avoid creating sub-directories? Or this is by design
and can not be changed?

This is because of the way file-formats generate hadoop name files without
collisions.

For instance, any change to that would break Parquet-MR for Tez. That's
why we generate a compatible, but colliding mapreduce.task.attempt.id
artificially for Tez jobs.

³Map 1² and ³Map 2² would both have an attempt 0 of task 1, generating
colliding file names (0001_0).

The easy workaround is a ³re-load² of the table.

insert overwrite table h1_passwords_target select * from
h1_passwords_target;


The slightly more complex one is to add a DISTRIBUTE BY  trigger a
reducer after the UNION ALL.

Cheers,
Gopal




Re: Question about PredicateTransitivePropagate

2015-08-19 Thread Gopal Vijayaraghavan
select * from t1 join t2 on t1.col = t2.col where t1.col = 1;
 Is rule PredicateTransitivePropagate supposed to propagate predicate
t1.col = 1 to t2 via join condition t1.col = t2.col?
 Assuming so, I found that the predicate t1.col = 1 has not been pushed
down to table scan of t1, thus PredicateTransitivePropagate wouldn't see
the predicate. Then I tried to put PredicateTransitivePropagate
 after PredicatePushDown, I saw  predicate t1.col = 1 was propagated to
t2.

Are you trying a recent build?

I ran the exact same with tonight¹s hive-2.0 build, with two temp-tables
and got 

create temporary table t1(x int, y int);
create temporary table t2(x int, y int);
explain  select * from t1 left  join t2 on t1.x = t2.x where (t1.x = 1 or
t1.x = 2) ;




Select Operator [SEL_6]
   outputColumnNames:[_col0,_col1,_col2,_col3]
   Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
   Map Join Operator [MAPJOIN_11]
   |  condition map:[{:Left Outer Join0 to 1}]
   |  keys:{Map 2:x (type: int),Map 1:x (type: int)}
   |  outputColumnNames:[_col0,_col1,_col5,_col6]
   |  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
   |-Map 2 [BROADCAST_EDGE]
   |  Reduce Output Operator [RS_3]
   | key expressions:x (type: int)
   | Map-reduce partition columns:x (type: int)
   | sort order:+
   | Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
   | value expressions:y (type: int)
   | Filter Operator [FIL_10]
   |predicate:((x = 1) or (x = 2)) (type: boolean)
   |Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
   |TableScan [TS_1]
   |   alias:t2
   |   Statistics:Num rows: 1 Data size: 0 Basic
stats: PARTIAL Column stats: NONE
   |-Filter Operator [FIL_9]
 predicate:((x = 1) or (x = 2)) (type: boolean)
 Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
 TableScan [TS_0]
alias:t1
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE


Cheers,
Gopal





Re: hive error when trying to write data to s3n

2015-08-04 Thread Gopal Vijayaraghavan

 Moving data to: 
s3n://access_key:secret_key@my_bucket/a/b/2015-07-30/.hive-staging_hiv
e_2015-08-04_18-38-47_649_1476668515119011800-1/-ext-1
 Failed with exception Wrong FS:
s3n://access_key:secret_key@my_bucket/a/b/2015-07-30/.hive-staging_hiv
e_2015-08-04_18-38-47_649_1476668515119011800-1/-ext-10002, expected:
hdfs://s
..
 FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.MoveTask

Looks like the insert overwrite isn¹t triggering the CopyTask - MoveTask
is the wrong operation for cross-FS operations.

You can try looking at the Explain plan and confirm that.

Cheers,
Gopal




Re: Running hive on tez locally

2015-08-10 Thread Gopal Vijayaraghavan

 How do I point Hive to tez libraries? Is it sufficient to populate
CLASSPATH environment variables with location of tez libraries ?

The easiest option there is to populate the hive-config.sh with these two
parameters

https://github.com/t3rmin4t0r/tez-autobuild/blob/master/Makefile#L138


Once you do that, the bin/hive will always load the Tez jars correctly.

Cheers,
Gopal









Re: Hive on Tez query failed with ³wrong key class

2015-07-27 Thread Gopal Vijayaraghavan



 From the java code which creates the sequence file, it has set the key
class to NullWritable.class:
 job.setOutputKeyClass(org.apache.hadoop.io.NullWritable.class);
...
 I think that caused the mismatch:
 wrong key class: org.apache.hadoop.io.BytesWritable is not class
org.apache.hadoop.io.NullWritable

In all possibilities, the exception you¹re hitting originates from here

https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-co
mmon/src/main/java/org/apache/hadoop/io/SequenceFile.java#L2328


 Anyone knows why Tez will check the key and value class when doing sort
stuff?

As I said in my earlier mail, if you can check the SequenceFile headers
and they look like my pasted pair, then we know it¹s the same as the known
issue.

Cheers,
Gopal




Re: limit clause + fetch optimization

2015-07-22 Thread Gopal Vijayaraghavan
Hi,

 For your first offer of testing a patch, unfortunately we tend to run
our production software on customers' Hadoop clusters, so we can't easily
patch their Hive instances.  But I'll still take you up on that if I find
some time to try it.

No worries, I¹ll get to a patch sooner or later. For now, as a workaround
to your problem, you can set your fetch task conversion threshold to a
high value.

hive.fetch.task.conversion.threshold


That¹s set to 1Gb by default, which is the input size threshold at which
point it goes and runs a cluster job.

Cheers,
Gopal








Re: limit clause + fetch optimization

2015-07-21 Thread Gopal Vijayaraghavan

 I've been experimenting with 'select *' and 'select * limit X' in
beeline and watching the hive-server2 log to understand when a M/R job is
triggered and when not.  It seems like whenever I set a limit, the job is
avoided, but with no limit, it is run.

https://issues.apache.org/jira/browse/HIVE-10156


It¹s sitting on my back-burner (I know the fix, but I¹m working on the
LLAP branch).

 hive.limit.optimize.fetch.max

 That defaults to 50,000 and as I understand it, whenever I set limit to
above that number, a job should be triggered.  But I can set limit to
something very high (e.g. 10M) and no job runs.

That configs belong to a different optimization - the global limit case,
which works as follows.

Run query with a 50k row sample of the input, then if it doesn¹t produce
enough rows, re-run the query with the full input data-set.

You will notice errors on your JDBC connections with that optimization
turned on (like HIVE-9382) and will get the following log line Retry
query with a different approachŠ² in the HS2 logs.

So I suggest not turning on the Global Limit optimization, if you¹re on
JDBC/ODBC.

Cheers,
Gopal
 




Re: limit clause + fetch optimization

2015-07-22 Thread Gopal Vijayaraghavan

 Just want to make sure I understand the behavior once that bug is
fixed...a 'select *' with no limit will run without a M/R job and instead
stream.  Is that correct?

Yes, that¹s the intended behaviour. I can help you get a fix in, if you
have some time to test out my WIP patches.

 That may incidently solve another bug I'm seeing: when you use JDBC
templates to set the limit (setMaxRows in Spring in my setup), it does
not avoid the M/R job (and no limit clause appears in the hive-server2
log).  Instead, the M/R job gets launched...I'm
 not sure if the jdbc framework subsequently would apply a limit, once
the job finishes.  I haven't spotted this issue in JIRA, I'd be happy to
file it if that's useful to you.

File a JIRA, would be very useful for me.

There¹s a lot of low-hanging fruit in the JDBC + Prepared Statement
codepath, so going over the issues  filing your findings would help me
pick up and knock them off one by one when I¹m back.

Prasanth¹s github has some automated benchmarking tools for JDBC, which I
use heavily - https://github.com/prasanthj/jmeter-hiveserver2/tree/llap


There are some known issues which have a 2-3x perf degradation for the
simple query patterns you¹re running, like -
https://issues.apache.org/jira/browse/HIVE-10982

Cheers,
Gopal




Re: Hive on Tez query failed with ³wrong key class

2015-07-21 Thread Gopal Vijayaraghavan

 Query is a simple group-by on top of sequence table.
...
 java.io.IOException: java.io.IOException: wrong key class:
org.apache.hadoop.io.BytesWritable is not class
org.apache.hadoop.io.NullWritable

I have seen this issue when mixing Sequence files written by PIG with
Sequence files written by Hive - primarily because the data ingestion
wasn¹t done properly via HCatalog writers.

Last report, the first sequence file had as its header

M?.io.LongWritableorg.apache.hadoop.io.BytesWritable)org.apache.hadoop.io.
compress.SnappyCodec??


and the second one had

SEQ!org.apache.hadoop.io.LongWritableorg.apache.hadoop.io.Text)org.apache.h
adoop.io.compress.SnappyCodec?


You can cross-check the exception trace and make sure that the exception
is coming from the RecordReader as the k-v pairs change types between
files.

Primarily this doesn¹t happen in Hive-mr at the small scale, but it
happens for both MR and Tez.

To hit this via CombineInputFormat, you need a file which has been split
up between machines and two such files to generate a combined split of
mismatched schema.

Tez is more aggressive at splitting, since it relies on the file format
splits, not HDFS locations.

If you confirm that this is indeed the cause of the issue, I might have an
idea how to fix it.

Cheers,
Gopal 




Re: Optimizing UDF

2015-07-14 Thread Gopal Vijayaraghavan

 
 I'm trying to optimize a UDF that runs very slowly on Hive. The UDF
takes in a 5GB table and builds a large data structure out of it to
facilitate lookups. The 5GB input is loaded into the distributed cache
with an 'add file path' command, and the UDF builds
 the data structure a single time per instance (or so it should).

No, this builds it once per map attempt in MRv2, because each JVM is
killed after executing a single map attempt.

In Tez, however you can build this once per container (usually, a ~10x
perf improvement).

This has a fix in Tez, since the UDFs can only load it over the network
once per JVM init and you can hang onto that in the loaded GenericUDF
object (*not* a static, but a private final), which is held in the
TezCache as long as the task keeps running the same vertex.

That will be thrown away whenever the container switches over to running a
reducer, so the cache is transient.

Cheers,
Gopal




Re: Optimizing UDF

2015-07-14 Thread Gopal Vijayaraghavan


 
 I am already using Tez (sorry, forgot to mention this), and my goal is
indeed to build the instance once per container.

Put a log line in your UDF init() and check if it is being called multiple
times per container. If you¹re loading the data everytime, then that might
be something to fix.

The other aspect is that there¹s GC pauses that can happen due to that and
such extraneous reasons for the slow-down.

But first, look at how many times you are loading the distributed cache
data per container.

Cheers,
Gopal




Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Gopal Vijayaraghavan

> so do you think if we want the same result from Hive and Spark or the
>other freamwork, how could we try this one ?

There's a special backwards compat slow codepath that gets triggered if
you do

set mapred.reduce.tasks=199; (or any number)

This will produce the exact same hash-code as the java hashcode for
Strings & Integers.

The bucket-id is determined by

(hashCode & Integer.MAX_VALUE) % numberOfBuckets

but this also triggers a non-stable sort on an entirely empty key, which
will shuffle the data so the output file's order bears no resemblance to
the input file's order.


Even with that setting, the only consistent layout produced by Hive is the
CLUSTER BY, which will sort on the same key used for distribution & uses
the java hashCode if the auto-parallelism is turned off by setting a fixed
reducer count.

Cheers,
Gopal




Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-27 Thread Gopal Vijayaraghavan

> I want to override partitionByHash function on Flink like the same way
>of DBY on Hive.
> I am working on implementing some benchmark system for these two system,
>which could be contritbutino to Hive as well.

I would be very disappointed if Flink fails to outperform Hive with a
Distribute BY, because the hive version is about 5-10x slower than it can
be with Tez.

Mapreduce forces a full sort of the output data, so the Hive version will
be potentially O(N*LOG(N)) by default while Flink should be able to do
O(N).


Assuming you don't turn on any of the compatibility modes, the hashCode
generated would be a murmur hash after encoding data into a byte[] using
BinarySortableSerDe & the data is then sorted using
key=(murmur_hash(byte[]) % n-reducers).


The reducers then pull the data, merge-sort using the disk which is
entirely wasted CPU.

If you or anyone's interested in fixing this for Tez, I have a JIRA open
to the fix the hash-only shuffle -
https://issues.apache.org/jira/browse/HIVE-11858

Cheers,
Gopal



Re: Using json_tuple for Nested json Arrays

2015-10-27 Thread Gopal Vijayaraghavan
Hi,

> If you have any tutorial for extracting data from complex nested json
>arrays (as the example given in my previous email), please send it.

90% of working with the real world is cleansing bad data. People
under-sell hive's flexibility in situations like this.


This is what I do 

hive> compile `
import org.apache.hadoop.hive.ql.exec.UDF \;
import groovy.json.JsonSlurper \;
import org.apache.hadoop.io.Text \;
public class JsonExtract extends UDF {
  public int evaluate(Text a){
def jsonSlurper = new JsonSlurper() \;
def obj = jsonSlurper.parseText(a.toString())\;
return  obj.val1\;
  }
} ` AS GROOVY NAMED json_extract.groovy;


hive> CREATE TEMPORARY FUNCTION json_extract as 'JsonExtract';


hive> select json_extract('{"val1": 2}') from date_dim limit 1;

select json_extract('{"val1": 2}') from date_dim limit 1
OK
2
Time taken: 0.13 seconds, Fetched: 1 row(s)


Caveats - this generates bytecode at runtime, so keep an eye on the

hive> list jars;

Because there's no real namespacing, naming your classes/functions the
same while developing can drive you crazy (a little).

Cheers,
Gopal










Re: Issue with job serialization formats mangling results

2015-10-23 Thread Gopal Vijayaraghavan


>I've then created ORC and Parquet versions of this same table.  The
>behavior remains... select * works, any filter creates horribly
>mangled results.
>
>To replace- throw this into a file:
>
>{"id":1,"order_id":8,"number":1,"broken":"#\n---\nstuff\nstuff2:
>\"stuff3\"\nstuff4: '730'\nstuff5: []\n","last":null}

You're trying to fix the issue on the wrong side of the problem, I think.

Try with

set 
hive.default.serde=org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
;
set hive.query.result.fileformat=SequenceFile;


Hopefully we'll have a newer & more compact format for results soon.

Cheers,
Gopal




Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Gopal Vijayaraghavan

> When applying [Distribute By] on Hive to the framework, the function
>should be partitionByHash on Flink. This is to spread out all the rows
>distributed by a hash key from Object Class in Java.

Hive does not use the Object hashCode - the identityHashCode is
inconsistent, so Object.hashCode() .

ObjectInspectorUtils::hashCode() is the hashcode used by the DBY in hive
(SORT BY uses a Random number generator).

Cheers,
Gopal



Re: export/import in hive failing with nested directory exception!

2015-11-12 Thread Gopal Vijayaraghavan
Hi,

>Thanks Gopal. Indeed table t is defined as ORC and transactional.
>
>Any reason why this should not work for transactional tables?

The committed transactions list is actually missing from the exported
metadata.

So the EXPORT as it exists today is a dirty read snapshot, which is not a
good thing when the data is continously being streamed in.

I don't think IMPORT likes that (and why should it?).

Try with a fully compacted table.

Cheers,
Gopal




Re: export/import in hive failing with nested directory exception!

2015-11-12 Thread Gopal Vijayaraghavan
> I am doing a simple export and import test from one database in Hive to
>another database in the same Hive instance. I thought this would have
>been straight forward.

Not if ACID transactions are involved.

> Copying file: 
>hdfs://rhes564:9000/user/hive/warehouse/asehadoop.db/t/delta_056_0
>56

This should be fine for non transactional tables.

Cheers,
Gopal
 




Re: [VOTE] Hive 2.0 release plan

2015-11-13 Thread Gopal Vijayaraghavan
(+user@)

+1.

Cheers,
Gopal

On 11/13/15, 5:54 PM, "Lefty Leverenz"  wrote:

>The Hive bylaws require this to be submitted on the user@hive mailing list
>(even though users don't get to vote).  See Release Plan in Actions
>.
>
>-- Lefty
...
>> > On Fri, Nov 13, 2015 at 1:38 PM, Sergey Shelukhin <
>> ser...@hortonworks.com>
>> > wrote:
>> >
>> >> Hi.
>> >> With no strong objections on DISCUSS thread, some issues raised and
>> >> addressed, and a reminder from Carl about the bylaws for the release
>> >> process, I propose we release the first version of Hive 2 (2.0), and
>> >> nominate myself as release manager.
>> >> The goal is to have the first release of Hive with aggressive set of
>>new
>> >> features, some of which are ready to use and some are at experimental
>> >> stage and will be developed in future Hive 2 releases, in line with
>>the
>> >> Hive-1-Hive-2 split discussion.
>> >> If the vote passes, the timeline to create a branch should be around
>>the
>> >> end of next week (to minimize merging in the wake of the release),
>>and
>> the
>> >> timeline to release would be around the end of November, depending on
>> the
>> >> issues found during the RC cutting process, as usual.
>> >>
>> >> Please vote:
>> >> +1 proceed with the release plan
>> >> +-0 don¹t care
>> >> -1 don¹t proceed with the release plan, for such and such reasons
>> >>
>> >> The vote will run for 3 days.
>> >>
>> >>
>>




Re: Cross join/cartesian product explanation

2015-11-10 Thread Gopal Vijayaraghavan

>I¹m having trouble doing a cross join between two tables that are too big
>for a map-side join.

The actual query would help btw. Usually what is planned as a cross-join
can be optimized out into a binning query with a custom UDF.

In particular with 2-D geo queries with binning, which people tend to run
as cross-joins when they port PostGIS queries into Hive (ST_Contains).

>Trying to break down one table into small enough partitions and then
>unioning them together seems to give comparable performance to a cross
>join.
...
>Short of moving to a different execution engine, are there any
>performance improvements that can be made to lessen the pain of a cross
>join? 

No, with MapReduce you're mostly stuck running each part of the union one
after the other.

Since this is a simple fan-out, you can try the simple parallelization

set hive.exec.parallel=true;

Beware, this has known deadlocks as queries get more complex - for which
you need a real Acyclic Graph scheduler engine like Tez.

Drop me an off-list mail if you want to run Tez on recent CDH, EMR or MapR
releases.

>Also, could you please elaborate on your comment ³The real trouble is
>that MapReduce cannot re-direct data at all (there¹s only shuffle
>edges)"? Thanks!

Mapreduce cannot do non pair-wise routing operations, since it violates
the direct assumption of map() partitioning (same applies to Spark's map()
operators).

Tez goes a bit out of the way to separate the control plane from the data
plane, so that you can do non-pairwise operations like auto-reducer
parallelism or splitting up pair-wise operations as m:n matching.

Here's slides from last year describing how that rewiring works.

http://www.slideshare.net/Hadoop_Summit/w-235phall1pandey/13

http://www.slideshare.net/Hadoop_Summit/pig-on-tez-low-latency-etl-with-big
-data/19


Cheers,
Gopal



Re: Cross join/cartesian product explanation

2015-11-06 Thread Gopal Vijayaraghavan

> Over the last few week I¹ve been trying to use cross joins/cartesian
>products and was wondering why, exactly, this all gets sent to one
>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>the job. 

The hashcode of the shuffle key is 0, since you need to process every row
against every key - there's no possibility of dividing up the work.

Tez will actually have a cross-product edge (TEZ-2104), which is a
distributed cross-product proposal but wasn't picked up in the last Google
Summer of Code.

The real trouble is that MapReduce cannot re-direct data at all (there's
only shuffle edges).

> Does anyone have a workaround?

I use a staged partitioned table as a workaround for this, hashed on a
high nDV key - the goal of the Tez edge is to shuffle the data similarly
at runtime.

For instance, this python script makes a query with a 19x improvement in
distribution for a cross-product which generates 50+Gb of data from a
~10Mb input.

https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8


It is possible for Hive-Tez to actually generate UNION VertexGroups, but
it's much more efficient to do this as a edge with a custom EdgeManager,
since that opens up potentially implementing ThetaJoins in hive using that.

Cheers,
Gopal




Re: Merging small files

2015-10-16 Thread Gopal Vijayaraghavan

> Is there a more efficient way to have Hive merge small files on the
>files without running with two passes?

Not entirely an efficient way, but adding a shuffle stage usually works
much better as it gives you the ability to layout the files for better
vectorization.

Like for TPC-H, doing ETL with

create table lineitem as select * from lineitem sort by l_shipdate,
l_suppkey;

will produce fewer files (exactly as many as your reducer #) & compresses
harder due to the natural order of transactions (saves ~20Gb or so at 1000
scale).

Caveat: that is not more efficient in MRv2, only in Tez/Spark which can
run MRR pipelines as-is.

Cheers,
Gopal




Re: Strict mode and joins

2015-10-20 Thread Gopal Vijayaraghavan

> However reality is I just want it called 1 time which is during planning
>and if I flag is as deterministic this is exactly what happens so you can
>do this:

There was a new constant added in hive-1.2.0 named CURRENT_TIMESTAMP which
works the way most people want unix_timestamp() to work.

> AND article_meta.publish_timestamp > ((unix_timestamp() * 1000) - (1000
>* 60 * 60 * 24 * 2))

If if helps, I have a backport as a UDF on my github.

https://github.com/t3rmin4t0r/current-timestamp/blob/master/src/main/java/o
rg/notmysock/hive/udf/CurrentTimeStampUDF.java#L35


with that UDF, I expect it to constant fold

(to_unix_timestamp(current_timestamp()) * 1000)

into a single bigint.

Cheers,
Gopal








Re: Container is running beyond physical memory limits

2015-10-13 Thread Gopal Vijayaraghavan


> is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB
>physical memory used; 6.6 GB of 8 GB virtual memory used. Killing
>container.

You need to change the yarn.nodemanager.vmem-check-enabled=false on
*every* machine on your cluster & restart all NodeManagers.

The VMEM check made a lot of sense in the 32 bit days when the CPU forced
a maximum of 4Gb of VMEM per process (even with PAE).

Similarly it was a way to punish processes which swap out to disk, since
the pmem only tracks the actual RSS.

In the large RAM 64bit world, vmem is not a significant issue yet - I
think the addressing limit is 128 TB per process.

> 
> mapreduce.reduce.memory.mb
> 4096
> 
...
 
> 
> mapreduce.reduce.java.opts
> -Xmx6144m
> 
 

That's the next failure point. 4Gb container with 6Gb limits. To produce
an immediate failure when checking configs, add

-XX:+AlwaysPreTouch -XX:+UseNUMA

to the java.opts.

Cheers,
Gopal
 




Re: Two Tables Join (One Big table and other 1gb size table)

2015-10-13 Thread Gopal Vijayaraghavan

> I tried doing stream table, but ran for long time like 3 hrs : Looks
>like only 1 reducer is working on it
...
> on (trim(p.pid)=trim(c.p_id) and p.source='XYZ');

In case that's devolving to a cross-product, it might be a miss in pushing
down the trim() to the TableScan.

Are you using hive-13? If you're using a version >1.0.0, can you see if
the query prints a warning about cross-products?

Cheers,
Gopal




Re: Container is running beyond physical memory limits

2015-10-13 Thread Gopal Vijayaraghavan

 
> Now I am rather confused about the following parameters (for example
> mapreduce.reduce versus
> mapreduce.map) and their correlation to each other

They have no relationship with each other. They are meant for two
different task types in MapReduce.

In general you run fewer reducers than mappers, so they are given more
memory per-task than mapppers - most commonly it's ~2x of the other, but
they are not related in any way.

The ideal numbers to use for both are exact multiples of
yarn.scheduler.minimum-allocation-mb (since YARN rounds up to that
quantum).

For example, with a 1536 min-alloc, you're better off allocating 4608 &
getting -Xmx3686, since the 4096 ask will anyway pad up to 4608, losing
500Mb in the process.

This is very annoying & complex, so with Tez there's exactly 1 config &
you can just skip the -Xmx param for hive.tez.java.opts. Tez will inject
an Xmx after a container alloc returns (so that re-adjustment is
automatic).

> 
> mapreduce.map.memory.mb
> 4096
> 
>  
> 
> mapreduce.reduce.memory.mb
> 8192
> 
>  
> 
> mapreduce.map.java.opts
> -Xmx3072m
> 
>  
> 
> mapreduce.reduce.java.opts
> -Xmx6144m
> 

Those configs are correct, the GC heap is approximately 80% of the
allocated container (the JVM uses non-GC buffers for operations like Zlib
decompression).


Cheers,
Gopal




Re: Limiting outer join

2015-07-07 Thread Gopal Vijayaraghavan

 Never mind, I got it working with UDF. I just pass the file location to
my evaluate function. Thanks! :)

Nice. Would be very interested in looking at performance of such a UDF, if
you have numbers before/after.

I suspect it will be a magnitude or more faster than the BETWEEN/JOIN
clauses.

Cheers,
Gopal




Re: Insert with dynamic partitioning from an ORC table fails

2015-07-08 Thread Gopal Vijayaraghavan


 I'm having a problem in Hive 0.13.0 using INSERT OVERWRITE with dynamic
partitioning, selecting from an ORC table to another ORC table (I don't
think the target table being ORC is significant).

I think for this to be triggered in 0.13, the src and destination have to
be ORC tables.

 Has anyone seen this error before? Is it fixed in a later version? I've
included reproduction steps below.

Seen a whole class of such errors, which were mostly fixed around the 1.0
release - can you try this on a build after HIVE-8226 was committed?

Cheers,
Gopal




Re: Changelog table

2015-11-17 Thread Gopal Vijayaraghavan

> If I have a series of entries that look like
...
> { "update", {"baz" : "bar" }}

Due to the way the split distribution works, you need a global ordering
key for each operation.

0, "ADD", "baz", ""
1, "SET", "baz", "bar"
2, "DEL", "baz", null

If you do not have updates coming in within a second, you could store a
timestamp.

Then you can write a windowing function for Hive to merge/order them.

select flatten_txns(op, key, value) over (partition by key order by ts)
from txns;

At this point, you're nearly reinventing what Hive's own
insert/update/delete statements do.

Except, compared to that, these updates are faster (since it's really an
unconditional SET).

Cheers,
Gopal




Re: Cross join/cartesian product explanation

2015-11-17 Thread Gopal Vijayaraghavan

>It¹s really a very simple query that I¹m trying to run:
>select
...
>bloom_contains(a_id, b_id_bloom)

That's nearly impossible to optimize directly - there is no way to limit
the number of table_b rows which may match table a.

More than one bloom filter can successfully match a single row from a_id,
so this reduces to

for (a_id in a) {

   for (b_id_bloom in b) {
if (bloom_contains(a_id, b_id_bloom)) {
  emitRow(); 
}
  }
}


>The sizes of the tables I¹m running against are small ‹ roughly 50-100Mb
>‹ but this query would need to be expanded to run on a table that is
>>100Gb (table_b would likely max out around 100Mb).

As long as table_b is <1Gb, this will be executed in parallel across all
the splits (so there's an invisible for-parallel( s in splits) outside).

You need to increase the noconditional task size & the container size to
hold the entire table_b in memory.

As you can probably tell, it would've been much much faster if hive
generated the dimension table loop outside for this case.

for (b_id_bloom in b) {
  for (a_id in split) {
if (bloom_contains(a_id, b_id_bloom) {
   emitRow();
}
  }
}

So that each bloom filter is deserialized exactly once & reused.

I can think of one way to leverage hive UDTFs to generate something like
that, but it won't be any sort of standard SQL syntax (i.e lateral view).
I'll probably have to think more about it though.

But before that, do run this code with JVM opts -Xprof and check the
stdout to see if the bloom filter deserialization is hurting you or not.

Cheers,

Gopal




Re: Decomposing nested Hive statements with views

2015-09-14 Thread Gopal Vijayaraghavan

> We have many HQL scripts that select from nested sub-selects. In many
>cases the nesting can be a few levels deep:
...
> Such queries are difficult to test as they are complex monoliths. While
>we can decompose and modularise them at the column level with UDFs and
>macros, it is not as evident to me how best to break up the nested
>queries into separate components. I consider the options to be:
...
> * What is considered the best practice for modularising this type of
>query?
> * Is there a penalty for using views over a fully inlines query?
> * Are there any other options that I haven't considered?

I prefer using the CTE expressions, which are a SQL standard way to do
this.

The approaches really depend on your hive version & hadoop version.

Just like views, CTEs aren't materialized, but are rolled into the query
and duplicated for each invocation which might not be ideal for you.

But that is no different from repeating the sub-query manually and is much
cleaner.

So for a CTE referred to in many queries, I prefer prefixing my query
fragments with

set hive.exec.temporary.table.storage=memory;


create temporary table if not exists q_monthly_aggregate_2014 ...

when running a large number of them in the same session, that works much
better (ughly cubing).

The CTEs and views are duplicated wherever they are referred, because that
lets you
column prune or push in table filters into the ORC layers. A 10 column
view where you read 1 column out
will only read that one column in the stage reading it.

But the hive CBO team is working on fixing the general case of that, so
that repetitions can identified by filters & spooled (via Apache Calcite) -
https://issues.apache.org/jira/browse/CALCITE-481

There's also a patch from Navis recently to work around this temporarily -
https://issues.apache.org/jira/browse/HIVE-11752

Cheers,
Gopal




Re: Force users to specify partition indexes in queries

2015-09-29 Thread Gopal Vijayaraghavan

> If this is not doable in Hive at the moment, I am interested in writing
>a patch for it. I am not familiar with hive codebase so not sure how
>complex this is. Any hints or tips would be great and if I do need to
>write such a patch, I would be happy to contribute back to the source.

There are execution hooks in hive, which can be used to walk the operator
tree & look for a TableScan operator.

Here's an example of me doing something really horrible with a hook

https://github.com/t3rmin4t0r/captain-hook


You can have your hive hook look up the table + filter clause & then deny
a query.

Cheers,
Gopal




Re: mapjoin with left join

2015-09-22 Thread Gopal Vijayaraghavan

> select small.* from small s left join large l on s.id  =
>l.id  where l.id  is null;
...
> We simply want to load the 81K rows in to RAM, then for each row in
>large, check the small hash table and if it the row in small is not in
>large, then add it to large.

That seems like a fair description of the problem with the approach, but
it is fairly hard to do left outers in a distributed fashion with
correctness.

Imagine I load partition #1 from the big table + the small table into a
task - how would I know partiton #2 doesn't have a key from the small
hashtable. So the system cannot output any rows in the naive version of
this problem until it compares the small table to all parts of the big
table.

That I think was Sergey's point & I don't know of an easy way around that
yet.

> The business case is loading only new rows into a large fact table.  The
>new rows are the ones that are small in number.

That however is much easier - because that's better written as.

insert into large select * from small where id NOT IN (select id from
large);

FYI, we want people to stop writing these sort of queries once the MERGE
syntax lands in HIVE - https://issues.apache.org/jira/browse/HIVE-10924


The trick is to rewrite the NOT IN condition as two set operations - inner
join + remainder left outer.

Find all ids which already exist in bigtable with an inner map-join.
Remove all those ids from the small table & insert the remainder.

explain rewrite select remainder.* from small remainder where id not in
(select id from small, large where small.id = large.id);

That query, you will find will run much faster than the query you're
currently using. 

If you can, please send the "explain rewrite" & I can probably fine-tune
this approach further.

If you're using Tez, you might want to try out the custom vertex managers
as well

set hive.vectorized.execution.mapjoin.minmax.enabled=true;

set hive.vectorized.execution.mapjoin.native.fast.hashtable.enabled=true;
set hive.optimize.dynamic.partition.hashjoin=true;

Those params should allow you to scale up a map-join to ~100x the
available RAM (aimed at LLAP memory utilization).
 
Cheers,
Gopal




Re: sql mapjoin very slow

2015-08-28 Thread Gopal Vijayaraghavan
 I have a question. I use hive 1.1.0 ,so hive.stats.dbclass default value
is fs. Mean store statistics
 in local filesystem.  Any one can tell what is the  file path to store
statistics ?

The statistics aren't stored in the file system long term - the final
destination for stats is the metastore.

The earlier default stats implementation used MR Counters. With
stats.dbclass=fs, they're passed during ETL via the FileSystem, not the MR
counters.

You'll see something like this in the ETL phase, which is just a way to
write the target table + a new location where stats for the insert is
staged.

2015-08-28T01:44:35,581 INFO  [main]: parse.SemanticAnalyzer
(SemanticAnalyzer.java:genFileSinkPlan(6629)) - Set stats collection dir :
hdfs://

The StatsTask on the client side will read this file and update the
metastore.

That aside, you might want to check if you're accidentally joining on a
Double. That has been recently reported as a HashMap regression  can be
triggered when doing a

join string_col = int_col;

with an easy workaround, cast the smaller table to the bigger table's type.

Cheers,
Gopal






Re: python libraries to execute or call hive queries

2015-08-28 Thread Gopal Vijayaraghavan

 Can anyone suggest any python libraries to call hive queries from python
scripts ?

https://cwiki.apache.org/confluence/display/Hive/HiveClient#HiveClient-Pyth
on


Though I suspect that's out of date.

https://github.com/t3rmin4t0r/amplab-benchmark/blob/master/runner/run_query
.py#L604


is roughly the way to cut-paste that into working form (for hive-13),
though you've got to use the exact thrift version of the HiveServer2 you
run against.

Though, recently I've noticed the SQLAlchemy wrappers to be more
convenient 

https://github.com/dropbox/PyHive/blob/master/pyhive/sqlalchemy_hive.py


Irrespective of the method of access, the only consistent way to talk to
Hive is over the JDBC interaction layer (Thrift server).

Launching bin/hive via Subprocess will work, but I've found that reading
the results out with a regex has more parsing issues than I'd like.

Cheers,
Gopal




Re: NPE when reading Parquet using Hive on Tez

2016-01-04 Thread Gopal Vijayaraghavan

> select count(*) from alexa_parquet;

> Caused by: java.lang.NullPointerException
>at 
>org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.tokeni
>ze(TypeInfoUtils.java:274)
>at 
>org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.
>(TypeInfoUtils.java:293)
>at 
>org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfosFromTypeS
>tring(TypeInfoUtils.java:764)
>at 
>org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.getColum
>nTypes(DataWritableReadSupport.java:76)
>at 
>org.apache.hadoop.hive.ql.io.parquet.read.DataWritableReadSupport.init(Dat
>aWritableReadSupport.java:220)
>at 
>org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.getSp
>lit(ParquetRecordReaderWrapper.java:256)

This might be an NPE triggered off by a specific case of the type parser.

I tested it out on my current build with simple types and it looks like
the issue needs more detail on the column types for a repro.

hive> create temporary table x (x int) stored as parquet;
hive> insert into x values(1),(2);
hive> select count(*) from x where x.x > 1;
Status: DAG finished successfully in 0.18 seconds
OK
1
Time taken: 0.792 seconds, Fetched: 1 row(s)
hive> 

Do you have INT96 in the schema?

> I'm currently evaluating Hive on Tez as an alternative to keeping the
>SparkSQL thrift sever running all the time locking up resources.

Tez has a tunable value in tez.am.session.min.held-containers (i.e
something small like 10).

And HiveServer2 can be made work similarly because spark
HiveThriftServer2.scala is a wrapper around hive's ThriftBinaryCLIService.






Cheers,
Gopal




Re: Hive HLL for appx count distinct

2015-12-30 Thread Gopal Vijayaraghavan
> In the hive-hll-udf, you seem to mention about RRD. Is that something
>supported by Hive?

No. RRDTool is what most people are replacing with Hive to store time
series data in.

Raw RRDTool files on a local disk have no availability model (i.e lose a
disk, you lose data).

The rollup concept however is very powerful, to maintain distinct
aggregates of a time-series (& throw out the expired ones), which is what
my example was 

last 30 days HLL + last 23 hours HLL + generate HLL over current_hour.

to count billions of distincts across them with a few megabytes of storage.

This can be then further extended to build hundreds of bitsets per hour,
one for each tracked A/B experiment to collect stats on.

Cheers,
Gopal




Re: Hive Buckets and Select queries

2015-12-31 Thread Gopal Vijayaraghavan
> and when we issue the following query, its doing a "Full table scan"
>
>
> SELECT * FROM foo WHERE id=

Bucket pruning is available in Hive-2.0 right now (HIVE-11525).

Doc note pending, for 2016 when Hive-2.0 releases.

Cheers,
Gopal




Re: Hive HLL for appx count distinct

2015-12-30 Thread Gopal Vijayaraghavan

> I'm trying to explore the HLL UDF option to compute # of uniq users for
>each time range (week, month, yr, etc.) and wanted to know if
> its possible to just maintain HLL struct for each day and then use those
>to compute the uniqs for various time
> ranges using these per day structs instead of running the queries across
>all the data?

Yes, unions of raw HLL can be done (though not intersects).

https://github.com/t3rmin4t0r/hive-hll-udf


Or better yet, use the Yahoo sketches which work better than raw HLL.

http://yahooeng.tumblr.com/post/135390948446/data-sketches

+
http://datasketches.github.io/

+
https://github.com/DataSketches/sketches-hive


Cheers,
Gopal



Re: Is Hive Index officially not recommended?

2016-01-05 Thread Gopal Vijayaraghavan

 
 
> I am going to run the same query in Hive. However, I only see a table
>scan below and no mention of that index. May be I am missing something
>here?

Hive Indexes are an incomplete feature, because they are not maintained
over an ACID storage & demand FileSystem access to check for validity.

I'm almost sure there's a better implementation, which never made it to
Apache (read HIVE-417 & comments about HBase).


So far, in all my prod cases, they've slowed down queries more often than
speeding them up.

By default, the indexes are *not* used to answer queries.

In fact, the slowness was mostly attributed to the time spent making sure
the index was invalid.

You can flip those on if you want mostly up-to date results.

set hive.optimize.index.filter=true;
set hive.optimize.index.groupby=true;

set hive.index.compact.query.max.size=-1;

set hive.optimize.index.filter.compact.minsize=-1;

set hive.index.compact.query.max.entries=-1;

Things are going to change in Hive-2.0 though. The addition of isolated
transactions brings new light into the world of indexes.

I'll be chasing that down after LLAP, since the txn model offers
serializability markers and the LockManager + compactions offer a great
way to purge/update them per-partition. And the metastore-2.0 removes a
large number of scalability problems associated with metadata.

 
Cheers,
Gopal







Re: Is Hive Index officially not recommended?

2016-01-05 Thread Gopal Vijayaraghavan

>So in a nutshell in Hive if "external" indexes are not used for improving
>query response, what value they add and can we forget them for now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge reduction
in IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index. 

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if
there's a non-stop stream of updates into it.

Cheers,
Gopal




Re: unique-id for the mapper task with tez execution engine

2015-12-22 Thread Gopal Vijayaraghavan
Hi,

(x-posts to bcc:)

On 12/22/15, 9:19 PM, "Amey Barve"  wrote:

>conf.get("mapreduce.task.id");
>
>Now I want to run same hive queries with tez engine and I want to know
>what
>should be my unique-id. Is there any property from configuration or other
>that can give me unique-id from the mapper task?
>
>I am using conf.get("mapreduce.task.partition") property, Is that
>guranteed
>to give unique - id for tez?

That's populated in Tez in pretty much the exact format used by MRv2.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hi
ve/ql/exec/tez/TezProcessor.java#L112


But beware, each mapper & reducer think it's the only one of its kind out
there, so there are collisions between DAG stages for the unique name.

Cheers,
Gopal




Re: unique-id for the mapper task with tez execution engine

2015-12-22 Thread Gopal Vijayaraghavan
Hi,

> So what do you suggest to get unique-id for mapper task with tez
>execution engine?
> 
> conf.get("mapreduce.task.partition");
>
> Is this correct?

Yes, that is correct - but it can only be unique within a Mapper vertex.

Tez plans sort of look like this for complex queries

http://people.apache.org/~gopalv/tpch-plans/q2_minimum_cost_supplier.svg


Every MapTezProcessor there will go from 0-n, for that config parameter.

So all those Map 1, Map 2 ... Map 17 will have a task.partition 0 with the
same  in the attempt id, because there's no room
for a vertex-id in there.

Cheers,
Gopal









Re: Null Representation in Hive tables

2015-12-27 Thread Gopal Vijayaraghavan

> Your best bet is take the serde you s re using and copy it and change
>the code to accept bith null types

Or use a view with an IF().

For instance, I like the ease of generating groovy UDFs inline for
quick-n-dirty state machines.

We now do column pruning into views, so if you do not select that column
it should not run the UDF unecessarily.

Most of the times I use a CTE, I really wish there was a "CREATE TEMPORARY
VIEW" for me to avoid prefixing all queries.

Cheers,
Gopal




Re: Low performance map join when join key types are different

2015-12-22 Thread Gopal Vijayaraghavan

> But why disable mapjoin has better performance when we don't use cast to
>string(user always lazy)?
> 
> Join key values comparison in  in reduce stage is more quickly?

The HashMap is slower than the full-sort +
sorted-merge-join.


It shouldn't be, but it hits the worst-case performance for the Hashmap
impl because of a bug in DoubleWritable in Hadoop.

The effect is somewhat the same as

public int hashCode() {
   return 1;
}

Read the comments on - https://issues.apache.org/jira/browse/HADOOP-12217

Cheers,
Gopal








Re: Low performance map join when join key types are different

2015-12-22 Thread Gopal Vijayaraghavan

> We found that when we join on two different type keys , hive will
>convert all join key to Double.

This is because of type coercions for BaseCompare, so that String:Integer
comparisons with "<=" will work similarly to "=".

> b.id to double. When the conversion occurs, map join will become very
>slow.
...
> Does anyone how to solve it more effectively?

This is an issue that only affects mapreduce mode in Hive. The broadcast
joins in Tez switched to Murmur hash to avoid this issue (HIVE-6924 +
HIVE-7121).

As a workaround, you can insert explicit casts to String to make this
faster.

Cheers,
Gopal




Re: How to capture query log and duration

2015-11-20 Thread Gopal Vijayaraghavan


>Can you please also let me know what argument list this script want .
>
>I was trying following in HDP Sandbox , but did not get JSON outout

The JSON output is saved into a .zip file, if you hit ^C.

> https://gist.github.com/t3rmin4t0r/e4bf835f10271b9e466e

Look for a file named atsdump*.zip

Cheers,
Gopal




Re: How to capture query log and duration

2015-11-19 Thread Gopal Vijayaraghavan
> We would like to capture some information in our Hadoop Cluster.
> Can anybody please suggest how we can we  achieve this, any tools
>available already ? Or do we need to scrub any log ?

Apache Atlas is the standardized solution for deeper analytics into data
ownership/usage (look at the HiveHook in Atlas).

> 1. We want to know how many queries are run in everyday
> 2. What are the durations of those queries .
> 3. If any queries are failing in what step they are failing.

For a general use-case, you probably are already writing a lot of this
data already.

https://gist.github.com/t3rmin4t0r/e4bf835f10271b9e466e

That only pulls the query text + plans in JSON (to automatically look for
bad plans), but the total event structure looks like this

{
"domain": "DEFAULT",
"entity":
"gopal_20151119211930_bae04691-f46a-44c4-9116-bef8f854e49a",
"entitytype": "HIVE_QUERY_ID",
"events": [
{
"eventinfo": {},
"eventtype": "QUERY_COMPLETED",
"timestamp": 1447986004954
},
{
"eventinfo": {},
"eventtype": "QUERY_SUBMITTED",
"timestamp": 1447985970564
}
],
"otherinfo": {
"STATUS": true,
"TEZ": true

"MAPRED": false,

"QUERY" : ...
}
"primaryfilters": {
"requestuser": [
"gopal"
],
"user": [
"gopal"
]
},

}

I have seen at least one custom KafkaHook to feed hive query plans into a
Storm pipeline, but that was custom built to police the system after an
ad-hoc query produced a 4.5 petabyte join.

Cheers,
Gopal




Re: Building Spark to use for Hive on Spark

2015-11-18 Thread Gopal Vijayaraghavan


> I wanted to know  why is it necessary to remove the Hive jars from the
>Spark build as mentioned on this

Because SparkSQL was originally based on Hive & still uses Hive AST to
parse SQL.

The org.apache.spark.sql.hive package contains the parser which has
hard-references to the hive's internal AST, which is unfortunately
auto-generated code (HiveParser.TOK_TABNAME etc).

Everytime Hive makes a release, those constants change in value and that
is private API because of the lack of backwards-compat, which is violated
by SparkSQL.

So Hive-on-Spark forces mismatched versions of Hive classes, because it's
a circular dependency of Hive(v1) -> Spark -> Hive(v2) due to the basic
laws of causality.

Spark cannot depend on a version of Hive that is unreleased and
Hive-on-Spark release cannot depend on a version of Spark that is
unreleased.

Cheers,
Gopal




Re: Hive UDF accessing https request

2016-01-10 Thread Gopal Vijayaraghavan

> javax.net.ssl.SSLHandshakeException:
>sun.security.validator.ValidatorException: PKIX path building failed:
>sun.security.provider.certpath.SunCertPathBuilderException: unable to
>find valid certification path to requested

There's a linux package named ca-certificates(-java) which might be
missing.

You can see what's in /etc/ssl/certs/java/ & make sure you have them.

Running external request workloads are not recommended - because for each
failure, all previous results are discarded.

Each retry will start from the 1st item and in general be wasteful & slow.

Cheers,
Gopal











Re: simple usage of stack UDTF causes a cast exception

2016-01-10 Thread Gopal Vijayaraghavan


> java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException:
>java.lang.ClassCastException:
>org.apache.hadoop.hive.serde2.lazy.LazyString cannot be cast to
>org.apache.hadoop.io.Text
...
>at 
>org.apache.hadoop.hive.serde2.objectinspector.primitive.WritableStringObje
>ctInspector.getPrimitiveWritableObject(WritableStringObjectInspector.java:
>41)>
>at 
>org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimp
>leSerDe.java:261)
>at 
>org.apache.hadoop.hive.serde2.DelimitedJSONSerDe.serializeField(DelimitedJ
>SONSerDe.java:72)

The UDTF seems to be forwarding the object as-is. From a quick look in the
UDTF, there's a convertIfNecessary in there - might need some analysis to
figure that out.

But looks like this is breaking the small data-size runs only. You can
lower the thresholds and try again.

set hive.fetch.task.conversion.threshold=-1;

Cheers,
Gopal 




Re: Using Hive table for twitter data

2016-06-09 Thread Gopal Vijayaraghavan

> Has anyone done recent load of twitter data into Hive table.

Not anytime recently, but the twitter corpus was heavily used to demo Hive.

Here's the original post on auto-learning schemas from an arbitrary
collection of JSON docs (like a MongoDB dump).

http://hortonworks.com/blog/discovering-hive-schema-in-collections-of-json-
documents/


Cheers,
Gopal




Re: Using Hive table for twitter data

2016-06-09 Thread Gopal Vijayaraghavan

> Any reason why that table in Hive cannot read data in?

No idea how you're loading data with flume, but it isn't doing it right.

>> PARTITIONED BY (datehour INT)

...

>> -rw-r--r--   2 hduser supergroup 433868 2016-06-09 09:52
>>/twitter_data/FlumeData.1465462333430

No ideas on how to get that to create partitions either.

Cheers,
Gopal




Re: Copying all Hive tables from Prod to UAT

2016-05-25 Thread Gopal Vijayaraghavan

> We are using HDP. Is there any feature in ambari

Apache Falcon handles data lifecycle management, not Ambari.
 
https://falcon.apache.org/0.8/HiveDR.html

Cheers,
Gopal




Re: Optimized Hive query

2016-06-14 Thread Gopal Vijayaraghavan

> You can see that you get identical execution plans for the nested query
>and the flatten one.

Wasn't that always though. Back when I started with Hive, before Stinger,
it didn't have the identity project remover.

To know if your version has this fix, try looking at

hive> set hive.optimize.remove.identity.project;


Cheers,
Gopal

 




Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Gopal Vijayaraghavan

> In short at the simplest set up what Resource Manager it works with?

Tez+Hive needs HDFS and YARN 2.6.0+ (preferably as close to an Apache
build as possible - CDH clusters need more work).

Hive2 needs Apache Slider 0.91 right now, to start the cache daemons on
YARN (see SLIDER-82).

> If so kindly specify both Hive and TEZ versions.

I maintain build scripts & configuration setups for Hive+Tez, for work

https://github.com/t3rmin4t0r/tez-autobuild/tree/llap


Both that & the master there builds Hive (2.1.0-SNAPSHOT) + Tez
(0.8.4-SNAPSHOT), this one has the
LLAP cache configurations turned on.

This is what I use to develop Hive, before there are releases and it will
allow each user
on a shared cluster to maintain their own independent private install of
hive - if you look at
something like the old Spotify Hive query presentations, you'll see that
more people have 
used that to run their own private builds successfully :)

Purely out of laziness, the LLAP configurations in slider-gen.sh (i.e the
Xmx & cache values)
are configured exactly to match my dev cluster - 32 vcore + 256Gb RAM.

Cheers,
Gopal




Re: Anyone successfully deployed Hive on TEZ engine?

2016-05-30 Thread Gopal Vijayaraghavan
> I do not use any vendor's product., All my own set up, build and
>configure.

My autobuild scripts should serve as readable documentation for this,
since nearly everything's in a single Makefile with an install: target.

Or take the easy route with

$ make dist install

In case you use the llap branch, just do "set
hive.llap.execution.mode=none;" to use Tez.

> java version "1.8.0_77"
> Hadoop 2.6.0
...
> https://tez.apache.org/install.html

Looks good so far.

> Ok I just need to make it work as I have hive on spark engine as well.

You're missing 3 things approximately - if you read through the Makefile
in github.

First, a good tez-site.xml in the classpath (remember, tez.lib.uris needs
to be an HDFS path - for the rest, see the base file from autobuild).

I usually update Tez to ${fs.default.name}/user/gopal/tez/tez.tar.gz and I
do not use the minimal tarball, but the full dist tarball.

The fixed tarball means it hits all the good localization characteristics
of YARN, which can add up to minutes on a >250+ node cluster.

Second, put that in the classpath for Hive (append to
$INSTALL_ROOT/hive/bin/hive-config.sh)

> export 
>HADOOP_CLASSPATH="$INSTALL_ROOT/tez/*:$INSTALL_ROOT/tez/lib/*:$INSTALL_ROO
>T/tez/conf/:$HADOOP_CLASSPATH"

> export HADOOP_USER_CLASSPATH_FIRST=true


Replace $INSTALL_ROOT with wherever Tez is located.

Third, disable the hive-1.x jars coming from SparkSQL (append/create in
$INSTALL_ROOT/hive/conf/hive-env.sh)

> export HIVE_SKIP_SPARK_ASSEMBLY=true


After that, you can do

> hive --hiveconf hive.execution.engine=tez

to get Tez working (add --hiveconf tez.queue.name= to use queues).

Cheers,
Gopal




Re: My first TEZ job fails

2016-05-30 Thread Gopal Vijayaraghavan
> hduser@rhes564: /usr/lib/apache-tez-0.7.1-bin> hadoop jar
>./tez-examples-0.7.1.jar orderedwordcount /tmp/input/test.txt
>/tmp/out/test.log

Sure, you're missing file:/// - the defaultFS is most like
hdfs://:/

The inputs and outputs without a scheme prefix will go the defaultFS
configured in core-site.xml.

Cheers,
Gopal













Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-31 Thread Gopal Vijayaraghavan

> but this sounds to me (without testing myself) adding caching capability
>to TEZ to bring it on par with SPARK.

Nope, that was the crux of the earlier email.

"Caching" seems to be catch-all term misused in that comparison.

>> There is a big difference between where LLAP & SparkSQL, which has to do
>> with access pattern needs.

On another note, LLAP can actually be used inside Spark as well, just use
LlapContext instead of HiveContext.





I even have a Postgres FDW for LLAP, which is mostly used for analytics
web dashboards which are hooked into Hive.

https://github.com/t3rmin4t0r/llap_fdw


LLAP can do 200-400ms queries, but Postgres can get to the sub 10ms when
it comes to slicing-dicing result sets <100k rows.

Cheers,
Gopal




  1   2   3   4   >