Re: Hive query on ORC table is really slow compared to Presto

2017-06-12 Thread Michael Segel
Silly question…

What about using COUNT() and a GROUP BY() instead?

I’m going from memory…. this may or may not work.  Since you want the row_id 
only in order to de-dupe, right?

On Jun 12, 2017, at 3:59 PM, Premal Shah 
> wrote:

Thanx Gopal.
Sorry, took me a few days to respond. Here are some findings.

hive.optimize.distinct.rewrite is True by default

I do see Reducer 2 + 3.

However, this might be worth mentioning. The distinct query on an ORC table 
takes a ton of time. I created a table with the TEXTFILE format from the ORC 
table and ran the same distinct query on it.  That query runs in a few seconds.

On running the orcfiledump utility, I see that the column on which I want to 
run the distinct query is encoded with a DIRECT encoding.  When I run distinct 
on other columns in the table that are encoded with the dictionary encoding, 
the query runs quickly.

This is the schema of another table

CREATE TABLE `ip_table`(
  `ip` string,
  `id` string,
  `name` string,
  `domain` string,
  `country` string)
CLUSTERED BY (ip)
INTO 16 BUCKETS
;

I've created a gist with the query execution to maintain formatting

https://gist.github.com/premal/81b1b00dfffcc8280e04ec177334cb0f

Running a count(distinct) query on master id took 3+ hours. It looks like the 
CPU was busy when running this query.
Running a count(distinct) on the name and domain columns took few seconds.

This is the output of the ORC File Dump for 1 of the files

Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2
Encoding column 2: DICTIONARY_V2[245849]
Encoding column 3: DICTIONARY_V2[199608]
Encoding column 4: DICTIONARY_V2[161352]
Encoding column 5: DICTIONARY_V2[188]

The table has a total of 66,768,600 rows

These are the count distinct values per column

ip- 66,768,600
id- 4,291,106
name  - 3,007,034
domain   - 2,245,715
country   - 212

One thing to note is that the id is kinda a fixed length string. It's either 16 
chars or 31. The other columns have more variety in the field lengths. Not sure 
if that matters.

Thanx in advance.


On Tue, Apr 4, 2017 at 3:27 PM, Gopal Vijayaraghavan 
> wrote:
> SELECT COUNT(*), COUNT(DISTINCT id) FROM accounts;
…
> 0:01 [8.59M rows, 113MB] [11M rows/s, 146MB/s]

I'm hoping this is not rewriting to the approx_distinct() in Presto.

> I got similar performance with Hive + LLAP too.

This is a logical plan issue, so I don't know if LLAP helps a lot.

A count + a count(distinct) is planned as a full shuffle of 100% of rows.

Run with

set hive.tez.exec.print.summary=true;

And see the output row-count of Map 1.

> What can be done to get the hive query to run faster in hive?

Try with (see if it generates a Reducer 2 + Reducer 3, which is what the 
speedup comes from).

set hive.optimize.distinct.rewrite=true;

or try a rewrite

select id from accounts group by id having count(1) > 1;

Both approaches enable full-speed vectorization for the query.

Cheers,
Gopal





--
Regards,
Premal Shah.



Fwd: Pro and Cons of using HBase table as an external table in HIVE

2017-06-09 Thread Michael Segel
Sorry. Need to send via right email address.

Begin forwarded message:

From: Michael Segel <mse...@segel.com<mailto:mse...@segel.com>>
Subject: Re: Pro and Cons of using HBase table as an external table in HIVE
Date: June 9, 2017 at 7:37:22 AM CDT
To: user@hive.apache.org<mailto:user@hive.apache.org>

Hey Edward,

Yes, that’s the gist of it.
However… if you can exclude data… your query in HBase could be faster.
Having said that…

I should have included hardware in to the equation… Also data locality could 
come in to play…  But that really would confuse the issue and the OP even more. 
;-)


-Mike

On Jun 9, 2017, at 7:14 AM, Edward Capriolo 
<edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> wrote:

Think about it like this one system is scanning a local file ORC, using an 
hbase scanner (over the network), and scanning the data in sstable format?

On Fri, Jun 9, 2017 at 5:50 AM, Amey Barve 
<ameybarv...@gmail.com<mailto:ameybarv...@gmail.com>> wrote:
Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the 
query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC / Text 
file formats

In other words Is querying over plain hive (ORC or Text) always faster than 
through HiveStorageHandler?

Regards,
Amey

On 9 June 2017 at 15:08, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
The pro’s is that you have the ability to update a table without having to 
worry about duplication of the row.  Tez is doing some form of compaction for 
you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading 
a file.  Read Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it 
will always be a complete table scan.

HBase was designed to give you fast access for direct get() and limited range 
scans.  Otherwise you have to perform full table scans.  This means that unless 
you’re able to do a range scan, your full table scan will be slower than if you 
did this on a flat file set.  Again the reason why you would want to use HBase 
if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you 
have make sure that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple.  If you don’t use 
the key, you have to do a full table scan. So even though you are partitioning 
on row key, you never use your partitions.  However in Hive or Spark, you can 
create an alternative partition pattern.  (e.g your key is the transaction_id, 
yet you partition on month/year portion of the transaction_date)

You can speed things up a little by using an inverted table as a secondary 
index. However this assumes that you want to use joins. If you have a single 
base table with no joins then you can limit your range scans based on making 
sure you are querying against the row key.  Note: This will mean that you have 
limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a 
get() .

In earlier versions of hive, the issue would be “SELECT * FROM foo where 
rowkey=BAR”  would still do a full table scan because of the lack of predicate 
pushdown.
This may have been fixed in later releases of hive. That would be your test 
case.   If there is predicate pushdown, then you will be faster, assuming that 
the query triggers an implied range scan.
This would be a simple thing. However keep in mind that you’re going to 
generate a map/reduce job (unless using a query engine like Tez) where you 
wouldn’t if you just wrote your code in Java.




> On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan 
> <ramasubramanian.naraya...@gmail.com<mailto:ramasubramanian.naraya...@gmail.com>>
>  wrote:
>
> Hi,
>
> Can you please let us know Pro and Cons of using HBase table as an external 
> table in HIVE.
>
> Will there be any performance degrade when using Hive over HBase instead of 
> using direct HIVE table.
>
> The table that I am planning to use in HBase will be master table like 
> account, customer. Wanting to achieve Slowly Changing Dimension. Please 
> through some lights on that too if you have done any such implementations.
>
> Thanks and Regards,
> Rams







Re: Pro and Cons of using HBase table as an external table in HIVE

2017-06-09 Thread Michael Segel
No.

First, I apologize for my first response.  I guess its never a good idea to 
check email at 4:00 in the morning before your first cup of coffee. ;-)
I went into a bit more detail that may have confused the issue.

To answer your question…
In other words Is querying over plain hive (ORC or Text) always faster than 
through HiveStorageHandler?
No.

Not always.  It will depend on the data, the schema and the query.

HBase is a <KEY, VALUE> store, where the KEY is the rowkey.
HBase partitions its data based on the rowkey.

The rows are stored by rowkey in  lexicographical sorted order. This creates a 
physical index.

With respect to hive, if the query uses a filter against the rowkey, the 
resulting query will perform a range scan.

So…
SELECT *
FROMsomeTable
WHERE  rowkey > aValue
ANDrowkey < bValue

This query will result in a range scan filter and depending on the values of 
aValue and bValue, you should exclude a portion of your table.

If you were to store the data in an HDFS table, the odds are your partition 
plan would not be on the rowkey. So your hive query against this table would 
not be able to exclude
data. Depending on how much data was excluded, you could be slower than a query 
against HBase.

There are some performance tuning tips that would help reduce the cost of the 
query.


Does this make sense?

Even though HBase could be slower, there are some reasons why you may want to 
use Hbase over ORC or Parquet.

The problem is that there isn’t a straight forward and simple answer. There are 
a lot of factors that go in to deciding what tools to use.
(e.g. If you’re a Cloudera Fanboi, you’d use Impala as your query engine, 
Hortonworks would push Tez, and MapR is a bit more agnostic.)

And then you have to decide if you want to use Hive, or a query engine or 
something different altogether.





HAVING SAID THAT…
Depending on your query, you may want to consider secondary indexing.

When you do that.. HBase could become faster. Again it depends on the query and 
the data.

Hbase takes care of deduping data. Hive does not. So unless your data set is 
immutable (new rows = new data … no updates …) you will have to figure out how 
to dedupe  your data which is outside of Hive.

Note that this falls outside the scope of this discussion. The OP’s question is 
about a base table and doesn’t seem to involve any joins.

HTH

-Mike




On Jun 9, 2017, at 4:50 AM, Amey Barve 
<ameybarv...@gmail.com<mailto:ameybarv...@gmail.com>> wrote:

Hi Michael,

"If there is predicate pushdown, then you will be faster, assuming that the 
query triggers an implied range scan"
---> Does this bring results faster than plain hive querying over ORC / Text 
file formats

In other words Is querying over plain hive (ORC or Text) always faster than 
through HiveStorageHandler?

Regards,
Amey

On 9 June 2017 at 15:08, Michael Segel 
<msegel_had...@hotmail.com<mailto:msegel_had...@hotmail.com>> wrote:
The pro’s is that you have the ability to update a table without having to 
worry about duplication of the row.  Tez is doing some form of compaction for 
you that already exists in HBase.

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading 
a file.  Read Lars George’s book on what takes place when you do a read.

2) HBase is not a relational store. (You have to think about what that implies)

3) You need to query against your row key for best performance, otherwise it 
will always be a complete table scan.

HBase was designed to give you fast access for direct get() and limited range 
scans.  Otherwise you have to perform full table scans.  This means that unless 
you’re able to do a range scan, your full table scan will be slower than if you 
did this on a flat file set.  Again the reason why you would want to use HBase 
if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you 
have make sure that you’re querying off your row key.

HBase was designed as a <key,value> store. Plain and simple.  If you don’t use 
the key, you have to do a full table scan. So even though you are partitioning 
on row key, you never use your partitions.  However in Hive or Spark, you can 
create an alternative partition pattern.  (e.g your key is the transaction_id, 
yet you partition on month/year portion of the transaction_date)

You can speed things up a little by using an inverted table as a secondary 
index. However this assumes that you want to use joins. If you have a single 
base table with no joins then you can limit your range scans based on making 
sure you are querying against the row key.  Note: This will mean that you have 
limited querying capabilities.

And yes, I’ve done this before but can’t share it with you.

HTH

P.S.
I haven’t tried Hive queries where you have what would be the equivalent of a 
get() .

In earlier versio

Re: Pro and Cons of using HBase table as an external table in HIVE

2017-06-09 Thread Michael Segel
The pro’s is that you have the ability to update a table without having to 
worry about duplication of the row.  Tez is doing some form of compaction for 
you that already exists in HBase. 

The cons:

1) Its slower. Reads from HBase have more overhead with them than just reading 
a file.  Read Lars George’s book on what takes place when you do a read. 

2) HBase is not a relational store. (You have to think about what that implies) 

3) You need to query against your row key for best performance, otherwise it 
will always be a complete table scan. 

HBase was designed to give you fast access for direct get() and limited range 
scans.  Otherwise you have to perform full table scans.  This means that unless 
you’re able to do a range scan, your full table scan will be slower than if you 
did this on a flat file set.  Again the reason why you would want to use HBase 
if your data set is mutable.

You also have to trigger a range scan when you write your hive query and you 
have make sure that you’re querying off your row key. 

HBase was designed as a  store. Plain and simple.  If you don’t use 
the key, you have to do a full table scan. So even though you are partitioning 
on row key, you never use your partitions.  However in Hive or Spark, you can 
create an alternative partition pattern.  (e.g your key is the transaction_id, 
yet you partition on month/year portion of the transaction_date) 

You can speed things up a little by using an inverted table as a secondary 
index. However this assumes that you want to use joins. If you have a single 
base table with no joins then you can limit your range scans based on making 
sure you are querying against the row key.  Note: This will mean that you have 
limited querying capabilities. 

And yes, I’ve done this before but can’t share it with you. 

HTH

P.S. 
I haven’t tried Hive queries where you have what would be the equivalent of a 
get() . 

In earlier versions of hive, the issue would be “SELECT * FROM foo where 
rowkey=BAR”  would still do a full table scan because of the lack of predicate 
pushdown. 
This may have been fixed in later releases of hive. That would be your test 
case.   If there is predicate pushdown, then you will be faster, assuming that 
the query triggers an implied range scan. 
This would be a simple thing. However keep in mind that you’re going to 
generate a map/reduce job (unless using a query engine like Tez) where you 
wouldn’t if you just wrote your code in Java. 




> On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan 
>  wrote:
> 
> Hi,
> 
> Can you please let us know Pro and Cons of using HBase table as an external 
> table in HIVE.
> 
> Will there be any performance degrade when using Hive over HBase instead of 
> using direct HIVE table.
> 
> The table that I am planning to use in HBase will be master table like 
> account, customer. Wanting to achieve Slowly Changing Dimension. Please 
> through some lights on that too if you have done any such implementations.
> 
> Thanks and Regards,
> Rams 



Re: Bug in ORC file code? (OrcSerde)?

2016-10-19 Thread Michael Segel
Just to follow up… 

This appears to be a bug in the hive version of the code… fixed in the orc 
library…  NOTE: There are two different libraries. 

Documentation is a bit lax… but in terms of design… 

Its better to do the build completely in the reducer making the mapper code 
cleaner. 


> On Oct 19, 2016, at 11:00 AM, Michael Segel <msegel_had...@hotmail.com> wrote:
> 
> Hi, 
> Since I am not on the ORC mailing list… and since the ORC java code is in the 
> hive APIs… this seems like a good place to start. ;-)
> 
> 
> So… 
> 
> Ran in to a little problem… 
> 
> One of my developers was writing a map/reduce job to read records from a 
> source and after some filter, write the result set to an ORC file. 
> There’s an example of how to do this at:
> http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html
> 
> So far, so good. 
> But now here’s the problem….  Large source data, means many mappers and with 
> the filter, the number of output rows is a fraction in terms of size. 
> So we want to write to a single reducer. (An identity reducer) so that we get 
> only a single file. 
> 
> Here’s the snag. 
> 
> We were using the OrcSerde class to serialize the data and generate an Orc 
> row which we then wrote to the file. 
> 
> Looking at the source code for OrcSerde, OrcSerde.serialize() returns a 
> OrcSerdeRow.
> see: 
> http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java
> 
> OrcSerdeRow implements Writable and as we can see in the example code… for a 
> map only example… context.write(Text, Writable) works. 
> 
> However… if we attempt to make this in to a Map/Reduce job, we run in to a 
> problem during run time. the context.write() throws the following exception:
> "Error: java.io.IOException: Type mismatch in value from map: expected 
> org.apache.hadoop.io.Writable, received 
> org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”
> 
> 
> The goal was to reduce the orc rows and then write out in the reducer. 
> 
> I’m curious as to why the context.write() fails? 
> The error is a bit cryptic since the OrcSerdeRow implements Writable… so the 
> error message doesn’t make sense. 
> 
> 
> Now the quick fix is to borrow the ArrayListWritable from giraph and create 
> the list of fields in to an ArrayListWritable and pass that to the reducer 
> which will then use that to generate the ORC file. 
> 
> Trying to figure out why the context.write() fails… when sending to reducer 
> while it works if its a mapside write.
> 
> The documentation on the ORC site is … well… to be polite… lacking. ;-) 
> 
> I have some ideas why it doesn’t work, however I would like to confirm my 
> suspicions. 
> 
> Thx
> 
> -Mike
> 
> 
> B�CB��[��X��ܚX�KK[XZ[�\�\�][��X��ܚX�PY���\X�K�ܙ�B��܈Y][ۘ[��[X[��K[XZ[�\�\�Z[Y���\X�K�ܙ�B



Bug in ORC file code? (OrcSerde)?

2016-10-19 Thread Michael Segel
Hi, 
Since I am not on the ORC mailing list… and since the ORC java code is in the 
hive APIs… this seems like a good place to start. ;-)


So… 

Ran in to a little problem… 

One of my developers was writing a map/reduce job to read records from a source 
and after some filter, write the result set to an ORC file. 
There’s an example of how to do this at:
http://hadoopcraft.blogspot.com/2014/07/generating-orc-files-using-mapreduce.html

So far, so good. 
But now here’s the problem….  Large source data, means many mappers and with 
the filter, the number of output rows is a fraction in terms of size. 
So we want to write to a single reducer. (An identity reducer) so that we get 
only a single file. 

Here’s the snag. 

We were using the OrcSerde class to serialize the data and generate an Orc row 
which we then wrote to the file. 

Looking at the source code for OrcSerde, OrcSerde.serialize() returns a 
OrcSerdeRow.
see: 
http://grepcode.com/file/repo1.maven.org/maven2/co.cask.cdap/hive-exec/0.13.0/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java

OrcSerdeRow implements Writable and as we can see in the example code… for a 
map only example… context.write(Text, Writable) works. 

However… if we attempt to make this in to a Map/Reduce job, we run in to a 
problem during run time. the context.write() throws the following exception:
 "Error: java.io.IOException: Type mismatch in value from map: expected 
org.apache.hadoop.io.Writable, received 
org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow”


The goal was to reduce the orc rows and then write out in the reducer. 

I’m curious as to why the context.write() fails? 
The error is a bit cryptic since the OrcSerdeRow implements Writable… so the 
error message doesn’t make sense. 


Now the quick fix is to borrow the ArrayListWritable from giraph and create the 
list of fields in to an ArrayListWritable and pass that to the reducer which 
will then use that to generate the ORC file. 

Trying to figure out why the context.write() fails… when sending to reducer 
while it works if its a mapside write.

The documentation on the ORC site is … well… to be polite… lacking. ;-) 

I have some ideas why it doesn’t work, however I would like to confirm my 
suspicions. 

Thx

-Mike




Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-07-11 Thread Michael Segel
Just a clarification. 

Tez is ‘vendor’ independent.  ;-) 

Yeah… I know…  Anyone can support it.  Only Hortonworks has stacked the deck in 
their favor. 

Drill could be in the same boat, although there now more committers who are not 
working for MapR. I’m not sure who outside of HW is supporting Tez. 

But I digress. 

Here in the Spark user list, I have to ask how do you run hive on spark? Is the 
execution engine … the spark context always running? (Client mode I assume) 
Are the executors always running?   Can you run multiple queries from multiple 
users in parallel? 

These are some of the questions that should be asked and answered when 
considering how viable spark is going to be as the engine under Hive… 

Thx

-Mike

> On May 29, 2016, at 3:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions.
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, 

Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel

> On Jun 8, 2016, at 3:35 PM, Eugene Koifman  wrote:
> 
> if you split “create table test.dummy as select * from oraclehadoop.dummy;”
> into create table statement, followed by insert into test.dummy as select… 
> you should see the behavior you expect with Hive.
> Drop statement will block while insert is running.
> 
> Eugene
> 

OK, assuming true… 

Then the ddl statement is blocked because Hive sees the table in use. 

If you can confirm this to be the case, and if you can confirm the same for 
spark and then you can drop the table while spark is running, then you would 
have a bug since Spark in the hive context doesn’t set any locks or improperly 
sets locks. 

I would have to ask which version of hive did you build spark against?  
That could be another factor.

HTH

-Mike




Re: Creating a Hive table through Spark and potential locking issue (a bug)

2016-06-08 Thread Michael Segel
Doh! It would help if I use the email address to send to the list… 


Hi, 

Lets take a step back… 

Which version of Hive? 

Hive recently added transaction support so you have to know your isolation 
level. 

Also are you running spark as your execution engine, or are you talking about a 
spark app running w a hive context and then you drop the table from within a 
Hive shell while the spark app is still running? 

And you also have two different things happening… you’re mixing a DDL with a 
query.  How does hive know you have another app reading from the table? 
I mean what happens when you try a select * from foo; and in another shell try 
dropping foo?  and if you want to simulate a m/r job add something like an 
order by 1 clause. 

HTH

-Mike
> On Jun 8, 2016, at 2:36 PM, Michael Segel <mse...@segel.com> wrote:
> 
> Hi, 
> 
> Lets take a step back… 
> 
> Which version of Hive? 
> 
> Hive recently added transaction support so you have to know your isolation 
> level. 
> 
> Also are you running spark as your execution engine, or are you talking about 
> a spark app running w a hive context and then you drop the table from within 
> a Hive shell while the spark app is still running? 
> 
> You also have two different things happening… you’re mixing a DDL with a 
> query.  How does hive know you have another app reading from the table? 
> I mean what happens when you try a select * from foo; and in another shell 
> try dropping foo?  and if you want to simulate a m/r job add something like 
> an order by 1 clause. 
> 
> HTH
> 
> -Mike



Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
And you have MapR supporting Apache Drill. 

So these are all alternatives to Spark, and its not necessarily an either or 
scenario. You can have both. 

> On May 30, 2016, at 12:49 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> yep Hortonworks supports Tez for one reason or other which I am going 
> hopefully to test it as the query engine for hive. Tthough I think Spark will 
> be faster because of its in-memory support.
> 
> Also if you are independent then you better off dealing with Spark and Hive 
> without the need to support another stack like Tez.
> 
> Cloudera support Impala instead of Hive but it is not something I have used. .
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 May 2016 at 20:19, Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>> wrote:
> Mich, 
> 
> Most people use vendor releases because they need to have the support. 
> Hortonworks is the vendor who has the most skin in the game when it comes to 
> Tez. 
> 
> If memory serves, Tez isn’t going to be M/R but a local execution engine? 
> Then LLAP is the in-memory piece to speed up Tez? 
> 
> HTH
> 
> -Mike
> 
>> On May 29, 2016, at 1:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> thanks I think the problem is that the TEZ user group is exceptionally 
>> quiet. Just sent an email to Hive user group to see anyone has managed to 
>> built a vendor independent version.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 29 May 2016 at 21:23, Jörn Franke <jornfra...@gmail.com 
>> <mailto:jornfra...@gmail.com>> wrote:
>> Well I think it is different from MR. It has some optimizations which you do 
>> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
>> 
>> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
>> integrated in the Hortonworks distribution. 
>> 
>> 
>> On 29 May 2016, at 21:43, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>>> Hi Jorn,
>>> 
>>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>>> did not make enough efforts) making it work.
>>> 
>>> That TEZ user group is very quiet as well.
>>> 
>>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>>> in-memory capability.
>>> 
>>> It would be interesting to see what version of TEZ works as execution 
>>> engine with Hive.
>>> 
>>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>>> Hive etc as I am sure you already know.
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 29 May 2016 at 20:19, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> Very interesting do you plan also a test with TEZ?
>>> 
>>> On 29 May 2016, at 13:40, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>>> 
>>>> Basically took the original table imported using Sqoop and created and 
>>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>>> as follows:
>>>> 
>>>>

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-30 Thread Michael Segel
Mich, 

Most people use vendor releases because they need to have the support. 
Hortonworks is the vendor who has the most skin in the game when it comes to 
Tez. 

If memory serves, Tez isn’t going to be M/R but a local execution engine? Then 
LLAP is the in-memory piece to speed up Tez? 

HTH

-Mike

> On May 29, 2016, at 1:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks I think the problem is that the TEZ user group is exceptionally quiet. 
> Just sent an email to Hive user group to see anyone has managed to built a 
> vendor independent version.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 29 May 2016 at 21:23, Jörn Franke  > wrote:
> Well I think it is different from MR. It has some optimizations which you do 
> not find in MR. Especially the LLAP option in Hive2 makes it interesting. 
> 
> I think hive 1.2 works with 0.7 and 2.0 with 0.8 . At least for 1.2 it is 
> integrated in the Hortonworks distribution. 
> 
> 
> On 29 May 2016, at 21:43, Mich Talebzadeh  > wrote:
> 
>> Hi Jorn,
>> 
>> I started building apache-tez-0.8.2 but got few errors. Couple of guys from 
>> TEZ user group kindly gave a hand but I could not go very far (or may be I 
>> did not make enough efforts) making it work.
>> 
>> That TEZ user group is very quiet as well.
>> 
>> My understanding is TEZ is MR with DAG but of course Spark has both plus 
>> in-memory capability.
>> 
>> It would be interesting to see what version of TEZ works as execution engine 
>> with Hive.
>> 
>> Vendors are divided on this (use Hive with TEZ) or use Impala instead of 
>> Hive etc as I am sure you already know.
>> 
>> Cheers,
>> 
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> 
>> On 29 May 2016 at 20:19, Jörn Franke > > wrote:
>> Very interesting do you plan also a test with TEZ?
>> 
>> On 29 May 2016, at 13:40, Mich Talebzadeh > > wrote:
>> 
>>> Hi,
>>> 
>>> I did another study of Hive using Spark engine compared to Hive with MR.
>>> 
>>> Basically took the original table imported using Sqoop and created and 
>>> populated a new ORC table partitioned by year and month into 48 partitions 
>>> as follows:
>>> 
>>> 
>>> ​ 
>>> Connections use JDBC via beeline. Now for each partition using MR it takes 
>>> an average of 17 minutes as seen below for each PARTITION..  Now that is 
>>> just an individual partition and there are 48 partitions. 
>>> 
>>> In contrast doing the same operation with Spark engine took 10 minutes all 
>>> inclusive. I just gave up on MR. You can see the StartTime and FinishTime 
>>> from below
>>> 
>>> 
>>> 
>>> This is by no means indicate that Spark is much better than MR but shows 
>>> that some very good results can ve achieved using Spark engine.
>>> 
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> 
>>>  
>>> http://talebzadehmich.wordpress.com 
>>>  
>>> 
>>> On 24 May 2016 at 08:03, Mich Talebzadeh >> > wrote:
>>> Hi,
>>> 
>>> We use Hive as the database and use Spark as an all purpose query tool.
>>> 
>>> Whether Hive is the write database for purpose or one is better off with 
>>> something like Phoenix on Hbase, well the answer is it depends and your 
>>> mileage varies. 
>>> 
>>> So fit for purpose.
>>> 
>>> Ideally what wants is to use the fastest  method to get the results. How 
>>> fast we confine it to our SLA agreements in production and that helps us 
>>> from unnecessary further work as we technologists like to play around.
>>> 
>>> So in short, we use Spark most of the time and use Hive as the backend 
>>> engine for data storage, mainly ORC tables.
>>> 
>>> We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a 
>>> combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but 
>>> at the moment it is one of my projects.
>>> 
>>> We do not use any vendor's products as it enables us to move away  from 
>>> being tied down after years of SAP, Oracle and MS dependency to yet another 
>>> vendor. Besides there is some politics going on 

Hive 14 performance and scalability?

2014-12-11 Thread Michael Segel
Hi, 

While I haven’t upgraded to HDP 2.2, I have to ask if the transaction 
processing introduced in 14 has been tested at scale in terms of both 
users, and data size? 

I am curious as to what happens if you have a long transaction how well  
it copes.


Thx

-Mike