Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Richard A. Bross
This is great information, Gopal, thank you.

I wish I had the time to create a comparison for our use case between Hive 
buckets and ORC files on S3 and ORC files without bucket.  Unfortunately it's a 
chicken and egg issue, since I won't have enough data volume until we are in 
production, which should be shortly.

Thanks again.

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 7:31:31 PM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

> so there asking "where is the Hive bucketing spec".  Is it just to read the 
> code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 
implemented Streaming ingest via Storm, it used an explicit naming "bucket_" 
for the filename.

Since until the compaction runs the actual base files don't exist, the ACID 
bucketing implementation has to handle missing buckets as 0 rows in base file + 
possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work 
similarly, for the ability to do bucket map-joins between ACID & non-ACID 
bucketed tables. Particularly about the modulus for -ve numbers, which was 
broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for 
bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/00_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/00_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/01_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/01_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/02_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/02_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/03_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/03_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per 
bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you 
can see that is only applicable for the 1st insert to table (& the regex will 
remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of 
Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross"  wrote:

Gopal,

The Presto devs say they are willing to make the changes to adhere to the 
Hive bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER 
BY.  My understanding was that this was related to the number of buckets, but 
you are relating it to ORC stripes.  It is odd that no examples that I've seen 
include the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing fu

Re: Hive, Tez, clustering, buckets, and Presto

2018-04-04 Thread Gopal Vijayaraghavan
> so there asking "where is the Hive bucketing spec".  Is it just to read the 
> code for that function? 

This worked the other way around in time, than writing a spec first - ACIDv1 
implemented Streaming ingest via Storm, it used an explicit naming "bucket_" 
for the filename.

Since until the compaction runs the actual base files don't exist, the ACID 
bucketing implementation has to handle missing buckets as 0 rows in base file + 
possibly more rows in uncompacted deltas.

ACID's implementation has forced the two bucketing implementations to work 
similarly, for the ability to do bucket map-joins between ACID & non-ACID 
bucketed tables. Particularly about the modulus for -ve numbers, which was 
broken in Hive-1.0.

https://issues.apache.org/jira/browse/HIVE-12025

that's the place where this all got refactored so that joins & filters for 
bucketed tables work the same way for ACID & non-ACID tables.

Because of that spec lives in the comments now as a Regex.

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1283

> They were looking for something more explicit, I think.

I think a simple unit test will probably help them a bit more.

create external table bucketed (x int) clustered by (x) into 4 buckets stored 
as orc;
insert into bucketed values(1),(2),(3),(4);
insert into bucketed values(1),(2),(3),(4);

0: jdbc:hive2://localhost:2181/> dfs -ls /apps/hive/warehouse/bucketed;

| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/00_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/00_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/01_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/01_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/02_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/02_0_copy_1 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:13 
/apps/hive/warehouse/bucketed/03_0 |
| -rw-r--r--   3 hive hdfs181 2018-04-04 23:14 
/apps/hive/warehouse/bucketed/03_0_copy_1 |

Even when all buckets are covered Presto should be expecting >1 files per 
bucket.

I saw a JIRA comment which said "sort in file order and assign buckets", you 
can see that is only applicable for the 1st insert to table (& the regex will 
remove the copy numbering).

And oddly enough this week, I saw an academic paper with a negative analysis of 
Hive bucketing.

https://www.researchgate.net/publication/323997831_Partitioning_and_Bucketing_in_Hive-Based_Big_Data_Warehouses

Cheers,
Gopal

On 4/3/18, 1:42 PM, "Richard A. Bross"  wrote:

Gopal,

The Presto devs say they are willing to make the changes to adhere to the 
Hive bucket spec.  I quoted 

"Presto could fix their fail-safe for bucketing implementation to actually 
trust the Hive bucketing spec & get you out of this mess - the bucketing 
contract for Hive is actual file name -> hash % buckets 
(Utilities::getBucketIdFromFile)."

so there asking "where is the Hive bucketing spec".  Is it just to read the 
code for that function?  They were looking for something more explicit, I think.

Thanks

- Original Message -
From: "Gopal Vijayaraghavan" 
To: user@hive.apache.org
Sent: Tuesday, April 3, 2018 3:15:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto

>* I'm interested in your statement that CLUSTERED BY does not CLUSTER 
BY.  My understanding was that this was related to the number of buckets, but 
you are relating it to ORC stripes.  It is odd that no examples that I've seen 
include the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)

ORC + bucketing has been something I've spent a lot of time with - a lot of 
this has to do with secondary characteristics of data (i.e same device has 
natural progressions for metrics), which when combined with a columnar format & 
ordering within files produces better storage and runtimes together (which I 
guess is usually a trade-off).

Without a SORTED BY, the organizing function for the data-shuffle does not 
order in any specific way - the partition key for the shuffle is the modulus, 
while the order key is 0 bytes long, so it sorts by (modulus,) which for a 
quick-sort also loses the input order into the shuffle & each bucket file is 
produced in random order within itself.

An explicit sort with bucketing is what I recommend to most of the HDP 
customers who have performance problems with ORC.

This turns the shuffle key into (modulus, key1, key2) producing mo

[SECURITY] CVE-2018-1284: Hive UDF series UDFXPathXXXX allow users to pass carefully crafted XML to access arbitrary files

2018-04-04 Thread Daniel Dai
CVE-2018-1284: Hive UDF series UDFXPath allow users to pass
carefully crafted XML to access arbitrary files

Severity: Important

Vendor: The Apache Software Foundation

Versions Affected: This vulnerability affects all versions from 0.6.0

Description: Malicious user might use any xpath UDFs
(xpath/xpath_string/xpath_boolean/xpath_number/xpath_double/xpath_float/xpath_long/xpath_int/xpath_short)
to expose the content of a file on the machine running HiveServer2
owned by HiveServer2 user (usually hive) if
hive.server2.enable.doAs=false.

Mitigation: Users who use xpath UDFs in HiveServer2 and
hive.server2.enable.doAs=false are recommended to upgrade to 2.3.3, or
update UDFXPathUtil.java to the head of branch-2.3 and rebuild
hive-exec.jar: 
https://git1-us-west.apache.org/repos/asf?p=hive.git;a=blob;f=ql/src/java/org/apache/hadoop/hive/ql/udf/xml/UDFXPathUtil.java;hb=refs/heads/branch-2.3.
If these functions are not being used at present, you can also
disable its use by adding them to the value of the config
hive.server2.builtin.udf.blacklist.


[SECURITY] CVE-2018-1282 JDBC driver is susceptible to SQL injection attack if the input parameters are not properly cleaned

2018-04-04 Thread Daniel Dai
CVE-2018-1282: JDBC driver is susceptible to SQL injection attack if
the input parameters are not properly cleaned

Severity: Important

Vendor: The Apache Software Foundation

Versions Affected: This vulnerability affects all versions of Hive
JDBC driver from 0.7.1

Description: This vulnerability in Hive allows carefully crafted arguments to be
used to bypass the argument escaping/cleanup that JDBC driver does in
PreparedStatement implementation.

Mitigation: It is recommended to upgrade prior version of Hive JDBC
driver to 2.3.3.
Note Hive JDBC driver is not backward compatible with HiveServer2,
which means newer version of Hive JDBC driver may not talk to older version
of HiveServer2. In particular, Hive JDBC driver 2.3.3 won't talk
to HiveServer2 2.1.1 or prior. If user is using Hive code 2.1.1 or below
they might need to upgrade all the Hive instances to 2.3.3.


Alternative to the upgrade, is to take the follow two actions in your
Hive JDBC client code/application when dealing with user provided
input in PreparedStatement:
1. Avoid passing user input PreparedStatement.setBinaryStream
2. Sanitize the user input for PreparedStatement.setString, by
replacing all occurrences of \' to '

Credit: This issue was discovered by Bear Giles of SnapLogic


[SECURITY] CVE-2018-1315 'COPY FROM FTP' statement in HPL/SQL can write to arbitrary location if the FTP server is compromised

2018-04-04 Thread Daniel Dai
CVE-2018-1315: 'COPY FROM FTP' statement in HPL/SQL can write to
arbitrary location if the FTP server is compromised

Severity: Moderate

Vendor: The Apache Software Foundation

Versions Affected: Hive 2.1.0 to 2.3.2

Description: When 'COPY FROM FTP' statement is run using HPL/SQL extension to
Hive, a compromised/malicious FTP server can cause the file to be
written to an arbitrary location on the cluster where the command is
run from. This is because FTP client code in HPL/SQL does not verify
the destination
location of the downloaded file. This does not affect hive
cli user and hiveserver2 user as hplsql is a separate command line
script and needs to be invoked differently.

Mitigation: User who use HPL/SQL with Hive 2.1.0 through 2.3.2 should upgrade to
2.3.3 which removes support for "COPY FROM FTP". Alternatively, the
usage of HPL/SQL can be disabled through
other means.

Credit: This issue was discovered by Danny Grander of Snyk


[ANNOUNCE] Apache Hive 2.3.3 Released

2018-04-04 Thread Daniel Dai
The Apache Hive team is proud to announce the release of Apache Hive
version 2.3.3.

The Apache Hive (TM) data warehouse software facilitates querying and
managing large datasets residing in distributed storage. Built on top
of Apache Hadoop (TM), it provides, among others:

* Tools to enable easy data extract/transform/load (ETL)

* A mechanism to impose structure on a variety of data formats

* Access to files stored either directly in Apache HDFS (TM) or in other
  data storage systems such as Apache HBase (TM)

* Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark
frameworks.

For Hive release details and downloads, please visit:
https://hive.apache.org/downloads.html

Hive 2.3.3 Release Notes are available here:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162&styleName=Text&projectId=12310843

We would like to thank the many contributors who made this release
possible.

Regards,

The Apache Hive Team


Re: [Announce] Hive-MR3: Hive running on top of MR3

2018-04-04 Thread Thai Bui
It would be interesting to see how this compares to Hive LLAP on Tez. Since
the llap daemons contain a queue of tasks that is shared amongst many Tez
AMs, it could have similar characteristics to the way MR3 is sharing the
containers between the AMs.

On Wed, Apr 4, 2018 at 10:06 AM Sungwoo Park  wrote:

> Hello Hive users,
>
> I am pleased to announce MR3 and Hive-MR3. Please visit the following
> webpage for everything on MR3 and Hive-MR3:
>
> https://mr3.postech.ac.kr/
> http://datamonad.com
>
> Here is a description of MR3 and Hive-MR3 from the webpage:
>
> MR3 is a new execution engine for Hadoop. Similar in spirit to Tez, it can
> be thought of as an enhancement of Tez with simpler design, better
> performance, and more features. MR3 is ready for production use as it
> supports all major features from Tez such as Kerberos-based security,
> authentication and authorization, fault-tolerance, and recovery. MR3 is
> implemented in Scala.
>
> Hive-MR3 is an extension of Hive that runs on top of MR3. In order to
> exploit new features in MR3, Hive-MR3 is built on a modified backend of
> Hive. In comparison with Hive-on-Tez, Hive-MR3 generally runs faster for
> sequential queries by virtue of the simple architectual design of
> ApplicationMaster in MR3. In particular, it makes a better utilization of
> computing resources and thus yields a higher throughput for concurrent
> queries.
>
> --- Sungwoo Park
>
> --
Thai


[Announce] Hive-MR3: Hive running on top of MR3

2018-04-04 Thread Sungwoo Park
Hello Hive users,

I am pleased to announce MR3 and Hive-MR3. Please visit the following
webpage for everything on MR3 and Hive-MR3:

https://mr3.postech.ac.kr/
http://datamonad.com

Here is a description of MR3 and Hive-MR3 from the webpage:

MR3 is a new execution engine for Hadoop. Similar in spirit to Tez, it can
be thought of as an enhancement of Tez with simpler design, better
performance, and more features. MR3 is ready for production use as it
supports all major features from Tez such as Kerberos-based security,
authentication and authorization, fault-tolerance, and recovery. MR3 is
implemented in Scala.

Hive-MR3 is an extension of Hive that runs on top of MR3. In order to
exploit new features in MR3, Hive-MR3 is built on a modified backend of
Hive. In comparison with Hive-on-Tez, Hive-MR3 generally runs faster for
sequential queries by virtue of the simple architectual design of
ApplicationMaster in MR3. In particular, it makes a better utilization of
computing resources and thus yields a higher throughput for concurrent
queries.

--- Sungwoo Park


ForeignKeysRequest Issue

2018-04-04 Thread Courtney Edwards
Hi,

In implementing the ThriftHiveMetastore.Iface we have run into some issues 
understanding the 
get_foreign_keys(ForeignKeysRequest
 request) method contract. We are receiving ForeignKeysRequest’s from hive cli( 
query: ‘describe formatted db1.table1’). The table has a FK constraint on 
another table. The ForeignKeysRequest.getParent_db_name() returns null. The 
documentation is sparse on what the `parent_db_name` and `foreign_db_name` 
represent, and after a number of educated guesses we felt it would be best to 
consult the mailing list. If possible could someone please explain these fields?

Thanks,
Courtney Edwards,
Software Engineer
Hotels.com – an Expedia Group Brand






Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Richard A. Bross
Mahender,

To really address your question I think that you'd have to supply a bit more 
information, such as the kind of data that you want to save; RBDMS type look 
ups, key/value/index type look ups, insert velocity, etc.  These wide choices 
of technologies are suited to different use cases, although they overlap in 
some areas.

In a previous position that I held we used Spark on Cassandra to solve a 
similar problem.  The Datastax distribution puts Spark worker nodes directly on 
Cassandra nodes.  Because Cassandra partitions the data across nodes based on a 
row key, it's a nice match.  If the key is chosen properly, the Spark nodes are 
typically accessing local data on the Cassandra nodes, meaning that there are 
typically very few shuffles for direct queries and also that inserts go 
directly to the proper Cassandra nodes.  We had time series data based on 
unique row keys.  So our row keys were unique and our column keys were the time 
stamps. In that case our queries were done directly with the Cassandra clients 
for the most part, with SparkQL primarily used for ad-hoc queries.  

At my current position, we directly load raw data into Hive (using HiveQL) and 
then use Presto for queries.  That's our OLAP data store.  You can use any 
number of other tools to query Hive created data stores as well.

Then we have another pipeline that takes the same raw data, uses Spark for the 
ETL, and then inserts the results into Aurora (MySQL).  The schema is designed 
for specific queries, so the Spark ETL is designed to transform the data to 
optimize for the schema so as to allow efficient updates to those tables.  
That's our OLTP data store and we use standard SQL for queries.

Rick


- Original Message -
From: "Furcy Pin" 
To: user@hive.apache.org
Sent: Wednesday, April 4, 2018 6:58:58 AM
Subject: Re: Building Datwarehouse Application in Spark


Hi Mahender, 


Did you look at this? https://www.snappydata.io/blog/the-spark-database 


But I believe that most people handle this use case by either using: 
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...) if 
the data is not too big 
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big and 
needs to be distributed 


Spark generally makes it easy enough to query these other databases to allow 
you to perform analytics. 


Hive and Spark have been designed as OLAP tools, not OLTP. 
I'm not sure what features you are seeking for your SCD but they probably won't 
be part of Spark's core design. 


Hope this helps, 


Furcy 






On 4 April 2018 at 11:29, Mahender Sarangam < mahender.bigd...@outlook.com > 
wrote: 




Hi, 
Does anyone has good architecture document/design principle for building 
warehouse application using Spark. 


Is it better way of having Hive Context created with HQL and perform 
transformation or Directly loading files in dataframe and perform data 
transformation. 


We need to implement SCD 2 Type in Spark, Is there any better 
document/reference for building Type 2 warehouse object 


Thanks in advace 


/Mahender 


Re: Building Datwarehouse Application in Spark

2018-04-04 Thread Furcy Pin
Hi Mahender,

Did you look at this? https://www.snappydata.io/blog/the-spark-database

But I believe that most people handle this use case by either using:
- Their favorite regular RDBMS (mySQL, postgres, Oracle, SQL-Server, ...)
if the data is not too big
- Their favorite New-SQL storage (Cassandra, HBase) if the data is too big
and needs to be distributed

Spark generally makes it easy enough to query these other databases to
allow you to perform analytics.

Hive and Spark have been designed as OLAP tools, not OLTP.
I'm not sure what features you are seeking for your SCD but they probably
won't be part of Spark's core design.

Hope this helps,

Furcy



On 4 April 2018 at 11:29, Mahender Sarangam 
wrote:

> Hi,
> Does anyone has good architecture document/design principle for building
> warehouse application using Spark.
>
> Is it better way of having Hive Context created with HQL and perform
> transformation or Directly loading  files in dataframe and perform data
> transformation.
>
> We need to implement SCD 2 Type in Spark, Is there any better
> document/reference for building Type 2 warehouse object
>
> Thanks in advace
>
> /Mahender
>


Building Datwarehouse Application in Spark

2018-04-04 Thread Mahender Sarangam
Hi,
Does anyone has good architecture document/design principle for building 
warehouse application using Spark.

Is it better way of having Hive Context created with HQL and perform 
transformation or Directly loading  files in dataframe and perform data 
transformation.

We need to implement SCD 2 Type in Spark, Is there any better 
document/reference for building Type 2 warehouse object

Thanks in advace

/Mahender