from:"Elliot West"

Re: Validating Hive statements

2020-02-19 Thread Elliot West

Hi,

If I recall correctly, not all script input is handled by the parser, and
the CLI takes care of some statements prior to parsing of SQL - comments
are an example. Also, there is some divergence between Beeline and hive
CLI. In HiveRunner we handled this by providing different CLI emulations:

https://github.com/klarna/HiveRunner/tree/master/src/main/java/com/klarna/hiverunner/sql/cli

Elliot.

On Wed, 19 Feb 2020 at 13:55, Odon Copon  wrote:

> Hi,
> I was wondering what would be the easiest way to validate a Hive script
> with multiple query statements offline. I thought it was possible to do
> that will the following Java code but it doesn't look like is possible for
> all of them:
>
> ---
> import org.apache.hadoop.hive.ql.parse.ParseDriver;
>
> ParseDriver pd = new ParseDriver();
> pd.parse(query);
> ---
>
> With that tiny snippet, I'm able to validate some queries but fails with
> other statements like:
>  - ADD JAR ;
>
> or:
>  - SET =true;
>
> I would like not to omit those and be able to parse a script with multiple
> query statements.
> Are there any tips you could give me to help me with this? Currently, I'm
> splitting by semicolon and discarding lines with ADD and SET statements,
> but there must something there I'm missing.
>
> Thanks.
>

Apache Iceberg integration

2020-01-08 Thread Elliot West

Hello,

We're considering working on an integration of Apache Iceberg with Hive,
initially so that the latest snapshot of Iceberg tables can be queried via
Hive, but later to allow the writing of data using the Iceberg table format.

I wanted to first check for the existence and status of any similar efforts
so that we do not find ourselves duplicating work unnecessarily. I've
checked both the Iceberg and Hive projects and can find no issues that
suggest that such an integration is underway or planned (only HIVE-19457
 which was raised by
myself and remains open).

If one or more efforts is underway we'd certainly be open to contributing.
If not, we'd be keen to capture any thoughts from the community on
preferred or recommended technical approaches.

I see that some work occurred on MR In/Out formats
 which might serve as
a foundation, so we'll certainly be investigating those further.

Thanks,

Elliot.

Re: Gather Partition Locations

2019-11-12 Thread Elliot West

Hello,

We faced a similar problem. Additionally, we had job clients were difficult
to integrate directly with the Thirft API, but needed to resolve file
locations via the metastore. To handle this, we build a cut down service
with a REST API that fronts the Hive metastore. The API is optimised for
this specific case of retrieving lists of file locations for some set of
partitions. A nice thing about this approach is that our users are able to
fetch file lists that are easy to parse, using very simple integrations (a
curl command for example).

Cheers,

Elliot.

On Fri, 8 Nov 2019 at 15:44,  wrote:

> Hi,
> I have a question about how to get the location for a bunch of partitions.
> My answer is: using the hive query `DESCRIBE EXTENDED 
> PARTITION()`
>
> I'm getting back a json response (if i set return type to JSON) which has
> the
> HDFS location in it.
>
> But. If I have, lets say 1000 partitions and every query needs 0.5 sec,
> I have to wait 500 sec.
>
> So my question is, do you have a single query to gather all locations?
> Or do you have a workaround to get the locations faster?
> I think about to query the metadatore RDS directly. Similar to
> http://www.openkb.info/2015/04/how-to-list-table-or-partition-location.html
> But in an enterprise environment I'm pretty sure this approach would not be
> the best because the RDS (mysql or derby) is maybe not reachable or
> I don't have the permission to it.
>
> Any other hint or idea how to get all partition locations from lets
> say external table with custom partition locations in a performant way.
>
> Thanks
> Marko
>
>
>

Real world Hive transactional table use

2019-09-04 Thread Elliot West

Hello,

I'm looking for examples and experiences of using Hive ACID tables for
large-scale, cloud-based data lakes - specifically where the data is stored
in a file store such as S3. I'm assuming that such usage is growing, given
that transactional tables are the default managed table type in Hive 3.

Would anyone care to share their experiences?

Thanks,

Elliot.

Optimal approach for changing file format of a partitioned table

2018-08-04 Thread Elliot West

Hi,

I’m trying to simply change the format of a very large partitioned table
from Json to ORC. I’m finding that it is unexpectedly resource intensive,
primarily due to a shuffle phase with the partition key. I end up running
out of disk space in what looks like a spill to disk in the reducers.
However, the partitioning scheme is identical on both the source and the
destination so my expectation is a map only job that simply rencodes each
file.

I’m using INSERT OVERWRITE TABLE with dynamic partitioning. I suspect I
could resolve my issue by allocating more storage to the task nodes.
However, can anyone advise a more resource and time efficient approach?

Cheers,

Elliot.

Re: Hive storm streaming with s3 file system

2018-06-12 Thread Elliot West

I don't not believe that S3 is currently a supported filesystem for
transactional tables. I believe there are plans to make this so.

On 12 June 2018 at 17:50, Abhishek Raj  wrote:

> Hi. I'm using HiveBolt from Apache Storm to stream
> 
> data into a transactional hive table. It works great when the table is
> backed by hdfs, but starts throwing error when the location is s3. The
> error I get is, "No filesystem for scheme: s3"
>
> Just wondering if it's possible to stream data into a s3 backed hive table
> from storm or is hdfs the only filesystem supported.
>
> Any insights would be great.
>
> Thanks.
>

Re: Hive External Table on particular set of files.

2018-06-03 Thread Elliot West

On which type of file system are you storing the data? S3? HDFS? Other?

On Sun, 3 Jun 2018 at 08:26, Mahender Sarangam 
wrote:

> We are copying files from our upstream system which are in JSON GZ format.
> They are following a pattern for very daily slice say MMDDHH
> (2018053100) they are maintianing two folders DATA and METADATA. Where DATA
> holds actual data and METADATA holds RowCount of that day's data.We need to
> create external table on top of copied data, where it only consider
> *.json.gz extension files only, excluding other file extensions. We dont
> want to copy files into another location since they are large in size. We
> also tried INPUT_ FILENAME  virtual column, it didn't work. Any
> suggesstion for this scnearios ?
>

org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory

2018-05-14 Thread Elliot West

Hello,

I've been looking at Amazon's integration of their Glue service with Hive
in EMR and notice that they achieve this with:

   - An AWS Glue specific implementation of
   org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory
   (com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory)
   - A configuration property to set the HMS client factory
   implementation: hive.metastore.client.factory.class

However, despite searching on https://github.com/apache/hive, I cannot find
the o.a.h.h.q.m.HMSCF class, or the configuration key anywhere in vanilla
Apache Hive even though the packaging suggests it is not an Amazon addition
specific to EMR. I do find a similar class in
org.apache.hadoop.hive.ql.security.authorization.plugin however.

Can you tell me if this an Amazon specific construct or are my code
searching abilities failing me. If this has been added by Amazon to EMR
alone, are there any plans to add similar functionality back to the Apache
Hive project, and if so when?

I'd be keen to see relevant GitHub links, JIRAs, etc.

Cheers,

Elliot.

Re: What does the ORC SERDE do

2018-05-13 Thread Elliot West

Hi Jörn,

I’m curious to know how the SerDe framework provides the means to deal with
partitions, table properties, and statistics? I was under the impression
that these were in the domain of the metastore and I’ve not found anything
in the SerDe interface related to these. I would appreciate if you could
point me in the direction of anything I’ve missed.

Thanks,

Elliot.

On Sun, 13 May 2018 at 15:42, Jörn Franke  wrote:

> In detail you can check the source code, but a Serde needs to translate an
> object to a Hive object and vice versa. Usually this is very simple (simply
> passing the object or create A HiveDecimal etc). It also provides an
> ObjectInspector that basically describes an object in more detail (eg to be
> processed by an UDF). For example, it can tell you precision and scale of
> an objects. In case of ORC it describes also how a bunch of objects
> (vectorized) can be mapped to hive objects and the other way around.
> Furthermore, it provides statistics and provides means to deal with
> partitions as well as table properties (!=input/outputformat properties).
> Although it sounds complex, hive provides most of the functionality so
> implementing a serde is most of the times easy.
>
> > On 13. May 2018, at 16:34, 侯宗田  wrote:
> >
> > Hello,everyone
> >   I know the json serde turn fields in a row to a json format, csv serde
> turn it to csv format with their serdeproperties. But I wonder what the orc
> serde does when I choose to stored as orc file format. And why is there
> still escaper, separator in orc serdeproperties. Also with RC Parquet. I
> think they are just about how to stored and compressed with their input and
> output format respectively, but I don’t know what their serde does, can
> anyone give some hint?
>

Re: Hive remote databases/tables proposal

2018-05-10 Thread Elliot West

The main problem I see with a SerDe based approach is that this abstraction
is not able to expose the needed set of metadata for the target table.
While the SerDe can return the schema (via getObjectInspector() I
presume), there is no provision for the delivery of available partitions,
or table and column statistics.

On a related note, I believe this might also preclude the SerDe from acting
as the main integration point for an iceberg integration (
https://issues.apache.org/jira/browse/HIVE-19457), as this too will need to
pass additional metadata that is stored outside of the metastore and does
not fall into the scope of the SerDe interface.

The org.apache.hadoop.hive.ql.metadata.MetastoreClientFactory integration
point looks promising for both of these cases, but I can only find an
operational implementation of this in EMR.

Cheers,

Elliot.

On 27 April 2018 at 17:32, Elliot West <tea...@gmail.com> wrote:

> Hi Johannes,
>
> We did not. I presume that your suggestion is that my use case could be
> implemented as a storage handler, and not that we access remote Hive data
> via JDBC (and by implication, HS2)?
>
> I must confess that I hadn't considered this approach, likely because for
> some time I'd assumed that a storage handler could not also be the source
> of table metadata. However, lately I've been externalizing schemas with the
> AvroSerDe and so I now have practical experience that demonstrates that
> isn't the case.
>
> It's a very good idea and I'm keen to look into the practicalities.
>
> Thank you for your helpful reply.
>
> Elliot.
>
>
> On 26 April 2018 at 17:28, Johannes Alberti <johan...@altiscale.com>
> wrote:
>
>> Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler
>> and discussed the pros/cons/similarities of the qubole approach
>>
>> On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <tea...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
>>> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for
>>> federating multiple Hive clusters and providing the illusion of a unified
>>> data catalog from disparate instances.
>>>
>>> We’ve been running Waggle Dance in production for well over a year and
>>> it has formed a critical part of our data platform architecture and
>>> infrastructure.We believe that this type of functionality will be of
>>> increasing importance as Hadoop and Hive workloads migrate to the cloud.
>>> While Waggle Dance is one solution, significant benefits could be realized
>>> if these kinds of abilities were an integral part of the Hive platform.
>>>
>>> If this sounds of interest, I've created a proposal on the Hive wiki.
>>> I've outlined why we think such a feature is needed in Hive, the benefits
>>> gained by offering it as a built-in feature, and representation of a
>>> possible implementation. Our proposed implementation draws inspiration from
>>> the remote table features present in some traditional RDBMSes, which may
>>> already be familiar to you.
>>>
>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pa
>>> geId=80452092
>>>
>>> Feedback gratefully accepted,
>>>
>>> Elliot.
>>>
>>> Senior Engineer
>>> Big Data Platform Team
>>> Hotels.com
>>>
>>
>>
>

Re: May 2018 Hive User Group Meeting

2018-05-02 Thread Elliot West

+1 for streaming or a recording. Content looks excellent.

On 2 May 2018 at 15:51, dan young  wrote:

> Looks like great talks, will this be streamed anywhere?
>
> On Wed, May 2, 2018, 8:48 AM Sahil Takiar  wrote:
>
>> Hey Everyone,
>>
>> The agenda for the meetup has been set and I'm excited to say we have
>> lots of interesting talks scheduled! Below is final agenda, the full list
>> of abstracts will be sent out soon. If you are planning to attend, please
>> RSVP on the meetup link so we can get an accurate headcount of attendees (
>> https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/).
>>
>> 6:30 - 7:00 PM Networking and Refreshments
>> 7:00PM - 8:20 PM Lightning Talks (10 min each) - 8 talks total
>>
>>- What's new in Hive 3.0.0 - Ashutosh Chauhan
>>- Hive-on-Spark at Uber: Efficiency & Scale - Xuefu Zhang
>>- Hive-on-S3 Performance: Past, Present, and Future - Sahil Takiar
>>- Dali: Data Access Layer at LinkedIn - Adwait Tumbde
>>- Parquet Vectorization in Hive - Vihang Karajgaonkar
>>- ORC Column Level Encryption - Owen O’Malley
>>- Running Hive at Scale @ Lyft - Sharanya Santhanam, Rohit Menon
>>- Materialized Views in Hive - Jesus Camacho Rodriguez
>>
>> 8:30 PM - 9:00 PM Hive Metastore Panel
>>
>>- Moderator: Vihang Karajgaonkar
>>- Participants:
>>   - Daniel Dai - Hive Metastore Caching
>>   - Alan Gates - Hive Metastore Separation
>>   - Rituparna Agrawal - Customer Use Cases & Pain Points of (Big)
>>   Metadata
>>
>> The Metastore panel will consist of a short presentation by each panelist
>> followed by a Q session driven by the moderator.
>>
>> On Tue, Apr 24, 2018 at 2:53 PM, Sahil Takiar 
>> wrote:
>>
>>> We still have a few slots open for lightening talks, so if anyone is
>>> interested in giving a presentation don't hesitate to reach out!
>>>
>>> If you are planning to attend the meetup, please RSVP on the Meetup link
>>> (https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/) so
>>> that we can get an accurate headcount for food.
>>>
>>> Thanks!
>>>
>>> --Sahil
>>>
>>> On Wed, Apr 11, 2018 at 5:08 PM, Sahil Takiar 
>>> wrote:
>>>
 Hi all,

 I'm happy to announce that the Hive community is organizing a Hive user
 group meeting in the Bay Area next month. The details can be found at
 https://www.meetup.com/Hive-User-Group-Meeting/events/249641278/

 The format of this meetup will be slightly different from previous
 ones. There will be one hour dedicated to lightning talks, followed by a
 group discussion on the future of the Hive Metastore.

 We are inviting talk proposals from Hive users as well as developers at
 this time. Please contact either myself (takiar.sa...@gmail.com),
 Vihang Karajgaonkar (vih...@cloudera.com), or Peter Vary (
 pv...@cloudera.com) with proposals. We currently have 5 openings.

 Please let me know if you have any questions or suggestions.

 Thanks,
 Sahil

>>>
>>>
>>>
>>> --
>>> Sahil Takiar
>>> Software Engineer
>>> takiar.sa...@gmail.com | (510) 673-0309
>>>
>>
>>
>>
>> --
>> Sahil Takiar
>> Software Engineer
>> takiar.sa...@gmail.com | (510) 673-0309
>>
>

Re: Hive remote databases/tables proposal

2018-04-27 Thread Elliot West

Hi Johannes,

We did not. I presume that your suggestion is that my use case could be
implemented as a storage handler, and not that we access remote Hive data
via JDBC (and by implication, HS2)?

I must confess that I hadn't considered this approach, likely because for
some time I'd assumed that a storage handler could not also be the source
of table metadata. However, lately I've been externalizing schemas with the
AvroSerDe and so I now have practical experience that demonstrates that
isn't the case.

It's a very good idea and I'm keen to look into the practicalities.

Thank you for your helpful reply.

Elliot.

On 26 April 2018 at 17:28, Johannes Alberti <johan...@altiscale.com> wrote:

> Did you guys look at https://github.com/qubole/Hive-JDBC-Storage-Handler
> and discussed the pros/cons/similarities of the qubole approach
>
> On Thu, Apr 26, 2018 at 4:01 AM, Elliot West <tea...@gmail.com> wrote:
>
>> Hello,
>>
>> At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
>> Dance <https://github.com/HotelsDotCom/waggle-dance>, a tool for
>> federating multiple Hive clusters and providing the illusion of a unified
>> data catalog from disparate instances.
>>
>> We’ve been running Waggle Dance in production for well over a year and it
>> has formed a critical part of our data platform architecture and
>> infrastructure.We believe that this type of functionality will be of
>> increasing importance as Hadoop and Hive workloads migrate to the cloud.
>> While Waggle Dance is one solution, significant benefits could be realized
>> if these kinds of abilities were an integral part of the Hive platform.
>>
>> If this sounds of interest, I've created a proposal on the Hive wiki.
>> I've outlined why we think such a feature is needed in Hive, the benefits
>> gained by offering it as a built-in feature, and representation of a
>> possible implementation. Our proposed implementation draws inspiration from
>> the remote table features present in some traditional RDBMSes, which may
>> already be familiar to you.
>>
>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80452092
>>
>> Feedback gratefully accepted,
>>
>> Elliot.
>>
>> Senior Engineer
>> Big Data Platform Team
>> Hotels.com
>>
>
>

Hive remote databases/tables proposal

2018-04-26 Thread Elliot West

Hello,

At the 2018 DataWorks conference in Berlin, Hotels.com presented Waggle
Dance , a tool for federating
multiple Hive clusters and providing the illusion of a unified data catalog
from disparate instances.

We’ve been running Waggle Dance in production for well over a year and it
has formed a critical part of our data platform architecture and
infrastructure.We believe that this type of functionality will be of
increasing importance as Hadoop and Hive workloads migrate to the cloud.
While Waggle Dance is one solution, significant benefits could be realized
if these kinds of abilities were an integral part of the Hive platform.

If this sounds of interest, I've created a proposal on the Hive wiki. I've
outlined why we think such a feature is needed in Hive, the benefits gained
by offering it as a built-in feature, and representation of a possible
implementation. Our proposed implementation draws inspiration from the
remote table features present in some traditional RDBMSes, which may
already be familiar to you.

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=80452092

Feedback gratefully accepted,

Elliot.

Senior Engineer
Big Data Platform Team
Hotels.com

org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory

2018-04-23 Thread Elliot West

Hello,

I'm looking for an abstraction to use for integrating with different
(non-Thrift) metadata catalog implementations. I know that AWS Glue manages
this and so have explored in EMR (Hive 2.3.2) a little. I see that it uses
the "org.apache.hadoop.hive.ql.metadata.HiveMetaStoreClientFactory"
interface to do this. However, I cannot find this class anywhere in vanilla
Apache Hive.

Is this an Amazon specific construct (if so then why is it namespaced to
org.apache.hadoop.hive?) or are my code searching abilities failing me.
Does this class exist in Apache Hive, and if so, where? (A link in GitHub
would be appreciated).

Cheers,

Elliot.

Proposal: Apply SQL based authorization functions in the metastore.

2018-04-20 Thread Elliot West

Hello,

I’d like to propose that SQL based authorization (or something similar) be
applied and enforced also in the metastore service as part of the
initiative to extract HMS as an independent project. While any such
implementation cannot be ’system complete’ like HiveServer2 (HS2) (HMS has
no scope to intercept operations applied to table data, only metadata), it
would be a significant step forward for controlling the operations that can
be actioned by the many non-HS2 clients in the Hive ecosystem.

I believe this is a good time to consider this option as there is currently
much discussion in the Hive community on the future directions of HMS and
greater recognition that HMS is now seen as general data platform
infrastructure and not simply an internal Hive component.

Further details are below. I’d be grateful for any feedback, thoughts, and
suggestions on how this could move forward.

*Problem*
At this time, Hive’s SQL based authorization feature is the recommended
approach for controlling which operations may be performed on what by whom.
This feature is applied in the HS2 component. However, a large number of
platforms that integrate with Hive do not do so via HS2, instead talking to
the metastore service directly and so bypassing authorization. They can
perform destructive operations such as a table drop even though the
permissions declared in the metastore may explicitly forbid it as they are
able to circumvent the authorization logic in HS2.

In short, there seems to be a lack of encapsulation with authorization in
the metastore; HMS owns the metadata, is responsible for performing actions
on metadata, for maintaining permissions on what actions are permissible by
whom, and yet has no means to use the information it has to protect the
data it owns.

*Workarounds*
Common workarounds to this deficiency include falling back to storage based
authorization or running read only metastore instances. However, both of
these approaches have significant drawbacks:

   - File based auth does not function when using object stores such as S3
   and so is not usable in cloud deployments of Hive - a pattern that is
   seeing significant growth.
   - Read only metastores incur significant infrastructure and operational
   overheads, requiring a separate set of server instances, while delivering
   little functionality and blunt authorization capabilities. You cannot for
   example restrict a particular operation type, by a certain user, on a
   specific table. You are literally blocking all writes by directing
   different user groups to different network endpoints.

*Anti-patterns*
It might be tempting to simply suggest using HS2 for all access to Hive
data. However, while this is conceptually appealing, it’s not practical to
apply on large, rich, and diverse data platforms where tool
interoperability and broad compatibility is required. Additionally, it can
be argued that the API exposed by HS2, while useful for analytical tools,
is not fit for use by large ETL processes; for example: using a “SELECT *”
over JDBC as a source for a large Spark job doesn’t scale.

*High level implementation notes*
I believe that the HMS requires little (if any) refactoring to support the
implementation of SQL based auth in the metastore. It currently maintains
all of the necessary metadata that describes the authorization rules that
should be applied. It also has access to the principle wishing to perform a
certain action via the UGI mechanism. Finally, there is an existing hook
mechanism to intercept metadata operations and apply authorization.

In deployments that use HS2 exclusively, the proposed metastore resident
SQL based auth could either be disabled or used harmlessly in conjunction
with the HS2 implementation.

Thanks,

Elliot.

Elliot West
Senior Engineer
Data Platform Team
Hotels.com

Re: HQL parser internals

2018-03-19 Thread Elliot West

Hello again,

We're now testing our system against a corpus of Hive SQL statements in an
effort to quickly highlight edge cases, limitations etc. We're finding that
org.apache.hadoop.hive.ql.parse.ParseUtils is stumbling on variables such
as ${hiveconf:varname}. Are variable substitutions handled prior to parsing
or within the parser itself? If in a pre-procesing stage, is there any code
or utility classes within Hive that we can use as a reference, or to
provide this functionality?

Cheers,

Elliot.

On 19 February 2018 at 11:10, Elliot West <tea...@gmail.com> wrote:

> Thank you all for your rapid responses; some really useful information and
> pointers in there.
>
> We'll keep the list updated with our progress.
>
> On 18 February 2018 at 19:00, Dharmesh Kakadia <dhkaka...@gmail.com>
> wrote:
>
>> +1 for using ParseDriver for this. I also have used it to intercept and
>> augment query AST.
>>
>> Also, I would echo others sentiment that its quite ugly. It would be
>> great if we can refactor/standardize this. That will make integrating
>> other system a lot easier.
>>
>> Thanks,
>> Dharmesh
>>
>> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin <pin.fu...@gmail.com> wrote:
>>
>>> Hi Elliot,
>>>
>>> Actually, I have done quite similar work regarding Hive custom Parsing,
>>> you should have a look at my project: https://github.com/flaminem/flamy
>>>
>>> The Hive parsing related stuff is here: https://github.com/flami
>>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>>> A good starting point to see how to parse queries is here:
>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>>
>>>
>>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>>> e.ParseDriver.
>>>
>>> val pd: ParseDriver = new ParseDriver
>>> val tree: ASTNode = pd.parse(query, hiveContext)
>>>
>>> You then get the ASTNode, that you can freely parse and change.
>>> Also, I must say that it is quite ugly to manipulate, and the Presto
>>> Parser seems to be much better designed (but it is not the same syntax,
>>> unfortunately),
>>> I recommend to look at it to get better design ideas.
>>>
>>>
>>> If you want to enrich your Hive syntax like I did (I wanted to be able
>>> to parse ${VARS} in queries),
>>> you will not be able to use the HiveParser without some workaround.
>>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>>> HiveParser would agree to parse,
>>> and that I could recognize afterwards...
>>>
>>> Also, if you are familiar with Scala, I recommend using it, it helps a
>>> lot...
>>>
>>> For instance, I have this class that transforms an AST back into a
>>> string query:
>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>>> I could never have done something that good looking in Java...
>>>
>>> Finally this method helps a lot to understand how the hell the AST works:
>>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>>
>>> Make sure to write *tons* of unit tests too, you'll need them.
>>>
>>> Hope this helps,
>>>
>>> Furcy
>>>
>>>
>>>
>>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan <gop...@apache.org>
>>> wrote:
>>>
>>>>
>>>> > However, ideally we wish to manipulate the original query as
>>>> delivered by the user (or as close to it as possible), and we’re finding
>>>> that the tree has been modified significantly by the time it hits the hook
>>>>
>>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>>>> the bushy join conversion is already done by the time the hook gets called.
>>>>
>>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>>>> hook.
>>>>
>>>> > Additionally we wish to track back ASTNodes to the character
>>>> sequences in the source HQL that were their origin (where sensible), and
>>>> ultimately hope to be able regenerate the query text from the AST.
>>>>
>>>> I started work on a Hive-unparser a while back based on this class, but
>>

Re: HQL parser internals

2018-02-19 Thread Elliot West

Thank you all for your rapid responses; some really useful information and
pointers in there.

We'll keep the list updated with our progress.

On 18 February 2018 at 19:00, Dharmesh Kakadia  wrote:

> +1 for using ParseDriver for this. I also have used it to intercept and
> augment query AST.
>
> Also, I would echo others sentiment that its quite ugly. It would be great
> if we can refactor/standardize this. That will make integrating other
> system a lot easier.
>
> Thanks,
> Dharmesh
>
> On Sat, Feb 17, 2018 at 12:07 AM, Furcy Pin  wrote:
>
>> Hi Elliot,
>>
>> Actually, I have done quite similar work regarding Hive custom Parsing,
>> you should have a look at my project: https://github.com/flaminem/flamy
>>
>> The Hive parsing related stuff is here: https://github.com/flami
>> nem/flamy/tree/master/src/main/scala/com/flaminem/flamy/parsing/hive
>> A good starting point to see how to parse queries is here:
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/PopulateParserInfo.scala#L492
>>
>>
>> Basically, all you need is to use a org.apache.hadoop.hive.ql.pars
>> e.ParseDriver.
>>
>> val pd: ParseDriver = new ParseDriver
>> val tree: ASTNode = pd.parse(query, hiveContext)
>>
>> You then get the ASTNode, that you can freely parse and change.
>> Also, I must say that it is quite ugly to manipulate, and the Presto
>> Parser seems to be much better designed (but it is not the same syntax,
>> unfortunately),
>> I recommend to look at it to get better design ideas.
>>
>>
>> If you want to enrich your Hive syntax like I did (I wanted to be able to
>> parse ${VARS} in queries),
>> you will not be able to use the HiveParser without some workaround.
>> What I did was replacing these ${VARS} by strings "${VARS}" that the
>> HiveParser would agree to parse,
>> and that I could recognize afterwards...
>>
>> Also, if you are familiar with Scala, I recommend using it, it helps a
>> lot...
>>
>> For instance, I have this class that transforms an AST back into a string
>> query:
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/ast/SQLFormatter.scala
>> I could never have done something that good looking in Java...
>>
>> Finally this method helps a lot to understand how the hell the AST works:
>> https://github.com/flaminem/flamy/blob/master/src/main/scala
>> /com/flaminem/flamy/parsing/hive/HiveParserUtils.scala#L593
>>
>> Make sure to write *tons* of unit tests too, you'll need them.
>>
>> Hope this helps,
>>
>> Furcy
>>
>>
>>
>> On 16 February 2018 at 21:20, Gopal Vijayaraghavan 
>> wrote:
>>
>>>
>>> > However, ideally we wish to manipulate the original query as delivered
>>> by the user (or as close to it as possible), and we’re finding that the
>>> tree has been modified significantly by the time it hits the hook
>>>
>>> That's CBO. It takes the Query - > AST -> Calcite Tree -> AST -> hook -
>>> the bushy join conversion is already done by the time the hook gets called.
>>>
>>> We need a Parser hook to hook it ahead of CBO, not a Semantic Analyzer
>>> hook.
>>>
>>> > Additionally we wish to track back ASTNodes to the character sequences
>>> in the source HQL that were their origin (where sensible), and ultimately
>>> hope to be able regenerate the query text from the AST.
>>>
>>> I started work on a Hive-unparser a while back based on this class, but
>>> it a world of verbose coding.
>>>
>>> https://github.com/apache/hive/blob/master/ql/src/java/org/a
>>> pache/hadoop/hive/ql/optimizer/calcite/translator/ASTConverter.java#L850
>>>
>>> If you're doing active work on this, I'd like to help, because I need
>>> the AST -> query to debug CBO.
>>>
>>> > The use case, if you are interested, is a mutation testing framework
>>> for HQL. The testing of mutants is operational, but now we need to report
>>> on survivors, hence the need to track back from specific query elements to
>>> character sequences in the original query string.
>>>
>>> This sounds a lot like the fuzzing random-query-gen used in Cloudera to
>>> have Impala vs Hive bug-for-bug compat.
>>>
>>> https://cwiki.apache.org/confluence/download/attachments/273
>>> 62054/Random%20Query%20Gen-%20Hive%20Meetup.pptx
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>

HQL parser internals

2018-02-16 Thread Elliot West

Hello,

We need to be able to parse and manipulate an HQL query. To date we’ve been
intercepting and transforming the parse tree in the SemanticAnalyzerHook.
However, ideally we wish to manipulate the original query as delivered by
the user (or as close to it as possible), and we’re finding that the tree
has been modified significantly by the time it hits the hook. Additionally
we wish to track back ASTNodes to the character sequences in the source HQL
that were their origin (where sensible), and ultimately hope to be able
regenerate the query text from the AST.

We are expecting to need to write our own parser to do this, but hope to
use the Hive grammar at least for the foundation. I wonder if anyone
familiar with this code might be able to suggest some good staring points.

The use case, if you are interested, is a mutation testing framework for
HQL. The testing of mutants is operational, but now we need to report on
survivors, hence the need to track back from specific query elements to
character sequences in the original query string.

Thanks,

Elliot.

Re: Proposal: File based metastore

2018-01-30 Thread Elliot West

Hi Ryan,

Is Hive support on the iceberg roadmap? Presumably its
MetastoreClientFactory and storage API provide an integration point? Or is
there perhaps some architectural detail that makes this impractical?

I’m thinking not just of the ability to support Hive, but also the range of
tooling that already integrates with the metastore.

Cheers,

Elliot.

On Tue, 30 Jan 2018 at 18:17, Ryan Blue  wrote:

> Thanks, Owen.
>
> I agree, Iceberg addresses a lot of the problems that you're hitting here.
> It doesn't quite go as far as moving all metadata into the file system. You
> can do that in HDFS and implementations that support atomic rename, but not
> in S3 (Iceberg has an implementation of the HDFS one strategy). For S3 you
> need some way of making commits atomic, for which we are using a metastore
> that is far more light-weight. You could also use a ZooKeeper cluster for
> write-side locking, or maybe there are other clever ideas out there.
>
> Even if Iceberg is agnostic to the commit mechanism, it does almost all of
> what you're suggesting and does it in a way that's faster than the current
> metastore while providing snapshot isolation.
>
> rb
>
> On Mon, Jan 29, 2018 at 9:10 AM, Owen O'Malley 
> wrote:
>
>> You should really look at what the Netflix guys are doing on Iceberg.
>>
>> https://github.com/Netflix/iceberg
>>
>> They have put a lot of thought into how to efficiently handle tabular
>> data in S3. They put all of the metadata in S3 except for a single link to
>> the name of the table's root metadata file.
>>
>> Other advantages of their design:
>>
>>- Efficient atomic addition and removal of files in S3.
>>- Consistent schema evolution across formats
>>- More flexible partitioning and bucketing.
>>
>>
>> .. Owen
>>
>> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo 
>> wrote:
>>
>>> All,
>>>
>>> I have been bouncing around the earth for a while and have had the
>>> privilege of working at 4-5 places. On arrival each place was in a variety
>>> of states in their hadoop journey.
>>>
>>> One large company that I was at had a ~200 TB hadoop cluster. They
>>> actually ran PIG and there ops group REFUSED to support hive, even though
>>> they had written thousands of lines of pig macros to deal with selecting
>>> from a partition, or a pig script file you would import so you would know
>>> what the columns of the data at location /x/y/z is.
>>>
>>> In another lifetime I have been at a shop that used SCALDING. Again lots
>>> of custom effort there with avro and parquet, all to do things that hive
>>> would do our of the box. Again the biggest challenge is the thrift service
>>> and metastore.
>>>
>>> In the cloud many people will use a bootstrap script
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html
>>> or 'msck repair'
>>>
>>> The "rise of the cloud" has changed us all the metastore is being a
>>> database is a hard paradigm to support. Imagine for example I created data
>>> to an s3 bucket with hive, and another group in my company requires read
>>> only access to this data for an ephemeral request. Sharing the data is
>>> easy, S3 access can be granted, sharing the metastore and thrift services
>>> are much more complicated.
>>>
>>> So lets think out of the box:
>>>
>>>
>>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-cassandra-together-at-last
>>>
>>> Datastax was able to build a platform where the filesystem and the
>>> metastore were backed into Cassandra. Even though a HBase user would not
>>> want that, the novel thing about that approach is that the metastore was
>>> not "some extra thing in a database" that you had to deal with.
>>>
>>> What I am thinking is that for the user of s3, the metastore should be
>>> in s3. Probably in hidden files inside the warehouse/table directory(ies).
>>>
>>> Think of it as msck repair "on the fly" "
>>> https://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.biginsights.commsql.doc/doc/biga_msckrep.html
>>> "
>>>
>>> The implementation could be something like this:
>>>
>>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>>> us locate the databases and in the databases we can locate tables, with the
>>> tables we can locate partitions.
>>>
>>> This will of course scale horribly across tables with 9000
>>> partitions but that would not be our use case. For all the people with
>>> "msck repair" in the bootstrap they have a much cleaner way of using hive.
>>>
>>> The implementations could even be "Stacked" files first metastore
>>> lookback second.
>>>
>>> It would be also wise to have a tool available in the CLI "metastore
>>>  toJson" making it drop dead simple to export the schema
>>> definitions.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>
>
>
> --
> Ryan Blue
>

Re: Cannot create external table on S3; class S3AFileSystem not found

2017-12-09 Thread Elliot West

Which distribution are you using? Do you have hadoop-aws on the class path?
Is ‘/path/to/hadoop/install’ a literal value or a placeholder that you’
using for the actual location?

Cheers,

Elliot.

On Sat, 9 Dec 2017 at 00:08, Scott Halgrim  wrote:

> Hi,
>
> I’ve been struggling with this for a few hours, hopefully somebody here
> can help me out.
>
> We have a lot of data in parquet format on S3 and we want to use Hive to
> query it. I’m running on ubuntu and we have a MySQL metadata store on AWS
> RDS.
>
> The command in the hive client I’m trying to run is:
>
> CREATE EXTERNAL TABLE
> my_schema.my_table
> (account_id INT,
> action VARCHAR(282),
> another_id INT
> yaid INT,
> `date` TIMESTAMP,
> deleted_at TIMESTAMP,
> id INT,
> lastchanged TIMESTAMP,
> thing_index DOUBLE,
> old_id INT,
> parent_id INT,
> running INT,
> other_id INT,
> another_thing VARCHAR(282),
> title VARCHAR(282),
> type_of VARCHAR(282))
> PARTITIONED BY (snapshot_date DATE)
> STORED AS parquet
> LOCATION 's3a://bucket/folder/foo_my_schema.my_table’;
>
>
> The error I get is:
>
> FAILED: SemanticException java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>
>
> I have this in my hive-site.xml file:
>
> 
> hive.aux.jars.path
> /path/to/hadoop-install/hadoop-2.9.0/share/hadoop/tools/lib
> 
>
>
>
> Another thing I tried was to create the external table without a location
> and then alter it to have the location:
>
> alter table my_schema.my_table set location
> "s3a://bucket/folder/foo_my_schema.my_table";
>
>
> And then I get a different error:
>
> FAILED: SemanticException Cannot connect to namenode, please check if
> host/port pair for s3a://bucket/folder/foo_my_schema.my_table is valid
>
>
> What could I be missing?
>
> Thanks!
>
> Scott
>

Re: Options for connecting to Apache Hive

2017-11-10 Thread Elliot West

Hi Jakob,

Assuming that your Hive deployment is running HiveServer2, you could issue
queries and obtain result sets via its Thrift API. Thrift has a broad set
of language implementations, including C IIRC. I believe this is also the
API used by Hive's JDBC connector, so it should be capable from a feature
set perspective.

Cheers - Elliot.

On 10 November 2017 at 10:28, Jakob Egger  wrote:

> Hi!
>
> I'm the developer of a database client tool, and I've received a request
> to add support for querying Apache Hive.
>
> (My tool lets the user execute SQL queries, and it allows browsing tables
> etc.)
>
> As a first step of evaluating this suggestion, I'm trying to find out if
> there is a convenient way to connect to Hive.
>
> From reading the documentation, it seems that the preferred way to connect
> seems to be using the JDBC driver. Since my app is not written in Java,
> this is probably not the way to go. Apart from that, I didn't find much on
> this topic in the docs.
>
> I have a few questions:
>
> 1) What ways are there to connect to Apache Hive?
>
> 2) Is there a C client library?
>
> 3) Is there any documentation on the wire protocol that Hive uses for
> client / server communication?
>
> I'd appreciate if someone who knows more about the project could point me
> in the right direction!
>
> Best regards,
> Jakob

Re: ACID update operation not working as expected

2017-10-18 Thread Elliot West

Did you manage to get any further with this?

On Fri, 6 Oct 2017 at 05:47, Manju A  wrote:

>
> > Hi Team,
> >
> >
> >
> > Using flume interceptor , I am reading messages from kafka with key and
> value pair. The key is represented by an integer variable called pk in
> below code and the value of message is represented by obj variable with
> Ctrl A delimiter.
> >
> >
> >
> > For inserting into hive table am doing this:-
> >
> > MutableRecord mutableRecordInput = new
> MutableRecord(pk,obj.split("\u0001"));
> >
> > MutableRecord mutableRecord = (MutableRecord)
> bucketIdResolver.attachBucketIdToRecord(mutableRecordInput);
> >
> > coordinator.insert(Collections. emptyList(), mutableRecord);
> >
> >
> >
> >
> >
> > For updating the records in hive table with pk being key am doing this:-
> >
> > RecordIdentifier rd = new RecordIdentifier();
> >
> > MutableRecord mutableRecordInputup = new
> MutableRecord(pk,rd,obj.split("\u0001"));
> >
> > MutableRecord mutableRecordInputu = (MutableRecord)
> bucketIdResolver.attachBucketIdToRecord(mutableRecordInputup);
> >
> > coordinator.update(Collections. emptyList(),
> mutableRecordInputu);
> >
> >
> >
> >
> >
> > The inserts work fine. However updates are causing data inconsistency
> and instead of updating, it is actually inserting for the first time and
> subsequent updates are updating this inserted row and hence is not working
> as expected. Please let me know if I am doing the updates in the right way.
> >
> >  I am using flume 1.6 version and hive 2.1 version provided by mapr
> >
> > Thanks and Regards,
> >
> > Manjunath Anand
>

Re: Hive and Schema Registry

2017-10-13 Thread Elliot West

We also use this feature of AvroSerDe and find it very useful. In our case
we copy the schema from our schema registry into S3 and reference it from
there. In effect, we listen to the internal topic used to store schemas by
our registry, and push to S3 whenever there is a new record. As well as
being automated, this also provides a degree of system decoupling; Hive
queries can be executed independently of the availability of our registry.
I appreciate that this is not a general solution.

On the topic of this feature proposal: The addition of some Json
filtering/extraction in the SerDe feels like a workaround for a very
specific design descision made in the confluent API rather than a more
generally useful feature for the SerDe. Arguably it'd be more generally
useful if the confluent API was amended/extended to return only the schema
document and not encapsulate it in a seemingly superfluous wrapper; that
way any system that can load an Avro schema from a URI can potentially
integrate with the registry with no confluent specific transformations.
However I can understand any reluctance to make such a change.

Given that it would be possible to implement some fairly simple
workarounds, I don't think it's the responsibility of the Hive project to
bridge this gap.

All that said, if you still are keen on this approach, might I suggest
using a JSONPointer to locate the relevant node in the returned document,
as this could be applied generally to many different Json response
structures: https://tools.ietf.org/html/rfc6901

Elliot.

On Thu, 12 Oct 2017 at 14:52, Stephen Durfey  wrote:

> Recently my team has opened a discussion with Confluent [1] in regards to
> the schema registry being used to serve up avro schemas for the Hive
> AvroSerDe to make use of through 'avro.schema.url' config. Originally we
> were hoping to just get a REST endpoint that returns just the schema to
> avoid making any changes to the AvroSerDe. The confluent rest endpoint
> today returns the avro schema embedded as an attribute inside a json
> response [2] which makes it unusable by the AvroSerDe.
>
> I wanted to reach out to the community to talk about the possibility of
> enhancing the AvroSerDe to be able to make use of JSON responses returned
> from the configured URL. One of the possibilities mentioned in the
> confluent github issue was to add in a new (optional) configuration to
> identify the field within the JSON response and the AvroSerDe, and if set
> use that config to retrieve the schema from that attribute.
>
> We're open to other suggestions and would be happy to contribute the patch
> back to hive for whatever design is settled on.
>
> - Stephen
>
>
> [1] https://github.com/confluentinc/schema-registry/issues/629
> [2]
> https://docs.confluent.io/current/schema-registry/docs/api.html#get--subjects-(string-
> subject)-versions-(versionId- version)
>

Parameterized views

2017-10-02 Thread Elliot West

Hello,

Does any version of Hive support parameterized views. I'm thinking of
something like the following contrived example:

CREATE VIEW x AS
  SELECT a
  FROM y
  WHERE date = ${mydate}


I've not been able to get this to work in Hive 1.2.1 or 2.1.0 and wonder if
this is the intended behavior, or I'm not constructing the view correctly.

Is it perhaps better instead to push up the filter columns to the view so
that they are accessible to the caller there?:

CREATE VIEW x AS
  SELECT a, date
  FROM y


Would this have any limitations in comparison with the variable
substitution approach?

Thanks,

Elliot.

Re: Hive Table with Avro Union

2017-07-31 Thread Elliot West

Hi Nishanth,

While what you suggest is indeed feasible, it is not something that I'd
recommend for the following reasons:

   1. Consumers of the data will need to write conditional code in their
   HQL which will likely be difficult to write and maintain (although this
   might be unavoidable regardless).
   2. Support for the union type in the Hive query engine is incomplete
   [1], and allows you to only get string representations of the union branch
   values. These will be difficult to interrogate. Certainly the code
   in HIVE-15434 [2] can remedy this, but this has not been merged so you'll
   need to build and deploy yourself.
   3. Should your consumers later wish to query the table using some other
   data processing framework, they'll struggle to find support for reading the
   union type. Spark [3] and Flink are lacking IIRC.

If you really are unable to make the joins more performant then I suggest
you try some alternative data modeling approaches that do not require the
union type. Largely we can reference the mapping strategies employed to
represent class hierarchies in RDBMSes. In this context, you are already
using 'one table per type'. To consolidate you could instead use single
table with a discriminator field, or a single table with a nullable field
per type. Either of these approaches will of course require that you modify
your schema.

(1) see warning here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-UnionTypesunionUnionTypes
(2) https://issues.apache.org/jira/browse/HIVE-15434
(3) https://issues.apache.org/jira/browse/SPARK-21529

Cheers - Elliot.

On 31 July 2017 at 20:21, Nishanth S  wrote:

> Hello All,
> I have a  set of avro schemas(6 of  them) which do not have any relation
> between  them .The data in them is relatively small and are stored as 6
> different hive  tables now . What I would want to do  is to convert them
> into  a single hive table using avro unions .Is that something doable?.Some
> of our queries have joins to these tables and it is affecting performance.
> I am guessing one hive table  will be a better approach. Can you chime in f
> you have done something similar?.Any thoughts or pointers are highly
> appreciated.
>
> Thanks,
> Nishanth
>

Re: Hive federation service

2017-07-27 Thread Elliot West

Kurt,

You are correct, Waggle Dance makes no attempt to simplify access to the
underlying table data in the filestore(s). In practice we ensure that
consumers of a table in a given cluster have sufficient connectivity and
privilege to read the data referenced in the LOCATION properties. As you
suggest, the precise mechanisms for providing such access vary depending on
the location of both the consumer and the data. We find that implementing
these 'one off' configuration changes is a small price to pay for enabling
exploratory access to useful data that would otherwise remain isolated.

As mentioned in my original post, the motivations for Waggle Dance arose
during cloud migrations from a single on-premises cluster. Therefore, in
practice, we're not federating multiple HDFS backed Hive deployments; all
new deployments are storing their table data on S3 which perhaps simplifies
things a little.

We also use Data+Metadata replication to address our original problem of
data silos. While replication can reduce the complexity of the security and
network configuration somewhat, it does not avoid it completely.
Additionally, it can suffer from increased latency, increased storage cost,
and discourages exploration (you need to know what you want to replicate
first). We have built an in-house tool to perform a variety of Hive
replications, taking advantage of cloud platform specific features where we
can (e.g there are some optimal methods for moving data between S3
deployments). Replication is our preferred approach for frequently accessed
datasets, where repeated transfer costs can be far greater than duplicate
storage costs.

Thanks,

Elliot.

On 27 July 2017 at 16:40, Larson, Kurt <klar...@wbgames.com> wrote:

> Hi Carter and Elliot,
>
>
>
> First off:
>
>
>
> Carter, as the JDBC endpoint is serviced by the HiveServer2 service and
> not the Hive Metastore Service (HMS), I’d assume that the answer to your
> question is no and that you’d still need your own HiveServer2 to interact
> with the Waggle-Dance HMS proxy to process your JDBC API requests.
>
>
>
> Waggle-Dance question:
>
>
>
> As the Waggle-Dance diagram shows only the HMS thrift API being federated,
> how is access to all the data that the LOCATION properties of all Hive
> database objects points to.  It seems that Waggle-Dance goes to great
> lengths to navigate the network topology to get from the proxy to the
> remote HMSs.  However, there’s no mention of where the data is stored.
> Clearly if all the remote HMSs store their data in a common service, like
> AWS S3 or Azure Blob Storage, it will be easier for the HMS proxy consumers
> to access it, but may still be configuration challenges of multiple
> accounts and different permissions and roles.  If each remote HMSs store
> their data in separate local distributed file systems, like HDFS clusters,
> or a mix of the 2, there are additional network topology challenges similar
> to get to the HMSs themselves.  Is there any solution or consideration for
> federated data access?
>
>
>
> Thanks!
>
> -Kurt
>
>
>
> *From:* Carter Shanklin [mailto:car...@hortonworks.com]
> *Sent:* Thursday, July 27, 2017 10:57 AM
> *To:* user@hive.apache.org
> *Subject:* Re: Hive federation service
>
>
>
> Elliot,
>
>
>
> Interesting stuff
>
>
>
> I have 3 questions
>
> 1. Can Waggle Dance deal with multiple kerberized Hadoop clusters?
>
> 2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table)
> or 2 layers, with a requirement to avoid any possible name collisions in
> the mapping layer.
>
> 3. Is it compatible with JDBC? It wasn't clear to me since the diagrams
> all mention thrift.
>
>
>
> Thanks!
>
>
>
>
>
> *From: *Elliot West <tea...@gmail.com>
> *Reply-To: *"user@hive.apache.org" <user@hive.apache.org>
> *Date: *Thursday, July 27, 2017 at 06:21
> *To: *"user@hive.apache.org" <user@hive.apache.org>
> *Subject: *Hive federation service
>
>
>
> Hello,
>
>
>
> We've recently contributed our Hive federation service to the open source
> community:
>
>
>
> https://github.com/HotelsDotCom/waggle-dance
>
>
>
> Waggle Dance is a request routing Hive metastore proxy that allows tables
> to be concurrently accessed across multiple Hive deployments. It was
> created to tackle the appearance of the dataset silos that arose as our
> large organization gradually migrated from monolithic on-premises clusters,
> to cloud based platforms.
>
>
>
> In short, Waggle Dance enables a unified end point with which you can
> describe, query, and join tables that may exist in multiple distinct Hive
> deployments. Such deployments may exist in disparate regions, acc

Re: Hive federation service

2017-07-27 Thread Elliot West

Hi Carter,

1. In theory, under certain conditions, I believe so, (note that my
kerberos experience is very limited). Waggle Dance will simply forward the
kerberos prinipals presented to the client. If these prinicpals are
meaningful to the target cluster then my assumption is that this would
work. In our case not all of our target platforms support kerberos yet, so
it's not something we've tried.

2. You are correct in identifying the the layers in the namespace
hierarchy. As Hive has no cluster representation, we flatten the cluster
and database coordinates into a single database name coordinate. At first
we simply applied a cluster specific prefix to all database names. However,
we found in our case that database name collisions were very infrequent
('default' being a notable exception), and that users disliked updating
database name references in existing HQL. Therefore now also provide an
alternative strategy that only applies the cluster prefix to database names
that are overloaded in the context of the hierarchy.

3. To be clear, Waggle Dance operates at the HIve Metastore Thift layer,
however JDBC also has some relevance in two areas. JDBC clients connecting
to Hive for the purposes of querying do so using HiveServer2 which in turn
uses the metastore API for metadata retrieval. If HiveServer2 is configured
to use a Waggle Dance instance as a metastore, these JDBC clients will also
benefit from the 'federated view'. The Hive metastore also uses JDBC when
persisting its internal model to relational stores. Normally, we'd be able
to ignore this as a backend implementation detail. However, we've seen that
some tools (GUI based schema explorers) bypass the Thrift based metastore
and interrogate the relational backing store directly, presumably as an
optimization. These tools don't see the 'federated view' as they 'step
over' our point of integration.

Hope this helps,

Elliot.

On 27 July 2017 at 15:56, Carter Shanklin <car...@hortonworks.com> wrote:

> Elliot,
>
> Interesting stuff
>
> I have 3 questions
> 1. Can Waggle Dance deal with multiple kerberized Hadoop clusters?
> 2. Do you support 3 layers in the hierarchy (i.e. cluster.database.table)
> or 2 layers, with a requirement to avoid any possible name collisions in
> the mapping layer.
> 3. Is it compatible with JDBC? It wasn't clear to me since the diagrams
> all mention thrift.
>
> Thanks!
>
>
> From: Elliot West <tea...@gmail.com>
> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
> Date: Thursday, July 27, 2017 at 06:21
> To: "user@hive.apache.org" <user@hive.apache.org>
> Subject: Hive federation service
>
> Hello,
>
> We've recently contributed our Hive federation service to the open source
> community:
>
> https://github.com/HotelsDotCom/waggle-dance
>
>
> Waggle Dance is a request routing Hive metastore proxy that allows tables
> to be concurrently accessed across multiple Hive deployments. It was
> created to tackle the appearance of the dataset silos that arose as our
> large organization gradually migrated from monolithic on-premises clusters,
> to cloud based platforms.
>
> In short, Waggle Dance enables a unified end point with which you can
> describe, query, and join tables that may exist in multiple distinct Hive
> deployments. Such deployments may exist in disparate regions, accounts, or
> clouds (security and network permitting). Dataset access is not limited to
> the Hive query engine, and should work with any Hive metastore enabled
> platform. We've been successfully using it with Spark for example.
>
> More recently we've employed Waggle Dance to apply a simple security layer
> to cloud based platforms such as Qubole, Databricks, and EMR. These
> currently provide no means to construct cross platform authentication and
> authorization strategies. Therefore we use a combination of Waggle Dance
> and network configuration to restrict writes and destructive Hive
> operations to specific user groups and applications.
>
> We currently operate many disparate Hive metastore instances whose tables
> must be shared across the organization. Therefore we are committed to the
> ongoing development of this project. However, should such federation
> features have broader appeal, we'd be keen to see similar features
> integrated into Hive, perhaps in a more accessible form echoing existing
> remote table or database link features present in some traditional RDBMSes.
>
> All feedback appreciated, many thanks for your time,
>
> Elliot.
>

Hive federation service

2017-07-27 Thread Elliot West

Hello,

We've recently contributed our Hive federation service to the open source
community:

https://github.com/HotelsDotCom/waggle-dance


Waggle Dance is a request routing Hive metastore proxy that allows tables
to be concurrently accessed across multiple Hive deployments. It was
created to tackle the appearance of the dataset silos that arose as our
large organization gradually migrated from monolithic on-premises clusters,
to cloud based platforms.

In short, Waggle Dance enables a unified end point with which you can
describe, query, and join tables that may exist in multiple distinct Hive
deployments. Such deployments may exist in disparate regions, accounts, or
clouds (security and network permitting). Dataset access is not limited to
the Hive query engine, and should work with any Hive metastore enabled
platform. We've been successfully using it with Spark for example.

More recently we've employed Waggle Dance to apply a simple security layer
to cloud based platforms such as Qubole, Databricks, and EMR. These
currently provide no means to construct cross platform authentication and
authorization strategies. Therefore we use a combination of Waggle Dance
and network configuration to restrict writes and destructive Hive
operations to specific user groups and applications.

We currently operate many disparate Hive metastore instances whose tables
must be shared across the organization. Therefore we are committed to the
ongoing development of this project. However, should such federation
features have broader appeal, we'd be keen to see similar features
integrated into Hive, perhaps in a more accessible form echoing existing
remote table or database link features present in some traditional RDBMSes.

All feedback appreciated, many thanks for your time,

Elliot.

Re: Metastore integration testing with S3

2017-06-26 Thread Elliot West

Thanks for your reply Vihang,

In response to your questions:

I'm not using HS2. HMS is in remote mode. I supply the property key value
pairs as hiveconf arguments:

/bin/sh -c /root/hive/bin/hive --service metastore \
--hiveconf fs.s3a.endpoint=http://172.17.0.2:9000 \
--hiveconf fs.s3a.access.key= \
--hiveconf fs.s3a.secret.key=


The path style property is specified in hive-site.xml.

Thanks,

Elliot.

On 26 June 2017 at 19:18, Vihang Karajgaonkar <vih...@cloudera.com> wrote:

> Both the below configs should be accessible to the HMS and HS2. Is the HMS
> in the embedded mode or a remote? If its in remote mode can you make sure
> that these keys are present in the hive-site.xml of HMS?
>
> fs.s3a.access.key=
> fs.s3a.secret.key=
>
> On Mon, Jun 26, 2017 at 11:12 AM, Elliot West <tea...@gmail.com> wrote:
>
>> I'm trying to put together a metastore instance for the purposes of
>> creating an integration test environment. The system in question reads and
>> writes data into S3 and consequently manages Hive tables whose raw data
>> lives in S3. I've been successfully using Minio (https://www.minio.io)
>> to decouple other parts of the system from AWS so that it can stand alone,
>> but I've been having trouble getting this to play nicely with the metastore.
>>
>> As I understand it I can switch S3 endpoints on the S3AFileSystem and
>> must enable path style access. I'm using Hive 2.0.1 and Hadoop 2.8.0
>> deployed in a docker container with the following additional site config:
>>
>> fs.s3a.path.style.access=true
>> fs.s3a.endpoint=http://172.17.0.2:9000
>> fs.s3a.access.key=
>> fs.s3a.secret.key=
>>
>> And environment:
>>
>>
>> com.amazonaws.services.s3.enableV4=true
>>
>>
>> However, when trying to create a table in one of my Minio buckets I see
>> the following:
>>
>> hive> create external table x (id string) location 's3://mybucket/x/';
>> FAILED: Execution Error, return code 1 from 
>> org.apache.hadoop.hive.ql.exec.DDLTask.
>> MetaException(message:Got exception: 
>> org.apache.hadoop.fs.s3a.AWSS3IOException
>> doesBucketExist on mybucket: 
>> com.amazonaws.services.s3.model.AmazonS3Exception:
>> Bad Request (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad
>> Request; Request ID: null), S3 Extended Request ID: null: Bad Request
>> (Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request
>> ID: null))
>>
>>
>> I've also tried Hive 2.1.1 with no luck. Can anyone advise?
>>
>> Thanks,
>>
>> Elliot.
>>
>
>

Metastore integration testing with S3

2017-06-26 Thread Elliot West

I'm trying to put together a metastore instance for the purposes of
creating an integration test environment. The system in question reads and
writes data into S3 and consequently manages Hive tables whose raw data
lives in S3. I've been successfully using Minio (https://www.minio.io) to
decouple other parts of the system from AWS so that it can stand alone, but
I've been having trouble getting this to play nicely with the metastore.

As I understand it I can switch S3 endpoints on the S3AFileSystem and must
enable path style access. I'm using Hive 2.0.1 and Hadoop 2.8.0 deployed in
a docker container with the following additional site config:

fs.s3a.path.style.access=true
fs.s3a.endpoint=http://172.17.0.2:9000
fs.s3a.access.key=
fs.s3a.secret.key=

And environment:


com.amazonaws.services.s3.enableV4=true


However, when trying to create a table in one of my Minio buckets I see the
following:

hive> create external table x (id string) location 's3://mybucket/x/';
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:Got
exception: org.apache.hadoop.fs.s3a.AWSS3IOException doesBucketExist on
mybucket: com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request
(Service: Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request
ID: null), S3 Extended Request ID: null: Bad Request (Service: Amazon S3;
Status Code: 400; Error Code: 400 Bad Request; Request ID: null))


I've also tried Hive 2.1.1 with no luck. Can anyone advise?

Thanks,

Elliot.

Client agnostic metastore authorization in AWS

2017-04-24 Thread Elliot West

We’re operating long lived Hive metastore instances in AWS to provide a
metadata source of truth for our data processing pipelines. These pipelines
are not restricted to Hive SQL, but must use frameworks that can integrate
with the metastore (such as Spark). We’re storing data in S3. As these are
centralized shared resources, access control is a concern. Primarily this
to guard against user error rather than malicious intent (e.g. user
accidentally drops the wrong table). We’ve been experimenting with Kerberos
in EMR for strong authentication but have no definitive solution for
authorization as yet. We cannot use SQL based authorization as I believe
this is implemented at HiveServer2, which is not a useful integration point
for us. Additionally, there is no workable implementation for metastore
storage based authorization on S3. We’ve toyed with a
HiveMetastoreAuthorizationProvider that evaluates S3 bucket policies, but
this is fairly complex, especially when mapping principles to IAM entities.

More recently we came across DefaultHiveMetastoreAuthorizationProvider in
our travels which appears to implement legacy GRANT/REVOKE type controls in
both the Hive client and metastore layers. I’ve managed to get this working
solely in the metastore, and control access to metadata entities both via
the Hive CLI and the Thrift API. For example, I can craft some Thrift calls
to drop a table that correctly fail with one UGI principal, but succeed
with another. My metastore service configuration is like so, note that no
client side authorization plugin is configured:

hive.security.authorization.enabled=true
hive.security.authorization.createtable.owner.grants=ALL
hive.metastore.pre.event.listeners=org.apache.hadoop.hive.ql.security.authorization.AuthorizationPreEventListener
hive.security.metastore.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveMetastoreAuthorizationProvider

Aside from the documented shortcomings of this legacy authorization
implementation, can anyone suggest any additional pitfalls with this
configuration?

Thanks,

Elliot.

Re: VARCHAR or STRING fields in Hive

2017-01-16 Thread Elliot West

Internally it looks as though Hive simply represents CHAR/VARCHAR values
using a Java String and so I would not expect a significant change in
execution performance. The Hive JIRA suggests that these types were added
to 'support for more SQL-compliant behavior, such as SQL string comparison
semantics, max length, etc.' rather than for performance reasons.

   - https://issues.apache.org/jira/browse/HIVE-4844
   - https://issues.apache.org/jira/browse/HIVE-5191

In terms of storage I expect it depends on the underlying file format and
the types that these values are encoded to. Parquet does appear to support
the specific encoding of both CHAR/VARCHAR, however I'm skeptical that
there would be significant storage efficiencies gained by using the CHAR
types, over String for comparable values. I'd be keen to hear otherwise.

   - https://issues.apache.org/jira/browse/HIVE-7735

Thanks,

Elliot.

On 16 January 2017 at 15:37, Mich Talebzadeh 
wrote:

>
> Coming from DBMS background I tend to treat the columns in Hive similar to
> an RDBMS table. For example if a table created in Hive as Parquet I will
> use VARCHAR(30) for column that has been defined as VARCHAR(30) as source.
> If a column is defined as TEXT in RDBMS table I use STRING in Hive with a
> max size of 2GB I believe.
>
> My view is that it is more efficient storage wise to have Hive table
> created as VARCHA as opposed to STRING.
>
> I have not really seen any performance difference if one uses VARCHAR or
> STRING. However, I believe there is a reason why one has VARCH in Hive as
> opposed to STRRING.
>
> What is the thread view on this?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: tez + union stmt

2017-01-11 Thread Elliot West

Thank you.

On Wed, 11 Jan 2017 at 07:21, Chris Drome <cdr...@yahoo-inc.com> wrote:

> Elliot,
>
> Mithun already created the following ticket to track the issue:
>
> https://issues.apache.org/jira/browse/HIVE-15575
>
> chris
>
>
> On Tuesday, January 10, 2017 11:05 PM, Elliot West <tea...@gmail.com>
> wrote:
>
>
> Thanks Rohini,
>
> This is good to know. Could you perhaps raise an issue in the Hive JIRA?
>
> Thanks,
>
> Elliot.
>
> On Tue, 10 Jan 2017 at 22:55, Rohini Palaniswamy <rohini.adi...@gmail.com>
> wrote:
>
> The implementation in hive does look wrong. The concept of VertexGroups
> was added in Tez specifically for the case of union to support writing to
> same directory from different vertices. Sub-directories should not be
> required as a workaround.
>
> Regards,
> Rohini
>
>
> On Sun, Dec 25, 2016 at 10:58 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
> Thanks Elliot.  Nice christmas present.   Those settings in that
> stackoverflow link look to me to be exactly what i need to set for MR jobs
> to pick that data up that Tez created.
>
> Cheers,
> Stephen.
>
> On Sun, Dec 25, 2016 at 2:45 AM, Elliot West <tea...@gmail.com> wrote:
>
> I believe that tez will generate subfolders for unioned data. As far as I
> know, this is the expected behaviour and there is no alternative.
> Presumably this is to prevent multiple tasks from attempting to write the
> same file?
>
> We've experienced issues when switching from mr to tez; downstream jobs
> weren't expecting subfolders and had trouble reading previously accessible
> datasets.
>
> Apparently there are workarounds within Hive:
>
> http://stackoverflow.com/questions/39511585/hive-create-table-not-insert-data
>
> Merry Christmas,
>
> Elliot.
>
> On Sun, 25 Dec 2016 at 03:11, Rajesh Balamohan <rbalamo...@apache.org>
> wrote:
>
> Are there any exceptions in hive.log?. Is tmp_pv_v4* table part of the
> select query?
>
> Assuming you are creating the table in staging.db, it would have created
> the table location as staging.db/foo (as you have not specified the
> location).
>
> Adding user@hive.apache.org as this is hive related.
>
>
> ~Rajesh.B
>
> On Sun, Dec 25, 2016 at 12:08 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
> all,
>
> i'm running tez with the sql pattern:
>
> * create table foo as select * from (select... UNION select... UNION
> select...)
>
> in the logs the final step is this:
>
> * Moving data to directory hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4
> from hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/.hive-staging_hive_2016-12-24_10-05-40_048_4896412314807355668-899/-ext-10002
>
>
> when querying the table i got zero rows returned which made me curious. so
> i queried the hdfs location and see this:
>
>   $ hdfs dfs -ls hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4
>
>   Found 3 items
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:05 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/1
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/2
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/3
>
> and yes the data files are under these three dirs.
>
> so i ask... i'm not used to seeing sub-directories under the tablename
> unless the table is partitioned. is this legit? might there be some config
> settings i need to set to see this data via sql?
>
> thanks,
> Stephen.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: tez + union stmt

2017-01-10 Thread Elliot West

Thanks Rohini,

This is good to know. Could you perhaps raise an issue in the Hive JIRA?

Thanks,

Elliot.

On Tue, 10 Jan 2017 at 22:55, Rohini Palaniswamy <rohini.adi...@gmail.com>
wrote:

> The implementation in hive does look wrong. The concept of VertexGroups
> was added in Tez specifically for the case of union to support writing to
> same directory from different vertices. Sub-directories should not be
> required as a workaround.
>
> Regards,
> Rohini
>
>
> On Sun, Dec 25, 2016 at 10:58 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
> Thanks Elliot.  Nice christmas present.   Those settings in that
> stackoverflow link look to me to be exactly what i need to set for MR jobs
> to pick that data up that Tez created.
>
> Cheers,
> Stephen.
>
> On Sun, Dec 25, 2016 at 2:45 AM, Elliot West <tea...@gmail.com> wrote:
>
> I believe that tez will generate subfolders for unioned data. As far as I
> know, this is the expected behaviour and there is no alternative.
> Presumably this is to prevent multiple tasks from attempting to write the
> same file?
>
> We've experienced issues when switching from mr to tez; downstream jobs
> weren't expecting subfolders and had trouble reading previously accessible
> datasets.
>
> Apparently there are workarounds within Hive:
>
> http://stackoverflow.com/questions/39511585/hive-create-table-not-insert-data
>
> Merry Christmas,
>
> Elliot.
>
> On Sun, 25 Dec 2016 at 03:11, Rajesh Balamohan <rbalamo...@apache.org>
> wrote:
>
> Are there any exceptions in hive.log?. Is tmp_pv_v4* table part of the
> select query?
>
> Assuming you are creating the table in staging.db, it would have created
> the table location as staging.db/foo (as you have not specified the
> location).
>
> Adding user@hive.apache.org as this is hive related.
>
>
> ~Rajesh.B
>
> On Sun, Dec 25, 2016 at 12:08 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
> all,
>
> i'm running tez with the sql pattern:
>
> * create table foo as select * from (select... UNION select... UNION
> select...)
>
> in the logs the final step is this:
>
> * Moving data to directory hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4
> from hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/.hive-staging_hive_2016-12-24_10-05-40_048_4896412314807355668-899/-ext-10002
>
>
> when querying the table i got zero rows returned which made me curious. so
> i queried the hdfs location and see this:
>
>   $ hdfs dfs -ls hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4
>
>   Found 3 items
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:05 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/1
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/2
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/3
>
> and yes the data files are under these three dirs.
>
> so i ask... i'm not used to seeing sub-directories under the tablename
> unless the table is partitioned. is this legit? might there be some config
> settings i need to set to see this data via sql?
>
> thanks,
> Stephen.
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: tez + union stmt

2016-12-25 Thread Elliot West

I believe that tez will generate subfolders for unioned data. As far as I
know, this is the expected behaviour and there is no alternative.
Presumably this is to prevent multiple tasks from attempting to write the
same file?

We've experienced issues when switching from mr to tez; downstream jobs
weren't expecting subfolders and had trouble reading previously accessible
datasets.

Apparently there are workarounds within Hive:
http://stackoverflow.com/questions/39511585/hive-create-table-not-insert-data

Merry Christmas,

Elliot.

On Sun, 25 Dec 2016 at 03:11, Rajesh Balamohan 
wrote:

> Are there any exceptions in hive.log?. Is tmp_pv_v4* table part of the
> select query?
>
> Assuming you are creating the table in staging.db, it would have created
> the table location as staging.db/foo (as you have not specified the
> location).
>
> Adding user@hive.apache.org as this is hive related.
>
>
> ~Rajesh.B
>
> On Sun, Dec 25, 2016 at 12:08 AM, Stephen Sprague 
> wrote:
>
> all,
>
> i'm running tez with the sql pattern:
>
> * create table foo as select * from (select... UNION select... UNION
> select...)
>
> in the logs the final step is this:
>
> * Moving data to directory hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4
> from hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/.hive-staging_hive_2016-12-24_10-05-40_048_4896412314807355668-899/-ext-10002
>
>
> when querying the table i got zero rows returned which made me curious. so
> i queried the hdfs location and see this:
>
>   $ hdfs dfs -ls hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4
>
>   Found 3 items
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:05 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/1
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/2
>   drwxrwxrwx   - dwr supergroup  0 2016-12-24 10:06 hdfs://
> dwrnn1.sv2.trulia.com:8020/user/hive/warehouse/staging.db/tmp_pv_v4c__loc_4/3
>
> and yes the data files are under these three dirs.
>
> so i ask... i'm not used to seeing sub-directories under the tablename
> unless the table is partitioned. is this legit? might there be some config
> settings i need to set to see this data via sql?
>
> thanks,
> Stephen.
>
>
>
>
>
>
>
>

Re: hive2.1.0 one partition has two locations

2016-12-22 Thread Elliot West

I believe there is an issue with non-string type partition values. On some
code path point they are incorrectly compared as strings when a numeric
comparison should be used instead. Consequently, as '04' ≠ '4' you get two
different partitions. To work around this you should ensure that only one
numerical partition key format is used: always strip leading zeros. I've
had a look in the Hive JIRA and can find no related issues that haven't
been fixed for Hive 2.1.0, so perhaps this is a new find. I'd suggest
raising a new issue: https://issues.apache.org/jira/browse/HIVE

Thanks,

Elliot.

On Thu, 22 Dec 2016 at 07:49, 徐 鹏  wrote:

>
>
>
>
>
>
>
>
>
>
> Hi all:
>
>
>
>
>
>
>
>
>
>
>
>
> HQL：
>
>
> Alter table OperatingStat_R_View Add IF NOT EXISTS 
> partition(YEAR=2016,MONTH=12,DAY=04)
>
> LOCATION 
> '/data/ApplicationDep/TCRecSys_ApplyData/OperatingStat/Report/View/2016/12/04';
>
>
>
> INSERT OVERWRITE TABLE TCRecSys_ApplyData.OperatingStat_R_View PARTITION 
> (YEAR=2016,MONTH=12,DAY=04)
>
> SELECT
>
> '2016-12-04',
>
> EventID,
>
> PlatID,
>
> ProvinceID,
>
> CityID,
>
> RefID,
>
> '',
>
> '',
>
> '',
>
> COUNT(1),
>
> COUNT(DISTINCT DeviceID)
>
> FROM TCRecSys_ApplyData.OperatingStat_D_EventList
>
> WHERE YEAR=2016 AND MONTH=12 AND DAY=04 AND LabelID='01'
>
> GROUP BY EventID,PlatID,ProvinceID,CityID,RefID;
>
>
>
> expected result:
>
>
>
>
>
>
>
>- partition(YEAR=2016,MONTH=12,DAY=04) LOCATION '/data/ApplicationDep/
>TCRecSys_ApplyData/OperatingStat/Report/View/2016/12/04’
>
>
>
> actual result:
>
>
>
>
>
>
>
>- partition(YEAR=2016,MONTH=12,DAY=04) LOCATION '/data/ApplicationDep/
>TCRecSys_ApplyData/OperatingStat/Report/View/2016/12/04'
>- partition(YEAR=2016,MONTH=12,DAY=4) LOCATION '/data/ApplicationDep/
>TCRecSys_ApplyData/OperatingStat/Report/View/year=2016/month=12/day=4'
>
>
>
> desc format info:
>
>
>
>
>
>
> hive> desc formatted TCRecSys_ApplyData.OperatingStat_R_View partition 
> (year=2016,month=12,day=04);
>
> OK
>
> # col_name  data_type   comment
>
>
>
> createdate  string
>
> eventid string
>
> platid  string
>
> provinceid  string
>
> cityid  string
>
> refid   string
>
> def1string
>
> def2string
>
> def3string
>
> pv  int
>
> uv  int
>
>
>
> # Partition Information
>
> # col_name  data_type   comment
>
>
>
> yearint
>
> month   int
>
> day int
>
>
>
> # Detailed Partition Information
>
> Partition Value:[2016, 12, 04]
>
> Database:   tcrecsys_applydata
>
> Table:  operatingstat_r_view
>
> CreateTime: Mon Dec 05 10:46:27 CST 2016
>
> LastAccessTime: UNKNOWN
>
> Protect Mode:   None
>
> Location:   
> hdfs://hadoopcluster/data/ApplicationDep/TCRecSys_ApplyData/OperatingStat/Report/View/2016/12/04
>
> Partition Parameters:
>
> COLUMN_STATS_ACCURATE   false
>
> numFiles0
>
> numRows -1
>
> rawDataSize -1
>
> totalSize   0
>
> transient_lastDdlTime   1480905987
>
>
>
> # Storage Information
>
> SerDe Library:  org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>
> InputFormat:org.apache.hadoop.mapred.TextInputFormat
>
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>
> Compressed: No
>
> Num Buckets:-1
>
> Bucket Columns: []
>
> Sort Columns:   []
>
> Storage Desc Params:
>
> field.delim ^
>
> serialization.format^
>
> Time taken: 0.485 seconds, Fetched: 48 row(s)
>
> hive> desc formatted TCRecSys_ApplyData.OperatingStat_R_View partition 
> (year=2016,month=12,day=4);
>
> OK
>
> # col_name  data_type   comment
>
>
>
> createdate  string
>
> eventid string
>
> platid  string
>
> provinceid  string
>
> cityid  string
>
> refid   string
>
> def1string
>
> def2string
>
> def3string
>
> pv  int
>
> uv  int
>
>
>
> # Partition Information
>
> # col_name  data_type   comment
>
>
>
> yearint
>
> month   int
>
> day int
>
>
>
> # Detailed Partition Information
>
> Partition Value:[2016, 12, 4]
>
> Database:   tcrecsys_applydata
>
> Table:  operatingstat_r_view
>
> CreateTime: Mon Dec 05 10:46:07 CST 2016
>
> LastAccessTime: UNKNOWN
>
> Protect Mode:   None
>
>

Re: how to load ORC file into hive orc table

2016-12-17 Thread Elliot West

It looks as though your table is partitioned yet perhaps you haven't
accounted for this when adding the data? Firstly it is good practice (and
sometimes essential) to put the data into a partition folder of the form
"timestamp=''". You may then need to add the partition
depending on how you are creating it. IIRC the Spark DataFrame/DataSet APIs
have good support for adding partitions to existing Hive tables although
there was a bug that prevented the creation of new partitioned tables when
I looked some time ago. If you are manually managing the partitions you may
need to issue an ADD PARTITION command using the Hive CLI:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AddPartitions

On Sat, 17 Dec 2016 at 08:07, 446463...@qq.com <446463...@qq.com> wrote:

>
> Hi All:
> I create a orc table in hive
>
> create table if not exists user_tag (
> rowkey STRING ,
> cate1 STRING ,
> cate2 STRING ,
> cate3 STRING ,
> cate4 STRING
> )
> PARTITIONED BY (timestamp STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> STORED AS orc
> LOCATION '/user/hive/warehouse/kylinlabel.db/user_tag';
>
> and I generate a orc file in spark  and I put this file into path
> /user/hive/warehouse/kylinlabel.db/user_tag
> /user/hive/warehouse/kylinlabel.db/user_tag/part-r-
> 0-920282f9-4d68-4af8-81c5-69522df3d374.orc
> this is the file path.
> I find there is no data in user_tag table
> Why?
>
>
> --
>
> 446463...@qq.com
>
>

Re: Column names in ORC file

2016-12-15 Thread Elliot West

Possibly related to HIVE-4243 which was fixed in Hive 2.0.0:
https://issues.apache.org/jira/browse/HIVE-4243


On Thu, 15 Dec 2016 at 18:06, Daniel Haviv 
wrote:

> Hi,
> When I'm generating ORC files using spark the column names are written
> into the ORC file but when generated using Hive I get the following column
> names:
>
> _col107, _col33, _col23, _col102
>
>
> Is it possible to somehow configure hive to properly store the column names 
> like Spark?
>
>
> Thank you,
>
> Daniel
>
>
>
>

Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West

I notice that HPL/SQL is not mentioned on the page I referenced, however I
expect that is another approach that you could use to modularise:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156
http://www.hplsql.org/doc

On 15 December 2016 at 17:17, Elliot West <tea...@gmail.com> wrote:

> Some options are covered here, although there is no definitive guidance as
> far as I know:
>
> https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#
> UnitTestingHiveSQL-Modularisation
>
> On 15 December 2016 at 17:08, Saumitra Shahapure <
> saumitra.offic...@gmail.com> wrote:
>
>> Hello,
>>
>> We are running and maintaining quite big and complex Hive SELECT query
>> right now. It's basically a single SELECT query which performs JOIN of
>> about ten other SELECT query outputs.
>>
>> A simplest way to refactor that we can think of is to break this query
>> down into multiple views and then join the views. There is similar
>> possibility to create intermediate tables.
>>
>> However creating multiple DDLs in order to maintain a single DML is not
>> very smooth. We would end up polluting metadata database by creating views
>> / intermediate tables which are used in just this ETL.
>>
>> What are the other efficient ways to maintain complex SQL queries written
>> in Hive? Are there better ways to break Hive query into multiple modules?
>>
>> -- Saumitra S. Shahapure
>>
>
>

Re: Maintaining big and complex Hive queries

2016-12-15 Thread Elliot West

Some options are covered here, although there is no definitive guidance as
far as I know:

https://cwiki.apache.org/confluence/display/Hive/Unit+Testing+Hive+SQL#UnitTestingHiveSQL-Modularisation

On 15 December 2016 at 17:08, Saumitra Shahapure <
saumitra.offic...@gmail.com> wrote:

> Hello,
>
> We are running and maintaining quite big and complex Hive SELECT query
> right now. It's basically a single SELECT query which performs JOIN of
> about ten other SELECT query outputs.
>
> A simplest way to refactor that we can think of is to break this query
> down into multiple views and then join the views. There is similar
> possibility to create intermediate tables.
>
> However creating multiple DDLs in order to maintain a single DML is not
> very smooth. We would end up polluting metadata database by creating views
> / intermediate tables which are used in just this ETL.
>
> What are the other efficient ways to maintain complex SQL queries written
> in Hive? Are there better ways to break Hive query into multiple modules?
>
> -- Saumitra S. Shahapure
>

Re: Interrogating a uniontype

2016-11-24 Thread Elliot West

Thanks for your reply. We just did something similar to this, implementing
a UDF that returns a struct containing a field per union subtype. This data
structure is then easily navigated using standard HQL operators. I'm
surprised such a feature doesn't already exist.

On Wed, 23 Nov 2016 at 21:50, Per Ullberg <per.ullb...@klarna.com> wrote:

> Could you write a UDF that parses it and returns a json object. From there
> you can use the standard json support in Hive. I did something similar for
> Erlang structs about 3 years ago. I actually kept them on file and wrote a
> serde that exposed them as json objects.
>
> regards
> /Pelle
>
> On Wed, Nov 23, 2016 at 6:40 PM, Elliot West <tea...@gmail.com> wrote:
>
> Ah, I see that this can't be done with an array as there is no type common
> to all union indexes. Perhaps a struct with one field per indexed type?
>
> On Wed, 23 Nov 2016 at 17:29, Elliot West <tea...@gmail.com> wrote:
>
> Can anyone recommend a good approach for interrogating uniontype values in
> HQL? I note that the documentation states that the support for such types
> is limited to 'look-at-only' which I assume to mean that I may only dump
> out the value in its entirety, and extract sub-elements. Using the example
> below, is there anyway I can get to field 'b' of union index 3 to extract
> only the value 5?
>
> {0:1}
> {1:2.0}
> {2:["three","four"]}
> {3:{"a":5,"b":"five"}}
>
>
> If not possible with HQL, would it be possible to implement a UDF that can
> explode the type into something more navigable, like an array, struct, or
> map?
>
> Example when exploded as array:
>
>
> [1,null,null,null]
> [null,2.0,null,null]
> [null,null,["three","four"],null]
> [null,null,null,{"a":5,"b":"five"}]
>
> Has anyone done this?
>
> Thanks,
>
> Elliot.
>
>
>
>
> --
>
> *Per Ullberg*
> Data Vault Tech Lead
> Odin Uppsala
> +46 701612693 <+46+701612693>
>
> Klarna AB (publ)
> Sveavägen 46, 111 34 Stockholm
> Tel: +46 8 120 120 00 <+46812012000>
> Reg no: 556737-0431
> klarna.com
>
>

Re: Interrogating a uniontype

2016-11-23 Thread Elliot West

Ah, I see that this can't be done with an array as there is no type common
to all union indexes. Perhaps a struct with one field per indexed type?

On Wed, 23 Nov 2016 at 17:29, Elliot West <tea...@gmail.com> wrote:

> Can anyone recommend a good approach for interrogating uniontype values in
> HQL? I note that the documentation states that the support for such types
> is limited to 'look-at-only' which I assume to mean that I may only dump
> out the value in its entirety, and extract sub-elements. Using the example
> below, is there anyway I can get to field 'b' of union index 3 to extract
> only the value 5?
>
> {0:1}
> {1:2.0}
> {2:["three","four"]}
> {3:{"a":5,"b":"five"}}
>
>
> If not possible with HQL, would it be possible to implement a UDF that can
> explode the type into something more navigable, like an array, struct, or
> map?
>
> Example when exploded as array:
>
>
> [1,null,null,null]
> [null,2.0,null,null]
> [null,null,["three","four"],null]
> [null,null,null,{"a":5,"b":"five"}]
>
> Has anyone done this?
>
> Thanks,
>
> Elliot.
>
>

Interrogating a uniontype

2016-11-23 Thread Elliot West

Can anyone recommend a good approach for interrogating uniontype values in
HQL? I note that the documentation states that the support for such types
is limited to 'look-at-only' which I assume to mean that I may only dump
out the value in its entirety, and extract sub-elements. Using the example
below, is there anyway I can get to field 'b' of union index 3 to extract
only the value 5?

{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}


If not possible with HQL, would it be possible to implement a UDF that can
explode the type into something more navigable, like an array, struct, or
map?

Example when exploded as array:


[1,null,null,null]
[null,2.0,null,null]
[null,null,["three","four"],null]
[null,null,null,{"a":5,"b":"five"}]

Has anyone done this?

Thanks,

Elliot.

Re: s3a and hive

2016-11-15 Thread Elliot West

My gut feeling is that this is not something you should do (except for
fun!) I'm fairly confident that somewhere in Hive, MR, or Tez, you'll hit
some code that requires consistent, atomic move/copy/list/overwrite
semantics from the warehouse filesystem. This is not something that the
vanilla S3AFileSystem can provide. Even if you get to the point where
everything appears functionally sound, I expect you'll encounter unusual
and inconsistent behavior if you use this in the long term.

Solutions to Hive on S3 include:

   - Use S3Guard (not yet available):
   https://issues.apache.org/jira/browse/HADOOP-13345
   - Use Hive on EMR with Amazon's S3 filesystem implementation and EMRFS.
   Note that this confusingly requires and overloads the 's3://' scheme.

Hope this helps, and please report back with any findings as we are doing
quite a bit of Hive in AWS too.

Cheers - Elliot.

On 15 November 2016 at 15:19, Stephen Sprague  wrote:

> no. permissions are good.  i believe the case to be that s3a does not have
> a "move" and/or "rename" semantic but i can't be the first one to encounter
> this. somebody out there has to have gone done this path way before me
> surely.
>
> searching the cyber i find this:
>
>https://issues.apache.org/jira/browse/HIVE-14270
>
> which is part of a even more work with s3 (see the related jira's that
> that jira comes under) especially the Hadoop Uber-Jira.
>
>
> so after digging though those jira's lemme ask:
>
> has anyone set hive.metastore.warehouse.dir to a s3a location with
> success?
>
> seems to me hive 2.2.0 and perhaps hadoop 2.7 or 2.8 are the only chances
> of success but i'm happy to be told i'm wrong.
>
> thanks,
> Stephen.
>
>
>
> On Mon, Nov 14, 2016 at 10:25 PM, Jörn Franke 
> wrote:
>
>> Is it a permission issue on the folder?
>>
>> On 15 Nov 2016, at 06:28, Stephen Sprague  wrote:
>>
>> so i figured i try and set hive.metastore.warehouse.dir=s3a://bucket/hive
>> and see what would happen.
>>
>> running this query:
>>
>> insert overwrite table omniture.hit_data_aws partition
>> (date_key=20161113) select * from staging.hit_data_aws_ext_20161113
>> limit 1;
>>
>> yields this error:
>>
>>Failed with exception java.io.IOException: rename for src path:
>> s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/.hive-
>> staging_hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>> to dest 
>> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>> returned false
>> FAILED: Execution Error, return code 1 from 
>> org.apache.hadoop.hive.ql.exec.MoveTask.
>> java.io.IOException: rename for src path: s3a://trulia-dwr-cluster-dev/h
>> ive/omniture.db/hit_data_aws/date_key=20161113/.hive-staging
>> _hive_2016-11-15_04-57-52_085_7825126612479617470-1/-ext-1/00_0
>> to dest 
>> path:s3a://trulia-dwr-cluster-dev/hive/omniture.db/hit_data_aws/date_key=20161113/00_0
>> returned false
>>
>>
>> is there any workaround?   i'm running hive 2.1.0 and hadoop version
>> 2.6.0-cdh5.7.1  .
>>
>>
>> thanks,
>> Stephen.
>>
>>
>

Re: Unit testing macros

2016-09-30 Thread Elliot West

Hi,

You can achieve this by storing the macro definition in a separate HQL file
and 'import' this as needed. Unfortunately such imports are interpreted by
your Hive client and the relevant command varies between client
implementations: '!run' in Beeline and 'SOURCE' in Hive CLI. I raised a
proposal to create a unified command that is compatible across clients but
this has yet to gain any traction:
https://issues.apache.org/jira/browse/HIVE-12703

Elliot.

On 30 September 2016 at 10:23, Staņislavs Rogozins <
stanislavs.rogoz...@gmail.com> wrote:

> The Unit testing wiki page
>  
> suggests
> using macros to 'extract and resuse the expressions applied to columns' and
> that they can be 'readily isolated for testing'. However as far as I'm
> aware, right now only temporary macros can be created, that stop existing
> outside of session where they were defined(https://cwiki.apache.
> org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/
> DropMacro), thus making it necessary to include the macro in every HQL
> script that uses it. How is it possible to isolate them for testing?
>

Re: Iterating over partitions using the metastore API

2016-08-04 Thread Elliot West

Thanks for your reply. I hadn't considered driving it from a list of
partition names.

To avoid the N+1 reads I am considering reading in batches like so:

   - Sorting the names
   - Taking every nth name (where n is the batch size) to use a a batch
   boundary.
   - Building a filter derived from boundary_name[n-1] and boundary_name[n].
   - Then selecting the batch using the filter and
   IMSC.listPartitionsWithFilter(...)

A drawback to this approach is that filters only support string key types
IIRC.

Thanks,

Elliot.

On 4 August 2016 at 13:15, Furcy Pin <furcy@flaminem.com> wrote:

> Hi Elliot,
>
> I guess you can use IMetaStoreClient.listPartitionsNames instead, and
> then use IMetaStoreClient.getPartition for each partition.
> This might be slow though, as you will have to make 10 000 calls to get
> them.
>
> Another option I'd consider is connecting directly to the Hive metastore.
> This require a little more configuration (grant read-only access to your
> process to the metastore), and might make your implementation dependent
> on the metastore underlying implementation (mysql, postgres, derby),
> unless you use a ORM to query it.
> Anyway, you could ask the metastore directly via JDBC for all the
> partitions, and get java.sql.ResultSet that can be iterated over.
>
> Regards,
>
> Furcy
>
>
> On Thu, Aug 4, 2016 at 1:29 PM, Elliot West <tea...@gmail.com> wrote:
>
>> Hello,
>>
>> I have a process that needs to iterate over all of the partitions in a
>> table using the metastore API.The process should not need to know about the
>> structure or meaning of the partition key values (i.e. whether they are
>> dates, numbers, country names etc), or be required to know the existing
>> range of partition values. Note that the process only needs to know about
>> one partition at any given time.
>>
>> Currently I am naively using the IMetaStoreClient.listPartitions(String,
>> String, short) method to retrieve all partitions but clearly this is not
>> scalable for tables with many 10,000s of partitions. I'm finding that even
>> with relatively large heaps I'm running into OOM exceptions when the
>> metastore API is building the List return value. I've
>> experimented with using IMetaStoreClient.listPartitionSpecs(String,
>> String, int) but this too seems to have high memory requirements.
>>
>> Can anyone suggest how I can better iterate over partitions in a manner
>> that is more considerate of memory usage?
>>
>> Thanks,
>>
>> Elliot.
>>
>>
>

Iterating over partitions using the metastore API

2016-08-04 Thread Elliot West

Hello,

I have a process that needs to iterate over all of the partitions in a
table using the metastore API.The process should not need to know about the
structure or meaning of the partition key values (i.e. whether they are
dates, numbers, country names etc), or be required to know the existing
range of partition values. Note that the process only needs to know about
one partition at any given time.

Currently I am naively using the IMetaStoreClient.listPartitions(String,
String, short) method to retrieve all partitions but clearly this is not
scalable for tables with many 10,000s of partitions. I'm finding that even
with relatively large heaps I'm running into OOM exceptions when the
metastore API is building the List return value. I've
experimented with using IMetaStoreClient.listPartitionSpecs(String, String,
int) but this too seems to have high memory requirements.

Can anyone suggest how I can better iterate over partitions in a manner
that is more considerate of memory usage?

Thanks,

Elliot.

Hive authentication and authorisation in AWS.

2016-07-13 Thread Elliot West

Hello,

I am attempting to setup a long running, shared Hive metastore in AWS. The
intention is to have this serve as the core repository of metadata for
shared datasets across multiple AWS accounts. Users will be able to spin up
their own short-lived EMR clusters, Spark jobs, etc. and then locate the
data that they need using this metastore. The data will be stored on S3,
the metadata database will be provided using RDS MySQL or Aurora, and I
have the metastore service running on EC2 instances. I’m trying to
determine what would be the best way to both authenticate and authorize
users of the metastore in this scenario. Given that I’m no expert on user
identity management and security, I’m finding it rather difficult to make
headway.
On the subject of authentication, I’d ideally like to use the user’s global
IAM identity. However, I’m at a loss on where and how I can integrate this
with the metastore service. The metastore apparently supports Kerberos and
LDAP but I’m note sure how these fit into an AWS setting. I’d rather not
run a separate directory server that maintains a set of identities separate
from the IAM identities in accounts, although this seems to be a
possibility.
On the subject of authorisation, I suspect that storage based authorisation
will not work with S3. Hive appears to use the Hadoop FileSystem
abstraction to interrogate FS permissions and the S3 FileSystem
implementations do not appear to provide any visibility on S3 bucket
permissions. Additionally, SQL based authorization also appears to be
inappropriate for this use case as it requires HiveServer2 to enforce the
finer grained permissions (column access control for example). However, I
don’t want to force all users to access data via HiveServer2 as this then
mandates that they must use a client that supports HS2. At this point I
wonder whether I must implement my own metastore authorization hook that
interrogates the S3 bucket policy using AWS apis.
Any suggestions or thoughts would be appreciated.
Thanks,
Elliot.

Re: Hive Metastore on Amazon Aurora

2016-07-11 Thread Elliot West

Hi Mich,

Correct. We have proof of concepts up and running with both MySQL on RDS
and Aurora. We'd be keen to hear of experiences of others with Aurora in a
Hive metastore database role, primarily as a sanity check. In answer to
your specific points:

   1. 30GB
   2. We don't intend to use ACID in this scenario (currently).

For this application we particularly value:

   - Compatibility (with Hive)
   - High availability
   - Scalability
   - Ease of management

Thanks, Elliot.

On 11 July 2016 at 15:15, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:

> Hi  Elliot,
>
> Am I correct that you want to put your Hive metastore on Amazon? Is the
> metastore (database/schema) is sitting on on MySQL and you want to migrate
> your MySQL to cloud now?
>
> Two questions that need to be verified
>
>
>1. How big is your current metadata
>2. Do you do a lot of transaction activity using ORC files with
>Insert/Update/Delete that need to communicate with metastore with heartbeat
>etc?
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 11 July 2016 at 13:58, Elliot West <tea...@gmail.com> wrote:
>
>> Hello,
>>
>> Is anyone running the Hive metastore database on Amazon Aurora?:
>> https://aws.amazon.com/rds/aurora/details/. My expectation is that it
>> should work nicely as it is derived from MySQL but I'd be keen to hear of
>> user's experiences with this setup.
>>
>> Many thanks,
>>
>> Elliot.
>>
>
>

Hive Metastore on Amazon Aurora

2016-07-11 Thread Elliot West

Hello,

Is anyone running the Hive metastore database on Amazon Aurora?:
https://aws.amazon.com/rds/aurora/details/. My expectation is that it
should work nicely as it is derived from MySQL but I'd be keen to hear of
user's experiences with this setup.

Many thanks,

Elliot.

Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Elliot West

Related to this, there exists an API in Hive to simplify the integrations
of other frameworks with Hive's ACID feature:

See:
https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API

It contains code for maintaining heartbeats, handling locks and
transactions, and submitting mutations in a distributed environment.

We have used it to write to transactional tables from Cascading based
processes.

Elliot.

On 2 June 2016 at 09:54, Mich Talebzadeh  wrote:

>
> Hi,
>
> Spark does not support transactions because as I understand there is a
> piece in the execution side that needs to send heartbeats to Hive metastore
> saying a transaction is still alive". That has not been implemented in
> Spark yet to my knowledge."
>
> Any idea on the timelines when we are going to have support for
> transactions in Spark for Hive ORC tables. This will really be useful.
>
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

Re: Copying all Hive tables from Prod to UAT

2016-05-26 Thread Elliot West

Hello,

I've been looking at this recently for moving Hive tables from on-premise
clusters to the cloud, but the principle should be the same for your
use-case. If you wish to do this in an automated way, some tools worth
considering are:

   - Hive's built in replication framework:
   https://cwiki.apache.org/confluence/display/Hive/Replication
   - Hive's IMPORT/EXPORT primitives:
   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ImportExport
   - AirBnB's ReAir replication tool:
   
https://medium.com/airbnb-engineering/reair-easy-to-use-tools-for-migrating-and-replicating-petabyte-scale-data-warehouses-5153f8a433da

Elliot.

On 8 April 2016 at 23:24, Ashok Kumar  wrote:

> Hi,
>
> Anyone has suggestions how to create and copy Hive and Spark tables from
> Production to UAT.
>
> One way would be to copy table data to external files and then move the
> external files to a local target directory and populate the tables in
> target Hive with data.
>
> Is there an easier way of doing so?
>
> thanks
>
>
>

Retrying metastore clients

2016-05-16 Thread Elliot West

Hello,

We have a fair amount of code that uses IMetaStoreClient implementation to
talk over a network to Hive metastore instances. I was under the impression
that the recommended way to create said clients was to use:

org.apache.hive.hcatalog.common.HCatUtil.getHiveClient(HiveConf)


However, I'm keen that our clients retry in the event of adverse network
conditions and I've recently discovered that clients constructed in this
manner do not appear to be of the retrying variety. In fact it seems as
though only the following factory method creates clients of this type:

org.apache.hadoop.hive.ql.metadata.Hive.getMSC()


But, in my particular use case I wish to copy data between two metastores
and this class seems to do a bunch of ThreadLocal stuff that would appear
to make it awkward to instantiate multiple, different IMetaStoreClient
instances. Therefore I think I'll need to instead call the following
directly:

org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(HiveConf,
HiveMetaHookLoader, String)


I'd be grateful for any advice regarding how I best create multiple, robust
clients that can collaborate in a single thread.

Thanks,

Elliot.

Re: How to find hive version using hive editor in hue ?

2016-02-18 Thread Elliot West

See set command usage here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli#LanguageManualCli-HiveInteractiveShellCommands

On Thursday, 18 February 2016, Abhishek Dubey 
wrote:

> Thanks Bennie, It worked !
>
> Would you like to tell us what else this command means, like what exactly
> it’s setting ?
>
>
>
>
>
> *Thanks & Regards,*
> *Abhishek Dubey*
>
>
>
> *From:* Bennie Schut [mailto:bennie.sc...@booking.com
> ]
> *Sent:* Thursday, February 18, 2016 5:12 PM
> *To:* user@hive.apache.org
> ; Abhishek Dubey
> 
> *Subject:* Re: How to find hive version using hive editor in hue ?
>
>
>
> Not directly but indirectly doing:
>
> set system:sun.java.command;
>
> That will likely give the the jar name which includes the version.
>
> On 18/02/16 08:12, Abhishek Dubey wrote:
>
> Thanks in advance..
>
>
>
> *Warm Regards,*
> *Abhishek Dubey*
>
>
>
>
>

Re: Is it ok to build an entire ETL/ELT data flow using HIVE queries?

2016-02-16 Thread Elliot West

I'd say that so long as you can achieve a similar quality of engineering as
is possible with other software development domains, then 'yes, it is ok'.

Specifically, our Hive projects are packaged as RPMs, built and released
with Maven, covered by suites of unit tests developed with HiveRunner, and
part of the same Jenkins CI process as other Java based projects.
Decomposing large processes into sensible units is not as easy as with
other frameworks so this may require more thought and care.

More information here:
https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL

You have many potential alternatives depending on which languages you are
comfortable using: Pig, Flink, Cascading, Spark, Crunch, Scrunch, Scalding,
etc.

Elliot.

On Tuesday, 16 February 2016, Ramasubramanian <
ramasubramanian.naraya...@gmail.com> wrote:

> Hi,
>
> Is it ok to build an entire ETL/ELT data flow using HIVE queries?
>
> Data is stored in HIVE. We have transactional and reference data. We need
> to build a small warehouse.
>
> Need suggestion on alternatives too.
>
> Regards,
> Rams

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-04 Thread Elliot West

Related to this and for the benefit of anyone who is using Hive: The issues
around testing and some possible approaches are summarised here:

https://cwiki.apache.org/confluence/display/Hive/Unit+testing+HQL


Ultimately there are no elegant solutions to the limitations correctly
described by Koert. However if you do choose to use Hive please be aware
that there are some good options out there for providing reasonable test
coverage of your production code. They aren't perfect by any means and are
certainly not at the level we've come to expect in other development
domains, but they are usable and therefore there is no excuse for not
writing tests! :-)

Elliot.


On 3 February 2016 at 04:49, Koert Kuipers  wrote:

> yeah but have you ever seen somewhat write a real analytical program in
> hive? how? where are the basic abstractions to wrap up a large amount of
> operations (joins, groupby's) into a single function call? where are the
> tools to write nice unit test for that?
>
> for example in spark i can write a DataFrame => DataFrame that internally
> does many joins, groupBys and complex operations. all unit tested and
> perfectly re-usable. and in hive? copy paste round sql queries? thats just
> dangerous.
>
> On Tue, Feb 2, 2016 at 8:09 PM, Edward Capriolo 
> wrote:
>
>> Hive has numerous extension points, you are not boxed in by a long shot.
>>
>>
>> On Tuesday, February 2, 2016, Koert Kuipers  wrote:
>>
>>> uuuhm with spark using Hive metastore you actually have a real
>>> programming environment and you can write real functions, versus just being
>>> boxed into some version of sql and limited udfs?
>>>
>>> On Tue, Feb 2, 2016 at 6:46 PM, Xuefu Zhang  wrote:
>>>
 When comparing the performance, you need to do it apple vs apple. In
 another thread, you mentioned that Hive on Spark is much slower than Spark
 SQL. However, you configured Hive such that only two tasks can run in
 parallel. However, you didn't provide information on how much Spark SQL is
 utilizing. Thus, it's hard to tell whether it's just a configuration
 problem in your Hive or Spark SQL is indeed faster. You should be able to
 see the resource usage in YARN resource manage URL.

 --Xuefu

 On Tue, Feb 2, 2016 at 3:31 PM, Mich Talebzadeh 
 wrote:

> Thanks Jeff.
>
>
>
> Obviously Hive is much more feature rich compared to Spark. Having
> said that in certain areas for example where the SQL feature is available
> in Spark, Spark seems to deliver faster.
>
>
>
> This may be:
>
>
>
> 1.Spark does both the optimisation and execution seamlessly
>
> 2.Hive on Spark has to invoke YARN that adds another layer to the
> process
>
>
>
> Now I did some simple tests on a 100Million rows ORC table available
> through Hive to both.
>
>
>
> *Spark 1.5.2 on Hive 1.2.1 Metastore*
>
>
>
>
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.805 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.358 seconds, Fetched 3 row(s)
>
> spark-sql> select * from dummy where id in (1, 5, 10);
>
> 1   0   0   63
> rMLTDXxxqXOZnqYRJwInlGfGBTxNkAszBGEUGELqTSRnFjRGbi   1
> xx
>
> 5   0   4   31
> vDsFoYAOcitwrWNXCxPHzIIIxwKpTlrsVjFFKUDivytqJqOHGA   5
> xx
>
> 10  99  999 188
> abQyrlxKzPTJliMqDpsfDTJUQzdNdfofUQhrKqXvRKwulZAoJe  10
> xx
>
> Time taken: 50.563 seconds, Fetched 3 row(s)
>
>
>
> So three runs returning three rows just over 50 seconds
>
>
>
> *Hive 1.2.1 on spark 1.3.1 execution engine*
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> select * from dummy where id in
> (1, 5, 10);
>
> INFO  :
>
> Query Hive on Spark job[4] stages:
>
> INFO  : 4
>
> INFO

Re: Hive table over S3 bucket with s3a

2016-02-02 Thread Elliot West

When I last looked at this it was recommended to simply regenerate the key
as you suggest.

On 2 February 2016 at 15:52, Terry Siu  wrote:

> Hi,
>
> I’m wondering if anyone has found a workaround for defining a Hive table
> over a S3 bucket when the secret access key has ‘/‘ characters in it. I’m
> using Hive 0.14 in HDP 2.2.4 and the statement that I used is:
>
>
> CREATE EXTERNAL TABLE IF NOT EXISTS s3_foo (
>
>   key INT, value STRING
>
> )
>
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t’
>
> LOCATION 's3a://:@/’;
>
>
> The following error is returned:
>
>
> FAILED: IllegalArgumentException The bucketName parameter must be
> specified.
>
>
> A workaround was to set the fs.s3a.access.key and fs.s3a.secret.key
> configuration and then change the location URL to be
> s3a:///. However, this produces the following error:
>
>
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask.
> MetaException(message:com.amazonaws.AmazonClientException: Unable to load
> AWS credentials from any provider in the chain)
>
>
> Has anyone found a way to create a Hive over S3 table when the key
> contains ‘/‘ characters or it just standard practice to simply regenerate
> the keys until IAM returns one that doesn’t have the offending characters?
>
>
> Thanks,
>
> -Terry
>

Re: Hive ExIm from on-premise HDP to Amazon EMR

2016-01-25 Thread Elliot West

Yes, we do use Falcon. But only a small fraction of our the datasets we
wish to replicate are defined in this way. Could I perhaps just declare the
feeds in falcon and not the processes that create them? Also, doesn't
falcon use Hive ExIm/Replication to achieve this internally and therefore
might I still encounter the same bug I am seeing now?

Thanks for your response.

On Sunday, 24 January 2016, Artem Ervits <dbis...@gmail.com> wrote:

> Have you looked at Apache Falcon?
> On Jan 8, 2016 2:41 AM, "Elliot West" <tea...@gmail.com
> <javascript:_e(%7B%7D,'cvml','tea...@gmail.com');>> wrote:
>
>> Further investigation appears to show this going wrong in a copy phase of
>> the plan. The correctly functioning HDFS → HDFS import copy stage looks
>> like this:
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>> Copy
>>   source: hdfs://host:8020/staging/my_table/year_month=2015-12
>>   destination:
>> hdfs://host:8020/tmp/hive/hadoop/4f155e62-cec1-4b35-95e5-647ab5a74d3d/hive_2016-01-07_17-27-48_864_1838369633925145253-1/-ext-1
>>
>>
>> Whereas the S3 → S3 import copy stage shows an unexpected destination,
>> which was presumably meant to be a temporary location on the source file
>> system but is in fact simply the parent directory:
>>
>>
>> STAGE PLANS:
>>   Stage: Stage-1
>> Copy
>>   source: s3n://exports-bucket/my_table/year_month=2015-12
>>   destination: s3n://exports-bucket/my_table
>>
>>
>> These stage plans were obtained using:
>>
>> EXPLAIN
>> IMPORT FROM 'spource
>> LOCATION 'destination';
>>
>>
>> I'm beginning to think that this is a bug and not something I can work
>> around, which is unfortunate as I'm not really in a position to deploy a
>> fixed version in the short term. That said, if you confirm that this is not
>> the intended behaviour, I'll raise a JIRA and possibly work on a fix.
>>
>> Thanks - Elliot.
>>
>>
>> On 7 January 2016 at 16:53, Elliot West <tea...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','tea...@gmail.com');>> wrote:
>>
>>> More information: This works if I move the export into EMR's HDFS and
>>> then import from there to a new location in HDFS. It does not work across
>>> FileSystems:
>>>
>>>- Import from S3 → EMR HDFS (fails in a similar manner to S3 → S3)
>>>- Import from EMR HDFS → S3 (complains that HDFS FileSystem was
>>>expected as the destination. Presumably the same FileSystem instance
>>>is used for the source and destination).
>>>
>>>
>>>
>>> On 7 January 2016 at 12:17, Elliot West <tea...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','tea...@gmail.com');>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Following on from my earlier post concerning syncing Hive data from an
>>>> on premise cluster to the cloud, I've been experimenting with the
>>>> IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
>>>> Amazon EMR. I started out with some simple Exports/Imports as these can be
>>>> the core operations on which replication is founded. This worked fine with
>>>> some on-premise clusters running HDP-2.2.4.
>>>>
>>>>
>>>> // on cluster 1
>>>>
>>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>>> TO '/exports/my_table'
>>>> FOR REPLICATION ('1');
>>>>
>>>> // Copy from cluster1:/exports/my_table to cluster2:/staging/my_table
>>>>
>>>> // on cluster 2
>>>>
>>>> IMPORT FROM '/staging/my_table'
>>>> LOCATION '/warehouse/my_table';
>>>>
>>>> // Table created, partition created, data relocated to
>>>> /warehouse/my_table/year_month=2015-12
>>>>
>>>>
>>>> I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:
>>>>
>>>> // On premise HDP2.2.4
>>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>>
>>>> EXPORT TABLE my_table PARTITION (year_month='2015-12')
>>>> TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'
>>>>
>>>> // on EMR
>>>> SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;
>>>>
>>>> IMPORT FROM 's3n://exports-bucket/my_table'
>>>> LOCATION 's3n://hive-warehouse-bucket/my_table'
>>>>
>>>>
>>>> The IMPORT behaviour I see is bizarre:
>>>>
>>>>1. Creates the folder 's3n://hive-warehouse/my_table'
>>>>2. Copies the part file from
>>>>'s3n://exports-bucket/my_table/year_month=2015-12' to
>>>>'s3n://exports-bucket/my_table' (i.e. to the parent)
>>>>3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
>>>>s3n://exports-bucket/my_table has nested
>>>>directorys3n://exports-bucket/my_table/year_month=2015-12"
>>>>
>>>> It is as if it is attempting to set the final partition location to
>>>> 's3n://exports-bucket/my_table' and not
>>>> 's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
>>>> HDP → HDP.
>>>>
>>>> I've tried variations, specifying the partition on import, excluding
>>>> the location, all with the same result. Any thoughts or assistance would be
>>>> appreciated.
>>>>
>>>> Thanks - Elliot.
>>>>
>>>>
>>>>
>>>
>>

Re: Using s3 as warehouse on emr

2016-01-22 Thread Elliot West

Related to this, might it be better to use the s3a protocol instead of s3n?
https://wiki.apache.org/hadoop/AmazonS3

Additionally, can anyone advise when EMRFS is required when storing Hive
tables in S3?
http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-overview-arch.html#emr-arch-storage

On 22 January 2016 at 12:52, Zsolt Tóth  wrote:

> Hi,
>
> I'd like to use S3 as the hive warehouse on my emr 4.x cluster.
> I've set hive.metastore.warehouse.dir=s3n://testbucket/hive_warehouse and
> fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem (not sure if
> this is needed) in the hive-site.xml on the master node. Double checked the
> "set -v" output, the properties are correct.
>
> When I run a command like "create table test1 (x String);" in Hive CLI, it
> is created in the default warehouse dir (/user/hive/warehouse/) instead of
> s3n://...
>
> What am I missing here?
>
> Thanks!
> Zsolt
>

Re: Synchronizing Hive metastores across clusters

2016-01-21 Thread Elliot West

Following up on this: I've spent some time trying to evaluate the Hive
replication features but in truth it's more been an exercise in trying to
get them working! I thought I'd share my findings:

   - Conceptually this feature can sync (nearly) all Hive metadata and data
   changes between two clusters.
   - On the source cluster you require at least Hive 1.1.0
   (DbNotificationListener dependency).
   - On the destination cluster you require at least Hive 0.8.0 (IMPORT
   command dependency).
   - The environment in which you execute replication tasks requires at
   least Hive 1.2.0 (ReplicationTask dependency) although this is at the JAR
   level only (i.e. you do not need a 1.2.0 metastore running etc).
   - It is not an 'out of the box solution'; you must still write some kind
   of service that instantiates, schedules, and executes ReplicationTasks.
   This can be quite simple.
   - Exporting into S3 using Hive on EMR (AMI 4.2.0) is currently broken,
   but apparently work is underway to fix it.
   - Data inserted into Hive tables using HCatalog writers will not be
   automatically synced (HIVE-9577).
   - Mappings can be applied to destination database names, table names,
   and table and partition locations.
   - All tables at the destination are managed, even if they are external
   at the source.
   - The source and destination can be running different Hadoop
   distributions and use differing metastore database providers.
   - There is no real user level documentation.
   - It might be nice to add a Kafka based NotificationListener.

In summary it looks like quite a powerful and useful feature. However as
I'm currently running Hive 1.0.0 at my source I cannot use it in a
straightforward manner.

Thanks for your help.

Elliot.

On 18 December 2015 at 14:31, Elliot West <tea...@gmail.com> wrote:

> Eugene/Susanth,
>
> Thank you for pointing me in the direction of these features. I'll
> investigate them further to see if I can put them to good use.
>
> Cheers - Elliot.
>
> On 17 December 2015 at 20:03, Sushanth Sowmyan <khorg...@gmail.com> wrote:
>
>> Also, while I have not wiki-ized the documentation for the above, I
>> have uploaded slides from talks that I've given in hive user group
>> meetup on the subject, and also a doc that describes the replication
>> protocol followed for the EXIM replication that are attached over at
>> https://issues.apache.org/jira/browse/HIVE-10264
>>
>> On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <khorg...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > I think that the replication work added with
>> > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
>> > alley.
>> >
>> > Per Eugene's suggestion of MetaStoreEventListener, this replication
>> > system plugs into that and gets you a stream of notification events
>> > from HCatClient for the exact purpose you mention.
>> >
>> > There's some work still outstanding on this task, most notably
>> > documentation (sorry!) but please have a look at
>> > HCatClient.getReplicationTasks(...) and
>> > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
>> > your implementation of  ReplicationTask.Factory to inject your own
>> > logic for how to handle the replication according to your needs.
>> > (currently there exists an implementation that uses Hive EXPORT/IMPORT
>> > to perform replication - you can look at the code for this, and the
>> > tests for these classes to see how that is achieved. Falcon already
>> > uses this to perform cross-hive-warehouse replication)
>> >
>> >
>> > Thanks,
>> >
>> > -Sushanth
>> >
>> > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
>> > <ekoif...@hortonworks.com> wrote:
>> >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
>> >> which may be useful here
>> >>
>> >> Eugene
>> >>
>> >> From: Elliot West <tea...@gmail.com>
>> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> >> Date: Thursday, December 17, 2015 at 8:21 AM
>> >> To: "user@hive.apache.org" <user@hive.apache.org>
>> >> Subject: Synchronizing Hive metastores across clusters
>> >>
>> >> Hello,
>> >>
>> >> I'm thinking about the steps required to repeatedly push Hive datasets
>> out
>> >> from a traditional Hadoop cluster into a parallel cloud based cluster.
>> This
>> >> is not a one off, it needs to be a constantly running sync process. As
>> new
>> >> tables

Re: Writing hive column headers in 'Insert overwrite query'

2016-01-13 Thread Elliot West

Unfortunately there appears to be no nice way of doing this. I've seen
others achieve a work around by UNIONing with a table of the same
schema, containing a single row of the header names, and then finally
sorting by a synthesised rank column (see:
http://stackoverflow.com/a/25214480/74772).

I believe headers should really be an option on INSERT OVERWRITE DIRECTORY
as it is often a far neater way of generating reports for third party
consumption. Additionally it is not always possible to elegantly capture
the headers generated by the CLI option in some environments such as Oozie
scheduled tasks.

There are arguments against including headers in such Hive generated
datasets. However, these datasets are often reports at the end of a
pipeline whose format is mandated by the intended third party recipient. It
seems a shame to introduce yet another tool into the pipeline purely to
introduce a header row, we'd rather just do so from within Hive.

There is some background information on the current header implementation
here: https://issues.apache.org/jira/browse/HIVE-138

Cheers - Elliot.

On Wednesday, 13 January 2016, Sreenath  wrote:

> Hey,
>
> This will work but lets say i want to write the output to an HDFS location
> using INSERT OVERWRITE DIRECTORY '' , in this case even if we set
> *hive.cli.print.header=true *the headers doesn't get written . Is there a
> way to write the headers in this case
>
> On 13 January 2016 at 12:04, Ankit Bhatnagar  wrote:
>
>> r u looking for
>>
>> hive -e "*set hive.cli.print.header=true*; < query> " > output
>>
>>
>> On Tuesday, January 12, 2016 10:14 PM, Sreenath 
>> wrote:
>>
>>
>> Hi All,
>>
>> Is there a way we can write the hive column headers also along with the
>> output when we are overwriting a query's output to an HDFS or local
>> directory ?
>>
>>
>> --
>> Sreenath S Kamath
>> Bangalore
>> Ph No:+91-9590989106
>>
>>
>>
>
>
> --
> Sreenath S Kamath
> Bangalore
> Ph No:+91-9590989106
>

Re: Writing hive column headers in 'Insert overwrite query'

2016-01-13 Thread Elliot West

I created an issue in the Hive Jira related to this. You may wish to vote
on it or watch it if you believe it to be relevant.

https://issues.apache.org/jira/browse/HIVE-12860



On 13 January 2016 at 09:43, Elliot West <tea...@gmail.com> wrote:

> Unfortunately there appears to be no nice way of doing this. I've seen
> others achieve a work around by UNIONing with a table of the same
> schema, containing a single row of the header names, and then finally
> sorting by a synthesised rank column (see:
> http://stackoverflow.com/a/25214480/74772).
>
> I believe headers should really be an option on INSERT OVERWRITE DIRECTORY
> as it is often a far neater way of generating reports for third party
> consumption. Additionally it is not always possible to elegantly capture
> the headers generated by the CLI option in some environments such as Oozie
> scheduled tasks.
>
> There are arguments against including headers in such Hive generated
> datasets. However, these datasets are often reports at the end of a
> pipeline whose format is mandated by the intended third party recipient. It
> seems a shame to introduce yet another tool into the pipeline purely to
> introduce a header row, we'd rather just do so from within Hive.
>
> There is some background information on the current header implementation
> here: https://issues.apache.org/jira/browse/HIVE-138
>
> Cheers - Elliot.
>
> On Wednesday, 13 January 2016, Sreenath <sreenaths1...@gmail.com> wrote:
>
>> Hey,
>>
>> This will work but lets say i want to write the output to an HDFS
>> location using INSERT OVERWRITE DIRECTORY '' , in this case even if
>> we set *hive.cli.print.header=true *the headers doesn't get written . Is
>> there a way to write the headers in this case
>>
>> On 13 January 2016 at 12:04, Ankit Bhatnagar <ank...@yahoo-inc.com>
>> wrote:
>>
>>> r u looking for
>>>
>>> hive -e "*set hive.cli.print.header=true*; < query> " > output
>>>
>>>
>>> On Tuesday, January 12, 2016 10:14 PM, Sreenath <sreenaths1...@gmail.com>
>>> wrote:
>>>
>>>
>>> Hi All,
>>>
>>> Is there a way we can write the hive column headers also along with the
>>> output when we are overwriting a query's output to an HDFS or local
>>> directory ?
>>>
>>>
>>> --
>>> Sreenath S Kamath
>>> Bangalore
>>> Ph No:+91-9590989106
>>>
>>>
>>>
>>
>>
>> --
>> Sreenath S Kamath
>> Bangalore
>> Ph No:+91-9590989106
>>
>

Hive ExIm from on-premise HDP to Amazon EMR

2016-01-07 Thread Elliot West

Hello,

Following on from my earlier post concerning syncing Hive data from an on
premise cluster to the cloud, I've been experimenting with the
IMPORT/EXPORT functionality to move data from an on-premise HDP cluster to
Amazon EMR. I started out with some simple Exports/Imports as these can be
the core operations on which replication is founded. This worked fine with
some on-premise clusters running HDP-2.2.4.


// on cluster 1

EXPORT TABLE my_table PARTITION (year_month='2015-12')
TO '/exports/my_table'
FOR REPLICATION ('1');

// Copy from cluster1:/exports/my_table to cluster2:/staging/my_table

// on cluster 2

IMPORT FROM '/staging/my_table'
LOCATION '/warehouse/my_table';

// Table created, partition created, data relocated to
/warehouse/my_table/year_month=2015-12


I next tried similar with HDP-2.2.4 → EMR (4.2.0) like so:

// On premise HDP2.2.4
SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;

EXPORT TABLE my_table PARTITION (year_month='2015-12')
TO 's3n://API_KEY:SECRET_KEY@exports-bucket/my_table'

// on EMR
SET hiveconf:hive.exim.uri.scheme.whitelist=hdfs,pfile,s3n;

IMPORT FROM 's3n://exports-bucket/my_table'
LOCATION 's3n://hive-warehouse-bucket/my_table'


The IMPORT behaviour I see is bizarre:

   1. Creates the folder 's3n://hive-warehouse/my_table'
   2. Copies the part file from
   's3n://exports-bucket/my_table/year_month=2015-12' to
   's3n://exports-bucket/my_table' (i.e. to the parent)
   3. Fails with: "ERROR exec.Task: Failed with exception checkPaths:
   s3n://exports-bucket/my_table has nested
   directorys3n://exports-bucket/my_table/year_month=2015-12"

It is as if it is attempting to set the final partition location to
's3n://exports-bucket/my_table' and not
's3n://hive-warehouse-bucket/my_table/year_month=2015-12' as happens with
HDP → HDP.

I've tried variations, specifying the partition on import, excluding the
location, all with the same result. Any thoughts or assistance would be
appreciated.

Thanks - Elliot.

Re: Attempt to do update or delete using transaction manager that does not support these operations. (state=42000,code=10294)

2015-12-22 Thread Elliot West

e t was created as transactional shown below, the
> table tt is not!
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> show create table t;
>
> +-+--+
>
> |   createtab_stmt|
>
> +-+--+
>
> | CREATE TABLE `t`(   |
>
> |   `owner` varchar(30),  |
>
> |   `object_name` varchar(30),|
>
> |   `subobject_name` varchar(30), |
>
> |   `object_id` bigint,   |
>
> |   `data_object_id` bigint,  |
>
> |   `object_type` varchar(19),|
>
> |   `created` timestamp,  |
>
> |   `last_ddl_time` timestamp,|
>
> |   `timestamp2` varchar(19), |
>
> |   `status` varchar(7),  |
>
> |   `temporary2` varchar(1),  |
>
> |   `generated` varchar(1),   |
>
> |   `secondary` varchar(1),   |
>
> |   `namespace` bigint,   |
>
> |   `edition_name` varchar(30),   |
>
> |   `padding1` varchar(4000), |
>
> |   `padding2` varchar(3500), |
>
> |   `attribute` varchar(32),  |
>
> |   `op_type` int,|
>
> |   `op_time` timestamp,  |
>
> |   `new_col` varchar(30))|
>
> | CLUSTERED BY (  |
>
> |   object_id)|
>
> | INTO 256 BUCKETS|
>
> | ROW FORMAT SERDE|
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'   |
>
> | STORED AS INPUTFORMAT   |
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
>
> | OUTPUTFORMAT|
>
> |   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'|
>
> | LOCATION|
>
> |   'hdfs://rhes564:9000/user/hive/warehouse/asehadoop.db/t'  |
>
> | TBLPROPERTIES ( |
>
> |   'COLUMN_STATS_ACCURATE'='false',  |
>
> |   'last_modified_by'='hduser',  |
>
> |   'last_modified_time'='1449831076',|
>
> |   'numFiles'='17',  |
>
> |   'numRows'='-1',   |
>
> |   'orc.bloom.filter.columns'='object_id',   |
>
> |   'orc.bloom.filter.fpp'='0.05',|
>
> |   'orc.compress'='SNAPPY',  |
>
> |   'orc.create.index'='true',|
>
> |   'orc.row.index.stride'='1',   |
>
> |   'orc.stripe.size'='268435456',|
>
> |   'rawDataSize'='-1',   |
>
> |   'totalSize'='64438212',   |
>
> |   'transactional'='true',   |
>
> |   'transient_lastDdlTime'='1449831076') |
>
> +-+--+
>
> 49 rows selected (0.06 seconds)
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is p

Re: Attempt to do update or delete using transaction manager that does not support these operations. (state=42000,code=10294)

2015-12-22 Thread Elliot West

Hi,

The input/output formats do not appear to be ORC, have you tried 'stored as
orc'? Additionally you'll need to set the property 'transactional=true' on
the table. Do you have the original create table statement?

Cheers - Elliot.

On Tuesday, 22 December 2015, Mich Talebzadeh  wrote:

> Hi,
>
>
>
> I am trying this code on table tt defined as an ORC table
>
>
>
> 0: jdbc:hive2://rhes564:10010/default> show create table tt;
>
> +-+--+
>
> | createtab_stmt  |
>
> +-+--+
>
> | CREATE TABLE `tt`(  |
>
> |   `owner` varchar(30),  |
>
> |   `object_name` varchar(30),|
>
> |   `subobject_name` varchar(30), |
>
> |   `object_id` bigint,   |
>
> |   `data_object_id` bigint,  |
>
> |   `object_type` varchar(19),|
>
> |   `created` timestamp,  |
>
> |   `last_ddl_time` timestamp,|
>
> |   `timestamp2` varchar(19), |
>
> |   `status` varchar(7),  |
>
> |   `temporary2` varchar(1),  |
>
> |   `generated` varchar(1),   |
>
> |   `secondary` varchar(1),   |
>
> |   `namespace` bigint,   |
>
> |   `edition_name` varchar(30),   |
>
> |   `padding1` varchar(4000), |
>
> |   `padding2` varchar(3500), |
>
> |   `attribute` varchar(32),  |
>
> |   `op_type` int,|
>
> |   `op_time` timestamp,  |
>
> |   `new_col` varchar(30))|
>
> | ROW FORMAT SERDE|
>
> |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
>
> | STORED AS INPUTFORMAT   |
>
> |   'org.apache.hadoop.mapred.TextInputFormat'|
>
> | OUTPUTFORMAT|
>
> |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'  |
>
> | LOCATION|
>
> |   'hdfs://rhes564:9000/user/hive/warehouse/asehadoop.db/tt' |
>
> | TBLPROPERTIES ( |
>
> |   'COLUMN_STATS_ACCURATE'='true',   |
>
> |   'numFiles'='546', |
>
> |   'numRows'='8299588',  |
>
> |   'rawDataSize'='18611443520',  |
>
> |   'totalSize'='18619743108',|
>
> |   'transient_lastDdlTime'='1450337155') |
>
> +-+--+
>
>
>
> The following delete statement:
>
>
>
> delete from tt where exists(select 1 from t where tt.object_id =
> t.object_id);
>
> Error: Error while compiling statement: FAILED: SemanticException [Error
> 10294]: Attempt to do update or delete using transaction manager that does
> not support these operations. (state=42000,code=10294)
>
>
>
>
>
>
>
> Mich Talebzadeh
>
>
>
> *Sybase ASE 15 Gold Medal Award 2008*
>
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf
>
> Author of the books* "A Practitioner’s Guide to Upgrading to Sybase ASE
> 15", ISBN 978-0-9563693-0-7*.
>
> co-author *"Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4*
>
> *Publications due shortly:*
>
> *Complex Event Processing in Heterogeneous Environments*, ISBN:
> 978-0-9563693-3-8
>
> *Oracle and Sybase, Concepts and Contrasts*, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Ltd, its subsidiaries nor their employees
> accept

Re: Synchronizing Hive metastores across clusters

2015-12-18 Thread Elliot West

Eugene/Susanth,

Thank you for pointing me in the direction of these features. I'll
investigate them further to see if I can put them to good use.

Cheers - Elliot.

On 17 December 2015 at 20:03, Sushanth Sowmyan <khorg...@gmail.com> wrote:

> Also, while I have not wiki-ized the documentation for the above, I
> have uploaded slides from talks that I've given in hive user group
> meetup on the subject, and also a doc that describes the replication
> protocol followed for the EXIM replication that are attached over at
> https://issues.apache.org/jira/browse/HIVE-10264
>
> On Thu, Dec 17, 2015 at 11:59 AM, Sushanth Sowmyan <khorg...@gmail.com>
> wrote:
> > Hi,
> >
> > I think that the replication work added with
> > https://issues.apache.org/jira/browse/HIVE-7973 is exactly up this
> > alley.
> >
> > Per Eugene's suggestion of MetaStoreEventListener, this replication
> > system plugs into that and gets you a stream of notification events
> > from HCatClient for the exact purpose you mention.
> >
> > There's some work still outstanding on this task, most notably
> > documentation (sorry!) but please have a look at
> > HCatClient.getReplicationTasks(...) and
> > org.apache.hive.hcatalog.api.repl.ReplicationTask. You can plug in
> > your implementation of  ReplicationTask.Factory to inject your own
> > logic for how to handle the replication according to your needs.
> > (currently there exists an implementation that uses Hive EXPORT/IMPORT
> > to perform replication - you can look at the code for this, and the
> > tests for these classes to see how that is achieved. Falcon already
> > uses this to perform cross-hive-warehouse replication)
> >
> >
> > Thanks,
> >
> > -Sushanth
> >
> > On Thu, Dec 17, 2015 at 11:22 AM, Eugene Koifman
> > <ekoif...@hortonworks.com> wrote:
> >> Metastore supports MetaStoreEventListener and MetaStorePreEventListener
> >> which may be useful here
> >>
> >> Eugene
> >>
> >> From: Elliot West <tea...@gmail.com>
> >> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
> >> Date: Thursday, December 17, 2015 at 8:21 AM
> >> To: "user@hive.apache.org" <user@hive.apache.org>
> >> Subject: Synchronizing Hive metastores across clusters
> >>
> >> Hello,
> >>
> >> I'm thinking about the steps required to repeatedly push Hive datasets
> out
> >> from a traditional Hadoop cluster into a parallel cloud based cluster.
> This
> >> is not a one off, it needs to be a constantly running sync process. As
> new
> >> tables and partitions are added in one cluster, they need to be synced
> to
> >> the cloud cluster. Assuming for a moment that I have the HDFS data
> syncing
> >> working, I'm wondering what steps I need to take to reliably ship the
> >> HCatalog metadata across. I use HCatalog as the point of truth as to
> when
> >> when data is available and where it is located and so I think that
> metadata
> >> is a critical element to replicate in the cloud based cluster.
> >>
> >> Does anyone have any recommendations on how to achieve this in
> practice? One
> >> issue (of many I suspect) is that Hive appears to store table/partition
> >> locations internally with absolute, fully qualified URLs, therefore
> unless
> >> the target cloud cluster is similarly named and configured some path
> >> transformation step will be needed as part of the synchronisation
> process.
> >>
> >> I'd appreciate any suggestions, thoughts, or experiences related to
> this.
> >>
> >> Cheers - Elliot.
> >>
> >>
>

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West

Hi Mich,

In your scenario is there any coordination of data syncing on HDFS and
metadata in HCatalog? I.e. could a situation occur where the replicated
metastore shows a partition as 'present' yet the data that backs the
partition in HDFS has not yet arrived at the replica filesystem? I Imagine
one could avoid this by snapshotting the source metastore, then syncing
HDFS, and then finally shipping the snapshot to the replica(?).

Thanks - Elliot.

On 17 December 2015 at 16:57, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Sounds like one way replication of metastore. Depending on your metastore
> platform that could be achieved pretty easily.
>
>
>
> Mine is Oracle and I use Materialised View replication which is pretty
> good but no latest technology. Others would be GoldenGate or SAP
> replication server.
>
>
>
> HTH,
>
>
>
> Mich
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 17 December 2015 16:47
> *To:* user@hive.apache.org
> *Subject:* RE: Synchronizing Hive metastores across clusters
>
>
>
> Are both clusters in active/active mode or the cloud based cluster is
> standby?
>
>
>
> *From:* Elliot West [mailto:tea...@gmail.com <tea...@gmail.com>]
> *Sent:* 17 December 2015 16:21
> *To:* user@hive.apache.org
> *Subject:* Synchronizing Hive metastores across clusters
>
>
>
> Hello,
>
>
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
>
>
> Does anyone have any recommendations on how to achieve this in practice?
> One issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
>
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
>
>
> Cheers - Elliot.
>
>
>
>
>

Re: Synchronizing Hive metastores across clusters

2015-12-17 Thread Elliot West

Hi Mich,

Thanks for your reply. The cloud cluster is to be used for read-only
analytics, so effectively one-way, stand-by. I'll take a look at your
suggested technologies as I'm not familiar with them.

Thanks - Elliot.

On 17 December 2015 at 16:57, Mich Talebzadeh <m...@peridale.co.uk> wrote:

> Sounds like one way replication of metastore. Depending on your metastore
> platform that could be achieved pretty easily.
>
>
>
> Mine is Oracle and I use Materialised View replication which is pretty
> good but no latest technology. Others would be GoldenGate or SAP
> replication server.
>
>
>
> HTH,
>
>
>
> Mich
>
>
>
> *From:* Mich Talebzadeh [mailto:m...@peridale.co.uk]
> *Sent:* 17 December 2015 16:47
> *To:* user@hive.apache.org
> *Subject:* RE: Synchronizing Hive metastores across clusters
>
>
>
> Are both clusters in active/active mode or the cloud based cluster is
> standby?
>
>
>
> *From:* Elliot West [mailto:tea...@gmail.com <tea...@gmail.com>]
> *Sent:* 17 December 2015 16:21
> *To:* user@hive.apache.org
> *Subject:* Synchronizing Hive metastores across clusters
>
>
>
> Hello,
>
>
>
> I'm thinking about the steps required to repeatedly push Hive datasets out
> from a traditional Hadoop cluster into a parallel cloud based cluster. This
> is not a one off, it needs to be a constantly running sync process. As new
> tables and partitions are added in one cluster, they need to be synced to
> the cloud cluster. Assuming for a moment that I have the HDFS data syncing
> working, I'm wondering what steps I need to take to reliably ship the
> HCatalog metadata across. I use HCatalog as the point of truth as to when
> when data is available and where it is located and so I think that metadata
> is a critical element to replicate in the cloud based cluster.
>
>
>
> Does anyone have any recommendations on how to achieve this in practice?
> One issue (of many I suspect) is that Hive appears to store table/partition
> locations internally with absolute, fully qualified URLs, therefore unless
> the target cloud cluster is similarly named and configured some path
> transformation step will be needed as part of the synchronisation process.
>
>
>
> I'd appreciate any suggestions, thoughts, or experiences related to this.
>
>
>
> Cheers - Elliot.
>
>
>
>
>

Re: Hotels.com

2015-11-30 Thread Elliot West

This looks like a phishing attempt. Please do no open it.

On 30 November 2015 at 10:32, @Sanjiv Singh  wrote:

> 
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Mon, Nov 30, 2015 at 1:16 PM, Amrit Jangid 
> wrote:
>
>> ??
>>
>> On Mon, Nov 30, 2015 at 11:56 AM, Roshini Johri 
>> wrote:
>>
>>> [image: Inline image 2]
>>>
>>> Roshini Johri
>>>
>>> [image: Inline image 1] 
>>>
>>>
>>>
>>> She borrowed the book from him many years ago and hasn't yet returned
>>> it. She advised him to come back at once. Someone I know recently combined
>>> Maple Syrup & buttered Popcorn thinking it would taste like caramel
>>> popcorn. It didn’t and they don’t recommend anyone else do it either. A
>>> glittering gem is not enough. They got there early, and they got really
>>> good seats.
>>>
>>
>>
>>
>> --
>>
>> Regards,
>> Amrit
>>
>>
>

Re: Bulk load in Hive transactions backed table

2015-11-18 Thread Elliot West

Are you loading new data (inserts) or mutating existing data
(update/delete) or both? And by 'transactions' are you referring to Hive
ACID transactional tables? If so:

For new data, I think you should be able to use:

INSERT INTO transactional_table ... FROM table_over_file_to_be_loaded


For updates, deletes, and inserts, the Mutation API that allows you to bulk
mutate large volumes of records in a single transaction, however it's a
Java API and you'd need to implement a job to invoke it:
http://htmlpreview.github.io/?https://github.com/apache/hive/blob/master/hcatalog/streaming/src/java/org/apache/hive/hcatalog/streaming/mutate/package.html

Also, there is a proposed HQL MERGE command that would allow you to do
this, but it has not been implemented as yet:
https://issues.apache.org/jira/browse/HIVE-10924

Elliot.


On 18 November 2015 at 10:57, Jagat Singh  wrote:

> Hi,
>
> Is it possible to do bulk load using files in hive table backed by
> transactions instead of update statements.
>
> Thanks
>

Re: hive transaction strange behaviour

2015-11-13 Thread Elliot West

It is the compaction process that creates the base files. Check your
configuration to ensure that compaction should be running. I believe the
compactor should run periodically. You can also request a compaction using
the appropriate ALTER TABLE HQL DDL command.

Elliot.

On Friday, 13 November 2015, Sanjeev Verma 
wrote:

> I have enable the hive transaction and able to see the delta files created
> for some of the partition but i dont not see any base file created yet.it
> seems strange to me seeing so many delta files without any base file.
> Could somebody let me know when Base file created.
>
> Thanks
>

Re: query orc file by hive

2015-11-09 Thread Elliot West

Hi,

You can create a table and point the location property to the folder
containing your ORC file:

CREATE EXTERNAL TABLE orc_table (
  
)
STORED AS ORC
LOCATION '/hdfs/folder/containing/orc/file'
;


https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable

Thanks - Elliot.

On 9 November 2015 at 09:44, patcharee  wrote:

> Hi,
>
> How can I query an orc file (*.orc) by Hive? This orc file is created by
> other apps, like spark, mr.
>
> Thanks,
> Patcharee
>

Re: Locking when using the Metastore/HCatalog APIs.

2015-10-28 Thread Elliot West

Captured in HIVE-12285.

Thanks - Elliot.

On 28 October 2015 at 08:54, Elliot West <tea...@gmail.com> wrote:

> Perhaps, I'd expect one might wish to make multiple API calls when holding
> a lock, but a suitably implemented client may be able to manage this
> seamlessly. Then again a user may also wish to have explicit control of
> locking. No reason it couldn't offer both. I'll create a JIRA to capture
> this. As we are working with these APIs currently, I expect that we may be
> able to submit a patch.
>
> Thanks for your responses,
>
> Elliot.
>
>
> On Tuesday, 27 October 2015, Eugene Koifman <ekoif...@hortonworks.com>
> wrote:
>
>> Wouldn’t it make more sense for the api to acquire required locks
>> automatically?   That seems like a simpler user model.
>>
>> From: Alan Gates <alanfga...@gmail.com>
>> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
>> Date: Tuesday, October 27, 2015 at 11:34 AM
>> To: "user@hive.apache.org" <user@hive.apache.org>
>> Subject: Re: Locking when using the Metastore/HCatalog APIs.
>>
>> Answers inlined.
>>
>> Elliot West
>> October 22, 2015 at 6:40
>> I notice from the Hive locking wiki page
>> <https://cwiki.apache.org/confluence/display/Hive/Locking> that locks
>> may be acquired for a range of HQL DDL operations. I wanted to know how the
>> locking scheme mapped mapped/employed by equivalent operations in the
>> Metastore and HCatalog APIs. Consider the operation:
>>
>> alter table T1 drop partition P1
>>
>>
>> This apparently requires a shared lock on T1 and an exclusive lock on P1.
>> However, when following the invocation chain on the equivalent APIs I can
>> see no acquisition of such locks:
>>
>> HCatClient.dropPartitions(...)
>> IMetaStoreClient.dropPartition(...)
>>
>>
>> I notice that IMetaStoreClient exposes methods to lock and unlock
>> resources and I have used these when working with the ACID APIs. However I
>> can find nothing similar on the HCatalog API. As a user of these APIs:
>>
>>1. Do I need to acquire/release locks as declared on the wiki page
>>when using equivalent operations in these APIs?
>>
>> Yes.
>>
>>
>>1. If not, why not?
>>2. Ideally, should the HCatClient also expose lock/unlock methods
>>like the Metastore API?
>>
>> Yes.
>>
>>
>>1. If not, why not?
>>
>> Thank you - Elliot.
>>
>> Alan.
>>
>

Re: Locking when using the Metastore/HCatalog APIs.

2015-10-28 Thread Elliot West

Perhaps, I'd expect one might wish to make multiple API calls when holding
a lock, but a suitably implemented client may be able to manage this
seamlessly. Then again a user may also wish to have explicit control of
locking. No reason it couldn't offer both. I'll create a JIRA to capture
this. As we are working with these APIs currently, I expect that we may be
able to submit a patch.

Thanks for your responses,

Elliot.

On Tuesday, 27 October 2015, Eugene Koifman <ekoif...@hortonworks.com>
wrote:

> Wouldn’t it make more sense for the api to acquire required locks
> automatically?   That seems like a simpler user model.
>
> From: Alan Gates <alanfga...@gmail.com
> <javascript:_e(%7B%7D,'cvml','alanfga...@gmail.com');>>
> Reply-To: "user@hive.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>" <
> user@hive.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>>
> Date: Tuesday, October 27, 2015 at 11:34 AM
> To: "user@hive.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>" <
> user@hive.apache.org
> <javascript:_e(%7B%7D,'cvml','user@hive.apache.org');>>
> Subject: Re: Locking when using the Metastore/HCatalog APIs.
>
> Answers inlined.
>
> Elliot West <javascript:_e(%7B%7D,'cvml','tea...@gmail.com');>
> October 22, 2015 at 6:40
> I notice from the Hive locking wiki page
> <https://cwiki.apache.org/confluence/display/Hive/Locking> that locks may
> be acquired for a range of HQL DDL operations. I wanted to know how the
> locking scheme mapped mapped/employed by equivalent operations in the
> Metastore and HCatalog APIs. Consider the operation:
>
> alter table T1 drop partition P1
>
>
> This apparently requires a shared lock on T1 and an exclusive lock on P1.
> However, when following the invocation chain on the equivalent APIs I can
> see no acquisition of such locks:
>
> HCatClient.dropPartitions(...)
> IMetaStoreClient.dropPartition(...)
>
>
> I notice that IMetaStoreClient exposes methods to lock and unlock
> resources and I have used these when working with the ACID APIs. However I
> can find nothing similar on the HCatalog API. As a user of these APIs:
>
>1. Do I need to acquire/release locks as declared on the wiki page
>when using equivalent operations in these APIs?
>
> Yes.
>
>
>1. If not, why not?
>2. Ideally, should the HCatClient also expose lock/unlock methods like
>the Metastore API?
>
> Yes.
>
>
>1. If not, why not?
>
> Thank you - Elliot.
>
> Alan.
>

Re: the column names removed after insert select

2015-10-23 Thread Elliot West

Excellent news. Thanks.

On 23 October 2015 at 15:50, Prasanth Jayachandran <
pjayachand...@hortonworks.com> wrote:

> Hi
>
> This has been fixed recently
> https://issues.apache.org/jira/browse/HIVE-4243
>
> This used to be a problem with the way hive writes rows out. The
> ObjectInspectors sent out by hive’s filesink operator contains internal
> column names and not the names of the destination table.
>
> From the record reader’s perspective, the column names doesn’t really
> matter and column projections uses the column ids instead of names.
>
> Thanks
> Prasanth
>
> > On Oct 23, 2015, at 12:25 AM, patcharee 
> wrote:
> >
> > Hi
> >
> > I inserted a table from select (insert into table newtable select date,
> > hh, x, y from oldtable). After the insert the column names of the table
> > have been removed, see the output below when I use hive --orcfiledump
> >
> > - Type: struct<_col0:int,_col1:int,_col2:int,_col3:int>
> >
> > while it is supposed to be
> >
> > - Type: struct
> >
> > Any ideas how this happened and how I can fix it. Please suggest me.
> >
> > BR,
> > Patcharee
> >
>
>

Re: the column names removed after insert select

2015-10-23 Thread Elliot West

I was seeing something similar in the initial ORC delta file when inserting
rows into a newly created ACID table. Subsequent deltas had the correct
columns names.

On 23 October 2015 at 08:25, patcharee  wrote:

> Hi
>
> I inserted a table from select (insert into table newtable select date,
> hh, x, y from oldtable). After the insert the column names of the table
> have been removed, see the output below when I use hive --orcfiledump
>
> - Type: struct<_col0:int,_col1:int,_col2:int,_col3:int>
>
> while it is supposed to be
>
> - Type: struct
>
> Any ideas how this happened and how I can fix it. Please suggest me.
>
> BR,
> Patcharee
>

Access to wiki (documenting locking requirements).

2015-10-22 Thread Elliot West

Hi,

May I have access to edit the wiki? My confluence user name is 'teabot'.

I've been looking briefly at ALTER TABLE CONCATENATE and noticed that the
operation isn't listed on the Hive/Locking wiki page
 even though it
acquires an exclusive lock.

Thanks - Elliot.

Locking when using the Metastore/HCatalog APIs.

2015-10-22 Thread Elliot West

I notice from the Hive locking wiki page
 that locks may
be acquired for a range of HQL DDL operations. I wanted to know how the
locking scheme mapped mapped/employed by equivalent operations in the
Metastore and HCatalog APIs. Consider the operation:

alter table T1 drop partition P1


This apparently requires a shared lock on T1 and an exclusive lock on P1.
However, when following the invocation chain on the equivalent APIs I can
see no acquisition of such locks:

HCatClient.dropPartitions(...)
IMetaStoreClient.dropPartition(...)


I notice that IMetaStoreClient exposes methods to lock and unlock resources
and I have used these when working with the ACID APIs. However I can find
nothing similar on the HCatalog API. As a user of these APIs:

   1. Do I need to acquire/release locks as declared on the wiki page when
   using equivalent operations in these APIs?
   2. If not, why not?
   3. Ideally, should the HCatClient also expose lock/unlock methods like
   the Metastore API?
   4. If not, why not?

Thank you - Elliot.

Strange HiveConf behaviour

2015-10-08 Thread Elliot West

I've been writing some unit tests around some extensions to Hive recently
and came across an issue that left me perplexed. If I execute:

System.err.println(new HiveConf().get(FileSystem.FS_DEFAULT_NAME_KEY,
FileSystem.DEFAULT_FS));


this prints:

core-site.xml


I suspect I need to provide a better configuration. However, in the absence
of a correct configuration is the intended behaviour of a bug? If intended,
can someone explain the purpose of returning this value in this scenario?

Thanks - Elliot.

Re: Better way to do UDF's for Hive

2015-10-01 Thread Elliot West

Perhaps a macro?

CREATE TEMPORARY MACRO state_from_city (city string) " +
/* HQL column logic */ ...;


On 1 October 2015 at 14:11, Daniel Lopes  wrote:

> Hi,
>
> I'd like to know the good way to do a a UDF for a single field, like
>
> SELECT
>   tbl.id AS id,
>   tbl.name AS name,
>   tbl.city AS city,
>   state_from_city(tbl.city) AS state
> FROM
>   my_db.my_table tbl;
>
> *Native Java*? *Python *over *Hadoop* *Streaming*?
>
> I prefer Python, but I don't know how to do in a good way.
>
> Thanks,
>
> *Daniel Lopes, B.Eng*
> Data Scientist - BankFacil
> CREA/SP 5069410560
> 
> Mob +55 (18) 99764-2733 
> Ph +55 (11) 3522-8009
> http://about.me/dannyeuu
>
> Av. Nova Independência, 956, São Paulo, SP
> Bairro Brooklin Paulista
> CEP 04570-001
> https://www.bankfacil.com.br
>
>

Re: Decomposing nested Hive statements with views

2015-09-15 Thread Elliot West

On 15 September 2015 at 00:09, Gopal Vijayaraghavan 
wrote:

> CTE


Thank you for the in depth reply Gopal. I've just had a quick try out of
CTEs but can't see how they address my original problem of decomposing a
query into separate independent units. It seems that the CTE definition ('
with' clause) must always be immediately followed by the use of the CTE ('
select' for example). Therefore, the CTE queries must still be inlined into
the parent queries, creating one monolithic query.

My goal is to create separate independent processing units (queries) that
can then be composed into larger queries. Effectively something like:

WITH a AS (SELECT ... FROM A);
// I would now like to be able to test 'a'

// This can be in a separate file imported with 'source'



WITH b AS (SELECT ... FROM B);
// I would now like to be able to test 'b'
// This can be in a separate file imported with 'source'


// Compose 'a' and 'b'

SOURCE a.hql;

SOURCE b.hql;

WITH c AS (
SELECT ... FROM (
SELECT ... FROM a WHERE ...
) A LEFT JOIN (
SELECT ... FROM b
) B ON (...)
);
// I would now like to be able to test the composition 'c'
// that uses both 'a' and 'b'.
// This can be in a separate file imported with 'source'


Thanks - Elliot.

Re: Organising Hive Scripts

2015-09-14 Thread Elliot West

Hi Charles,

You can also split out column level logic using Hive macros. These also
allow re-use of said logic:


hive> create temporary macro MYSIGMOID(x DOUBLE)
> 2.0 / (1.0 + exp(-x));
OK

hive> select MYSIGMOID(1.0) from dual;
OK

1.4621171572600098


Cheers - Elliot.

On 11 September 2015 at 20:51, Charles Mean  wrote:

> Great Dmitry,
>
> It will certainly help me a lot.
> I will give it a try, thank you very much for your help.
>
> On Fri, Sep 11, 2015 at 4:34 PM, Dmitry Tolpeko 
> wrote:
>
>> Charles,
>>
>> Not sure what you can do in Hive CLI right now, but consider a new Hive
>> HPL/SQL component that will be included to new Hive versions and that
>> currently you can compile and run separately, see
>> https://github.com/apache/hive/tree/master/hplsql or www.hplsql.org
>>
>> It supports include files, user defined procedures and functions,
>> exception based exception handling, flow of control statements and so on.
>> It is new, under active development so any feedback is welcome.
>>
>> Dmitry
>>
>> On Fri, Sep 11, 2015 at 8:32 PM, Charles Mean 
>> wrote:
>>
>>> Hello,
>>>
>>> I am working with a huge hive script that I would like to improve its
>>> organisation to a better maintenance in the future.
>>> Looking into this issue, I did not found any kind of include or
>>> something to split my script into some smaller parts.
>>> So, Is there some sort of pattern that is implemented to improve script
>>> structure ?
>>>
>>
>>
>

Decomposing nested Hive statements with views

2015-09-14 Thread Elliot West

Hello,

We have many HQL scripts that select from nested sub-selects. In many cases
the nesting can be a few levels deep:

SELECT ... FROM (
SELECT ... FROM (
SELECT ... FROM (
SELECT ... FROM a WHERE ...
) A LEFT JOIN (
SELECT ... FROM b
) B ON (...)
) ab FULL OUTER JOIN (
SELECT ... FROM c WHERE ...
  ) C ON (...)
) abc LEFT JOIN (
  SELECT ... FROM d WHERE ...
) D ON (...)
GROUP BY ...;


Such queries are difficult to test as they are complex monoliths. While we
can decompose and modularise them at the column level with UDFs and macros,
it is not as evident to me how best to break up the nested queries into
separate components. I consider the options to be:

1. Make each statement a top level query that has an associated table

CREATE TABLE Y AS
SELECT ... FROM X ...

CREATE TABLE Z AS
SELECT ... FROM Y ...


The problems with this approach as I see it are that:

   - The resultant process will generate far more HDFS I/O as each query
   will write a table which is then read by the parent query.
   - As each query results in an isolated job, Hive will not be able to
   optimise 'across queries'. I do not know how Hive's optimiser functions so
   this is no more than a naïve assumption.

2. Encapsulate each query as a view. In this way the statements are defined
in isolation and merely describe a processing step rather than the
processed data itself.

CREATE VIEW Y AS
SELECT ... FROM X ...

CREATE VIEW Z AS
SELECT ... FROM Y ...


I'd expect that a given query would result in the same execution plan
regardless of whether it was delivered as a single inlined statement or as
a composition of views. However, for some time I've heard that views
perform poorly and should be avoided.

Clearly I should try out a few queries concretely, but with respect to this
issue can someone illuminate me on:

   - What is considered the best practice for modularising this type of
   query?
   - Is there a penalty for using views over a fully inlines query?
   - Are there any other options that I haven't considered?

Many thanks - Elliot.

Hive Macros roadmap

2015-09-11 Thread Elliot West

Hi,

I noticed some time ago the Hive Macro feature. To me at least this seemed
like an excellent addition to HQL, allowing the user to encapsulate complex
column logic as an independent HQL, reusable macro while avoiding the
complexities of Java UDFs. However, few people seem to be aware of them or
use them. If you are unfamiliar with macros they look like this:

hive> create temporary macro MYSIGMOID(x DOUBLE)
> 2.0 / (1.0 + exp(-x));
OK

hive> select MYSIGMOID(1.0) from dual;
OK

1.4621171572600098


As far as I can tell, they are no longer documented on the Hive wiki. There
is a tiny reference to them in the O'Reilly 'Programming Hive' book (page
185). Can anyone advise me on the following:

   - Are there are plans to keep or remove this functionality?
   - Are there are plans to document this functionality?
   - Aside from limitations of HQL are there compelling reasons not to use
   macros?

Thanks - Elliot.

Hive Concurrency support

2015-08-23 Thread Elliot West

Shared locks are all that is required to insert data into transactional
tables. Multiple clients can hold a shared lock simultaneously. Each client
will write using uniquely assigned transaction ids so that their work is
isolated from one another. It should actually be possible for multiple
clients to insert into the same partition concurrently.

See slide 12 in:
http://www.slideshare.net/mobile/Hadoop_Summit/w-525210-comalley

Thanks - Elliot.

On Friday, 21 August 2015, Suyog Parlikar suyogparli...@gmail.com
javascript:_e(%7B%7D,'cvml','suyogparli...@gmail.com'); wrote:

 Thanks Elliot,

 For the immediate reply.

 But as per hive locking mechanism,
 While inserting data to a partition hive acquires exclusive lock on that
 partition and shared lock on the entire table.

 How is it possible to insert data into a different partition of the same
 table while having shared lock on the table which does not allow write
 operation.

 Please correct me if my understanding about the same is wrong.
 (I am using hql inserts only for these operations)

 Thanks,
 Suyog
 On Aug 21, 2015 7:28 PM, Elliot West tea...@gmail.com wrote:

 I presume you mean into different partitions of a table at the same
 time? This should be possible. It is certainly supported by the streaming
 API, which is probably where you want to look if you need to insert large
 volumes of data to multiple partitions concurrently. I can't see why it
 would not also be possible with HQL INSERTs.

 On Friday, 21 August 2015, Suyog Parlikar suyogparli...@gmail.com
 wrote:

 Can we insert data in different partitions of a table at a time.

 Waiting for inputs .

 Thanks in advance.

 - suyog

Re: Hive Concurrency support

2015-08-23 Thread Elliot West

Is the table configured to be transactional?

https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-TableProperties

On Monday, 24 August 2015, Suyog Parlikar suyogparli...@gmail.com wrote:

Hello Elliot,

Thanks for clarification.

But I am still not able to understand the hive working.

My cluster has following properties.

Txn.manager- dummyTxnManager

Cocurrency - true

Actually I am trying to insert data into two different partitions of a
table at the same time.

When I check the locks present on the table ,it shows the shared
lock.Which does not allow the writes on the table.

So I wanted to understand that ,

Does hive execute these two insert operations sequentially or it executes
it in parallel .

Thanks,
Suyog
On Aug 23, 2015 4:23 PM, Elliot West tea...@gmail.com
javascript:_e(%7B%7D,'cvml','tea...@gmail.com'); wrote:

Shared locks are all that is required to insert data into transactional
tables. Multiple clients can hold a shared lock simultaneously. Each client
will write using uniquely assigned transaction ids so that their work is
isolated from one another. It should actually be possible for multiple
clients to insert into the same partition concurrently.

See slide 12 in:
http://www.slideshare.net/mobile/Hadoop_Summit/w-525210-comalley

Thanks - Elliot.

On Friday, 21 August 2015, Suyog Parlikar suyogparli...@gmail.com
wrote:

Thanks Elliot,

For the immediate reply.

But as per hive locking mechanism,
While inserting data to a partition hive acquires exclusive lock on that
partition and shared lock on the entire table.

How is it possible to insert data into a different partition of the same
table while having shared lock on the table which does not allow write
operation.

Please correct me if my understanding about the same is wrong.
(I am using hql inserts only for these operations)

Thanks,
Suyog
On Aug 21, 2015 7:28 PM, Elliot West tea...@gmail.com wrote:

I presume you mean into different partitions of a table at the same
time? This should be possible. It is certainly supported by the streaming
API, which is probably where you want to look if you need to insert large
volumes of data to multiple partitions concurrently. I can't see why it
would not also be possible with HQL INSERTs.

On Friday, 21 August 2015, Suyog Parlikar suyogparli...@gmail.com
wrote:

Can we insert data in different partitions of a table at a time.

Waiting for inputs .

Thanks in advance.

- suyog

Re: Hive Concurrency support

2015-08-21 Thread Elliot West

I presume you mean into different partitions of a table at the same
time? This should be possible. It is certainly supported by the streaming
API, which is probably where you want to look if you need to insert large
volumes of data to multiple partitions concurrently. I can't see why it
would not also be possible with HQL INSERTs.

On Friday, 21 August 2015, Suyog Parlikar suyogparli...@gmail.com wrote:

 Can we insert data in different partitions of a table at a time.

 Waiting for inputs .

 Thanks in advance.

 - suyog

Re: delta file compact take no effect

2015-06-11 Thread Elliot West

What do you see if you issue:

SHOW COMPACTIONS;

On Thursday, 11 June 2015, r7raul1...@163.com r7raul1...@163.com wrote:


 I use hive 1.1.0 on hadoop 2.5.0
 After I do some update operation on table u_data_txn.
 My table create many delta file like:
 drwxr-xr-x - hdfs hive 0 2015-02-06 22:52
 /user/hive/warehouse/u_data_txn/delta_001_001
 -rw-r--r-- 3 hdfs supergroup 346453 2015-02-06 22:52
 /user/hive/warehouse/u_data_txn/delta_001_001/bucket_0
 -rw-r--r-- 3 hdfs supergroup 415924 2015-02-06 22:52
 /user/hive/warehouse/u_data_txn/delta_001_001/bucket_1
 drwxr-xr-x - hdfs hive 0 2015-02-06 22:58
 /user/hive/warehouse/u_data_txn/delta_002_002
 -rw-r--r-- 3 hdfs supergroup 807 2015-02-06 22:58
 /user/hive/warehouse/u_data_txn/delta_002_002/bucket_0
 -rw-r--r-- 3 hdfs supergroup 779 2015-02-06 22:58
 /user/hive/warehouse/u_data_txn/delta_002_002/bucket_1
 drwxr-xr-x - hdfs hive 0 2015-02-06 22:59
 /user/hive/warehouse/u_data_txn/delta_003_003
 -rw-r--r-- 3 hdfs supergroup 817 2015-02-06 22:59
 /user/hive/warehouse/u_data_txn/delta_003_003/bucket_0
 -rw-r--r-- 3 hdfs supergroup 767 2015-02-06 22:59
 /user/hive/warehouse/u_data_txn/delta_003_003/bucket_1
 drwxr-xr-x - hdfs hive 0 2015-02-06 23:01
 /user/hive/warehouse/u_data_txn/delta_004_004
 -rw-r--r-- 3 hdfs supergroup 817 2015-02-06 23:01
 /user/hive/warehouse/u_data_txn/delta_004_004/bucket_0
 -rw-r--r-- 3 hdfs supergroup 779 2015-02-06 23:01
 /user/hive/warehouse/u_data_txn/delta_004_004/bucket_1
 drwxr-xr-x - hdfs hive 0 2015-02-06 23:03
 /user/hive/warehouse/u_data_txn/delta_005_005
 -rw-r--r-- 3 hdfs supergroup 817 2015-02-06 23:03
 /user/hive/warehouse/u_data_txn/delta_005_005/bucket_0
 -rw-r--r-- 3 hdfs supergroup 779 2015-02-06 23:03
 /user/hive/warehouse/u_data_txn/delta_005_005/bucket_1
 drwxr-xr-x - hdfs hive 0 2015-02-10 21:34
 /user/hive/warehouse/u_data_txn/delta_006_006
 -rw-r--r-- 3 hdfs supergroup 821 2015-02-10 21:34
 /user/hive/warehouse/u_data_txn/delta_006_006/bucket_0
 drwxr-xr-x - hdfs hive 0 2015-02-10 21:35
 /user/hive/warehouse/u_data_txn/delta_007_007
 -rw-r--r-- 3 hdfs supergroup 821 2015-02-10 21:35
 /user/hive/warehouse/u_data_txn/delta_007_007/bucket_0
 drwxr-xr-x - hdfs hive 0 2015-03-24 01:16
 /user/hive/warehouse/u_data_txn/delta_008_008
 -rw-r--r-- 3 hdfs supergroup 1670 2015-03-24 01:16
 /user/hive/warehouse/u_data_txn/delta_008_008/bucket_0
 -rw-r--r-- 3 hdfs supergroup 1767 2015-03-24 01:16
 /user/hive/warehouse/u_data_txn/delta_008_008/bucket_1





 *I try ALTER TABLE u_data_txn COMPACT 'MAJOR'; The delta still exist. Then
 I try ALTER TABLE u_data_txn COMPACT 'MINOR'; The delta still exist. How to
  merge delta file?*

 *My config is:*
 property
 namehive.support.concurrency/name
 valuetrue/value
 /property
 property
 namehive.enforce.bucketing/name
 valuetrue/value
 /property
 property
 namehive.exe.dynamic.partition.mode/name
 valuenonstrict/value
 /property
 property
 namehive.txn.manager/name
 valueorg.apache.hadoop.hive.ql.lockmgr.DbTxnManager/value
 /property
 property
 namehive.compactor.initiator.on/name
 valuetrue/value
 /property
 property
 namehive.compactor.worker.threads/name
 value4/value
 /property
 --
 r7raul1...@163.com javascript:_e(%7B%7D,'cvml','r7raul1...@163.com');

Re: ACID ORC file reader issue with uncompacted data

2015-05-18 Thread Elliot West

Thanks for the reply Alan.

I see your point regarding the multiple delta directories and why it would
not make sense to include one of them as a leaf. However, it seems that
with this scheme one cannot abstractly work with such paths. One must have
knowledge of the underlying format to understand why some paths have 'part'
leaves and others do not. Conversely it could be argued that Cascading is
making an incorrect assumption about the structure of paths (i.e. there
will always have a 'part' leaf.)

Regarding your suggestion, unfortunately I think this is out of my control
as I have to work with the way in which Cascading operates. We configure
Cascading to use OrcInputFormat and internally it calls
OrcInputFormat.getSplits(). This returns OrcSplits and cascading then calls
getPath on whichever splits are returned. Ideally this mechanism has to
work for any InputFormat and any InputSplit type that returns a path.

Currently I have a patch to massage the paths before they are used but this
places ORC/ACID specific code in what should be a very general purpose
cascading class:
https://github.com/HotelsDotCom/cascading-hive/blob/acid/src/main/java/cascading/tap/hive/HivePartitionTap.java#L105

Conceptually it seems odd that that the path has a different number of
elements in the event that the dataset is ACID, contains one or more
operations, and has not yet been compacted. It feels like an internal
implementation detail that is leaked into the publicly used path. In this
scenario would it not instead be possible for the path to contain an empty
base folder or perhaps a faux place-holder part leaf:

warehouse/test_table/continent=Asia/country=India/base_000 // contains
nothing, ignored by ORC
warehouse/test_table/continent=Asia/country=India/deltas_only // not
actually a folder on disk, ignored by ORC

In this way any framework using elements within the split path can happily
be oblivious to the leaf structure and specifically ORC+ACID. They can rely
on split paths for a given data set always containing a fixed number of
elements.

Cheers - Elliot.

On 14 May 2015 at 18:27, Alan Gates alanfga...@gmail.com wrote:

Ok, I think I understand now. I also get why OrcSplit.getPath returns
just up to the partition keys and not the delta directories. In most cases
there will be more than one delta directory, so which one would it pick?

It seems you already know the file type you are working on before you call
this (since you're calling OrcSplit.getPath rather than
FileSplit.getPath). The best way forward might be to make a utility method
in Hive that takes the file type and the result of getPath and then returns
you the partition keys. That way you're not left putting ORC specific code
in Cascading.

Alan.

Elliot West tea...@gmail.com
May 1, 2015 at 3:04
Yes and no :-) We're initially using OrcFile.createReader to create a
Reader so that we can obtain the schema (StructTypeInfo) from the file. I
don't believe this is possible with OrcInputFormat.getReader(?):

Reader orcReader = OrcFile.createReader(path,
OrcFile.readerOptions(conf));

ObjectInspector inspector = orcReader.getObjectInspector();
StructTypeInfo typeInfo = (StructTypeInfo)
TypeInfoUtils.getTypeInfoFromObjectInspector(inspector);

In the case of transactional datasets we've worked around this by
generating the StructTypeInfo from schema data retrieved from the meta
store as we need to interact with the meta store anyway to correct read the
data. Even if OrcFile.createReader were to transparently read delta only
datasets, It wouldn't get us much further currently as the delta files lack
the correct column names and the Reader would thus return an unusable
StructTypeInfo.

The org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() issue is
currently our biggest pain point as it requires us to place Orc+Atomic
specific code in what should be a general framework. To illustrate the
problem further, somewhere in cascading there is some code that extracts
partition keys from split paths. It extracts keys by chopping off the
'part' leaf and removing the preceding parent:

*Text etc:*
OrcSplit.getPath() returns:
'warehouse/test_table/continent=Asia/country=India/part-01'
Partition keys derived as: 'continent=Asia/country=India' (CORRECT)

*Orc base+delta:*
OrcSplit.getPath() returns:
warehouse/test_table/continent=Asia/country=India/base_006'
Partition keys derived as: 'continent=Asia/country=India' (CORRECT)

*Orc delta only etc:*
OrcSplit.getPath() returns:
warehouse/test_table/continent=Asia/country=India
Partition keys derived as: 'continent=Asia' (INCORRECT)

Cheers - Elliot.

On 30 April 2015 at 17:40, Alan Gates alanfga...@gmail.com wrote:

Are you using OrcInputFormat.getReader to get a reader? If so, it should
take care of these anomalies for you and mask your need to worry about
delta versus base files.

Alan.

Elliot West tea...@gmail.com
April 29, 2015 at 9:40
Hi

Re: ACID ORC file reader issue with uncompacted data

2015-05-01 Thread Elliot West

Yes and no :-) We're initially using OrcFile.createReader to create a
Reader so that we can obtain the schema (StructTypeInfo) from the file. I
don't believe this is possible with OrcInputFormat.getReader(?):

Reader orcReader = OrcFile.createReader(path, OrcFile.readerOptions(conf));

ObjectInspector inspector = orcReader.getObjectInspector();
StructTypeInfo typeInfo = (StructTypeInfo)
TypeInfoUtils.getTypeInfoFromObjectInspector(inspector);


In the case of transactional datasets we've worked around this by
generating the StructTypeInfo from schema data retrieved from the meta
store as we need to interact with the meta store anyway to correct read the
data. Even if OrcFile.createReader were to transparently read delta only
datasets, It wouldn't get us much further currently as the delta files lack
the correct column names and the Reader would thus return an unusable
StructTypeInfo.

The org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() issue is currently
our biggest pain point as it requires us to place Orc+Atomic specific code
in what should be a general framework. To illustrate the problem further,
somewhere in cascading there is some code that extracts partition keys from
split paths. It extracts keys by chopping off the 'part' leaf and removing
the preceding parent:

*Text etc:*
OrcSplit.getPath() returns:
'warehouse/test_table/continent=Asia/country=India/part-01'
Partition keys derived as: 'continent=Asia/country=India' (CORRECT)

*Orc base+delta:*
OrcSplit.getPath() returns:
warehouse/test_table/continent=Asia/country=India/base_006'
Partition keys derived as: 'continent=Asia/country=India' (CORRECT)

*Orc delta only etc:*
OrcSplit.getPath() returns:
warehouse/test_table/continent=Asia/country=India
Partition keys derived as: 'continent=Asia' (INCORRECT)

Cheers - Elliot.





On 30 April 2015 at 17:40, Alan Gates alanfga...@gmail.com wrote:

 Are you using OrcInputFormat.getReader to get a reader?  If so, it should
 take care of these anomalies for you and mask your need to worry about
 delta versus base files.

 Alan.

   Elliot West tea...@gmail.com
  April 29, 2015 at 9:40
 Hi,

 I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and
 I've hit a couple of issues for a particular scenario. The case I have is
 when data has been written into a transactional table and a compaction has
 not yet occurred. This can be recreated like so:

 CREATE TABLE test_table ( id int, message string )
   PARTITIONED BY ( continent string, country string )
   CLUSTERED BY (id) INTO 1 BUCKETS
   STORED AS ORC
   TBLPROPERTIES ('transactional' = 'true')
 );

 INSERT INTO TABLE test_table
 PARTITION (continent = 'Asia', country = 'India')
 VALUES (1, 'x'), (2, 'y'), (3, 'z');


 This results in a dataset that contains only a delta file:


 warehouse/test_table/continent=Asia/country=India/delta_060_060/bucket_0


 I'm assuming that this scenario is valid - a user might insert new data
 into a table and want to read it back at a time prior to the first
 compaction. I can select the data back from this table in Hive with no
 problem. However, for a number of reasons I'm finding it rather tricky to
 do so programmatically. At this point I should mention that reading base
 files or base+deltas is trouble free. The issues I've encountered are as
 follows:

1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
ReaderOptions) fails if the directory specified by the path ('
warehouse/test_table/continent=Asia/country=India' in this case)
contains only a delta. Specifically it attempts to access
'delta_060_060' as if it were a file and therefore fails. It
appears to function correctly if the directory also contains a base. We use
this method to extract the typeInfo from the ORCFile and build a mapping
between the user's declared fields.
2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
inconsistent in that it returns the path of the base if present, otherwise
the parent. This presents issues within cascading (and I assume other
frameworks) that expect the paths returned by splits to be at the same
depth and for them to contain some kind of 'part' file leaf. In my example
the path returned is 'warehouse/test_table/continent=Asia/country=India',
if I had also had a base I'd have seen '
warehouse/test_table/continent=Asia/country=India/base_006'.
3. The footers of the delta files do not contain the true field names
of the table. In my example I see '_col0:int,_col1:string' where I'd
expect 'id:int,message:string'. A base file, if present correctly
declares the field names. We chose to access values by field name rather
than position so that users of our reader do not need to declare the full
schema to read partial data, however this behaviour trips this up.

 I have (horrifically :) worked around issues 1 and 2 in my own code and
 have some ideas to circumvent 3

ACID ORC file reader issue with uncompacted data

2015-04-29 Thread Elliot West

Hi,

I'm implementing a tap to read Hive ORC ACID date into Cascading jobs and
I've hit a couple of issues for a particular scenario. The case I have is
when data has been written into a transactional table and a compaction has
not yet occurred. This can be recreated like so:

CREATE TABLE test_table ( id int, message string )
  PARTITIONED BY ( continent string, country string )
  CLUSTERED BY (id) INTO 1 BUCKETS
  STORED AS ORC
  TBLPROPERTIES ('transactional' = 'true')
);

INSERT INTO TABLE test_table
PARTITION (continent = 'Asia', country = 'India')
VALUES (1, 'x'), (2, 'y'), (3, 'z');


This results in a dataset that contains only a delta file:

warehouse/test_table/continent=Asia/country=India/delta_060_060/bucket_0


I'm assuming that this scenario is valid - a user might insert new data
into a table and want to read it back at a time prior to the first
compaction. I can select the data back from this table in Hive with no
problem. However, for a number of reasons I'm finding it rather tricky to
do so programmatically. At this point I should mention that reading base
files or base+deltas is trouble free. The issues I've encountered are as
follows:

   1. org.apache.hadoop.hive.ql.io.orc.OrcFile.createReader(Path,
   ReaderOptions) fails if the directory specified by the path ('
   warehouse/test_table/continent=Asia/country=India' in this case)
   contains only a delta. Specifically it attempts to access
   'delta_060_060' as if it were a file and therefore fails. It
   appears to function correctly if the directory also contains a base. We use
   this method to extract the typeInfo from the ORCFile and build a mapping
   between the user's declared fields.
   2. org.apache.hadoop.hive.ql.io.orc.OrcSplit.getPath() is seemingly
   inconsistent in that it returns the path of the base if present, otherwise
   the parent. This presents issues within cascading (and I assume other
   frameworks) that expect the paths returned by splits to be at the same
   depth and for them to contain some kind of 'part' file leaf. In my example
   the path returned is 'warehouse/test_table/continent=Asia/country=India',
   if I had also had a base I'd have seen '
   warehouse/test_table/continent=Asia/country=India/base_006'.
   3. The footers of the delta files do not contain the true field names of
   the table. In my example I see '_col0:int,_col1:string' where I'd expect
   'id:int,message:string'. A base file, if present correctly declares the
   field names. We chose to access values by field name rather than position
   so that users of our reader do not need to declare the full schema to read
   partial data, however this behaviour trips this up.

I have (horrifically :) worked around issues 1 and 2 in my own code and
have some ideas to circumvent 3 but I wanted to get a feeling as to whether
I'm going against the tide and if my life might be easier if I approached
this another way.

Thanks - Elliot.

Transactional table read lifecycle

2015-04-17 Thread Elliot West

Hi, I'm working on a Cascading Tap that reads the data that backs a
transactional Hive table. I've successfully utilised the in-built
OrcInputFormat functionality to read and merge the deltas with the base and
optionally pull in the RecordIdentifiers. However, I'm now considering what
other steps I may need to take to collaborate with an active Hive instance
that could be writing to or compacting the table as I'm trying to read it.

I recently became aware of the need to obtain a list of valid transaction
IDs but now wonder if I must also acquire a read lock for the table? I'm
thinking that the set of interactions for reading this data may look
something like:


   1. Obtain ValidTxnList from the meta store:
   org.apache.hadoop.hive.metastore.IMetaStoreClient.getValidTxns()

   2. Set the ValidTxnList in the Configuration:
   conf.set(ValidTxnList.VALID_TXNS_KEY, validTxnList.toString());

   3. Aquire a read lock:
   org.apache.hadoop.hive.metastore.IMetaStoreClient.lock(LockRequest)

   4. Use OrcInputFormat to read the data

   5. Finally, release the lock:
   org.apache.hadoop.hive.metastore.IMetaStoreClient.unlock(long)


Can you advise on whether the lock is needed, whether this is the correct
way of managing the lock, and whether there are any other steps I need take
to appropriately interact with the data underpinning a 'live' transactional
table?

Thanks - Elliot.

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-04-01 Thread Elliot West

Hi Alan,

Regarding the streaming changes, I've raised an issue and submitted patches
here: https://issues.apache.org/jira/browse/HIVE-10165

Thanks - Elliot.

On 26 March 2015 at 23:20, Alan Gates alanfga...@gmail.com wrote:



   Elliot West tea...@gmail.com
  March 26, 2015 at 15:58
 Hi Alan,

 Yes, this is precisely our situation. The issues I'm having with the
 current API are that I cannot intercept the creation of the
 OrcRecordUpdater to set the recordIdColumn in the AcidOutputFormat.Options
 instance. Additionally, I cannot extend the TransactionBatch interface to
 expose further operations and instead have to encode the operation type in
 the row parameter and decode it in the RecordWriter implementation -
 possible but not very elegant.

 Yes, you'd have to make changes in the API itself to make this possible,
 which is fine.  I'm happy to review patches for this.


 Regarding the merge functionality, is this a new suggested feature of
 Hive? Is there a JIRA I could track?

 Not yet, it's just discussions I've had with users like you who'd be
 interested in merge, though from the SQL side rather than streaming.  I am
 not aware of anyone working on it at the moment.

 Alan.


 Thanks - Elliot.


 On Thursday, March 26, 2015, Alan Gates alanfga...@gmail.com wrote:
Alan Gates alanfga...@gmail.com
  March 26, 2015 at 15:30
  Are you saying that when the records arrive you don't know updates from
 inserts and you're already doing processing to determine that?  If so, this
 is exactly the case we'd like to hit with the merge functionality.

 If you're already scanning the existing ORC file and obtaining the unique
 identifier (which is the triple (rowid, txnid, bucketid)) and determining
 which records have changed then you could certainly change the streaming
 interface to enable passing the records through  in bulk (along with
 operation markers to indicate insert/update/delete) and write those in a
 delta file in one pass.

 Alan.

   Elliot West tea...@gmail.com
  March 26, 2015 at 15:10
 Hi, thanks for your quick reply.

 I see your point, but in my case would I not have the required
 RecordIdentifiers available as I'm already reading the entire partition to
 determine which records have changed?  Admittedly Hive will not reveal
 the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
 underlying ORCFile and therefore can be read by my Cascading process. In
 this scenario (where I have the relevant recIds to hand) why would it then
 be necessary to perform additional scans when issuing the deletes/updates?

 Apologies if I'm misunderstanding something.

 Thanks - Elliot.

 On Thursday, March 26, 2015, Alan Gates alanfga...@gmail.com wrote:
   Alan Gates alanfga...@gmail.com
  March 26, 2015 at 14:48
  The missing piece for adding update and delete to the streaming API is a
 primary key.  Updates and deletes in SQL work by scanning the table or
 partition where the record resides.  This is assumed to be ok since we are
 not supporting transactional workloads and thus update/deletes are assumed
 to be infrequent.  But a need to scan for each update or delete will not
 perform adequately in the streaming case.

 I've had a few discussions with others recently who are thinking of adding
 merge like functionality, where you would upload all changes to a temp
 table and then in one scan/transaction apply those changes.  This is a
 common way to handle these situations for data warehouses, and is much
 easier than adding a primary key concept to Hive.

 Alan.

   Elliot West tea...@gmail.com
  March 26, 2015 at 14:08
 Hi,

 I'd like to ascertain if it might be possible to add 'update' and 'delete'
 operations to the hive-hcatalog-streaming API. I've been looking at the API
 with interest for the last week as it appears to have the potential to help
 with some general data processing patterns that are prevalent where I work.
 Ultimately, we continuously load large amounts of data into Hadoop which is
 partitioned by some time interval - usually hour, day, or month depending
 on the data size. However, the records that reside in this data can change.
 We often receive some new information that mutates part of an existing
 record already stored in a partition in HDFS. Typically the amount of
 mutations is very small compared to the number of records in each
 partitions.

 To handle this currently we re-read and re-write all partitions that could
 potentially be affected by new data. In practice a single hour's worth of
 new data can require the reading and writing of 1 month's worth of
 partitions. By storing the data in a transactional Hive table I believe
 that we can instead issue updates and deletes for only the affected rows.
 Although we do use Hive for analytics on this data, much of the processing
 that generates and consumes the data is performed using Cascading.
 Therefore I'd like to be able to read and write the data via an API which
 we'd aim to integrate

Re: Interpretation of transactional table base file format

2015-03-30 Thread Elliot West

Ok, so both the source and Javadoc for
org.apache.hadoop.hive.ql.io.orc.OrcInputFormat answer most of these
questions.

Apologies for the spam.

Thanks - Elliot.

On 30 March 2015 at 11:52, Elliot West tea...@gmail.com wrote:

 I've been looking at the structure of the ORCFiles that back transaction
 tables in Hive. After a compaction I was surprised to find that the base
 file structure is identical to the delta structure:

   struct
 operation:int,
 originalTransaction:bigint,
 bucket:int,
 rowId:bigint,
 currentTransaction:bigint,
 row:struct
   // row fields
 
   

 This raises a few questions:

- How should I interpret the operation and originalTransaction values
in these compacted rows?
- Are the values in the operation and originalTransaction fields
required for the application of later deltas?
- Does this structure in anyway inhibit the ability to perform partial
reads of the row data (i.e. specific columns)?
- How does this structure relate to the RecordIdentier class which
contains only a subset of the meta-data fields, and the
AcidInputFormat.Options.recordIdColumn() which seems to imply a meta
data column that is alongside the row columns, not the nested structure
that we see in practice.

 I suppose that I might find the answers to some of these myself by simply
 reading in the data with the appropriate input format, which leads me to my
 final question: is there already a input format available that will
 seamlessly and transparently apply any deltas on read (for consuming the
 data in an M/R job for example).

 Apologies for so many questions.

 Thanks - Elliot.

Interpretation of transactional table base file format

2015-03-30 Thread Elliot West

I've been looking at the structure of the ORCFiles that back transaction
tables in Hive. After a compaction I was surprised to find that the base
file structure is identical to the delta structure:

  struct
operation:int,
originalTransaction:bigint,
bucket:int,
rowId:bigint,
currentTransaction:bigint,
row:struct
  // row fields

  

This raises a few questions:

   - How should I interpret the operation and originalTransaction values in
   these compacted rows?
   - Are the values in the operation and originalTransaction fields
   required for the application of later deltas?
   - Does this structure in anyway inhibit the ability to perform partial
   reads of the row data (i.e. specific columns)?
   - How does this structure relate to the RecordIdentier class which
   contains only a subset of the meta-data fields, and the
   AcidInputFormat.Options.recordIdColumn() which seems to imply a meta
   data column that is alongside the row columns, not the nested structure
   that we see in practice.

I suppose that I might find the answers to some of these myself by simply
reading in the data with the appropriate input format, which leads me to my
final question: is there already a input format available that will
seamlessly and transparently apply any deltas on read (for consuming the
data in an M/R job for example).

Apologies for so many questions.

Thanks - Elliot.

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West

Hi, thanks for your quick reply.

I see your point, but in my case would I not have the required
RecordIdentifiers available as I'm already reading the entire partition to
determine which records have changed?  Admittedly Hive will not reveal
the ROW__IDs to me but I assume (incorrectly perhaps) that these are in the
underlying ORCFile and therefore can be read by my Cascading process. In
this scenario (where I have the relevant recIds to hand) why would it then
be necessary to perform additional scans when issuing the deletes/updates?

Apologies if I'm misunderstanding something.

Thanks - Elliot.

On Thursday, March 26, 2015, Alan Gates alanfga...@gmail.com wrote:

 The missing piece for adding update and delete to the streaming API is a
 primary key.  Updates and deletes in SQL work by scanning the table or
 partition where the record resides.  This is assumed to be ok since we are
 not supporting transactional workloads and thus update/deletes are assumed
 to be infrequent.  But a need to scan for each update or delete will not
 perform adequately in the streaming case.

 I've had a few discussions with others recently who are thinking of adding
 merge like functionality, where you would upload all changes to a temp
 table and then in one scan/transaction apply those changes.  This is a
 common way to handle these situations for data warehouses, and is much
 easier than adding a primary key concept to Hive.

 Alan.

   Elliot West javascript:_e(%7B%7D,'cvml','tea...@gmail.com');
  March 26, 2015 at 14:08
 Hi,

 I'd like to ascertain if it might be possible to add 'update' and 'delete'
 operations to the hive-hcatalog-streaming API. I've been looking at the API
 with interest for the last week as it appears to have the potential to help
 with some general data processing patterns that are prevalent where I work.
 Ultimately, we continuously load large amounts of data into Hadoop which is
 partitioned by some time interval - usually hour, day, or month depending
 on the data size. However, the records that reside in this data can change.
 We often receive some new information that mutates part of an existing
 record already stored in a partition in HDFS. Typically the amount of
 mutations is very small compared to the number of records in each
 partitions.

 To handle this currently we re-read and re-write all partitions that could
 potentially be affected by new data. In practice a single hour's worth of
 new data can require the reading and writing of 1 month's worth of
 partitions. By storing the data in a transactional Hive table I believe
 that we can instead issue updates and deletes for only the affected rows.
 Although we do use Hive for analytics on this data, much of the processing
 that generates and consumes the data is performed using Cascading.
 Therefore I'd like to be able to read and write the data via an API which
 we'd aim to integrate into a Cascading Tap of some description. Our
 Cascading processes could determine the new, updated, and deleted records
 and then use the API to stream these changes to the transactional Hive
 table.

 We have most of this working in a proof of concept, but as
 hive-hcatalog-streaming does not expose the delete/update methods of the
 OrcRecordUpdater we've had to hack together something unpleasant based on
 the original API.

 As a first step I'd like to check if there is any appetite for adding such
 functionality to the API or if this goes against the original motivations
 of the project? If this suggestion sounds reasonable then I'd be keen to
 help move this forward.

 Thanks - Elliot.

Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West

Hi,

I'd like to ascertain if it might be possible to add 'update' and 'delete'
operations to the hive-hcatalog-streaming API. I've been looking at the API
with interest for the last week as it appears to have the potential to help
with some general data processing patterns that are prevalent where I work.
Ultimately, we continuously load large amounts of data into Hadoop which is
partitioned by some time interval - usually hour, day, or month depending
on the data size. However, the records that reside in this data can change.
We often receive some new information that mutates part of an existing
record already stored in a partition in HDFS. Typically the amount of
mutations is very small compared to the number of records in each
partitions.

To handle this currently we re-read and re-write all partitions that could
potentially be affected by new data. In practice a single hour's worth of
new data can require the reading and writing of 1 month's worth of
partitions. By storing the data in a transactional Hive table I believe
that we can instead issue updates and deletes for only the affected rows.
Although we do use Hive for analytics on this data, much of the processing
that generates and consumes the data is performed using Cascading.
Therefore I'd like to be able to read and write the data via an API which
we'd aim to integrate into a Cascading Tap of some description. Our
Cascading processes could determine the new, updated, and deleted records
and then use the API to stream these changes to the transactional Hive
table.

We have most of this working in a proof of concept, but as
hive-hcatalog-streaming does not expose the delete/update methods of the
OrcRecordUpdater we've had to hack together something unpleasant based on
the original API.

As a first step I'd like to check if there is any appetite for adding such
functionality to the API or if this goes against the original motivations
of the project? If this suggestion sounds reasonable then I'd be keen to
help move this forward.

Thanks - Elliot.

Re: Adding update/delete to the hive-hcatalog-streaming API

2015-03-26 Thread Elliot West

Hi Mich,

Yes, we have a timestamp on each record. Our processes effectively group by
a key and order by time stamp.

Cheers - Elliot.

Updates/deletes with OrcRecordUpdater

2015-03-20 Thread Elliot West

Hi,

I'm trying to use the insert, update and delete methods on OrcRecordUpdater
to programmatically mutate an ORC based Hive table (1.0.0). I've got
inserts working correctly but I'm hitting into a problem with deletes and
updates. I get an NPE which I have traced back to what seems like a missing
recIdField(?).

java.lang.NullPointerException
at
org.apache.hadoop.hive.serde2.lazy.objectinspector.LazySimpleStructObjectInspector.getStructFieldData(LazySimpleStructObjectInspector.java:103)
at
org.apache.hadoop.hive.ql.io.orc.OrcRecordUpdater.addEvent(OrcRecordUpdater.java:296)
at
org.apache.hadoop.hive.ql.io.orc.OrcRecordUpdater.delete(OrcRecordUpdater.java:330)


I've tried specifying a location for the field using
AcidOutputFormat.Options.recordIdColumn(0) but this fails dues to an
ObjectInspector mismatch. I'm not sure if I should be creating this field
as part of my table definition or not. Currently I'm constructing the table
with some code based on that located in the storm-hive project:

  Table tbl = new Table();
  tbl.setDbName(databaseName);
  tbl.setTableName(tableName);
  tbl.setTableType(TableType.MANAGED_TABLE.toString());
  StorageDescriptor sd = new StorageDescriptor();
  sd.setCols(getTableColumns(colNames, colTypes));
  sd.setNumBuckets(1);
  sd.setLocation(dbLocation + Path.SEPARATOR + tableName);
  if (partNames != null  partNames.length != 0) {
tbl.setPartitionKeys(getPartitionKeys(partNames));
  }

  tbl.setSd(sd);

  sd.setBucketCols(new ArrayListString(2));
  sd.setSerdeInfo(new SerDeInfo());
  sd.getSerdeInfo().setName(tbl.getTableName());
  sd.getSerdeInfo().setParameters(new HashMapString, String());

sd.getSerdeInfo().getParameters().put(serdeConstants.SERIALIZATION_FORMAT,
1);
  // Not sure if this does anything?
  sd.getSerdeInfo().getParameters().put(transactional,
Boolean.TRUE.toString());

  sd.getSerdeInfo().setSerializationLib(OrcSerde.class.getName());
  sd.setInputFormat(OrcInputFormat.class.getName());
  sd.setOutputFormat(OrcOutputFormat.class.getName());

  MapString, String tableParams = new HashMapString, String();
  // Not sure if this does anything?
  tableParams.put(transactional, Boolean.TRUE.toString());
  tbl.setParameters(tableParams);
  client.createTable(tbl);
  try {
if (partVals != null  partVals.size()  0) {
  addPartition(client, tbl, partVals);
}
  } catch (AlreadyExistsException e) {
  }

I don't really know enough about Hive and ORCFile internals to work out
where I'm going wrong so any help would be appreciated.

Thanks - Elliot.

99 matches

Mail list logo