Re: Read Hive ACID tables in Spark or Pig

Alan Gates Tue, 12 Mar 2019 11:34:51 -0700

If you want to read those tables directly in something other than Hive,
yes, you need to get the valid writeid list for each table you're reading
from the metastore.  If you want to avoid merging data in, take a look at
Hive's streaming ingest, which allows you to ingest data into Hive without
merges, though it doesn't support update, only insert.
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest


Alan.

On Mon, Mar 11, 2019 at 9:14 AM David Morin <morin.david....@gmail.com>
wrote:

> Hi,
>
> I've just implemented a pipeline to synchronize data between MySQL and
> Hive (transactional + bucketized) onto HDP cluster.
> I've used Orc files but without ACID properties.
> Then, we've created external tables on these hdfs directories that contain
> these delta Orc files.
> Then, MERGE INTO queries are executed periodically to merge data into the
> Hive target table.
> It works pretty well but we want to avoid the use of these Merge queries.
> It's not really clear at the moment. But thanks for your links. I'm going
> to delve into that point.
> To resume, if i want to avoid these queries, I have to get the valid
> transaction for each table from Hive Metastore and, then, read all related
> files.
> Is it correct ?
>
> Thanks,
> David
>
>
> Le dim. 10 mars 2019 à 01:45, Nicolas Paris <nicolas.pa...@riseup.net> a
> écrit :
>
>> Thanks Alan for the clarifications.
>>
>> Hive has made such improvements it has lost its old friends in the
>> process. Hope one day all the friends speak together again: pig, spark,
>> presto read/write ACID together.
>>
>> On Sat, Mar 09, 2019 at 02:23:48PM -0800, Alan Gates wrote:
>> > There's only been one significant change in ACID that requires different
>> > implementations.  In ACID v1 delta files contained inserts, updates, and
>> > deletes.  In ACID v2 delta files are split so that inserts are placed
>> in one
>> > file, deletes in another, and updates are an insert plus a delete.
>> This change
>> > was put into Hive 3, so you have to upgrade your ACID tables when
>> upgrading
>> > from Hive 2 to 3.
>> >
>> > You can see info on ACID v1 at
>> https://cwiki.apache.org/confluence/display/Hive
>> > /Hive+Transactions
>> >
>> > You can get a start understanding ACID v2 with
>> https://issues.apache.org/jira/
>> > browse/HIVE-14035  This has design documents.  I don't guarantee the
>> > implementation completely matches the design, but you can at least get
>> an idea
>> > of the intent and follow the JIRA stream from there to see what was
>> > implemented.
>> >
>> > Alan.
>> >
>> > On Sat, Mar 9, 2019 at 3:25 AM Nicolas Paris <nicolas.pa...@riseup.net>
>> wrote:
>> >
>> >     Hi,
>> >
>> >     > The issue is that outside readers don't understand which records
>> in
>> >     > the delta files are valid and which are not. Theoretically all
>> this
>> >     > is possible, as outside clients could get the valid transaction
>> list
>> >     > from the metastore and then read the files, but no one has done
>> this
>> >     > work.
>> >
>> >     I guess each hive version (1,2,3) differ in how they manage delta
>> files
>> >     isn't ? This means pig or spark need to implement 3 different ways
>> of
>> >     dealing with hive.
>> >
>> >     Is there any documentation that would help a developper to implement
>> >     those specific connectors ?
>> >
>> >     Thanks
>> >
>> >
>> >     On Wed, Mar 06, 2019 at 09:51:51AM -0800, Alan Gates wrote:
>> >     > Pig is in the same place as Spark, that the tables need to be
>> compacted
>> >     first.
>> >     > The issue is that outside readers don't understand which records
>> in the
>> >     delta
>> >     > files are valid and which are not.
>> >     >
>> >     > Theoretically all this is possible, as outside clients could get
>> the
>> >     valid
>> >     > transaction list from the metastore and then read the files, but
>> no one
>> >     has
>> >     > done this work.
>> >     >
>> >     > Alan.
>> >     >
>> >     > On Wed, Mar 6, 2019 at 8:28 AM Abhishek Gupta <
>> abhila...@gmail.com>
>> >     wrote:
>> >     >
>> >     >     Hi,
>> >     >
>> >     >     Does Hive ACID tables for Hive version 1.2 posses the
>> capability of
>> >     being
>> >     >     read into Apache Pig using HCatLoader or Spark using
>> SQLContext.
>> >     >     For Spark, it seems it is only possible to read ACID tables
>> if the
>> >     table is
>> >     >     fully compacted i.e no delta folders exist in any partition.
>> Details
>> >     in the
>> >     >     following JIRA
>> >     >
>> >     >     https://issues.apache.org/jira/browse/SPARK-15348, https://
>> >     >     issues.apache.org/jira/browse/SPARK-15348
>> >     >
>> >     >     However I wanted to know if it is supported at all in Apache
>> Pig to
>> >     read
>> >     >     ACID tables in Hive
>> >     >
>> >
>> >     --
>> >     nicolas
>> >
>>
>> --
>> nicolas
>>
>

Re: Read Hive ACID tables in Spark or Pig

Reply via email to