Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

Kit Menke Wed, 24 Aug 2016 04:56:00 -0700

Joel,
Another option which you have is to use the Storm HDFS bolt to stream data
into Hive external tables. The external tables then get loaded into ORC
history tables for long term storage. We use this in a HDP cluster with
similar load so I know it works. :)


I'm with Jörn on this one. My impression of hive transactions is that it is
a new feature not totally ready for production.
Thanks,
Kit

On Aug 24, 2016 3:07 AM, "Joel Victor" <joelsvic...@gmail.com> wrote:

> @Jörn: If I understood correctly even later versions of Hive won't be able
> to handle these kinds of workloads?
>
> On Wed, Aug 24, 2016 at 1:26 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> I think Hive especially these old versions have not been designed for
>> this. Why not store them in Hbase and run a oozie job regularly that puts
>> them all into Hive /Orc or parquet in a bulk job?
>>
>> On 24 Aug 2016, at 09:35, Joel Victor <joelsvic...@gmail.com> wrote:
>>
>> Currently I am using Apache Hive 0.14 that ships with HDP 2.2. We are
>> trying perform streaming ingestion with it.
>> We are using the Storm Hive bolt and we have 7 tables in which we are
>> trying to insert. The RPS (requests per second) of our bolts ranges from
>> 7000 to 5000 and our commit policies are configured accordingly i.e 100k
>> events or 15 seconds.
>>
>> We see that there are many commitTxn exceptions due to serialization
>> errors in the metastore (we are using PostgreSQL 9.5 as metastore)
>> The serialization errors will cause the topology to start lagging in
>> terms of events processed as it will try to reprocess the batches that have
>> failed.
>>
>> I have already backported this HIVE-10500
>> <https://issues.apache.org/jira/browse/HIVE-10500> to 0.14 and there
>> isn't much improvement.
>> I went through most of the JIRA's about transaction and I found the
>> following HIVE-11948 <https://issues.apache.org/jira/browse/HIVE-11948>,
>> HIVE-13013 <https://issues.apache.org/jira/browse/HIVE-13013>. I would
>> like to backport them to 0.14.
>> Going through the patches gives me an impression that I need to mostly
>> update the queries and transaction levels.
>> Do these patches also require me to update the schema in the metastore?
>> Please also let me know if there are any other patches that I missed.
>>
>> I would also like to know whether Apache Hive can handle inserts to the
>> same/different tables concurrently from multiple clients in 1.2.1 or later
>> versions without many serialization errors in Hive metastore?
>>
>> -Joel
>>
>>
>

Re: Concurrency support of Apache Hive for streaming data ingest at 7K RPS into multiple tables

Reply via email to