Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Mich Talebzadeh Sat, 12 Mar 2016 11:38:24 -0800

Hi,

Thanks for the input. I use Hive 2 and still have this issue.




   1. Hive version 2
   2. Hive on Spark engine 1.3.1
   3. Spark 1.5.2


I have added Hive user group  to this as well. So hopefully we may get some
resolution.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 12 March 2016 at 19:25, Timur Shenkao <t...@timshenkao.su> wrote:

> Hi,
>
> I have suffered from Hive Streaming , Transactions enough, so I can share
> my experience with you.
>
> 1) It's not a problem of Spark. It happens because of "peculiarities" /
> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
> technologies. If you look at Hive JIRA, you'll see several critical bugs
> concerning Hive Streaming, transactions. Some of them are resolved in Hive
> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
> & buggy Hive.
> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>
> 2) In Hive 1.1, I  issue the following lines
> ALTER TABLE default.foo COMPACT 'MAJOR';
> SHOW COMPACTIONS;
>
> My manual compaction was shown but it was never fulfilled.
>
> 3) If you use Hive Streaming, it's not recommended or even forbidden to
> insert rows into Hive Streaming tables manually. Only the process that
> writes to such table should insert incoming rows sequentially. Otherwise
> you'll get unpredictable behaviour.
>
> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
> Hive Streaming / transactional tables are catalogs that have numerous
> subcatalogs with "delta" prefix. Moreover, there are files with
> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
> bytes length. The presence of "flush_length" file in some subfolder means
> that Hive writes updates to this subfolder right now. When Hive fails or is
> restarted, it begins to write into new delta subfolder with new
> "flush_length" file. And old "flush_length" file (that was used before
> failure) still remains.
> One of the goal of compaction is to delete outdated "flush_length" files.
> Not every application / library can read such folder structure or knows
> details of Hive Streaming / transactions implementation. Most of the
> software solutions still expect ordinary Hive tables as input.
> When they encounter subcatalogs or special files "flush_length" file,
> applications / libraries either "see nothing" (return 0 or empty result
> set) or stumble over "flush_length" files (return unexplainable errors).
>
> For instance, Facebook Presto couldn't read subfolders by default unless
> you activate special parameters. But it stumbles over "flush_length" files
> as Presto expect legal ORC files not 8-byte-length text files in folders.
>
> So, I don't advise you to use Hive Streaming, transactions right now in
> real production systems (24 / 7 /365) with hundreds millions of events a
> day.
>
> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I am facing this issue on HDP setup on which COMPACTION is required only
>> once for transactional tables to fetch records with Spark SQL.
>> On the other hand, Apache setup doesn't required compaction even once.
>>
>> May be something got triggered on meta-store after compaction, Spark SQL
>> start recognizing delta files.
>>
>> Let know me if needed other details to get root cause.
>>
>> Try this,
>>
>> *See complete scenario :*
>>
>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>> hive> insert into default.foo values(10);
>>
>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>> because data is still in delta files
>>
>> Now run major compaction:
>>
>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>
>> scala> sqlContext.table("default.foo").count // Gives 1
>>
>> hive> insert into foo values(20);
>>
>> scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
>> required.*
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Reply via email to