Re: Spark SQL is not returning records for HIVE transactional tables on HDP

@Sanjiv Singh Sun, 13 Mar 2016 03:44:08 -0700

Hi All,
We are using for Spark SQL :


   - Hive :1.2.1
   - Spark : 1.3.1
   - Hadoop :2.7.1

Let me know if needs other details to debug the issue.


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Sun, Mar 13, 2016 at 1:07 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi,
>
> Thanks for the input. I use Hive 2 and still have this issue.
>
>
>
>    1. Hive version 2
>    2. Hive on Spark engine 1.3.1
>    3. Spark 1.5.2
>
>
> I have added Hive user group  to this as well. So hopefully we may get
> some resolution.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 12 March 2016 at 19:25, Timur Shenkao <t...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I have suffered from Hive Streaming , Transactions enough, so I can share
>> my experience with you.
>>
>> 1) It's not a problem of Spark. It happens because of "peculiarities" /
>> bugs of Hive Streaming.  Hive Streaming, transactions are very raw
>> technologies. If you look at Hive JIRA, you'll see several critical bugs
>> concerning Hive Streaming, transactions. Some of them are resolved in Hive
>> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated
>> & buggy Hive.
>> So use Hive 2+. Earlier versions of Hive didn't run compaction at all.
>>
>> 2) In Hive 1.1, I  issue the following lines
>> ALTER TABLE default.foo COMPACT 'MAJOR';
>> SHOW COMPACTIONS;
>>
>> My manual compaction was shown but it was never fulfilled.
>>
>> 3) If you use Hive Streaming, it's not recommended or even forbidden to
>> insert rows into Hive Streaming tables manually. Only the process that
>> writes to such table should insert incoming rows sequentially. Otherwise
>> you'll get unpredictable behaviour.
>>
>> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files.
>> Hive Streaming / transactional tables are catalogs that have numerous
>> subcatalogs with "delta" prefix. Moreover, there are files with
>> "flush_length" suffix in some delta subfolders. "flush_length" files have 8
>> bytes length. The presence of "flush_length" file in some subfolder means
>> that Hive writes updates to this subfolder right now. When Hive fails or is
>> restarted, it begins to write into new delta subfolder with new
>> "flush_length" file. And old "flush_length" file (that was used before
>> failure) still remains.
>> One of the goal of compaction is to delete outdated "flush_length" files.
>> Not every application / library can read such folder structure or knows
>> details of Hive Streaming / transactions implementation. Most of the
>> software solutions still expect ordinary Hive tables as input.
>> When they encounter subcatalogs or special files "flush_length" file,
>> applications / libraries either "see nothing" (return 0 or empty result
>> set) or stumble over "flush_length" files (return unexplainable errors).
>>
>> For instance, Facebook Presto couldn't read subfolders by default unless
>> you activate special parameters. But it stumbles over "flush_length" files
>> as Presto expect legal ORC files not 8-byte-length text files in folders.
>>
>> So, I don't advise you to use Hive Streaming, transactions right now in
>> real production systems (24 / 7 /365) with hundreds millions of events a
>> day.
>>
>> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am facing this issue on HDP setup on which COMPACTION is required only
>>> once for transactional tables to fetch records with Spark SQL.
>>> On the other hand, Apache setup doesn't required compaction even once.
>>>
>>> May be something got triggered on meta-store after compaction, Spark SQL
>>> start recognizing delta files.
>>>
>>> Let know me if needed other details to get root cause.
>>>
>>> Try this,
>>>
>>> *See complete scenario :*
>>>
>>> hive> create table default.foo(id int) clustered by (id) into 2 buckets
>>> STORED AS ORC TBLPROPERTIES ('transactional'='true');
>>> hive> insert into default.foo values(10);
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
>>> because data is still in delta files
>>>
>>> Now run major compaction:
>>>
>>> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>>>
>>> scala> sqlContext.table("default.foo").count // Gives 1
>>>
>>> hive> insert into foo values(20);
>>>
>>> scala> sqlContext.table("default.foo").count* // Gives 2 , no
>>> compaction required.*
>>>
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>
>>
>

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

Reply via email to