Hi All, We are using for Spark SQL :
- Hive :1.2.1 - Spark : 1.3.1 - Hadoop :2.7.1 Let me know if needs other details to debug the issue. Regards Sanjiv Singh Mob : +091 9990-447-339 On Sun, Mar 13, 2016 at 1:07 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Thanks for the input. I use Hive 2 and still have this issue. > > > > 1. Hive version 2 > 2. Hive on Spark engine 1.3.1 > 3. Spark 1.5.2 > > > I have added Hive user group to this as well. So hopefully we may get > some resolution. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On 12 March 2016 at 19:25, Timur Shenkao <t...@timshenkao.su> wrote: > >> Hi, >> >> I have suffered from Hive Streaming , Transactions enough, so I can share >> my experience with you. >> >> 1) It's not a problem of Spark. It happens because of "peculiarities" / >> bugs of Hive Streaming. Hive Streaming, transactions are very raw >> technologies. If you look at Hive JIRA, you'll see several critical bugs >> concerning Hive Streaming, transactions. Some of them are resolved in Hive >> 2+ only. But Cloudera & Hortonworks ship their distributions with outdated >> & buggy Hive. >> So use Hive 2+. Earlier versions of Hive didn't run compaction at all. >> >> 2) In Hive 1.1, I issue the following lines >> ALTER TABLE default.foo COMPACT 'MAJOR'; >> SHOW COMPACTIONS; >> >> My manual compaction was shown but it was never fulfilled. >> >> 3) If you use Hive Streaming, it's not recommended or even forbidden to >> insert rows into Hive Streaming tables manually. Only the process that >> writes to such table should insert incoming rows sequentially. Otherwise >> you'll get unpredictable behaviour. >> >> 4) Ordinary Hive tables are catalogs with text, ORC, etc. files. >> Hive Streaming / transactional tables are catalogs that have numerous >> subcatalogs with "delta" prefix. Moreover, there are files with >> "flush_length" suffix in some delta subfolders. "flush_length" files have 8 >> bytes length. The presence of "flush_length" file in some subfolder means >> that Hive writes updates to this subfolder right now. When Hive fails or is >> restarted, it begins to write into new delta subfolder with new >> "flush_length" file. And old "flush_length" file (that was used before >> failure) still remains. >> One of the goal of compaction is to delete outdated "flush_length" files. >> Not every application / library can read such folder structure or knows >> details of Hive Streaming / transactions implementation. Most of the >> software solutions still expect ordinary Hive tables as input. >> When they encounter subcatalogs or special files "flush_length" file, >> applications / libraries either "see nothing" (return 0 or empty result >> set) or stumble over "flush_length" files (return unexplainable errors). >> >> For instance, Facebook Presto couldn't read subfolders by default unless >> you activate special parameters. But it stumbles over "flush_length" files >> as Presto expect legal ORC files not 8-byte-length text files in folders. >> >> So, I don't advise you to use Hive Streaming, transactions right now in >> real production systems (24 / 7 /365) with hundreds millions of events a >> day. >> >> On Sat, Mar 12, 2016 at 11:24 AM, @Sanjiv Singh <sanjiv.is...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> I am facing this issue on HDP setup on which COMPACTION is required only >>> once for transactional tables to fetch records with Spark SQL. >>> On the other hand, Apache setup doesn't required compaction even once. >>> >>> May be something got triggered on meta-store after compaction, Spark SQL >>> start recognizing delta files. >>> >>> Let know me if needed other details to get root cause. >>> >>> Try this, >>> >>> *See complete scenario :* >>> >>> hive> create table default.foo(id int) clustered by (id) into 2 buckets >>> STORED AS ORC TBLPROPERTIES ('transactional'='true'); >>> hive> insert into default.foo values(10); >>> >>> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong >>> because data is still in delta files >>> >>> Now run major compaction: >>> >>> hive> ALTER TABLE default.foo COMPACT 'MAJOR'; >>> >>> scala> sqlContext.table("default.foo").count // Gives 1 >>> >>> hive> insert into foo values(20); >>> >>> scala> sqlContext.table("default.foo").count* // Gives 2 , no >>> compaction required.* >>> >>> >>> >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >> >> >