Try this,
hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into default.foo values(10); scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files Now run major compaction: hive> ALTER TABLE default.foo COMPACT 'MAJOR'; scala> sqlContext.table("default.foo").count // Gives 1 hive> insert into foo values(20); scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction required.* Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <srinath...@gmail.com > wrote: > This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure > if its different. > > Please walkthrough the below commands in the same order to understand the > sequence. > > hive> create table default.foo(id int) clustered by (id) into 2 buckets > STORED AS ORC TBLPROPERTIES ('transactional'='true'); > hive> insert into foo values(10); > > scala> sqlContext.table("default.foo").count // Gives 0, which is wrong > because data is still in delta files > > Now run major compaction: > > hive> ALTER TABLE default.foo COMPACT 'MAJOR'; > > scala> sqlContext.table("default.foo").count // Gives 1 > > > On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com> > wrote: > >> Hi Varadharajan, >> >> >> That is the point, Spark SQL is able to recognize delta files. See below >> directory structure, ONE BASE (43 records) and one DELTA (created after >> last insert). And I am able see last insert through Spark SQL. >> >> >> *See below complete scenario :* >> >> *Steps:* >> >> - Inserted 43 records in table. >> - Run major compaction on table. >> - *alter table mytable COMPACT 'major';* >> - Disabled auto compaction on table. >> - *alter table mytable set >> TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* >> - Inserted 1 record in table. >> >> >> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >> drwxrwxrwx - root hdfs 0 2016-02-23 11:43 >> /apps/hive/warehouse/mydb.db/mytable/base_0000087 >> drwxr-xr-x - root hdfs 0 2016-02-23 12:02 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088 >> >> *SPARK JDBC :* >> >> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ; >> +------+ >> | _c0 | >> +------+ >> | 44 | >> +------+ >> 1 row selected (1.196 seconds) >> >> *HIVE JDBC :* >> >> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ; >> +------+--+ >> | _c0 | >> +------+--+ >> | 44 | >> +------+--+ >> 1 row selected (0.121 seconds) >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < >> srinath...@gmail.com> wrote: >> >>> Hi Sanjiv, >>> >>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the >>> rows since it is hive which processes the query. But i think the problem >>> with Hive JDBC is that there are two layers of processing, hive and then at >>> spark with the result set. And another one is performance is limited to >>> that single HiveServer2 node and network. >>> >>> But If we make use of sqlContext.table function in spark to access hive >>> tables, it is supposed to read files directly from HDFS skipping the hive >>> layer. But it doesn't read delta files and just reads the contents from >>> base folder. Only after Major compaction, the delta files would be merged >>> with based folder and be visible for Spark SQL >>> >>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com> >>> wrote: >>> >>>> Hi Varadharajan, >>>> >>>> Can you elaborate on (you quoted on previous mail) : >>>> "I observed that hive transaction storage structure do not work with >>>> spark yet" >>>> >>>> >>>> If it is related to delta files created after each transaction and >>>> spark would not be able recognize them. then I have a table *mytable *(ORC >>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and >>>> deletes. I can see delta files created in HDFS (see below), Still able to >>>> fetch consistent records through Spark JDBC and HIVE JDBC. >>>> >>>> Not compaction triggered for that table. >>>> >>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >>>> >>>> drwxrwxrwx - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:40 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086 >>>> drwxr-xr-x - root hdfs 0 2016-02-23 11:41 >>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087 >>>> >>>> >>>> >>>> Regards >>>> Sanjiv Singh >>>> Mob : +091 9990-447-339 >>>> >>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan < >>>> srinath...@gmail.com> wrote: >>>> >>>>> Actually the auto compaction if enabled is triggered based on the >>>>> volume of changes. It doesn't automatically run after every insert. I >>>>> think >>>>> its possible to reduce the thresholds but that might reduce performance by >>>>> a big margin. As of now, we do compaction after the batch insert >>>>> completes. >>>>> >>>>> The only other way to solve this problem as of now is to use Hive JDBC >>>>> API. >>>>> >>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh < >>>>> sanjiv.is...@gmail.com> wrote: >>>>> >>>>>> Compaction would have been triggered automatically as following >>>>>> properties already set in *hive-site.xml*. and also >>>>>> *NO_AUTO_COMPACTION* property not been set for these tables. >>>>>> >>>>>> >>>>>> <property> >>>>>> >>>>>> <name>hive.compactor.initiator.on</name> >>>>>> >>>>>> <value>true</value> >>>>>> >>>>>> </property> >>>>>> >>>>>> <property> >>>>>> >>>>>> <name>hive.compactor.worker.threads</name> >>>>>> >>>>>> <value>1</value> >>>>>> >>>>>> </property> >>>>>> >>>>>> >>>>>> Documentation is upset sometimes. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Regards >>>>>> Sanjiv Singh >>>>>> Mob : +091 9990-447-339 >>>>>> >>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan < >>>>>> srinath...@gmail.com> wrote: >>>>>> >>>>>>> Yes, I was burned down by this issue couple of weeks back. This also >>>>>>> means that after every insert job, compaction should be run to access >>>>>>> new >>>>>>> rows from Spark. Sad that this issue is not documented / mentioned >>>>>>> anywhere. >>>>>>> >>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh < >>>>>>> sanjiv.is...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Varadharajan, >>>>>>>> >>>>>>>> Thanks for your response. >>>>>>>> >>>>>>>> Yes it is transnational table; See below *show create table. * >>>>>>>> >>>>>>>> Table hardly have 3 records , and after triggering minor compaction >>>>>>>> on tables , it start showing results on spark SQL. >>>>>>>> >>>>>>>> >>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';* >>>>>>>> >>>>>>>> >>>>>>>> > *show create table hivespark;* >>>>>>>> >>>>>>>> CREATE TABLE `hivespark`( >>>>>>>> >>>>>>>> `id` int, >>>>>>>> >>>>>>>> `name` string) >>>>>>>> >>>>>>>> CLUSTERED BY ( >>>>>>>> >>>>>>>> id) >>>>>>>> >>>>>>>> INTO 32 BUCKETS >>>>>>>> >>>>>>>> ROW FORMAT SERDE >>>>>>>> >>>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >>>>>>>> >>>>>>>> STORED AS INPUTFORMAT >>>>>>>> >>>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >>>>>>>> >>>>>>>> OUTPUTFORMAT >>>>>>>> >>>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' >>>>>>>> >>>>>>>> LOCATION >>>>>>>> >>>>>>>> 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark' >>>>>>>> TBLPROPERTIES ( >>>>>>>> >>>>>>>> 'COLUMN_STATS_ACCURATE'='true', >>>>>>>> >>>>>>>> 'last_modified_by'='root', >>>>>>>> >>>>>>>> 'last_modified_time'='1455859079', >>>>>>>> >>>>>>>> 'numFiles'='37', >>>>>>>> >>>>>>>> 'numRows'='3', >>>>>>>> >>>>>>>> 'rawDataSize'='0', >>>>>>>> >>>>>>>> 'totalSize'='11383', >>>>>>>> >>>>>>>> 'transactional'='true', >>>>>>>> >>>>>>>> 'transient_lastDdlTime'='1455864121') ; >>>>>>>> >>>>>>>> >>>>>>>> Regards >>>>>>>> Sanjiv Singh >>>>>>>> Mob : +091 9990-447-339 >>>>>>>> >>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan < >>>>>>>> srinath...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Is the transaction attribute set on your table? I observed that >>>>>>>>> hive transaction storage structure do not work with spark yet. You can >>>>>>>>> confirm this by looking at the transactional attribute in the output >>>>>>>>> of >>>>>>>>> "desc extended <tablename>" in hive console. >>>>>>>>> >>>>>>>>> If you'd need to access transactional table, consider doing a >>>>>>>>> major compaction and then try accessing the tables >>>>>>>>> >>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh < >>>>>>>>> sanjiv.is...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> I have observed that Spark SQL is not returning records for hive >>>>>>>>>> bucketed ORC tables on HDP. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive >>>>>>>>>> bucketed tables are not returning records. >>>>>>>>>> >>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is >>>>>>>>>> working fine. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Same is working on plain Apache setup. >>>>>>>>>> >>>>>>>>>> Let me know if needs other details. >>>>>>>>>> >>>>>>>>>> Regards >>>>>>>>>> Sanjiv Singh >>>>>>>>>> Mob : +091 9990-447-339 >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Thanks, >>>>>>>>> M. Varadharajan >>>>>>>>> >>>>>>>>> ------------------------------------------------ >>>>>>>>> >>>>>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>>>>> >>>>>>>>> My Journal :- http://varadharajan.in >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Thanks, >>>>>>> M. Varadharajan >>>>>>> >>>>>>> ------------------------------------------------ >>>>>>> >>>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>>> >>>>>>> My Journal :- http://varadharajan.in >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> M. Varadharajan >>>>> >>>>> ------------------------------------------------ >>>>> >>>>> "Experience is what you get when you didn't get what you wanted" >>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>> >>>>> My Journal :- http://varadharajan.in >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks, >>> M. Varadharajan >>> >>> ------------------------------------------------ >>> >>> "Experience is what you get when you didn't get what you wanted" >>> -By Prof. Randy Pausch in "The Last Lecture" >>> >>> My Journal :- http://varadharajan.in >>> >> >> > > > -- > Thanks, > M. Varadharajan > > ------------------------------------------------ > > "Experience is what you get when you didn't get what you wanted" > -By Prof. Randy Pausch in "The Last Lecture" > > My Journal :- http://varadharajan.in >