This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure if its different.
Please walkthrough the below commands in the same order to understand the sequence. hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into foo values(10); scala> sqlContext.table("default.foo").count // Gives 0, which is wrong because data is still in delta files Now run major compaction: hive> ALTER TABLE default.foo COMPACT 'MAJOR'; scala> sqlContext.table("default.foo").count // Gives 1 On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com> wrote: > Hi Varadharajan, > > > That is the point, Spark SQL is able to recognize delta files. See below > directory structure, ONE BASE (43 records) and one DELTA (created after > last insert). And I am able see last insert through Spark SQL. > > > *See below complete scenario :* > > *Steps:* > > - Inserted 43 records in table. > - Run major compaction on table. > - *alter table mytable COMPACT 'major';* > - Disabled auto compaction on table. > - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* > - Inserted 1 record in table. > > > > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* > drwxrwxrwx - root hdfs 0 2016-02-23 11:43 > /apps/hive/warehouse/mydb.db/mytable/base_0000087 > drwxr-xr-x - root hdfs 0 2016-02-23 12:02 > /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088 > > *SPARK JDBC :* > > 0: jdbc:hive2://myhost:9999> select count(*) from mytable ; > +------+ > | _c0 | > +------+ > | 44 | > +------+ > 1 row selected (1.196 seconds) > > *HIVE JDBC :* > > 1: jdbc:hive2://myhost:10000> select count(*) from mytable ; > +------+--+ > | _c0 | > +------+--+ > | 44 | > +------+--+ > 1 row selected (0.121 seconds) > > > Regards > Sanjiv Singh > Mob : +091 9990-447-339 > > On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < > srinath...@gmail.com> wrote: > >> Hi Sanjiv, >> >> Yes.. If we make use of Hive JDBC we should be able to retrieve all the >> rows since it is hive which processes the query. But i think the problem >> with Hive JDBC is that there are two layers of processing, hive and then at >> spark with the result set. And another one is performance is limited to >> that single HiveServer2 node and network. >> >> But If we make use of sqlContext.table function in spark to access hive >> tables, it is supposed to read files directly from HDFS skipping the hive >> layer. But it doesn't read delta files and just reads the contents from >> base folder. Only after Major compaction, the delta files would be merged >> with based folder and be visible for Spark SQL >> >> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com> >> wrote: >> >>> Hi Varadharajan, >>> >>> Can you elaborate on (you quoted on previous mail) : >>> "I observed that hive transaction storage structure do not work with >>> spark yet" >>> >>> >>> If it is related to delta files created after each transaction and spark >>> would not be able recognize them. then I have a table *mytable *(ORC , >>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes. >>> I can see delta files created in HDFS (see below), Still able to fetch >>> consistent records through Spark JDBC and HIVE JDBC. >>> >>> Not compaction triggered for that table. >>> >>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >>> >>> drwxrwxrwx - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/base_0000060 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:40 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086 >>> drwxr-xr-x - root hdfs 0 2016-02-23 11:41 >>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087 >>> >>> >>> >>> Regards >>> Sanjiv Singh >>> Mob : +091 9990-447-339 >>> >>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan < >>> srinath...@gmail.com> wrote: >>> >>>> Actually the auto compaction if enabled is triggered based on the >>>> volume of changes. It doesn't automatically run after every insert. I think >>>> its possible to reduce the thresholds but that might reduce performance by >>>> a big margin. As of now, we do compaction after the batch insert completes. >>>> >>>> The only other way to solve this problem as of now is to use Hive JDBC >>>> API. >>>> >>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is...@gmail.com >>>> > wrote: >>>> >>>>> Compaction would have been triggered automatically as following >>>>> properties already set in *hive-site.xml*. and also >>>>> *NO_AUTO_COMPACTION* property not been set for these tables. >>>>> >>>>> >>>>> <property> >>>>> >>>>> <name>hive.compactor.initiator.on</name> >>>>> >>>>> <value>true</value> >>>>> >>>>> </property> >>>>> >>>>> <property> >>>>> >>>>> <name>hive.compactor.worker.threads</name> >>>>> >>>>> <value>1</value> >>>>> >>>>> </property> >>>>> >>>>> >>>>> Documentation is upset sometimes. >>>>> >>>>> >>>>> >>>>> >>>>> Regards >>>>> Sanjiv Singh >>>>> Mob : +091 9990-447-339 >>>>> >>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan < >>>>> srinath...@gmail.com> wrote: >>>>> >>>>>> Yes, I was burned down by this issue couple of weeks back. This also >>>>>> means that after every insert job, compaction should be run to access new >>>>>> rows from Spark. Sad that this issue is not documented / mentioned >>>>>> anywhere. >>>>>> >>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh < >>>>>> sanjiv.is...@gmail.com> wrote: >>>>>> >>>>>>> Hi Varadharajan, >>>>>>> >>>>>>> Thanks for your response. >>>>>>> >>>>>>> Yes it is transnational table; See below *show create table. * >>>>>>> >>>>>>> Table hardly have 3 records , and after triggering minor compaction >>>>>>> on tables , it start showing results on spark SQL. >>>>>>> >>>>>>> >>>>>>> > *ALTER TABLE hivespark COMPACT 'major';* >>>>>>> >>>>>>> >>>>>>> > *show create table hivespark;* >>>>>>> >>>>>>> CREATE TABLE `hivespark`( >>>>>>> >>>>>>> `id` int, >>>>>>> >>>>>>> `name` string) >>>>>>> >>>>>>> CLUSTERED BY ( >>>>>>> >>>>>>> id) >>>>>>> >>>>>>> INTO 32 BUCKETS >>>>>>> >>>>>>> ROW FORMAT SERDE >>>>>>> >>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >>>>>>> >>>>>>> STORED AS INPUTFORMAT >>>>>>> >>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >>>>>>> >>>>>>> OUTPUTFORMAT >>>>>>> >>>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' >>>>>>> >>>>>>> LOCATION >>>>>>> >>>>>>> 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark' >>>>>>> TBLPROPERTIES ( >>>>>>> >>>>>>> 'COLUMN_STATS_ACCURATE'='true', >>>>>>> >>>>>>> 'last_modified_by'='root', >>>>>>> >>>>>>> 'last_modified_time'='1455859079', >>>>>>> >>>>>>> 'numFiles'='37', >>>>>>> >>>>>>> 'numRows'='3', >>>>>>> >>>>>>> 'rawDataSize'='0', >>>>>>> >>>>>>> 'totalSize'='11383', >>>>>>> >>>>>>> 'transactional'='true', >>>>>>> >>>>>>> 'transient_lastDdlTime'='1455864121') ; >>>>>>> >>>>>>> >>>>>>> Regards >>>>>>> Sanjiv Singh >>>>>>> Mob : +091 9990-447-339 >>>>>>> >>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan < >>>>>>> srinath...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Is the transaction attribute set on your table? I observed that >>>>>>>> hive transaction storage structure do not work with spark yet. You can >>>>>>>> confirm this by looking at the transactional attribute in the output of >>>>>>>> "desc extended <tablename>" in hive console. >>>>>>>> >>>>>>>> If you'd need to access transactional table, consider doing a major >>>>>>>> compaction and then try accessing the tables >>>>>>>> >>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh < >>>>>>>> sanjiv.is...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> >>>>>>>>> I have observed that Spark SQL is not returning records for hive >>>>>>>>> bucketed ORC tables on HDP. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive >>>>>>>>> bucketed tables are not returning records. >>>>>>>>> >>>>>>>>> I have also tried the same for non-bucketed hive tables. it is >>>>>>>>> working fine. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Same is working on plain Apache setup. >>>>>>>>> >>>>>>>>> Let me know if needs other details. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> Sanjiv Singh >>>>>>>>> Mob : +091 9990-447-339 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks, >>>>>>>> M. Varadharajan >>>>>>>> >>>>>>>> ------------------------------------------------ >>>>>>>> >>>>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>>>> >>>>>>>> My Journal :- http://varadharajan.in >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> M. Varadharajan >>>>>> >>>>>> ------------------------------------------------ >>>>>> >>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>> >>>>>> My Journal :- http://varadharajan.in >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> Thanks, >>>> M. Varadharajan >>>> >>>> ------------------------------------------------ >>>> >>>> "Experience is what you get when you didn't get what you wanted" >>>> -By Prof. Randy Pausch in "The Last Lecture" >>>> >>>> My Journal :- http://varadharajan.in >>>> >>> >>> >> >> >> -- >> Thanks, >> M. Varadharajan >> >> ------------------------------------------------ >> >> "Experience is what you get when you didn't get what you wanted" >> -By Prof. Randy Pausch in "The Last Lecture" >> >> My Journal :- http://varadharajan.in >> > > -- Thanks, M. Varadharajan ------------------------------------------------ "Experience is what you get when you didn't get what you wanted" -By Prof. Randy Pausch in "The Last Lecture" My Journal :- http://varadharajan.in