Hi Varadharajan,
That is the point, Spark SQL is able to recognize delta files. See below directory structure, ONE BASE (43 records) and one DELTA (created after last insert). And I am able see last insert through Spark SQL. *See below complete scenario :* *Steps:* - Inserted 43 records in table. - Run major compaction on table. - *alter table mytable COMPACT 'major';* - Disabled auto compaction on table. - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");* - Inserted 1 record in table. > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* drwxrwxrwx - root hdfs 0 2016-02-23 11:43 /apps/hive/warehouse/mydb.db/mytable/base_0000087 drwxr-xr-x - root hdfs 0 2016-02-23 12:02 /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088 *SPARK JDBC :* 0: jdbc:hive2://myhost:9999> select count(*) from mytable ; +------+ | _c0 | +------+ | 44 | +------+ 1 row selected (1.196 seconds) *HIVE JDBC :* 1: jdbc:hive2://myhost:10000> select count(*) from mytable ; +------+--+ | _c0 | +------+--+ | 44 | +------+--+ 1 row selected (0.121 seconds) Regards Sanjiv Singh Mob : +091 9990-447-339 On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan < srinath...@gmail.com> wrote: > Hi Sanjiv, > > Yes.. If we make use of Hive JDBC we should be able to retrieve all the > rows since it is hive which processes the query. But i think the problem > with Hive JDBC is that there are two layers of processing, hive and then at > spark with the result set. And another one is performance is limited to > that single HiveServer2 node and network. > > But If we make use of sqlContext.table function in spark to access hive > tables, it is supposed to read files directly from HDFS skipping the hive > layer. But it doesn't read delta files and just reads the contents from > base folder. Only after Major compaction, the delta files would be merged > with based folder and be visible for Spark SQL > > On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com> > wrote: > >> Hi Varadharajan, >> >> Can you elaborate on (you quoted on previous mail) : >> "I observed that hive transaction storage structure do not work with >> spark yet" >> >> >> If it is related to delta files created after each transaction and spark >> would not be able recognize them. then I have a table *mytable *(ORC , >> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes. >> I can see delta files created in HDFS (see below), Still able to fetch >> consistent records through Spark JDBC and HIVE JDBC. >> >> Not compaction triggered for that table. >> >> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable* >> >> drwxrwxrwx - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/base_0000060 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:38 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:39 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:40 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086 >> drwxr-xr-x - root hdfs 0 2016-02-23 11:41 >> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087 >> >> >> >> Regards >> Sanjiv Singh >> Mob : +091 9990-447-339 >> >> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan < >> srinath...@gmail.com> wrote: >> >>> Actually the auto compaction if enabled is triggered based on the volume >>> of changes. It doesn't automatically run after every insert. I think its >>> possible to reduce the thresholds but that might reduce performance by a >>> big margin. As of now, we do compaction after the batch insert completes. >>> >>> The only other way to solve this problem as of now is to use Hive JDBC >>> API. >>> >>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is...@gmail.com> >>> wrote: >>> >>>> Compaction would have been triggered automatically as following >>>> properties already set in *hive-site.xml*. and also >>>> *NO_AUTO_COMPACTION* property not been set for these tables. >>>> >>>> >>>> <property> >>>> >>>> <name>hive.compactor.initiator.on</name> >>>> >>>> <value>true</value> >>>> >>>> </property> >>>> >>>> <property> >>>> >>>> <name>hive.compactor.worker.threads</name> >>>> >>>> <value>1</value> >>>> >>>> </property> >>>> >>>> >>>> Documentation is upset sometimes. >>>> >>>> >>>> >>>> >>>> Regards >>>> Sanjiv Singh >>>> Mob : +091 9990-447-339 >>>> >>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan < >>>> srinath...@gmail.com> wrote: >>>> >>>>> Yes, I was burned down by this issue couple of weeks back. This also >>>>> means that after every insert job, compaction should be run to access new >>>>> rows from Spark. Sad that this issue is not documented / mentioned >>>>> anywhere. >>>>> >>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sanjiv.is...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi Varadharajan, >>>>>> >>>>>> Thanks for your response. >>>>>> >>>>>> Yes it is transnational table; See below *show create table. * >>>>>> >>>>>> Table hardly have 3 records , and after triggering minor compaction >>>>>> on tables , it start showing results on spark SQL. >>>>>> >>>>>> >>>>>> > *ALTER TABLE hivespark COMPACT 'major';* >>>>>> >>>>>> >>>>>> > *show create table hivespark;* >>>>>> >>>>>> CREATE TABLE `hivespark`( >>>>>> >>>>>> `id` int, >>>>>> >>>>>> `name` string) >>>>>> >>>>>> CLUSTERED BY ( >>>>>> >>>>>> id) >>>>>> >>>>>> INTO 32 BUCKETS >>>>>> >>>>>> ROW FORMAT SERDE >>>>>> >>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' >>>>>> >>>>>> STORED AS INPUTFORMAT >>>>>> >>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' >>>>>> >>>>>> OUTPUTFORMAT >>>>>> >>>>>> 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' >>>>>> >>>>>> LOCATION >>>>>> >>>>>> 'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark' >>>>>> TBLPROPERTIES ( >>>>>> >>>>>> 'COLUMN_STATS_ACCURATE'='true', >>>>>> >>>>>> 'last_modified_by'='root', >>>>>> >>>>>> 'last_modified_time'='1455859079', >>>>>> >>>>>> 'numFiles'='37', >>>>>> >>>>>> 'numRows'='3', >>>>>> >>>>>> 'rawDataSize'='0', >>>>>> >>>>>> 'totalSize'='11383', >>>>>> >>>>>> 'transactional'='true', >>>>>> >>>>>> 'transient_lastDdlTime'='1455864121') ; >>>>>> >>>>>> >>>>>> Regards >>>>>> Sanjiv Singh >>>>>> Mob : +091 9990-447-339 >>>>>> >>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan < >>>>>> srinath...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Is the transaction attribute set on your table? I observed that hive >>>>>>> transaction storage structure do not work with spark yet. You can >>>>>>> confirm >>>>>>> this by looking at the transactional attribute in the output of "desc >>>>>>> extended <tablename>" in hive console. >>>>>>> >>>>>>> If you'd need to access transactional table, consider doing a major >>>>>>> compaction and then try accessing the tables >>>>>>> >>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh < >>>>>>> sanjiv.is...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> >>>>>>>> I have observed that Spark SQL is not returning records for hive >>>>>>>> bucketed ORC tables on HDP. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On spark SQL , I am able to list all tables , but queries on hive >>>>>>>> bucketed tables are not returning records. >>>>>>>> >>>>>>>> I have also tried the same for non-bucketed hive tables. it is >>>>>>>> working fine. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Same is working on plain Apache setup. >>>>>>>> >>>>>>>> Let me know if needs other details. >>>>>>>> >>>>>>>> Regards >>>>>>>> Sanjiv Singh >>>>>>>> Mob : +091 9990-447-339 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Thanks, >>>>>>> M. Varadharajan >>>>>>> >>>>>>> ------------------------------------------------ >>>>>>> >>>>>>> "Experience is what you get when you didn't get what you wanted" >>>>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>>>> >>>>>>> My Journal :- http://varadharajan.in >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks, >>>>> M. Varadharajan >>>>> >>>>> ------------------------------------------------ >>>>> >>>>> "Experience is what you get when you didn't get what you wanted" >>>>> -By Prof. Randy Pausch in "The Last Lecture" >>>>> >>>>> My Journal :- http://varadharajan.in >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks, >>> M. Varadharajan >>> >>> ------------------------------------------------ >>> >>> "Experience is what you get when you didn't get what you wanted" >>> -By Prof. Randy Pausch in "The Last Lecture" >>> >>> My Journal :- http://varadharajan.in >>> >> >> > > > -- > Thanks, > M. Varadharajan > > ------------------------------------------------ > > "Experience is what you get when you didn't get what you wanted" > -By Prof. Randy Pausch in "The Last Lecture" > > My Journal :- http://varadharajan.in >