Hi Varadharajan,

That is the point, Spark SQL is able to recognize delta files. See below
directory structure, ONE BASE (43 records) and one DELTA (created after
last insert). And I am able see last insert through Spark SQL.


*See below complete scenario :*

*Steps:*

   - Inserted 43 records in table.
   - Run major compaction on table.
   - *alter table mytable COMPACT 'major';*
   - Disabled auto compaction on table.
   - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
   - Inserted 1 record in table.


> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
/apps/hive/warehouse/mydb.db/mytable/base_0000087
drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
/apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088

*SPARK JDBC :*

0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
+------+
| _c0  |
+------+
| 44   |
+------+
1 row selected (1.196 seconds)

*HIVE JDBC :*

1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
+------+--+
| _c0  |
+------+--+
| 44   |
+------+--+
1 row selected (0.121 seconds)


Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
srinath...@gmail.com> wrote:

> Hi Sanjiv,
>
> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
> rows since it is hive which processes the query. But i think the problem
> with Hive JDBC is that there are two layers of processing, hive and then at
> spark with the result set. And another one is performance is limited to
> that single HiveServer2 node and network.
>
> But If we make use of sqlContext.table function in spark to access hive
> tables, it is supposed to read files directly from HDFS skipping the hive
> layer. But it doesn't read delta files and just reads the contents from
> base folder. Only after Major compaction, the delta files would be merged
> with based folder and be visible for Spark SQL
>
> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>> Can you elaborate on (you quoted on previous mail) :
>> "I observed that hive transaction storage structure do not work with
>> spark yet"
>>
>>
>> If it is related to delta files created after each transaction and spark
>> would not be able recognize them. then I have a table *mytable *(ORC ,
>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>> I can see delta files created in HDFS (see below), Still able to fetch
>> consistent records through Spark JDBC and HIVE JDBC.
>>
>> Not compaction triggered for that table.
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>
>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>> srinath...@gmail.com> wrote:
>>
>>> Actually the auto compaction if enabled is triggered based on the volume
>>> of changes. It doesn't automatically run after every insert. I think its
>>> possible to reduce the thresholds but that might reduce performance by a
>>> big margin. As of now, we do compaction after the batch insert completes.
>>>
>>> The only other way to solve this problem as of now is to use Hive JDBC
>>> API.
>>>
>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>>> wrote:
>>>
>>>> Compaction would have been triggered automatically as following
>>>> properties already set in *hive-site.xml*. and also
>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>
>>>>
>>>>     <property>
>>>>
>>>>       <name>hive.compactor.initiator.on</name>
>>>>
>>>>       <value>true</value>
>>>>
>>>>     </property>
>>>>
>>>>     <property>
>>>>
>>>>       <name>hive.compactor.worker.threads</name>
>>>>
>>>>       <value>1</value>
>>>>
>>>>     </property>
>>>>
>>>>
>>>> Documentation is upset sometimes.
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>> srinath...@gmail.com> wrote:
>>>>
>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>> means that after every insert job, compaction should be run to access new
>>>>> rows from Spark. Sad that this issue is not documented / mentioned 
>>>>> anywhere.
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sanjiv.is...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Varadharajan,
>>>>>>
>>>>>> Thanks for your response.
>>>>>>
>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>
>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>> on tables , it start showing results on spark SQL.
>>>>>>
>>>>>>
>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>
>>>>>>
>>>>>> > *show create table hivespark;*
>>>>>>
>>>>>>   CREATE TABLE `hivespark`(
>>>>>>
>>>>>>     `id` int,
>>>>>>
>>>>>>     `name` string)
>>>>>>
>>>>>>   CLUSTERED BY (
>>>>>>
>>>>>>     id)
>>>>>>
>>>>>>   INTO 32 BUCKETS
>>>>>>
>>>>>>   ROW FORMAT SERDE
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>
>>>>>>   STORED AS INPUTFORMAT
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>
>>>>>>   OUTPUTFORMAT
>>>>>>
>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>
>>>>>>   LOCATION
>>>>>>
>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>   TBLPROPERTIES (
>>>>>>
>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>
>>>>>>     'last_modified_by'='root',
>>>>>>
>>>>>>     'last_modified_time'='1455859079',
>>>>>>
>>>>>>     'numFiles'='37',
>>>>>>
>>>>>>     'numRows'='3',
>>>>>>
>>>>>>     'rawDataSize'='0',
>>>>>>
>>>>>>     'totalSize'='11383',
>>>>>>
>>>>>>     'transactional'='true',
>>>>>>
>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>> srinath...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>>>> transaction storage structure do not work with spark yet. You can 
>>>>>>> confirm
>>>>>>> this by looking at the transactional attribute in the output of "desc
>>>>>>> extended <tablename>" in hive console.
>>>>>>>
>>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>>> compaction and then try accessing the tables
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>> bucketed tables are not returning records.
>>>>>>>>
>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>> working fine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>
>>>>>>>> Let me know if needs other details.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Reply via email to