This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure if
its different.

Please walkthrough the below commands in the same order to understand the
sequence.

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1


On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
wrote:

> Hi Varadharajan,
>
>
> That is the point, Spark SQL is able to recognize delta files. See below
> directory structure, ONE BASE (43 records) and one DELTA (created after
> last insert). And I am able see last insert through Spark SQL.
>
>
> *See below complete scenario :*
>
> *Steps:*
>
>    - Inserted 43 records in table.
>    - Run major compaction on table.
>    - *alter table mytable COMPACT 'major';*
>    - Disabled auto compaction on table.
>    - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>    - Inserted 1 record in table.
>
>
> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
> /apps/hive/warehouse/mydb.db/mytable/base_0000087
> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>
> *SPARK JDBC :*
>
> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
> +------+
> | _c0  |
> +------+
> | 44   |
> +------+
> 1 row selected (1.196 seconds)
>
> *HIVE JDBC :*
>
> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
> +------+--+
> | _c0  |
> +------+--+
> | 44   |
> +------+--+
> 1 row selected (0.121 seconds)
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
> srinath...@gmail.com> wrote:
>
>> Hi Sanjiv,
>>
>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>> rows since it is hive which processes the query. But i think the problem
>> with Hive JDBC is that there are two layers of processing, hive and then at
>> spark with the result set. And another one is performance is limited to
>> that single HiveServer2 node and network.
>>
>> But If we make use of sqlContext.table function in spark to access hive
>> tables, it is supposed to read files directly from HDFS skipping the hive
>> layer. But it doesn't read delta files and just reads the contents from
>> base folder. Only after Major compaction, the delta files would be merged
>> with based folder and be visible for Spark SQL
>>
>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>> wrote:
>>
>>> Hi Varadharajan,
>>>
>>> Can you elaborate on (you quoted on previous mail) :
>>> "I observed that hive transaction storage structure do not work with
>>> spark yet"
>>>
>>>
>>> If it is related to delta files created after each transaction and spark
>>> would not be able recognize them. then I have a table *mytable *(ORC ,
>>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>>> I can see delta files created in HDFS (see below), Still able to fetch
>>> consistent records through Spark JDBC and HIVE JDBC.
>>>
>>> Not compaction triggered for that table.
>>>
>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>
>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>> srinath...@gmail.com> wrote:
>>>
>>>> Actually the auto compaction if enabled is triggered based on the
>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>> its possible to reduce the thresholds but that might reduce performance by
>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>
>>>> The only other way to solve this problem as of now is to use Hive JDBC
>>>> API.
>>>>
>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is...@gmail.com
>>>> > wrote:
>>>>
>>>>> Compaction would have been triggered automatically as following
>>>>> properties already set in *hive-site.xml*. and also
>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>
>>>>>
>>>>>     <property>
>>>>>
>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>
>>>>>       <value>true</value>
>>>>>
>>>>>     </property>
>>>>>
>>>>>     <property>
>>>>>
>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>
>>>>>       <value>1</value>
>>>>>
>>>>>     </property>
>>>>>
>>>>>
>>>>> Documentation is upset sometimes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>> srinath...@gmail.com> wrote:
>>>>>
>>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>>> means that after every insert job, compaction should be run to access new
>>>>>> rows from Spark. Sad that this issue is not documented / mentioned 
>>>>>> anywhere.
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Varadharajan,
>>>>>>>
>>>>>>> Thanks for your response.
>>>>>>>
>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>
>>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>>> on tables , it start showing results on spark SQL.
>>>>>>>
>>>>>>>
>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>
>>>>>>>
>>>>>>> > *show create table hivespark;*
>>>>>>>
>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>
>>>>>>>     `id` int,
>>>>>>>
>>>>>>>     `name` string)
>>>>>>>
>>>>>>>   CLUSTERED BY (
>>>>>>>
>>>>>>>     id)
>>>>>>>
>>>>>>>   INTO 32 BUCKETS
>>>>>>>
>>>>>>>   ROW FORMAT SERDE
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>
>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>
>>>>>>>   OUTPUTFORMAT
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>
>>>>>>>   LOCATION
>>>>>>>
>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>   TBLPROPERTIES (
>>>>>>>
>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>
>>>>>>>     'last_modified_by'='root',
>>>>>>>
>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>
>>>>>>>     'numFiles'='37',
>>>>>>>
>>>>>>>     'numRows'='3',
>>>>>>>
>>>>>>>     'rawDataSize'='0',
>>>>>>>
>>>>>>>     'totalSize'='11383',
>>>>>>>
>>>>>>>     'transactional'='true',
>>>>>>>
>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Sanjiv Singh
>>>>>>> Mob :  +091 9990-447-339
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>> srinath...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>
>>>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>>>> compaction and then try accessing the tables
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>>> bucketed tables are not returning records.
>>>>>>>>>
>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>> working fine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>
>>>>>>>>> Let me know if needs other details.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Sanjiv Singh
>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> M. Varadharajan
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>
>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> M. Varadharajan
>>>>>>
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>
>>>>>> My Journal :- http://varadharajan.in
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Reply via email to