Re: Spark SQL is not returning records for hive bucketed tables on HDP

Varadharajan Mukundan Tue, 23 Feb 2016 00:32:50 -0800

This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure if
its different.


Please walkthrough the below commands in the same order to understand the
sequence.

hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1


On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
wrote:

> Hi Varadharajan,
>
>
> That is the point, Spark SQL is able to recognize delta files. See below
> directory structure, ONE BASE (43 records) and one DELTA (created after
> last insert). And I am able see last insert through Spark SQL.
>
>
> *See below complete scenario :*
>
> *Steps:*
>
>    - Inserted 43 records in table.
>    - Run major compaction on table.
>    - *alter table mytable COMPACT 'major';*
>    - Disabled auto compaction on table.
>    - *alter table mytable set TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>    - Inserted 1 record in table.
>
>
> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
> /apps/hive/warehouse/mydb.db/mytable/base_0000087
> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>
> *SPARK JDBC :*
>
> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
> +------+
> | _c0  |
> +------+
> | 44   |
> +------+
> 1 row selected (1.196 seconds)
>
> *HIVE JDBC :*
>
> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
> +------+--+
> | _c0  |
> +------+--+
> | 44   |
> +------+--+
> 1 row selected (0.121 seconds)
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>
> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
> srinath...@gmail.com> wrote:
>
>> Hi Sanjiv,
>>
>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>> rows since it is hive which processes the query. But i think the problem
>> with Hive JDBC is that there are two layers of processing, hive and then at
>> spark with the result set. And another one is performance is limited to
>> that single HiveServer2 node and network.
>>
>> But If we make use of sqlContext.table function in spark to access hive
>> tables, it is supposed to read files directly from HDFS skipping the hive
>> layer. But it doesn't read delta files and just reads the contents from
>> base folder. Only after Major compaction, the delta files would be merged
>> with based folder and be visible for Spark SQL
>>
>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>> wrote:
>>
>>> Hi Varadharajan,
>>>
>>> Can you elaborate on (you quoted on previous mail) :
>>> "I observed that hive transaction storage structure do not work with
>>> spark yet"
>>>
>>>
>>> If it is related to delta files created after each transaction and spark
>>> would not be able recognize them. then I have a table *mytable *(ORC ,
>>> BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
>>> I can see delta files created in HDFS (see below), Still able to fetch
>>> consistent records through Spark JDBC and HIVE JDBC.
>>>
>>> Not compaction triggered for that table.
>>>
>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>
>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>
>>>
>>>
>>> Regards
>>> Sanjiv Singh
>>> Mob :  +091 9990-447-339
>>>
>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>> srinath...@gmail.com> wrote:
>>>
>>>> Actually the auto compaction if enabled is triggered based on the
>>>> volume of changes. It doesn't automatically run after every insert. I think
>>>> its possible to reduce the thresholds but that might reduce performance by
>>>> a big margin. As of now, we do compaction after the batch insert completes.
>>>>
>>>> The only other way to solve this problem as of now is to use Hive JDBC
>>>> API.
>>>>
>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is...@gmail.com
>>>> > wrote:
>>>>
>>>>> Compaction would have been triggered automatically as following
>>>>> properties already set in *hive-site.xml*. and also
>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>
>>>>>
>>>>>     <property>
>>>>>
>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>
>>>>>       <value>true</value>
>>>>>
>>>>>     </property>
>>>>>
>>>>>     <property>
>>>>>
>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>
>>>>>       <value>1</value>
>>>>>
>>>>>     </property>
>>>>>
>>>>>
>>>>> Documentation is upset sometimes.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Sanjiv Singh
>>>>> Mob :  +091 9990-447-339
>>>>>
>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>> srinath...@gmail.com> wrote:
>>>>>
>>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>>> means that after every insert job, compaction should be run to access new
>>>>>> rows from Spark. Sad that this issue is not documented / mentioned 
>>>>>> anywhere.
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Varadharajan,
>>>>>>>
>>>>>>> Thanks for your response.
>>>>>>>
>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>
>>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>>> on tables , it start showing results on spark SQL.
>>>>>>>
>>>>>>>
>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>
>>>>>>>
>>>>>>> > *show create table hivespark;*
>>>>>>>
>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>
>>>>>>>     `id` int,
>>>>>>>
>>>>>>>     `name` string)
>>>>>>>
>>>>>>>   CLUSTERED BY (
>>>>>>>
>>>>>>>     id)
>>>>>>>
>>>>>>>   INTO 32 BUCKETS
>>>>>>>
>>>>>>>   ROW FORMAT SERDE
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>
>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>
>>>>>>>   OUTPUTFORMAT
>>>>>>>
>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>
>>>>>>>   LOCATION
>>>>>>>
>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>   TBLPROPERTIES (
>>>>>>>
>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>
>>>>>>>     'last_modified_by'='root',
>>>>>>>
>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>
>>>>>>>     'numFiles'='37',
>>>>>>>
>>>>>>>     'numRows'='3',
>>>>>>>
>>>>>>>     'rawDataSize'='0',
>>>>>>>
>>>>>>>     'totalSize'='11383',
>>>>>>>
>>>>>>>     'transactional'='true',
>>>>>>>
>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>
>>>>>>>
>>>>>>> Regards
>>>>>>> Sanjiv Singh
>>>>>>> Mob :  +091 9990-447-339
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>> srinath...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>> confirm this by looking at the transactional attribute in the output of
>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>
>>>>>>>> If you'd need to access transactional table, consider doing a major
>>>>>>>> compaction and then try accessing the tables
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>>> bucketed tables are not returning records.
>>>>>>>>>
>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>> working fine.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>
>>>>>>>>> Let me know if needs other details.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Sanjiv Singh
>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks,
>>>>>>>> M. Varadharajan
>>>>>>>>
>>>>>>>> ------------------------------------------------
>>>>>>>>
>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>
>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks,
>>>>>> M. Varadharajan
>>>>>>
>>>>>> ------------------------------------------------
>>>>>>
>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>
>>>>>> My Journal :- http://varadharajan.in
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> M. Varadharajan
>>>>
>>>> ------------------------------------------------
>>>>
>>>> "Experience is what you get when you didn't get what you wanted"
>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>
>>>> My Journal :- http://varadharajan.in
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> M. Varadharajan
>>
>> ------------------------------------------------
>>
>> "Experience is what you get when you didn't get what you wanted"
>>                -By Prof. Randy Pausch in "The Last Lecture"
>>
>> My Journal :- http://varadharajan.in
>>
>
>


-- 
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
               -By Prof. Randy Pausch in "The Last Lecture"

My Journal :- http://varadharajan.in

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Reply via email to