Re: Spark SQL is not returning records for hive bucketed tables on HDP

@Sanjiv Singh Tue, 23 Feb 2016 00:46:47 -0800

Try this,


hive> create table default.foo(id int) clustered by (id) into 2 buckets
STORED AS ORC TBLPROPERTIES ('transactional'='true');
hive> insert into default.foo values(10);

scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
because data is still in delta files

Now run major compaction:

hive> ALTER TABLE default.foo COMPACT 'MAJOR';

scala> sqlContext.table("default.foo").count // Gives 1

hive> insert into foo values(20);

scala> sqlContext.table("default.foo").count* // Gives 2 , no compaction
required.*




Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Feb 23, 2016 at 2:02 PM, Varadharajan Mukundan <srinath...@gmail.com
> wrote:

> This is the scenario i'm mentioning.. I'm not using Spark JDBC. Not sure
> if its different.
>
> Please walkthrough the below commands in the same order to understand the
> sequence.
>
> hive> create table default.foo(id int) clustered by (id) into 2 buckets
> STORED AS ORC TBLPROPERTIES ('transactional'='true');
> hive> insert into foo values(10);
>
> scala> sqlContext.table("default.foo").count // Gives 0, which is wrong
> because data is still in delta files
>
> Now run major compaction:
>
> hive> ALTER TABLE default.foo COMPACT 'MAJOR';
>
> scala> sqlContext.table("default.foo").count // Gives 1
>
>
> On Tue, Feb 23, 2016 at 12:35 PM, @Sanjiv Singh <sanjiv.is...@gmail.com>
> wrote:
>
>> Hi Varadharajan,
>>
>>
>> That is the point, Spark SQL is able to recognize delta files. See below
>> directory structure, ONE BASE (43 records) and one DELTA (created after
>> last insert). And I am able see last insert through Spark SQL.
>>
>>
>> *See below complete scenario :*
>>
>> *Steps:*
>>
>>    - Inserted 43 records in table.
>>    - Run major compaction on table.
>>    - *alter table mytable COMPACT 'major';*
>>    - Disabled auto compaction on table.
>>    - *alter table mytable set
>>       TBLPROPERTIES("NO_AUTO_COMPACTION"="true");*
>>    - Inserted 1 record in table.
>>
>>
>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:43
>> /apps/hive/warehouse/mydb.db/mytable/base_0000087
>> drwxr-xr-x   - root hdfs          0 2016-02-23 12:02
>> /apps/hive/warehouse/mydb.db/mytable/delta_0000088_0000088
>>
>> *SPARK JDBC :*
>>
>> 0: jdbc:hive2://myhost:9999> select count(*) from mytable ;
>> +------+
>> | _c0  |
>> +------+
>> | 44   |
>> +------+
>> 1 row selected (1.196 seconds)
>>
>> *HIVE JDBC :*
>>
>> 1: jdbc:hive2://myhost:10000> select count(*) from mytable ;
>> +------+--+
>> | _c0  |
>> +------+--+
>> | 44   |
>> +------+--+
>> 1 row selected (0.121 seconds)
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Tue, Feb 23, 2016 at 12:04 PM, Varadharajan Mukundan <
>> srinath...@gmail.com> wrote:
>>
>>> Hi Sanjiv,
>>>
>>> Yes.. If we make use of Hive JDBC we should be able to retrieve all the
>>> rows since it is hive which processes the query. But i think the problem
>>> with Hive JDBC is that there are two layers of processing, hive and then at
>>> spark with the result set. And another one is performance is limited to
>>> that single HiveServer2 node and network.
>>>
>>> But If we make use of sqlContext.table function in spark to access hive
>>> tables, it is supposed to read files directly from HDFS skipping the hive
>>> layer. But it doesn't read delta files and just reads the contents from
>>> base folder. Only after Major compaction, the delta files would be merged
>>> with based folder and be visible for Spark SQL
>>>
>>> On Tue, Feb 23, 2016 at 11:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Can you elaborate on (you quoted on previous mail) :
>>>> "I observed that hive transaction storage structure do not work with
>>>> spark yet"
>>>>
>>>>
>>>> If it is related to delta files created after each transaction and
>>>> spark would not be able recognize them. then I have a table *mytable *(ORC
>>>> , BUCKETED , NON-SORTED) , already done lots on insert , update and
>>>> deletes. I can see delta files created in HDFS (see below), Still able to
>>>> fetch consistent records through Spark JDBC and HIVE JDBC.
>>>>
>>>> Not compaction triggered for that table.
>>>>
>>>> > *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*
>>>>
>>>> drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/base_0000060
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
>>>> drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
>>>> /apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087
>>>>
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <
>>>> srinath...@gmail.com> wrote:
>>>>
>>>>> Actually the auto compaction if enabled is triggered based on the
>>>>> volume of changes. It doesn't automatically run after every insert. I 
>>>>> think
>>>>> its possible to reduce the thresholds but that might reduce performance by
>>>>> a big margin. As of now, we do compaction after the batch insert 
>>>>> completes.
>>>>>
>>>>> The only other way to solve this problem as of now is to use Hive JDBC
>>>>> API.
>>>>>
>>>>> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <
>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>
>>>>>> Compaction would have been triggered automatically as following
>>>>>> properties already set in *hive-site.xml*. and also
>>>>>> *NO_AUTO_COMPACTION* property not been set for these tables.
>>>>>>
>>>>>>
>>>>>>     <property>
>>>>>>
>>>>>>       <name>hive.compactor.initiator.on</name>
>>>>>>
>>>>>>       <value>true</value>
>>>>>>
>>>>>>     </property>
>>>>>>
>>>>>>     <property>
>>>>>>
>>>>>>       <name>hive.compactor.worker.threads</name>
>>>>>>
>>>>>>       <value>1</value>
>>>>>>
>>>>>>     </property>
>>>>>>
>>>>>>
>>>>>> Documentation is upset sometimes.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>>>>>> srinath...@gmail.com> wrote:
>>>>>>
>>>>>>> Yes, I was burned down by this issue couple of weeks back. This also
>>>>>>> means that after every insert job, compaction should be run to access 
>>>>>>> new
>>>>>>> rows from Spark. Sad that this issue is not documented / mentioned 
>>>>>>> anywhere.
>>>>>>>
>>>>>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <
>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Varadharajan,
>>>>>>>>
>>>>>>>> Thanks for your response.
>>>>>>>>
>>>>>>>> Yes it is transnational table; See below *show create table. *
>>>>>>>>
>>>>>>>> Table hardly have 3 records , and after triggering minor compaction
>>>>>>>> on tables , it start showing results on spark SQL.
>>>>>>>>
>>>>>>>>
>>>>>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>>>>>
>>>>>>>>
>>>>>>>> > *show create table hivespark;*
>>>>>>>>
>>>>>>>>   CREATE TABLE `hivespark`(
>>>>>>>>
>>>>>>>>     `id` int,
>>>>>>>>
>>>>>>>>     `name` string)
>>>>>>>>
>>>>>>>>   CLUSTERED BY (
>>>>>>>>
>>>>>>>>     id)
>>>>>>>>
>>>>>>>>   INTO 32 BUCKETS
>>>>>>>>
>>>>>>>>   ROW FORMAT SERDE
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>>>>>
>>>>>>>>   STORED AS INPUTFORMAT
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>>>>>
>>>>>>>>   OUTPUTFORMAT
>>>>>>>>
>>>>>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>>>>>
>>>>>>>>   LOCATION
>>>>>>>>
>>>>>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>>>>>   TBLPROPERTIES (
>>>>>>>>
>>>>>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>>>>>
>>>>>>>>     'last_modified_by'='root',
>>>>>>>>
>>>>>>>>     'last_modified_time'='1455859079',
>>>>>>>>
>>>>>>>>     'numFiles'='37',
>>>>>>>>
>>>>>>>>     'numRows'='3',
>>>>>>>>
>>>>>>>>     'rawDataSize'='0',
>>>>>>>>
>>>>>>>>     'totalSize'='11383',
>>>>>>>>
>>>>>>>>     'transactional'='true',
>>>>>>>>
>>>>>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Sanjiv Singh
>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>
>>>>>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>>>>>> srinath...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Is the transaction attribute set on your table? I observed that
>>>>>>>>> hive transaction storage structure do not work with spark yet. You can
>>>>>>>>> confirm this by looking at the transactional attribute in the output 
>>>>>>>>> of
>>>>>>>>> "desc extended <tablename>" in hive console.
>>>>>>>>>
>>>>>>>>> If you'd need to access transactional table, consider doing a
>>>>>>>>> major compaction and then try accessing the tables
>>>>>>>>>
>>>>>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <
>>>>>>>>> sanjiv.is...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>>>>>> bucketed ORC tables on HDP.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>>>>>> bucketed tables are not returning records.
>>>>>>>>>>
>>>>>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>>>>>> working fine.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Same is working on plain Apache setup.
>>>>>>>>>>
>>>>>>>>>> Let me know if needs other details.
>>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Sanjiv Singh
>>>>>>>>>> Mob :  +091 9990-447-339
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks,
>>>>>>>>> M. Varadharajan
>>>>>>>>>
>>>>>>>>> ------------------------------------------------
>>>>>>>>>
>>>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>>>
>>>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Thanks,
>>>>>>> M. Varadharajan
>>>>>>>
>>>>>>> ------------------------------------------------
>>>>>>>
>>>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>>>
>>>>>>> My Journal :- http://varadharajan.in
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Reply via email to