Re: Spark SQL is not returning records for hive bucketed tables on HDP

@Sanjiv Singh Mon, 22 Feb 2016 22:29:03 -0800

Hi Varadharajan,

Can you elaborate on (you quoted on previous mail) :
"I observed that hive transaction storage structure do not work with spark
yet"



If it is related to delta files created after each transaction and spark
would not be able recognize them. then I have a table *mytable *(ORC ,
BUCKETED , NON-SORTED) , already done lots on insert , update and deletes.
I can see delta files created in HDFS (see below), Still able to fetch
consistent records through Spark JDBC and HIVE JDBC.

Not compaction triggered for that table.

> *hadoop fs -ls /apps/hive/warehouse/mydb.db/mytable*

drwxrwxrwx   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/base_0000060
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000061_0000061
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000062_0000062
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000063_0000063
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000064_0000064
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000065_0000065
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000066_0000066
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000067_0000067
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000068_0000068
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000069_0000069
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000070_0000070
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000071_0000071
drwxr-xr-x   - root hdfs          0 2016-02-23 11:38
/apps/hive/warehouse/mydb.db/mytable/delta_0000072_0000072
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000073_0000073
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000074_0000074
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000075_0000075
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000076_0000076
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000077_0000077
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000078_0000078
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000079_0000079
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000080_0000080
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000081_0000081
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000082_0000082
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000083_0000083
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000084_0000084
drwxr-xr-x   - root hdfs          0 2016-02-23 11:39
/apps/hive/warehouse/mydb.db/mytable/delta_0000085_0000085
drwxr-xr-x   - root hdfs          0 2016-02-23 11:40
/apps/hive/warehouse/mydb.db/mytable/delta_0000086_0000086
drwxr-xr-x   - root hdfs          0 2016-02-23 11:41
/apps/hive/warehouse/mydb.db/mytable/delta_0000087_0000087



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Mon, Feb 22, 2016 at 1:38 PM, Varadharajan Mukundan <srinath...@gmail.com
> wrote:

> Actually the auto compaction if enabled is triggered based on the volume
> of changes. It doesn't automatically run after every insert. I think its
> possible to reduce the thresholds but that might reduce performance by a
> big margin. As of now, we do compaction after the batch insert completes.
>
> The only other way to solve this problem as of now is to use Hive JDBC API.
>
> On Mon, Feb 22, 2016 at 11:39 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
> wrote:
>
>> Compaction would have been triggered automatically as following
>> properties already set in *hive-site.xml*. and also *NO_AUTO_COMPACTION* 
>> property
>> not been set for these tables.
>>
>>
>>     <property>
>>
>>       <name>hive.compactor.initiator.on</name>
>>
>>       <value>true</value>
>>
>>     </property>
>>
>>     <property>
>>
>>       <name>hive.compactor.worker.threads</name>
>>
>>       <value>1</value>
>>
>>     </property>
>>
>>
>> Documentation is upset sometimes.
>>
>>
>>
>>
>> Regards
>> Sanjiv Singh
>> Mob :  +091 9990-447-339
>>
>> On Mon, Feb 22, 2016 at 9:49 AM, Varadharajan Mukundan <
>> srinath...@gmail.com> wrote:
>>
>>> Yes, I was burned down by this issue couple of weeks back. This also
>>> means that after every insert job, compaction should be run to access new
>>> rows from Spark. Sad that this issue is not documented / mentioned anywhere.
>>>
>>> On Mon, Feb 22, 2016 at 9:27 AM, @Sanjiv Singh <sanjiv.is...@gmail.com>
>>> wrote:
>>>
>>>> Hi Varadharajan,
>>>>
>>>> Thanks for your response.
>>>>
>>>> Yes it is transnational table; See below *show create table. *
>>>>
>>>> Table hardly have 3 records , and after triggering minor compaction on
>>>> tables , it start showing results on spark SQL.
>>>>
>>>>
>>>> > *ALTER TABLE hivespark COMPACT 'major';*
>>>>
>>>>
>>>> > *show create table hivespark;*
>>>>
>>>>   CREATE TABLE `hivespark`(
>>>>
>>>>     `id` int,
>>>>
>>>>     `name` string)
>>>>
>>>>   CLUSTERED BY (
>>>>
>>>>     id)
>>>>
>>>>   INTO 32 BUCKETS
>>>>
>>>>   ROW FORMAT SERDE
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
>>>>
>>>>   STORED AS INPUTFORMAT
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
>>>>
>>>>   OUTPUTFORMAT
>>>>
>>>>     'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
>>>>
>>>>   LOCATION
>>>>
>>>>     'hdfs://myhost:8020/apps/hive/warehouse/mydb.db/hivespark'
>>>>   TBLPROPERTIES (
>>>>
>>>>     'COLUMN_STATS_ACCURATE'='true',
>>>>
>>>>     'last_modified_by'='root',
>>>>
>>>>     'last_modified_time'='1455859079',
>>>>
>>>>     'numFiles'='37',
>>>>
>>>>     'numRows'='3',
>>>>
>>>>     'rawDataSize'='0',
>>>>
>>>>     'totalSize'='11383',
>>>>
>>>>     'transactional'='true',
>>>>
>>>>     'transient_lastDdlTime'='1455864121') ;
>>>>
>>>>
>>>> Regards
>>>> Sanjiv Singh
>>>> Mob :  +091 9990-447-339
>>>>
>>>> On Mon, Feb 22, 2016 at 9:01 AM, Varadharajan Mukundan <
>>>> srinath...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Is the transaction attribute set on your table? I observed that hive
>>>>> transaction storage structure do not work with spark yet. You can confirm
>>>>> this by looking at the transactional attribute in the output of "desc
>>>>> extended <tablename>" in hive console.
>>>>>
>>>>> If you'd need to access transactional table, consider doing a major
>>>>> compaction and then try accessing the tables
>>>>>
>>>>> On Mon, Feb 22, 2016 at 8:57 AM, @Sanjiv Singh <sanjiv.is...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I have observed that Spark SQL is not returning records for hive
>>>>>> bucketed ORC tables on HDP.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On spark SQL , I am able to list all tables , but queries on hive
>>>>>> bucketed tables are not returning records.
>>>>>>
>>>>>> I have also tried the same for non-bucketed hive tables. it is
>>>>>> working fine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Same is working on plain Apache setup.
>>>>>>
>>>>>> Let me know if needs other details.
>>>>>>
>>>>>> Regards
>>>>>> Sanjiv Singh
>>>>>> Mob :  +091 9990-447-339
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Thanks,
>>>>> M. Varadharajan
>>>>>
>>>>> ------------------------------------------------
>>>>>
>>>>> "Experience is what you get when you didn't get what you wanted"
>>>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>>>
>>>>> My Journal :- http://varadharajan.in
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Thanks,
>>> M. Varadharajan
>>>
>>> ------------------------------------------------
>>>
>>> "Experience is what you get when you didn't get what you wanted"
>>>                -By Prof. Randy Pausch in "The Last Lecture"
>>>
>>> My Journal :- http://varadharajan.in
>>>
>>
>>
>
>
> --
> Thanks,
> M. Varadharajan
>
> ------------------------------------------------
>
> "Experience is what you get when you didn't get what you wanted"
>                -By Prof. Randy Pausch in "The Last Lecture"
>
> My Journal :- http://varadharajan.in
>

Re: Spark SQL is not returning records for hive bucketed tables on HDP

Reply via email to