Re: Populating tables using hive and spark

Furcy Pin Mon, 22 Aug 2016 03:46:26 -0700

Hi Nitin,

I confirm that there is something odd here.


I did the following test :

create table test_orc (id int, name string, dept string) stored as ORC;
insert into table test_orc values (1, 'abc', 'xyz');
insert into table test_orc values (2, 'def', 'xyz');
insert into table test_orc values (3, 'pqr', 'xyz');
insert into table test_orc values (4, 'ghi', 'xyz');


I ended up with 4 files on hdfs:

000000_0
000000_0_copy_1
000000_0_copy_2
000000_0_copy_3


Then I renamed 000000_0_copy_2 to part-00000, and I still got COUNT(*) = 4
with hive.
So this is not a file name issue.

I then removed one of the files, and I got this :

> SELECT COUNT(1) FROM test_orc ;
+------+--+
| _c0  |
+------+--+
| 4    |
+------+--+

> SELECT * FROM test_orc ;
+--------------+----------------+----------------+--+
| test_orc.id  | test_orc.name  | test_orc.dept  |
+--------------+----------------+----------------+--+
| 1            | abc            | xyz            |
| 2            | def            | xyz            |
| 4            | ghi            | xyz            |
+--------------+----------------+----------------+--+
3 rows selected (0.162 seconds)

So, my guess is that when Hive inserts data, it must keep somewhere in the
metastore the number of rows in the table.
However, if the files are modified by someone else than Hive itself,
(either manually or with Spark), you end up with an inconsistency.

So I guess we can call it a bug:

Hive should detect that the files changed and invalidate its pre-calculated
count.
Optionally, Spark should be nice with Hive and update the the count when
inserting.

I don't know if this bug has already been reported, and I tested on Hive
1.1.0, so perhaps it is already solved in later releases.

Regards,

Furcy


On Mon, Aug 22, 2016 at 9:34 AM, Nitin Kumar <nk94.nitinku...@gmail.com>
wrote:

> Hi!
>
> I've noticed that hive has problems in registering new data records if the
> same table is written to using both the hive terminal and spark sql. The
> problem is demonstrated through the commands listed below
>
> ====================================================================
> hive> use default;
> hive> create table test_orc (id int, name string, dept string) stored as
> ORC;
> hive> insert into table test_orc values (1, 'abc', 'xyz');
> hive> insert into table test_orc values (2, 'def', 'xyz');
> hive> select count(*) from test_orc;
> OK
> 2
> hive> select distinct(name) from test_orc;
> OK
> abc
> def
>
> *** files in hdfs path in warehouse for the created table ***
>
>
> 
>
> >>> data_points = [(3, 'pqr', 'xyz'), (4, 'ghi', 'xyz')]
> >>> column_names = ['identity_id', 'emp_name', 'dept_name']
> >>> data_df = sqlContext.createDataFrame(data_points, column_names)
> >>> data_df.show()
>
> +-----------+--------+---------+
> |identity_id|emp_name|dept_name|
> +-----------+--------+---------+
> |          3|     pqr|      xyz|
> |          4|     ghi|      xyz|
> +-----------+--------+---------+
>
> >>> data_df.registerTempTable('temp_table')
> >>> sqlContext.sql('insert into table default.test_orc select * from
> temp_table')
>
> *** files in hdfs path in warehouse for the created table ***
>
> 
> hive> select count(*) from test_orc; (Does not launch map-reduce job)
> OK
> 2
> hive> select distinct(name) from test_orc; (Launches map-reduce job)
> abc
> def
> ghi
> pqr
>
> hive> create table test_orc_new like test_orc stored as ORC;
> hive> insert into table test_orc_new select * from test_orc;
> hive> select count(*) from test_orc_new;
> OK
> 4
> ==================================================================
>
> Even if I restart the hive services I cannot get the proper count output
> from hive. This problem only occurs if the table is written to using both
> hive and spark. If only spark is used to insert records into the table
> multiple times, the count query in the hive terminal works perfectly fine.
>
> This problem occurs for tables stored with different storage formats as
> well (textFile etc.)
>
> Is this because of the different naming conventions used by hive and spark
> to write records to hdfs? Or maybe it is not a recommended practice to
> write tables using different services?
>
> Your thoughts and comments on this matter would be highly appreciated!
>
> Thanks!
> Nitin
>
>
>

Re: Populating tables using hive and spark

Reply via email to