Real time data streaming into Hive text table and excessive amount of file numbers

Mich Talebzadeh Sun, 26 Mar 2017 14:35:06 -0700

Hi,

I   am testing real time data streaming from Oracle DB schema table to Hive
table via SAP Replication Server (SRS).


This works fine and SRS uses DIRECT LOAD to ingest data into Hive table. By
default this is set to 10K rows for batch load into Hive ass shown below

-- Direct Load Materialization Configuration Parameters
--
-- Parameter "mat_load_tran_size", Default: 10000, specifies the optimal
transaction size or batch size for the initial copying of primary data to
the replicate table during direct load materialization.
--

A sample bulk load for updates in table "t" are shown below, it does in
batches of 10K

T. 2017/03/26 18:25:33. (136): Command sent to 'hiveserver2.asehadoop':
T. 2017/03/26 18:25:33. (136): 'Bulk update table 'rs_ut_136_1' (10000 rows
affected)'
T. 2017/03/26 18:25:49. (136): Command sent to 'hiveserver2.asehadoop':
T. 2017/03/26 18:25:49. (136): 'Bulk update table 'rs_ut_136_1' (10000 rows
affected)'
T. 2017/03/26 18:26:06. (136): Command sent to 'hiveserver2.asehadoop':
T. 2017/03/26 18:26:06. (136): 'Bulk update table 'rs_ut_136_1' (10000 rows
affected)'

Now the issue that each reduce job creates a new partition under table
directory like below

-rwxr-xr-x   2 hduser supergroup   11128103 2017-03-26 16:01
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_1539
-rwxr-xr-x   2 hduser supergroup       2245 2017-03-25 23:37
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_154
-rwxr-xr-x   2 hduser supergroup   11128103 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_1540
-rwxr-xr-x   2 hduser supergroup   11128118 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_1541
-rwxr-xr-x   2 hduser supergroup   11128118 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_1542
-rwxr-xr-x   2 hduser supergroup   11128093 2017-03-26 16:02
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_1543
-rwxr-xr-x   2 hduser supergroup   11128093 2017-03-26 16:03
/user/hive/warehouse/asehadoop.db/t/000000_0_copy_1544


So I was wondering what are the best ways of compacting these files? Is
there any detriment when the number of these files grow very high such as
1000s of them?

Thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Real time data streaming into Hive text table and excessive amount of file numbers

Reply via email to