Hi, I am testing real time data streaming from Oracle DB schema table to Hive table via SAP Replication Server (SRS).
This works fine and SRS uses DIRECT LOAD to ingest data into Hive table. By default this is set to 10K rows for batch load into Hive ass shown below -- Direct Load Materialization Configuration Parameters -- -- Parameter "mat_load_tran_size", Default: 10000, specifies the optimal transaction size or batch size for the initial copying of primary data to the replicate table during direct load materialization. -- A sample bulk load for updates in table "t" are shown below, it does in batches of 10K T. 2017/03/26 18:25:33. (136): Command sent to 'hiveserver2.asehadoop': T. 2017/03/26 18:25:33. (136): 'Bulk update table 'rs_ut_136_1' (10000 rows affected)' T. 2017/03/26 18:25:49. (136): Command sent to 'hiveserver2.asehadoop': T. 2017/03/26 18:25:49. (136): 'Bulk update table 'rs_ut_136_1' (10000 rows affected)' T. 2017/03/26 18:26:06. (136): Command sent to 'hiveserver2.asehadoop': T. 2017/03/26 18:26:06. (136): 'Bulk update table 'rs_ut_136_1' (10000 rows affected)' Now the issue that each reduce job creates a new partition under table directory like below -rwxr-xr-x 2 hduser supergroup 11128103 2017-03-26 16:01 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_1539 -rwxr-xr-x 2 hduser supergroup 2245 2017-03-25 23:37 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_154 -rwxr-xr-x 2 hduser supergroup 11128103 2017-03-26 16:02 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_1540 -rwxr-xr-x 2 hduser supergroup 11128118 2017-03-26 16:02 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_1541 -rwxr-xr-x 2 hduser supergroup 11128118 2017-03-26 16:02 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_1542 -rwxr-xr-x 2 hduser supergroup 11128093 2017-03-26 16:02 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_1543 -rwxr-xr-x 2 hduser supergroup 11128093 2017-03-26 16:03 /user/hive/warehouse/asehadoop.db/t/000000_0_copy_1544 So I was wondering what are the best ways of compacting these files? Is there any detriment when the number of these files grow very high such as 1000s of them? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.