Hi I'm a newbie to Hive and recently I need to do some data warehousing work with hive. Actually the business data is inside a SQL server database and I need to extract the data from the tables inside the database first into HDFS and then into Hive.
The current design is like this, I first use Apache NiFi to extract all data from sql server tables and then write the extracted data into HDFS. I need to have a layer of tables(JSON Data layer) whose data are extracted by day, so the data is written into HDFS folders like /user/hive/table1/20170410/(this folder contains all the data extracted by 20170410). So I could create external tables from these data partitioned by the extraction day. Then I will have 2 partitions for the next ODS layer, the first partition is extraction day(date string when the data is extracted, e.g. 20170410) and the second is the business day whose value comes from a specific business time field of the sql server tables.(e.g. table 1 have a field called applicationTime to record when the record is last updated, the business day is the date string of the value, e.g. 20160413) Then whenever the extraction starts, I will use the hive insert overwrite sql to update all data of tables under the current extraction day partition(this will erase all the second level business date partitions and fill in with new data, also with a bunch of business date partitions). I have talked to one of my friends about this ETL design, but he thought the partitions in hive should be logically connected, like first level partition is day, second level partition is hour. But the 2 partitions of the ODS layer in my design is unrelated. But I need the extraction date partition to gather all the data extracted by day and I also need the business date partition so that I could query data with a specific business date on the tables. He also mentioned to me this 2 unrelated partitions is error prone and couldn't improve the query performance. So I'm wondering if what he said is true.