Hi I'm a newbie to Hive and recently I need to do some data warehousing
work with hive.
Actually the business data is inside a SQL server database and I need to
extract the data from the tables inside the database first into
HDFS and then into Hive.

The current design is like this, I first use Apache NiFi to extract all
data from sql server tables and then write the extracted data into HDFS. I
need to have a layer of tables(JSON Data layer) whose data are extracted by
day, so the data is written into HDFS folders like
/user/hive/table1/20170410/(this folder contains all the data extracted by
20170410). So I could create external tables from these data partitioned by
the extraction day.

Then I will have 2 partitions for the next ODS layer, the first partition
is extraction day(date string when the data is extracted, e.g. 20170410)
and the second is the business day whose value comes from a specific
business time field of the sql server tables.(e.g. table 1 have a field
called applicationTime to record when the record is last updated, the
business day is the date string of the value, e.g. 20160413)

Then whenever the extraction starts, I will use the hive insert overwrite
sql to update all data of tables under the current extraction day
partition(this will erase all the second level business date partitions and
fill in with new data, also with a bunch of business date partitions).

I have talked to one of my friends about this ETL design, but he thought
the partitions in hive should be logically connected, like first level
partition is day, second level partition is hour. But the 2 partitions of
the ODS layer in my design is unrelated. But I need the extraction date
partition to gather all the data extracted by day and I also need the
business date partition so that I could query data with a specific business
date on the tables. He also mentioned to me this 2 unrelated partitions is
error prone and couldn't improve the query performance. So I'm wondering if
what he said is true.

Reply via email to