Hi Edward,
Thanks to raise this discussion, read data from RDBMs is tricky and we
have to come up a very clear design and architecture before implement it.
There's one thread/JIRA about read data from Oracle directly, but
finally dropped this since there's already many tools could handle it,
extract data from Oracle and load to Hive.
The concern here is, most RDBMs are not optimized yet for distribution
system to read directly. For example, hundreds Hadoop nodes read data from
MySQL or Oracle or others directly. And also network.
From the beginning, we decided to use Hive as protocol between
upstream and Kylin. This is good model so far since users could leverage
every ETL tool to do this job, to landing source data into Hive and then
build cube based on it. Even if Kylin supports to read data from RDBMs,
then how about transform? how about load? it will bring ETL parts into
Kylin's scope which is not good idea, I think.
But read from RDBMs is valid to extend input source rather than Hive
today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
How about to build a light tool for this requirement? Which could be
one extension tool for user to leverage.
Thanks.
Luke
Best Regards!
---------------------
Luke Han
On Wed, Feb 3, 2016 at 9:45 AM, Edward Zhang <[email protected]>
wrote:
> Hi Kylin Community,
>
> I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
> Kylin to support RDBMS as data source. But we want to get more input from
> community to see how much importance and urgency for this feature. Please
> do respond and provide your suggestion if you are in need of this feature
> or are interested in developing this feature.
>
> Though Kylin today supports plugin datasource, this RDBMS feature is not
> trivial in that we need take care of the following problems.
>
> 1. Independent dictionary especially for data type mapping.
> Hive has its different data type system from RDBMS. Kylin dictionary should
> infer column type from HIVE schema today, but we need make sure dictionary
> is dependent of data source so that RDBMS schema can be stored in Kylin
> dictionary
>
> 2. Pipeline
> Do we import data from RDBMS to Hive or directly read data from RDBMS?
> If the destination is Hive, we may reuse current Hive MR cubing job, but we
> need take care of RDBMS to Hive conversion.
> If Kylin directly reads data from RDBMS, we need write a new MR or Spark
> job.
>
> 3. Consistency
> Normally RDBMS supports data insert/update/delete, how does Kylin handle
> that?
>
> 4. Read continuously
> Do we require that RDBMS fact table always has a timestamp field which
> Kylin uses for reading records continuously?
>
> 5. Cube modeling
> Is current cube modeling feature independent enough to support RDBMS
> modeling?
>
> 6. Sharding
> Normally RDBMS can support complicated join queries across multiple tables,
> here the reason we use Kylin is probably that the source table is sharded
> into many children tables and Kylin can query across all the shards once
> after the data is imported into Kylin.
>
> Thanks
> Edward
>