Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Sudhir . Kumar Wed, 03 Feb 2016 01:17:18 -0800

Hello Edward,

One of the big advantages of Kylin talking to RDMS would be in building the 
unified data architecture. But how would data blending from multiple source be 
done in Kylin. The advantage from RDMS would be if data blending is enabled in 
Kylin.  Also as Luke mentions there are tools available which can enable ETL to 
Hive. Also as data strategy,  organizations would like to eventually keep data 
into HDFS.


In my opinion reading from RDMS would be good to have feature and does not seem 
to be urgent.

Thanks,

Sudhir

"We must accept finite disappointment, but never lose infinite hope." - Martin 
Luther King Jr.


From: Luke Han <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, February 3, 2016 at 8:32 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Cc: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Hi Edward,
     Thanks to raise this discussion, read data from RDBMs is tricky and we 
have to come up a very clear design and architecture before implement it.

     There's one thread/JIRA about read data from Oracle directly, but finally 
dropped this since there's already many tools could handle it, extract data 
from Oracle and load to Hive.

     The concern here is, most RDBMs are not optimized yet for distribution 
system to read directly. For example, hundreds Hadoop nodes read data from 
MySQL or Oracle or others directly. And also network.

     From the beginning, we decided to use Hive as protocol between upstream 
and Kylin. This is good model so far since users could leverage every ETL tool 
to do this job, to landing source data into Hive and then build cube based on 
it. Even if Kylin supports to read data from RDBMs, then how about transform? 
how about load? it will bring ETL parts into Kylin's scope which is not good 
idea, I think.

      But read from RDBMs is valid to extend input source rather than Hive 
today, not only RDBMs also SparkSQL, Impala, Drill and other SQL on Hadoop.
      How about to build a light tool for this requirement? Which could be one 
extension tool for user to leverage.

      Thanks.
Luke





Best Regards!
---------------------

Luke Han

On Wed, Feb 3, 2016 at 9:45 AM, Edward Zhang 
<[email protected]<mailto:[email protected]>> wrote:
Hi Kylin Community,

I had discussion with Shaofeng (@Shaofengshi) on JIRA KYLIN-1351 to make
Kylin to support RDBMS as data source. But we want to get more input from
community to see how much importance and urgency for this feature. Please
do respond and provide your suggestion if you are in need of this feature
or are interested in developing this feature.

Though Kylin today supports plugin datasource, this RDBMS feature is not
trivial in that we need take care of the following problems.

1. Independent dictionary especially for data type mapping.
Hive has its different data type system from RDBMS. Kylin dictionary should
infer column type from HIVE schema today, but we need make sure dictionary
is dependent of data source so that RDBMS schema can be stored in Kylin
dictionary

2. Pipeline
Do we import data from RDBMS to Hive or directly read data from RDBMS?
If the destination is Hive, we may reuse current Hive MR cubing job, but we
need take care of RDBMS to Hive conversion.
If Kylin directly reads data from RDBMS, we need write a new MR or Spark
job.

3. Consistency
Normally RDBMS supports data insert/update/delete, how does Kylin handle
that?

4. Read continuously
Do we require that RDBMS fact table always has a timestamp field which
Kylin uses for reading records continuously?

5. Cube modeling
Is current cube modeling feature independent enough to support RDBMS
modeling?

6. Sharding
Normally RDBMS can support complicated join queries across multiple tables,
here the reason we use Kylin is probably that the source table is sharded
into many children tables and Kylin can query across all the shards once
after the data is imported into Kylin.

Thanks
Edward

Re: [Discuss] KYLIN-1351 Kylin to support RDBMS as data source

Reply via email to