Re: Kylin streaming questions

Xiaoxiang Yu Wed, 26 Jun 2019 02:40:28 -0700

Hi Andras, Shaofeng,
  I will update this information asap. 
  About segment overlaping problem, I have a test in my env, looks like 
everything works well. Since the segment range created by kylin’s streaming 
coordinator is something like "201906290000_201906290100" , if you want to 
build a segment, I think you should use the exact match segment range (such as 
"201906290000_201906290100"), or merge multi exist segments range (such as 
"201906290100_201906290300") .

-----------------
-----------------
Best wishes to you ! 
From ：Xiaoxiang Yu

At 2019-06-26 12:00:38, "ShaoFeng Shi" <[email protected]> wrote:

Hi Xiaoxiang,

Thank you for the detailed information. Could you please record these 
limitations as JIRA issues (if not yet)? Thanks.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]

Xiaoxiang Yu <[email protected]> 于2019年6月25日周二 下午11:42写道：

Hi, Andras
    I am glad to see that you have have a strong understanding with Kylin's 
Realtime OLAP. Most of them are correct, the following is my understanding:
    1)  Currently, there is no such documentation which talk about how to use 
lambda mode, we will publish one after 3.0.0-beta release (maybe this wekend or 
after a week?).
    2)  Hive table must have the same name as the streaming table , and should 
be locate at "default" namespace of hive. The column name should match exactly 
and data type should be compatible.
    3)  If you want to build segment which data from hive,  you have to built 
by rest api.
    4)  Cube build engine must be mapreduce, spark is not supported at the 
moment.

-----------------
-----------------
Best wishes to you ! 
From ：Xiaoxiang Yu

At 2019-06-25 17:20:55, "Andras Nagy" <[email protected]> wrote:

Hi ShaoFeng,

Thanks a lot for the pointer on the lambda mode, yes, that's exactly what I 
need :)

Is there perhaps documentation on this? For now, I was trying to get this 
working 'empirically' and finally succeeded, but some of my conclusions may be 
wrong. This is what I concluded:

- hive table must have the same name as the streaming table (name given to the 
data source)
- cube can't be built from UI (to build the historic segments from the data in 
hive), but it can be built using the REST API
- cube build engine must be mapreduce. For Spark as build engine I got 
exception "Cannot adapt to interface org.apache.kylin.engine.spark.ISparkOutput"
- endTime must be non-overlapping with the streaming data. When I had overlap, 
the streaming data coming from kafka did not show up in the output, I guess 
this is what you meant by "the segments from Hive will overwrite the segments 
from Kafka".

Are these correct conclusions? Is there anything else I should be aware of?

Many thanks,
Andras

On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <[email protected]> wrote:

Hello Andras,

Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in 
https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which means, 
you can define a fact table whose data can be from both Kafka and Hive. The 
only requirement is that all the cube columns appear in both Kafka data and 
Hive data. I think maybe that can fit your need. The cube can be built from 
Kafka, in the meanwhile, it can also be built from Hive, the segments from Hive 
will overwrite the segments from Kafka (as usually Hive data is more accurate). 
When querying the cube, Kylin will firstly query historical segments, and then 
real-time segments (adding the max-time of historical segments as the 
condition).

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]

Andras Nagy <[email protected]> 于2019年6月24日周一 下午11:29写道：

Dear Ma,

Thanks for your reply.

Slightly related to my original question on the hybrid model, I was wondering 
if it's possible to combine a batch and a streaming cube. I realized this is 
not possible, as a hybrid model can only be created from cubes of the same 
model (and a model points to either a batch or a streaming datasource).

The usecase would be this:
- we have a large amount of streaming data in Kafka that we would like to 
process with Kylin streaming
- Kafka retention is only a few days, so if we need to change anything in the 
cubes (e.g. introduce a new metric or dimension which has been present in the 
events, but not in the cube definition), we can only reprocess a few days worth 
of data in the streaming model
- the raw events are also written to a data lake for long-term storage
- the data written to the data lake could be used to feed the historic data 
into a batch kylin model (and cubes)
- I'm looking for a way to combine these, so if we want to change anything in 
the cubes, we can recalculate them for the historic data as well

Is there a way to achieve this with current Kylin? (Without implementing a 
custom query layer that combines the two cubes.)

Best regards,
Andras

On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <[email protected]> wrote:

Hi Andras,

Currently it doesn't support consume from specified offsets, only support 
consume from startOffset or latestOffset, if you want to consume from 
startOffset, you need to set the configuration: 
kylin.stream.consume.offsets.latest to false in the cube's overrides page.

If you do need to start from specified offsets, please create a jira request, 
but I think it is hard for user to know what's the offsets should be set for 
all partitions.

At 2019-06-13 22:34:59, "Andras Nagy" <[email protected]> wrote:

Dear Ma,

Thank you very much!

>1)yes, you can specify a configuration in the new cube, to consume data from 
>start offset
That is, an offset value for each partition of the topic? That would be good - 
could you please point me where to do this in practice, or point me to what I 
should read? (I haven't found it on the cube designer UI - perhaps this is 
something that's only available on the API?)

Many thanks,
Andras

On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <[email protected]> wrote:

Hi Andras,
1)yes, you can specify a configuration in the new cube, to consume data from 
start offset

2)It should work, but I haven't tested it yet

3)as I remember, currently we use Kafka 1.0 client library, so it is better to 
use the version later, I'm sure that the version before 0.9.0 cannot work, but 
not sure 0.9.x can work or not

| |
Ma Gang
|
|
邮箱：[email protected]
|

签名由 网易邮箱大师 定制

On 06/13/2019 18:01, Andras Nagy wrote:
Greetings,

I have a few questions related to the new streaming (real-time OLAP) 
implementation.

1) Is there a way to have data reprocessed from kafka? E.g. I change a cube 
definition and drop the cube (or add a new cube definition) and want to have 
data that is still available on kafka to be reprocessed to build the changed 
cube (or new cube)? Is this possible?

2) Does the hybrid model work with streaming cubes (to combine two cubes)?

3) What is minimum kafka version required? The tutorial asks to install Kafka 
1.0, is this the minimum required version?

Thank you very much,
Andras

Re: Kylin streaming questions

Reply via email to