Hi Xiaoxiang, Thank you for the detailed information. Could you please record these limitations as JIRA issues (if not yet)? Thanks.
Best regards, Shaofeng Shi 史少锋 Apache Kylin PMC Email: [email protected] Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html Join Kylin user mail group: [email protected] Join Kylin dev mail group: [email protected] Xiaoxiang Yu <[email protected]> 于2019年6月25日周二 下午11:42写道: > > Hi, Andras > I am glad to see that you have have a strong understanding with > Kylin's Realtime OLAP. Most of them are correct, the following is my > understanding: > 1) Currently, there is no such documentation which talk about how to > use lambda mode, we will publish one after 3.0.0-beta release (maybe this > wekend or after a week?). > 2) Hive table must have the same name as the streaming table , and > should be locate at "default" namespace of hive. The column name should > match exactly and data type should be compatible. > 3) If you want to build segment which data from hive, you have to > built by rest api. > 4) Cube build engine must be mapreduce, spark is not supported at the > moment. > > > *-----------------* > *-----------------* > *Best wishes to you ! * > *From :**Xiaoxiang Yu* > > At 2019-06-25 17:20:55, "Andras Nagy" <[email protected]> > wrote: > > Hi ShaoFeng, > > Thanks a lot for the pointer on the lambda mode, yes, that's exactly what > I need :) > > Is there perhaps documentation on this? For now, I was trying to get this > working 'empirically' and finally succeeded, but some of my conclusions may > be wrong. This is what I concluded: > > - hive table must have the same name as the streaming table (name given to > the data source) > - cube can't be built from UI (to build the historic segments from the > data in hive), but it can be built using the REST API > - cube build engine must be mapreduce. For Spark as build engine I got > exception "Cannot adapt to interface > org.apache.kylin.engine.spark.ISparkOutput" > - endTime must be non-overlapping with the streaming data. When I had > overlap, the streaming data coming from kafka did not show up in the > output, I guess this is what you meant by "the segments from Hive will > overwrite the segments from Kafka". > > Are these correct conclusions? Is there anything else I should be aware of? > > Many thanks, > Andras > > On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <[email protected]> > wrote: > >> Hello Andras, >> >> Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in >> https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which >> means, you can define a fact table whose data can be from both Kafka and >> Hive. The only requirement is that all the cube columns appear in both >> Kafka data and Hive data. I think maybe that can fit your need. The cube >> can be built from Kafka, in the meanwhile, it can also be built from Hive, >> the segments from Hive will overwrite the segments from Kafka (as usually >> Hive data is more accurate). When querying the cube, Kylin will firstly >> query historical segments, and then real-time segments (adding the max-time >> of historical segments as the condition). >> >> >> Best regards, >> >> Shaofeng Shi 史少锋 >> Apache Kylin PMC >> Email: [email protected] >> >> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html >> Join Kylin user mail group: [email protected] >> Join Kylin dev mail group: [email protected] >> >> >> >> >> Andras Nagy <[email protected]> 于2019年6月24日周一 下午11:29写道: >> >>> Dear Ma, >>> >>> Thanks for your reply. >>> >>> Slightly related to my original question on the hybrid model, I was >>> wondering if it's possible to combine a batch and a streaming cube. I >>> realized this is not possible, as a hybrid model can only be created from >>> cubes of the same model (and a model points to either a batch or a >>> streaming datasource). >>> >>> The usecase would be this: >>> - we have a large amount of streaming data in Kafka that we would like >>> to process with Kylin streaming >>> - Kafka retention is only a few days, so if we need to change anything >>> in the cubes (e.g. introduce a new metric or dimension which has been >>> present in the events, but not in the cube definition), we can only >>> reprocess a few days worth of data in the streaming model >>> - the raw events are also written to a data lake for long-term storage >>> - the data written to the data lake could be used to feed the historic >>> data into a batch kylin model (and cubes) >>> - I'm looking for a way to combine these, so if we want to change >>> anything in the cubes, we can recalculate them for the historic data as well >>> >>> Is there a way to achieve this with current Kylin? (Without implementing >>> a custom query layer that combines the two cubes.) >>> >>> Best regards, >>> Andras >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <[email protected]> wrote: >>> >>>> Hi Andras, >>>> >>>> Currently it doesn't support consume from specified offsets, only >>>> support consume from startOffset or latestOffset, if you want to consume >>>> from startOffset, you need to set the >>>> configuration: kylin.stream.consume.offsets.latest to false in the cube's >>>> overrides page. >>>> >>>> If you do need to start from specified offsets, please create a jira >>>> request, but I think it is hard for user to know what's the offsets should >>>> be set for all partitions. >>>> >>>> At 2019-06-13 22:34:59, "Andras Nagy" <[email protected]> >>>> wrote: >>>> >>>> Dear Ma, >>>> >>>> Thank you very much! >>>> >>>> >1)yes, you can specify a configuration in the new cube, to consume >>>> data from start offset >>>> That is, an offset value for each partition of the topic? That would be >>>> good - could you please point me where to do this in practice, or point me >>>> to what I should read? (I haven't found it on the cube designer UI - >>>> perhaps this is something that's only available on the API?) >>>> >>>> Many thanks, >>>> Andras >>>> >>>> >>>> >>>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <[email protected]> wrote: >>>> >>>>> Hi Andras, >>>>> 1)yes, you can specify a configuration in the new cube, to consume >>>>> data from start offset >>>>> >>>>> 2)It should work, but I haven't tested it yet >>>>> >>>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is >>>>> better to use the version later, I'm sure that the version before 0.9.0 >>>>> cannot work, but not sure 0.9.x can work or not >>>>> >>>>> >>>>> >>>>> Ma Gang >>>>> 邮箱:[email protected] >>>>> >>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D> >>>>> >>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制 >>>>> >>>>> On 06/13/2019 18:01, Andras Nagy <[email protected]> wrote: >>>>> Greetings, >>>>> >>>>> I have a few questions related to the new streaming (real-time OLAP) >>>>> implementation. >>>>> >>>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a >>>>> cube definition and drop the cube (or add a new cube definition) and want >>>>> to have data that is still available on kafka to be reprocessed to build >>>>> the changed cube (or new cube)? Is this possible? >>>>> >>>>> 2) Does the hybrid model work with streaming cubes (to combine two >>>>> cubes)? >>>>> >>>>> 3) What is minimum kafka version required? The tutorial asks to >>>>> install Kafka 1.0, is this the minimum required version? >>>>> >>>>> Thank you very much, >>>>> Andras >>>>> >>>>> >>>> >>>> >>>> >>>
