Hi Xiaoxiang,

Thank you for the detailed information. Could you please record these
limitations as JIRA issues (if not yet)? Thanks.

Best regards,

Shaofeng Shi 史少锋
Apache Kylin PMC
Email: [email protected]

Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
Join Kylin user mail group: [email protected]
Join Kylin dev mail group: [email protected]




Xiaoxiang Yu <[email protected]> 于2019年6月25日周二 下午11:42写道:

>
> Hi, Andras
>     I am glad to see that you have have a strong understanding with
> Kylin's Realtime OLAP. Most of them are correct, the following is my
> understanding:
>     1)  Currently, there is no such documentation which talk about how to
> use lambda mode, we will publish one after 3.0.0-beta release (maybe this
> wekend or after a week?).
>     2)  Hive table must have the same name as the streaming table , and
> should be locate at "default" namespace of hive. The column name should
> match exactly and data type should be compatible.
>     3)  If you want to build segment which data from hive,  you have to
> built by rest api.
>     4)  Cube build engine must be mapreduce, spark is not supported at the
> moment.
>
>
> *-----------------*
> *-----------------*
> *Best wishes to you ! *
> *From :**Xiaoxiang Yu*
>
> At 2019-06-25 17:20:55, "Andras Nagy" <[email protected]>
> wrote:
>
> Hi ShaoFeng,
>
> Thanks a lot for the pointer on the lambda mode, yes, that's exactly what
> I need :)
>
> Is there perhaps documentation on this? For now, I was trying to get this
> working 'empirically' and finally succeeded, but some of my conclusions may
> be wrong. This is what I concluded:
>
> - hive table must have the same name as the streaming table (name given to
> the data source)
> - cube can't be built from UI (to build the historic segments from the
> data in hive), but it can be built using the REST API
> - cube build engine must be mapreduce. For Spark as build engine I got
> exception "Cannot adapt to interface
> org.apache.kylin.engine.spark.ISparkOutput"
> - endTime must be non-overlapping with the streaming data. When I had
> overlap, the streaming data coming from kafka did not show up in the
> output, I guess this is what you meant by "the segments from Hive will
> overwrite the segments from Kafka".
>
> Are these correct conclusions? Is there anything else I should be aware of?
>
> Many thanks,
> Andras
>
> On Tue, Jun 25, 2019 at 9:19 AM ShaoFeng Shi <[email protected]>
> wrote:
>
>> Hello Andras,
>>
>> Kylin's realtime-OLAP feature supports a "Lambda" mode (mentioned in
>> https://kylin.apache.org/blog/2019/04/12/rt-streaming-design/), which
>> means, you can define a fact table whose data can be from both Kafka and
>> Hive. The only requirement is that all the cube columns appear in both
>> Kafka data and Hive data. I think maybe that can fit your need. The cube
>> can be built from Kafka, in the meanwhile, it can also be built from Hive,
>> the segments from Hive will overwrite the segments from Kafka (as usually
>> Hive data is more accurate). When querying the cube, Kylin will firstly
>> query historical segments, and then real-time segments (adding the max-time
>> of historical segments as the condition).
>>
>>
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>> Apache Kylin PMC
>> Email: [email protected]
>>
>> Apache Kylin FAQ: https://kylin.apache.org/docs/gettingstarted/faq.html
>> Join Kylin user mail group: [email protected]
>> Join Kylin dev mail group: [email protected]
>>
>>
>>
>>
>> Andras Nagy <[email protected]> 于2019年6月24日周一 下午11:29写道:
>>
>>> Dear Ma,
>>>
>>> Thanks for your reply.
>>>
>>> Slightly related to my original question on the hybrid model, I was
>>> wondering if it's possible to combine a batch and a streaming cube. I
>>> realized this is not possible, as a hybrid model can only be created from
>>> cubes of the same model (and a model points to either a batch or a
>>> streaming datasource).
>>>
>>> The usecase would be this:
>>> - we have a large amount of streaming data in Kafka that we would like
>>> to process with Kylin streaming
>>> - Kafka retention is only a few days, so if we need to change anything
>>> in the cubes (e.g. introduce a new metric or dimension which has been
>>> present in the events, but not in the cube definition), we can only
>>> reprocess a few days worth of data in the streaming model
>>> - the raw events are also written to a data lake for long-term storage
>>> - the data written to the data lake could be used to feed the historic
>>> data into a batch kylin model (and cubes)
>>> - I'm looking for a way to combine these, so if we want to change
>>> anything in the cubes, we can recalculate them for the historic data as well
>>>
>>> Is there a way to achieve this with current Kylin? (Without implementing
>>> a custom query layer that combines the two cubes.)
>>>
>>> Best regards,
>>> Andras
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Jun 14, 2019 at 6:43 AM Ma Gang <[email protected]> wrote:
>>>
>>>> Hi Andras,
>>>>
>>>> Currently it doesn't support consume from specified offsets, only
>>>> support consume from startOffset or latestOffset, if you want to consume
>>>> from startOffset, you need to set the
>>>> configuration: kylin.stream.consume.offsets.latest to false in the cube's
>>>> overrides page.
>>>>
>>>> If you do need to start from specified offsets, please create a jira
>>>> request, but I think it is hard for user to know what's the offsets should
>>>> be set for all partitions.
>>>>
>>>> At 2019-06-13 22:34:59, "Andras Nagy" <[email protected]>
>>>> wrote:
>>>>
>>>> Dear Ma,
>>>>
>>>> Thank you very much!
>>>>
>>>> >1)yes, you can specify a configuration in the new cube, to consume
>>>> data from start offset
>>>> That is, an offset value for each partition of the topic? That would be
>>>> good - could you please point me where to do this in practice, or point me
>>>> to what I should read? (I haven't found it on the cube designer UI -
>>>> perhaps this is something that's only available on the API?)
>>>>
>>>> Many thanks,
>>>> Andras
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2019 at 1:14 PM Ma Gang <[email protected]> wrote:
>>>>
>>>>> Hi Andras,
>>>>> 1)yes, you can specify a configuration in the new cube, to consume
>>>>> data from start offset
>>>>>
>>>>> 2)It should work, but I haven't tested it yet
>>>>>
>>>>> 3)as I remember, currently we use Kafka 1.0 client library, so it is
>>>>> better to use the version later, I'm sure that the version before 0.9.0
>>>>> cannot work, but not sure 0.9.x can work or not
>>>>>
>>>>>
>>>>>
>>>>> Ma Gang
>>>>> 邮箱:[email protected]
>>>>>
>>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=Ma+Gang&uid=mg4work%40163.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22%E9%82%AE%E7%AE%B1%EF%BC%9Amg4work%40163.com%22%5D>
>>>>>
>>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail88> 定制
>>>>>
>>>>> On 06/13/2019 18:01, Andras Nagy <[email protected]> wrote:
>>>>> Greetings,
>>>>>
>>>>> I have a few questions related to the new streaming (real-time OLAP)
>>>>> implementation.
>>>>>
>>>>> 1) Is there a way to have data reprocessed from kafka? E.g. I change a
>>>>> cube definition and drop the cube (or add a new cube definition) and want
>>>>> to have data that is still available on kafka to be reprocessed to build
>>>>> the changed cube (or new cube)? Is this possible?
>>>>>
>>>>> 2) Does the hybrid model work with streaming cubes (to combine two
>>>>> cubes)?
>>>>>
>>>>> 3) What is minimum kafka version required? The tutorial asks to
>>>>> install Kafka 1.0, is this the minimum required version?
>>>>>
>>>>> Thank you very much,
>>>>> Andras
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>

Reply via email to