Alright, I have checkpoints saving implemented that way. I will apply this
same pattern to jobgraphs.


On Thu, Feb 15, 2018 at 11:13 AM, Fabian Hueske <fhue...@gmail.com> wrote:

> Hi,
>
> all data is stored in a distributed file system or object store (HDFS, S3,
> Ceph, ...) and ZooKeeper only stores pointers to that data.
>
> Cheers, Fabian
>
> 2018-02-15 11:08 GMT+01:00 Krzysztof Białek <krzysiek.bia...@gmail.com>:
>
>> Alright, just came across the first real-life problem with my Consul HA
>> implementation.
>> In Consul KV store there is a limit of 512kB per node and JobGraph of one
>> of my apps exceeded it.
>> In ZK there seems to be similar zNode Limit = 1MB
>> How did you workaround it? Or maybe I serialize the JobGraph wrong?
>>
>> On Thu, Feb 15, 2018 at 8:47 AM, Krzysztof Białek <
>> krzysiek.bia...@gmail.com> wrote:
>>
>>> I have very little experience with ZK and cannot explain the differences
>>> between ZK and Consul by myself. However there are some comparisions
>>> available:
>>> * https://www.consul.io/intro/vs/zookeeper.html - done by Consul so may
>>> be biased
>>> * https://www.slideshare.net/IvanGlushkov/zookeeper-vs-consul-41882991
>>> * https://jakon.me/2017/01/consul-deployment-orchestration/
>>>
>>> Regarding testing - I did basic failover scenarios on my workstation
>>> with 2 JobManagers, 2 TaskManagers and WindowJoin example app with
>>> checkpointing and restarting turned on.
>>> I was running the cluster no longer than for few hours.
>>>
>>> For now I'd like to open Flink for alternative HA backends (
>>> https://issues.apache.org/jira/browse/FLINK-8660)
>>>
>>>
>>> On Wed, Feb 14, 2018 at 1:47 PM, Chesnay Schepler <ches...@apache.org>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> I don't know anything about Consul but the prospect of having other
>>>> options beside Zookeeper is very interesting. It's rather surprising how
>>>> little you had to modify existing classes to get this to work.
>>>>
>>>> It may take a bit until someone provides proper feedback as the
>>>> community is currently prepping 2 releases (1.4.1 and 1.5), please don't be
>>>> discouraged by this.
>>>>
>>>> I saw that your branch was based on the 1.4 version. In 1.5 we reworked
>>>> the distributed architecture of Flink (in an initiative commonly referred
>>>> to as FLIP-6) which may affect your work.
>>>>
>>>> 2 things to note from my side:
>>>> It would also be helpful if you could explain the differences between
>>>> ZK and Consul and how they stack up in terms of guarantees etc. .
>>>> How did you test your solution so far? (Like how long was a cluster
>>>> running, what failure scenarios)
>>>>
>>>>
>>>> On 13.02.2018 21:38, Krzysztof Białek wrote:
>>>>
>>>> I'd like to get your opinion about this idea. I found related JIRA issue
>>>>  FLINK-2366, but it seems to be dead. To attract your attention I copy
>>>> my comment here.
>>>>
>>>> As an experiment I've implemented Flink HA on top of Consul. The
>>>> implementation is working fine in the "lab" but is not battle tested yet.
>>>> The source code is available at https://github.com/kbialek/
>>>> flink/tree/feature/consul (flink-runtime package
>>>> org.apache.flink.runtime.consul)
>>>>
>>>> Why?. Generally I'd like to keep as less moving parts as possible. We
>>>> do not have Zookeeper running, but Consul is already in place. And in the
>>>> end freedom of choice is a good thing.
>>>>
>>>> It would be great to see built-in Consul support in Flink someday, but
>>>> if it is not expected then I suggest a little refactoring to open
>>>> possibility to configure HighAvailabilityServicesFactory. As far as I
>>>> can see this should be enough to inject any HA implementation.
>>>>
>>>> Regards,
>>>> Krzysztof
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to