Re: Flink on Kubernetes - Session vs Job cluster mode and storage

2020-02-28 Thread Hao Sun
Sounds good. Thank you! Hao Sun On Thu, Feb 27, 2020 at 6:52 PM Yang Wang wrote: > Hi Hao Sun, > > I just post the explanation to the user ML so that others could also have > the same problem. > > Gven the job graph is fetched from the jar, do we still need Zookeeper for >

Re: Flink 1.8.0 S3 Checkpoints fail with java.security.ProviderException: Could not initialize NSS

2019-10-10 Thread Hao Sun
I saw similar issue when using alpine linux. https://pkgs.alpinelinux.org/package/v3.3/main/x86/nss Installing this package fixed my problem Hao Sun On Thu, Oct 10, 2019 at 3:46 PM Austin Cawley-Edwards < austin.caw...@gmail.com> wrote: > Hi there, > > I'm getting the followin

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-10-10 Thread Hao Sun
Yep I know that option. That's where get me confused as well. In a HA setup, where do I supply this option (allowNonRestoredState)? This option requires a savepoint path when I start a flink job I remember. And HA does not require the path Hao Sun On Thu, Oct 10, 2019 at 11:16 AM Yun Tang

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Hao Sun
I think I overlooked it. Good point. I am using Redis to save the path to my savepoint, I might be able to set a TTL to avoid such issue. Hao Sun On Tue, Sep 24, 2019 at 9:54 AM Yuval Itzchakov wrote: > Hi Hao, > > I think he's exactly talking about the usecase where the JM/T

Re: Challenges Deploying Flink With Savepoints On Kubernetes

2019-09-24 Thread Hao Sun
case well, I do not see a need to start from checkpoint after a bug fix. >From what I know, currently you can use checkpoint as a savepoint as well Hao Sun On Tue, Sep 24, 2019 at 7:48 AM Yuval Itzchakov wrote: > AFAIK there's currently nothing implemented to solve this problem, but >

Re: [ANNOUNCE] Rong Rong becomes a Flink committer

2019-07-11 Thread Hao Sun
Congratulations Rong. On Thu, Jul 11, 2019, 11:39 Xuefu Z wrote: > Congratulations, Rong! > > On Thu, Jul 11, 2019 at 10:59 AM Bowen Li wrote: > >> Congrats, Rong! >> >> >> On Thu, Jul 11, 2019 at 10:48 AM Oytun Tez wrote: >> >> > Congratulations Rong! >> > >> > --- >> > Oytun Tez >> > >> >

Re: Graceful Task Manager Termination and Replacement

2019-07-11 Thread Hao Sun
I have a common interest in this topic. My k8s recycle hosts, and I am facing the same issue. Flink can tolerate this situation, but I am wondering if I can do better On Thu, Jul 11, 2019, 12:39 Aaron Levin wrote: > Hello, > > Is there a way to gracefully terminate a Task Manager beyond just

Re: [VOTE] How to Deal with Split/Select in DataStream API

2019-07-04 Thread Hao Sun
Personally I prefer 3) to keep split/select and correct the behavior. I feel side output is kind of overkill for such a primitive function, and I prefer simple APIs like split/select. Hao Sun On Thu, Jul 4, 2019 at 11:20 AM Xingcan Cui wrote: > Hi folks, > > Two weeks ago, I started

Re: status on FLINK-7129

2019-04-23 Thread Hao Sun
+1 On Tue, Apr 23, 2019, 05:18 Vishal Santoshi wrote: > +1 > > On Tue, Apr 23, 2019, 4:57 AM kant kodali wrote: > >> Thanks all for the reply. I believe this is one of the most important >> feature that differentiates flink from other stream processing engines as >> others don't even have CEP

Re: java.io.IOException: NSS is already initialized

2019-04-17 Thread Hao Sun
I think I found the root cause https://bugs.alpinelinux.org/issues/10126 I have to re-install nss after apk update/upgrade Hao Sun On Sun, Nov 11, 2018 at 10:50 AM Ufuk Celebi wrote: > Hey Hao, > > 1) Regarding Hadoop S3: are you using the repackaged Hadoop S3 > dependency f

Re: inconsistent module descriptor for hadoop uber jar (Flink 1.8)

2019-04-16 Thread Hao Sun
I am using sbt and sbt-assembly. In build.sbt libraryDependencies ++= Seq("org.apache.flink" % "flink-shaded-hadoop2-uber" % "2.8.3-1.8.0") Hao Sun On Tue, Apr 16, 2019 at 12:07 AM Gary Yao wrote: > Hi, > > Can you describe how to reproduce this? &g

inconsistent module descriptor for hadoop uber jar (Flink 1.8)

2019-04-15 Thread Hao Sun
-uber-2.8.3-1.8.0.pom': bad revision: expected='2.8.3-1.8.0' found='2.4.1-1.8.0'; Is this a bug? Hao Sun

Re: Is there a better way to debug ClassNotFoundException from FlinkUserCodeClassLoaders?

2019-01-04 Thread Hao Sun
Thanks Congxian for the tip. Arthas looks great Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Fri, Jan 4, 2019 at 5:42 PM Congxian Qiu wrote: > Hi, Hao Sun > > For debugging the `ClassNotFoundException`, maybe the Arthas[1] tool can > help. > > [1] Arthas &

Re: Is there a better way to debug ClassNotFoundException from FlinkUserCodeClassLoaders?

2019-01-03 Thread Hao Sun
om/sbt/sbt-assembly to assemble the fat jar. There might be some issue, or config issue with that as well. I am reading this article, it is a good start for me as well https://heapanalytics.com/blog/engineering/missing-scala-class-noclassdeffounderror Hao Sun Team Lead 1019 Market St. 7F San Fra

Re: Is there a better way to debug ClassNotFoundException from FlinkUserCodeClassLoaders?

2019-01-02 Thread Hao Sun
ang.Object createInstance(java.lang.Object[]); public com.zendesk.fraudprevention.examples.ConnectedStreams$$anon$90$$anon$45(com.zendesk.fraudprevention.examples.ConnectedStreams$$anon$90, org.apache.flink.api.common.typeutils.TypeSerializer[]); } Hao Sun Team Lead 1019 Market St. 7F San Francisco

Is there a better way to debug ClassNotFoundException from FlinkUserCodeClassLoaders?

2019-01-02 Thread Hao Sun
) at org.apache.flink.streaming.api.graph.StreamConfig.getChainedOutputs(StreamConfig.java:321) ... 5 more Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103

Re: Kafka consumer, is there a way to filter out messages using key only?

2018-12-27 Thread Hao Sun
...) based on the message key, that would allow you >> later to filter it out. So assuming the Optional solution the result of >> KeyedDeserializationSchema#deserialize could be Optional.empty() for >> invalid keys and Optional.of(deserializedValue) for valid keys. >> >>

Kafka consumer, is there a way to filter out messages using key only?

2018-12-18 Thread Hao Sun
there is a KeyedDeserializationSchema, but can I use it to filter data? Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103

Re: How many times Flink initialize an operator?

2018-12-11 Thread Hao Sun
Ok, thanks for the clarification. Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Tue, Dec 11, 2018 at 2:38 PM Ken Krugler wrote: > It’s based the parallelism of that operator, not the number of > TaskManagers. > > E.g. you can have an operator with a parall

How many times Flink initialize an operator?

2018-12-11 Thread Hao Sun
[I,O]).addSink(discard) Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-06 Thread Hao Sun
Thanks for the tip! I did change the jobGraph this time. Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Thu, Dec 6, 2018 at 2:47 AM Till Rohrmann wrote: > Hi Hao, > > if Flink tries to recover from a checkpoint, then the JobGraph should not > be modified and the s

Re: Flink 1.7 job cluster (restore from checkpoint error)

2018-12-05 Thread Hao Sun
Till, Flink is automatically trying to recover from a checkpoint not savepoint. How can I get allowNonRestoredState applied in this case? Hao Sun Team Lead 1019 Market St. 7F San Francisco, CA 94103 On Wed, Dec 5, 2018 at 10:09 AM Till Rohrmann wrote: > Hi Hao, > > I think you need t

Flink 1.7 job cluster (restore from checkpoint error)

2018-12-04 Thread Hao Sun
nStates(StateAssignmentOperation.java:77) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1049) at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1152) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:296) at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:157) == Can somebody help out? Thanks Hao Sun

Re: Flink 1.7 RC missing flink-scala-shell jar

2018-11-14 Thread Hao Sun
> though. > > > On 14.11.2018 03:44, Hao Sun wrote: > > I do not see flink-scala-shell jar under flink opt directory. To run > scala shell, do I have to include the flink-scala-shell jar in my program > jar? > Why the error is saying Could not find or load main class >

Re: Flink 1.7 RC missing flink-scala-shell jar

2018-11-13 Thread Hao Sun
: > Hi, > > Till is the release manager for 1.7, so ping him here. > > Best, > tison. > > > Hao Sun 于2018年11月14日周三 上午3:07写道: > >> Sorry I mean the scala-2.12 version is missing >> >> On Tue, Nov 13, 2018 at 10:58 AM Hao Sun wrote: &

Re: Flink 1.7 RC missing flink-scala-shell jar

2018-11-13 Thread Hao Sun
Sorry I mean the scala-2.12 version is missing On Tue, Nov 13, 2018 at 10:58 AM Hao Sun wrote: > I can not find the jar here: > > https://repository.apache.org/content/repositories/orgapacheflink-1191/org/apache/flink/ > > Here is the error: > bash-4.4# ./bin/start-scala-shel

Flink 1.7 RC missing flink-scala-shell jar

2018-11-13 Thread Hao Sun
I can not find the jar here: https://repository.apache.org/content/repositories/orgapacheflink-1191/org/apache/flink/ Here is the error: bash-4.4# ./bin/start-scala-shell.sh local Error: Could not find or load main class org.apache.flink.api.scala.FlinkShell I think somehow I have to include the

Re: How to run Flink 1.6 job cluster in "standalone" mode?

2018-11-12 Thread Hao Sun
. > > Maybe you can tell us what wrong behavior you observe? > > Btw. Flink's metrics can also already be quite helpful. > > Regards, > Timo > > Am 07.11.18 um 14:15 schrieb Hao Sun: > > "Standalone" here I mean job-mananger + taskmanager on the same JVM.

Re: Where is the "Latest Savepoint" information saved?

2018-11-11 Thread Hao Sun
tive is the web UI checkpointing tab. It shows the latest > checkpoint used for restore of the job. You should see your savepoint > there. > > Best, > > Ufuk > > > On Sun, Nov 11, 2018 at 7:45 PM Hao Sun wrote: > > > > This is great, I will try option 3 and

Re: Where is the "Latest Savepoint" information saved?

2018-11-11 Thread Hao Sun
/rest_api.html > <https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html> > [3] > https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html#job-cancellation > <https://ci.apache.org/projects/flink/flink-docs-release-1.6/monito

Re: java.io.IOException: NSS is already initialized

2018-11-10 Thread Hao Sun
pointing is working with the hadoop flavour. On Fri, Nov 9, 2018 at 2:02 PM Ufuk Celebi wrote: > Hey Hao Sun, > > - Is this an intermittent failure or permanent? The logs indicate that > some checkpoints completed before the error occurs (e.g. checkpoint > numbers are greater than

Re: Where is the "Latest Savepoint" information saved?

2018-11-09 Thread Hao Sun
ease-1.6/monitoring/rest_api.html > <https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html> > > Best, > Paul Lam > > > 在 2018年11月9日,13:55,Hao Sun 写道: > > Since this save point path is very useful to application updates, where is > this information stored? Can we keep it in ZK or S3 for retrieval? > > > > >

The heartbeat of TaskManager with id ... timed out.

2018-11-08 Thread Hao Sun
I am running Flink 1.7 on K8S. I am not sure how to debug this issue. I turned on debug on JM/TM. I am not sure this part is related or not. How could an Actor suddenly disappear? = 2018-11-09 04:47:19,480 DEBUG org.apache.flink.runtime.rest.handler.legacy.metrics.MetricFetcher - Query

Re: java.io.IOException: NSS is already initialized

2018-11-08 Thread Hao Sun
id > On 03/11/2018 03:09, Hao Sun wrote: > > Same environment, new error. > > I can run the same docker image with my local Mac, but on K8S, this gives > me this error. > I can not think of any difference between local Docker and K8S Docker. > > Any hint will be helpful.

How to run Flink 1.6 job cluster in "standalone" mode?

2018-11-07 Thread Hao Sun
"Standalone" here I mean job-mananger + taskmanager on the same JVM. I have an issue to debug on our K8S environment, I can not reproduce it in local docker env or Intellij. If JM and TM are running in different VMs, it makes things harder to debug. Or is there a way to debug a job running on JM

Re: Flink 1.6 job cluster mode, job_id is always 00000000000000000000000000000000

2018-11-05 Thread Hao Sun
Thanks all. On Mon, Nov 5, 2018 at 2:05 AM Ufuk Celebi wrote: > On Sun, Nov 4, 2018 at 10:34 PM Hao Sun wrote: > > Thanks that also works. To avoid same issue with zookeeper, I assume I > have to do the same trick? > > Yes, exactly. The following configuration [1

Re: Flink 1.6 job cluster mode, job_id is always 00000000000000000000000000000000

2018-11-04 Thread Hao Sun
Thanks that also works. To avoid same issue with zookeeper, I assume I have to do the same trick? On Sun, Nov 4, 2018, 03:34 Ufuk Celebi wrote: > Hey Hao Sun, > > this has been changed recently [1] in order to properly support > failover in job cluster mode. > > A workar

Flink 1.6 job cluster mode, job_id is always 00000000000000000000000000000000

2018-11-03 Thread Hao Sun
I am wondering if I can customize job_id for job cluster mode. Currently it is always . I am running multiple job clusters and sharing s3, it means checkpoints will be shared by different jobs as well e.g. /chk-64, how can I avoid

Re: java.io.IOException: NSS is already initialized

2018-11-02 Thread Hao Sun
untime.executiongraph.ExecutionGraph- Try to restart or fail the job ConnectedStreams maxwell.accounts () if no longer possible. ===== On Thu, Nov 1, 2018 at 9:22 PM Hao Sun wrote: > I am on Flink 1.6.2 (no Hadoop, in docker + K8S),

java.io.IOException: NSS is already initialized

2018-11-01 Thread Hao Sun
I am on Flink 1.6.2 (no Hadoop, in docker + K8S), using rocksdb and S3 (presto) I got this error when flink creating a checking point === 2018-11-02 04:00:55,011 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job ConnectedStreams maxwell.accounts

Re: anybody can start flink with job mode?

2018-08-28 Thread Hao Sun
; Till > > On Sat, Aug 25, 2018 at 5:11 AM Hao Sun wrote: > >> Thanks, I'll look into it. >> >> On Fri, Aug 24, 2018, 19:44 vino yang wrote: >> >>> Hi Hao Sun, >>> >>> From the error log, it seems that the jar package for the job was not &

Re: anybody can start flink with job mode?

2018-08-24 Thread Hao Sun
Thanks, I'll look into it. On Fri, Aug 24, 2018, 19:44 vino yang wrote: > Hi Hao Sun, > > From the error log, it seems that the jar package for the job was not > found. > You must make sure your Jar is in the classpath. > Related documentation may not be up-to-date, and the

anybody can start flink with job mode?

2018-08-24 Thread Hao Sun
I got an error like this. $ docker run -it flink-job:latest job-cluster Starting the job-cluster config file: jobmanager.rpc.address: localhost jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.heap.size: 1024m taskmanager.numberOfTaskSlots: 1 parallelism.default: 1 rest.port:

Re: [DISCUSS] Flink 1.6 features

2018-06-05 Thread Hao Sun
adding my vote to K8S Job mode, maybe it is this? > Smoothen the integration in Container environment, like "Flink as a Library", and easier integration with Kubernetes services and other proxies. On Mon, Jun 4, 2018 at 11:01 PM Ben Yan wrote: > Hi Stephan, > > Will [

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
After I added these to my flink-conf.yml, everything works now. s3.sse.enabled: true s3.sse.type: S3 Thanks for the help! In general I also want to know what config keys for presto-s3 I can use. On Tue, Jun 5, 2018 at 11:43 AM Hao Sun wrote: > also a follow up question. Can I use

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
also a follow up question. Can I use all properties here? Should I remove `hive.` for all the keys? https://prestodb.io/docs/current/connector/hive.html#hive-configuration-properties More specifically how I configure sse for s3? On Tue, Jun 5, 2018 at 11:33 AM Hao Sun wrote: > I do not h

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
add config values to the flink config as > s3.xxx. > > Best, > Aljoscha > > > On 5. Jun 2018, at 18:23, Hao Sun wrote: > > Thanks for pick up my question. I had s3a in the config now I removed it. > I will post a full trace soon, but want to get some questions a

Re: Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-05 Thread Hao Sun
Also, could you also post the full stack trace, please? > > Best, > Aljoscha > > > On 2. Jun 2018, at 07:34, Hao Sun wrote: > > I am trying to figure out how to use S3 as state storage. > The recommended way is > https://ci.apache.org/projects/flink/flink-docs-release-1.

Re: Flink 1.5, failed to instantiate S3 FS

2018-06-02 Thread Hao Sun
Thanks Amit for checking. I do not use hadoop, but I am using Flink with bundled HDP 2.8 binary. I think this article is right, I mixed 2.7 lib and 2.8 binary somehow. On Sat, Jun 2, 2018 at 1:05 AM Amit Jain wrote: > Hi Hao, > > Have look over >

Do I still need hadoop-aws libs when using Flink 1.5 and Presto?

2018-06-01 Thread Hao Sun
I am trying to figure out how to use S3 as state storage. The recommended way is https://ci.apache.org/projects/flink/flink-docs-release-1.5/ops/deployment/aws.html#shaded-hadooppresto-s3-file-systems-recommended Seems like I only have to do two things: *1. Put flink-s3-fs-presto to the lib* *2.

Flink 1.5, failed to instantiate S3 FS

2018-06-01 Thread Hao Sun
I can not find anywhere I have 100M. Not sure why I get this failure. This is in my dev docker env. Same configure file worked well for 1.3.2 = Log Caused by: org.apache.flink.util.FlinkException: Failed to submit job aa75905062dd0487034bb9d8b6617dc2. at

Re: [ANNOUNCE] Apache Flink 1.5.0 release

2018-05-25 Thread Hao Sun
This is great. Thanks for the effort to get this out! On Fri, May 25, 2018 at 9:47 AM Till Rohrmann wrote: > The Apache Flink community is very happy to announce the release of Apache > Flink 1.5.0. > > Apache Flink® is an open-source stream processing framework for >

Re: keyBy and parallelism

2018-04-11 Thread Hao Sun
>From what I learnt, you have to control parallelism your self. You can set parallelism on operator or set default one through flink-config.yaml. I might be wrong. On Wed, Apr 11, 2018 at 2:16 PM Christophe Jolif wrote: > Hi all, > > Imagine I have a default parallelism of 16

Re: java.lang.Exception: TaskManager was lost/killed

2018-04-09 Thread Hao Sun
Same story here, 1.3.2 on K8s. Very hard to find reasons on why a TM is killed. Not likely caused by memory leak. If there is a logger I have turn on please let me know. On Mon, Apr 9, 2018, 13:41 Lasse Nedergaard wrote: > We see the same running 1.4.2 on Yarn hosted

Re: Temporary failure in name resolution

2018-04-03 Thread Hao Sun
Hi Timo, we do have similar issue, TM got killed by a job. Is there a way to monitor JVM status? If through the monitor metrics, what metric I should look after? We are running Flink on K8S. Is there a possibility that a job consumes too much network bandwidth, so JM and TM can not connect? On

Re: Flink and Docker ?

2018-04-03 Thread Hao Sun
Hi, we are using this docker on K8S + S3. https://github.com/docker-flink/docker-flink It works fine for us. On Tue, Apr 3, 2018 at 1:00 AM Christophe Salperwyck < christophe.salperw...@gmail.com> wrote: > Hi, > > I didn't try docker with Flink but I know that those guys did: >

How can I confirm a savepoint is used for a new job?

2018-03-21 Thread Hao Sun
Do we have any logs in JM/TM indicate the job is using a savepoint I passed in when I submit the job? Thanks

Re: Can not cancel with savepoint with Flink 1.3.2

2018-03-15 Thread Hao Sun
with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. On Thu, Mar 15, 2018 at 8:38 PM Hao Sun <ha...@zendesk.com> wrote: > Hi, I am running flink on K8S and store states in s3 with rocksdb backend. > > I used to be able to cancel and savepointing t

Can not cancel with savepoint with Flink 1.3.2

2018-03-15 Thread Hao Sun
Hi, I am running flink on K8S and store states in s3 with rocksdb backend. I used to be able to cancel and savepointing through the rest api. But sometimes the process never finish. No matter how many time I try. Is there a way to figure out what is going wrong? Why "isStoppable"=>false? Thanks

Re: state.checkpoints.dir

2018-01-22 Thread Hao Sun
We generate flink.conf on the fly, so we can use different values based on environment. On Mon, Jan 22, 2018 at 12:53 PM Biswajit Das wrote: > Hello , > > Is there any hack to supply *state.checkpoints.*dir as argument or JVM > parameter when running locally . I can

Re: scala 2.12 support/cross-compile

2018-01-03 Thread Hao Sun
s > also a problem for Spark, which track their respective progress here: > https://issues.apache.org/jira/browse/SPARK-14540 > <https://issues.apache.org/jira/browse/SPARK-14540>. > > Best, > Aljoscha > > > On 3. Jan 2018, at 10:39, Stephan Ewen <se...@apache.or

Re: org.apache.zookeeper.ClientCnxn, Client session timed out

2017-12-28 Thread Hao Sun
Ok, thanks for the clarification. On Thu, Dec 28, 2017 at 1:05 AM Ufuk Celebi <u...@apache.org> wrote: > On Thu, Dec 28, 2017 at 12:11 AM, Hao Sun <ha...@zendesk.com> wrote: > > Thanks! Great to know I do not have to worry duplicates inside Flink. > > > > On

Re: org.apache.zookeeper.ClientCnxn, Client session timed out

2017-12-27 Thread Hao Sun
gt; wrote: > On Wed, Dec 27, 2017 at 4:41 PM, Hao Sun <ha...@zendesk.com> wrote: > >> Somehow TM detected JM leadership loss from ZK and self disconnected? >> And couple of seconds later, JM failed to connect to ZK? >> > > Yes, exactly as you describe. The TM notic

org.apache.zookeeper.ClientCnxn, Client session timed out

2017-12-27 Thread Hao Sun
Hi I need some help to figure out the root cause of this error. I am running flink 1.3.2 on K8S. My cluster has been up and running for almost two weeks and all of a sudden I see this familiar error again, my task manager is killed/lost. There are many ways cause this error, I need help to figure

Could not flush and close the file system output stream to s3a, is this fixed?

2017-12-12 Thread Hao Sun
https://issues.apache.org/jira/browse/FLINK-7590 I have a similar situation with Flink 1.3.2 on K8S = 2017-12-13 00:57:12,403 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: KafkaSource(maxwell.tickets) -> MaxwellFilter->Maxwell(maxwell.tickets) ->

Re: [ANNOUNCE] Apache Flink 1.4.0 released

2017-12-12 Thread Hao Sun
Congratulations! Awesome work. Two quick questions about the HDFS free feature. I am using S3 to store checkpoints, savepoints, and I know it is being done through hadoop-aws. - Do I have to include a hadoop-aws jar in my flatjar AND flink's lib directory to make it work for 1.4? Both or just the

Re: Trace jar file name from jobId, is that possible?

2017-12-07 Thread Hao Sun
Let me check details, on top of my mind I remember the job id changes, I might be wrong. On Thu, Dec 7, 2017, 08:48 Fabian Hueske <fhue...@gmail.com> wrote: > AFAIK, a job keeps its ID in case of a recovery. > Did you observe something else? > > 2017-12-07 17:32 GMT

Re: Trace jar file name from jobId, is that possible?

2017-12-07 Thread Hao Sun
I mean restarted during failure recovery On Thu, Dec 7, 2017 at 8:29 AM Fabian Hueske <fhue...@gmail.com> wrote: > What do you mean by rescheduled? > Started from a savepoint or restarted during failure recovery? > > > 2017-12-07 16:59 GMT+01:00 Hao Sun <ha...@zendesk.c

Re: Trace jar file name from jobId, is that possible?

2017-12-07 Thread Hao Sun
Anything I can do for the job reschedule case? Thanks. Or is there a way to add job lifecycle hooks to trace it? On Mon, Dec 4, 2017 at 12:01 PM Hao Sun <ha...@zendesk.com> wrote: > Thanks Fabian, there is one case can not be covered by the REST API. When > a job rescheduled to ru

Re: non-shared TaskManager-specific config file

2017-12-04 Thread Hao Sun
https://issues.apache.org/jira/browse/FLINK-8197, here is the JIRA link for xref. On Mon, Dec 4, 2017 at 7:35 AM Hao Sun <ha...@zendesk.com> wrote: > Sure, I will do that. > > On Mon, Dec 4, 2017, 07:26 Fabian Hueske <fhue...@gmail.com> wrote: > >> Can you

Re: Trace jar file name from jobId, is that possible?

2017-12-04 Thread Hao Sun
nk/flink-docs-release-1.3/monitoring/rest_api.html#submitting-programs > <https://ci.apache.org/projects/flink/flink-docs-release-1.3/monitoring/rest_api.html#submitting-programs> > > 2017-12-02 0:28 GMT+01:00 Hao Sun <ha...@zendesk.com>: > >> Hi I am using Flink 1.3.2

Re: non-shared TaskManager-specific config file

2017-12-04 Thread Hao Sun
Sure, I will do that. On Mon, Dec 4, 2017, 07:26 Fabian Hueske <fhue...@gmail.com> wrote: > Can you create a JIRA issue to propose the feature? > > Thank you, > Fabian > > 2017-12-04 16:15 GMT+01:00 Hao Sun <ha...@zendesk.com>: > >> Thanks. If w

Re: non-shared TaskManager-specific config file

2017-12-04 Thread Hao Sun
Thanks. If we can support include configuration dir that will be very helpful. On Mon, Dec 4, 2017, 00:50 Chesnay Schepler <ches...@apache.org> wrote: > You will have to create a separate config for each TaskManager. > > > On 01.12.2017 23:14, Hao Sun wrote: > > Hi team,

Trace jar file name from jobId, is that possible?

2017-12-01 Thread Hao Sun
Hi I am using Flink 1.3.2 on K8S, and need a deployment strategy for my app. I want to use savepoints to resume a job after each deployment. As you know I need jar file name and path to savepoints to resume a task. Currently `flink list` command only gives me job ids, not jar file names. And

non-shared TaskManager-specific config file

2017-12-01 Thread Hao Sun
Hi team, I am wondering how can I create a non-shared config file and let Flink read it. Can I use include in the config? Or I have to prepare a different config for each TM? https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html - taskmanager.hostname: The

Re: Questions about checkpoints/savepoints

2017-11-30 Thread Hao Sun
Hi team, I am a similar use case do we have any answers on this? When we trigger savepoint can we store that information to ZK as well? So I can avoid S3 file listing and do not have to use other external services? On Wed, Oct 25, 2017 at 11:19 PM vipul singh wrote: > As a

Re: [EXTERNAL] difference between checkpoints & savepoints

2017-11-30 Thread Hao Sun
Hi team, I have one follow up question on this. There is a discussion on resuming jobs from *a saved external checkpoint*, I feel there are two aspects of that topic. *1. I do not have changes to the job, just want to resume the job from a failure.* I can see this automatically happen with ZK

Task manager suddenly lost connection to JM

2017-11-16 Thread Hao Sun
Hi team, I see an wired issue that one of my TM suddenly lost connection to JM. Once the job running on the TM relocated to a new TM, it can reconnect to JM again. And after a while, the new TM running the same job will repeat the same process. It is not guaranteed the troubled TMs can reconnect

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Hao Sun
Sorry, the "killed" I mean here is JM lost the TM. The TM instance is still running inside kubernetes, but it is not responding to any requests, probably due to high load. And from JM side, JM lost heartbeat tracking of the TM, so it marked the TM as died. The „volume“ of Kafka topics, I mean,

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

2017-11-16 Thread Hao Sun
gt; Last, I think the number of G1_Young_Generation is a counter of how many > gc cycles have been performed and the time is a sum. So naturally, those > values would always increase. > > Best, > Stefan > > > Am 15.11.2017 um 18:35 schrieb Hao Sun <ha...@zendesk.com>

Re: R/W traffic estimation between Flink and Zookeeper

2017-11-16 Thread Hao Sun
etween a few > bytes or kilobytes for most messages and somewhere in the low two-digit > megabytes as a typical max size. > > Best, > Stefan > > Am 15.11.2017 um 18:41 schrieb Hao Sun <ha...@zendesk.com>: > > Thanks Piotr, does Flink read/write to zookeeper ever

Re: R/W traffic estimation between Flink and Zookeeper

2017-11-15 Thread Hao Sun
would be best to try it out in your particular use case on some > small scale. > > Piotrek > > > On 11 Oct 2017, at 19:58, Hao Sun <ha...@zendesk.com> wrote: > > > > Hi Is there a way to estimate read/write traffic between flink and zk? > > I a

R/W traffic estimation between Flink and Zookeeper

2017-10-11 Thread Hao Sun
Hi Is there a way to estimate read/write traffic between flink and zk? I am looking for something like 1000 reads/sec or 1000 writes/sec. And the size of the message. Thanks

Re: How to make my execution graph prettier?

2017-10-10 Thread Hao Sun
n can be applied when there is no shuffle between operations > and when the parallelism is the same (roughly speaking). > > If you wan't the graph to have separate tasks, you can disable chaining on > the Flink ExecutionConfig. This can lead to worse performance, though. > > Best, > A

How to make my execution graph prettier?

2017-10-09 Thread Hao Sun
Hi my execution graph looks like following, all things stuffed into on tile.[image: image.png] How can I get something like this?

TM get killed/disconnected after a while

2017-10-06 Thread Hao Sun
Hi, I am running Flink 1.3.2 on kubernetes, I am not sure why sometime one of my TM is killed, is there a way to debug this? Thanks = Logs *2017-10-05 22:36:42,631 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at

Re: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated for S3 access

2017-10-04 Thread Hao Sun
Here is what my docker file says: ENV FLINK_VERSION=1.3.2 \ HADOOP_VERSION=27 \ SCALA_VERSION=2.11 \ On Wed, Oct 4, 2017 at 8:23 AM Hao Sun <ha...@zendesk.com> wrote: > I am running Flink 1.3.2 with docker on kubernetes. My docker is using > openjdk-8, I do not have hadoop,

Re: javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated for S3 access

2017-10-04 Thread Hao Sun
outdated jdk version on the > server/client may be the cause. > > Which Flink binary (specifically, for which hadoop version) are you using? > > > On 03.10.2017 20:48, Hao Sun wrote: > > com.amazonaws.http.AmazonHttpClient - Unable to > execute

javax.net.ssl.SSLPeerUnverifiedException: peer not authenticated for S3 access

2017-10-03 Thread Hao Sun
I am using S3 for checkpointing and external ckp as well. s3a://bucket/checkpoints/e58d369f5a181842768610b5ab6a500b I have this exception, and not sure what I can do with it. I guess to configure hadoop to use some SSLFactory? I am not using hadoop, I am on kubernetes (in AWS) with S3

Re: CheckpointConfig says to persist periodic checkpoints, but no checkpoint directory has been configured.

2017-09-26 Thread Hao Sun
Thanks, I will try that. On Tue, Sep 26, 2017 at 8:24 AM Aljoscha Krettek <aljos...@apache.org> wrote: > I'm not sure whether the JM is reading it or not. But you can manually set > the values on the Configuration using the setter methods. > > > On 26. Sep 2017,

CheckpointConfig says to persist periodic checkpoints, but no checkpoint directory has been configured.

2017-09-25 Thread Hao Sun
Hi I am running flink in dev mode through Intellij, I have flink-conf.yaml correctly configured and from the log you can see job manager is reading it. 2017-09-25 20:41:52.255 [main] INFO org.apache.flink.configuration.GlobalConfiguration - *Loading configuration property: state.backend,

Re: Flink HA with Kubernetes, without Zookeeper

2017-08-22 Thread Hao Sun
ate if condition. > > If you don’t want to actually rip way into the code for the Job Manager > the ETCD Operator <https://github.com/coreos/etcd-operator> would > be a good way to bring up an ETCD cluster that is separate from the core > Kubernetes ETCD database. Combined wi

Re: StandaloneResourceManager failed to associate with JobManager leader

2017-08-22 Thread Hao Sun
Thanks Till, the DEBUG log level is a good idea. I figured it out. I made a mistake with `-` and `_`. On Tue, Aug 22, 2017 at 1:39 AM Till Rohrmann <trohrm...@apache.org> wrote: > Hi Hao Sun, > > have you checked that one can resolve the hostname flink_jobmanager from > wi

Re: Flink HA with Kubernetes, without Zookeeper

2017-08-21 Thread Hao Sun
ecessary even in that case, because it is >> where the JobManager stores information which needs to be recovered after >> the JobManager fails. >> >> We're eyeing https://github.com/coreos/zetcd >> <https://github.com/coreos/zetcd> as a way to run >> Zookeeper

Flink HA with Kubernetes, without Zookeeper

2017-08-20 Thread Hao Sun
Hi, I am new to Flink and trying to bring up a Flink cluster on top of Kubernetes. For HA setup, with kubernetes, I think I just need one job manager and do not need Zookeeper? I will store all states to S3 buckets. So in case of failure, kubernetes can just bring up a new job manager without

StandaloneResourceManager failed to associate with JobManager leader

2017-08-15 Thread Hao Sun
Hi, I am trying to run a cluster of job-manager and task-manager in docker. One of each for now. I got a StandaloneResourceManager error, stating that it can not associate with job-manager. I do not know what was wrong. I am sure that job-manager can be connected. ===