Re: Question on Job Restart strategy

2020-05-26 Thread Gary Yao
Hi Bhaskar, > Why the reset counter is not zero after streaming job restart is successful? The short answer is that the fixed delay restart strategy is not implemented like that (see [1] if you are using Flink 1.10 or above). There are also other systems that behave similarly, e.g., Apache

Re: ProducerRecord with Kafka Sink for 1.8.0

2020-05-14 Thread Gary Yao
ctory in flink the artifact is flink-core-1.7.2.jar . I need to pack flink-core-1.9.0.jar from application and use child first class loading to use newer version of flink-core. If I have it as provided scope, sure it will work in IntelliJ but not outside of it . > > Best, > Nick > >

Re: ProducerRecord with Kafka Sink for 1.8.0

2020-05-12 Thread Gary Yao
to me so quickly. > > Best, > Nick > > On Tue, May 12, 2020 at 3:37 AM Gary Yao wrote: >> >> Hi Nick, >> >> Are you able to upgrade to Flink 1.9? Beginning with Flink 1.9 you can use >> KafkaSerializationSchema to produce a ProducerRecord [1][2]. >>

Re: Flink Metrics in kubernetes

2020-05-12 Thread Gary Yao
Hi Averell, If you are seeing the log message from [1] and Scheduled#report() is not called, the thread in the "Flink-MetricRegistry" thread pool might be blocked. You can use the jstack utility to see on which task the thread pool is blocked. Best, Gary [1]

Re: ProducerRecord with Kafka Sink for 1.8.0

2020-05-12 Thread Gary Yao
Hi Nick, Are you able to upgrade to Flink 1.9? Beginning with Flink 1.9 you can use KafkaSerializationSchema to produce a ProducerRecord [1][2]. Best, Gary [1] https://issues.apache.org/jira/browse/FLINK-11693 [2]

Re: Running 2 instances of LocalExecutionEnvironments with same consumer group.id

2020-04-22 Thread Gary Yao
Hi Suraj, This question has been asked before: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-kafka-consumers-don-t-honor-group-id-td21054.html Best, Gary On Wed, Apr 22, 2020 at 6:08 PM Suraj Puvvada wrote: > > Hello, > > I have two JVMs that run

Re: Two questions about Async

2020-04-22 Thread Gary Yao
> Bytes Sent but Records Sent is always 0 Sounds like a bug. However, I am unable to reproduce this using the AsyncIOExample [1]. Can you provide a minimal working example? > Is there an Async Sink? Or do I just rewrite my Sink as an AsyncFunction followed by a dummy sink? You will have to

Re: Run several jobs in parallel in same EMR cluster?

2020-03-30 Thread Gary Yao
Can you try to set config option taskmanager.numberOfTaskSlots to 2? By default the TMs only offer one slot [1] independent from the number of CPU cores. Best, Gary [1]

Re: Testing RichAsyncFunction with TestHarness

2020-03-30 Thread Gary Yao
> > Additionally even though I add all necessary dependencies defiend in [1] I > cannot see ProcessFunctionTestHarnesses class. > That class was added in Flink 1.10 [1]. [1]

Re: Automatically Clearing Temporary Directories

2020-03-12 Thread Gary Yao
Hi David, > Would it be safe to automatically clear the temporary storage every time when a TaskManager is started? > (Note: the temporary volumes in use are dedicated to the TaskManager and not shared :-) Yes, it is safe in your case. Best, Gary On Tue, Mar 10, 2020 at 6:39 PM David Maddison

Re: Failure detection and Heartbeats

2020-03-11 Thread Gary Yao
Hi Morgan, > I am interested in knowing more about the failure detection mechanism used by Flink, unfortunately information is a little thin on the ground and I was hoping someone could shed a little light on the topic. It is probably best to look into the implementation (see my answers below).

Re: java.util.concurrent.ExecutionException

2020-03-03 Thread Gary Yao
ly(CaseStatements.scala:21) >> >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> >> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> >

Re: java.util.concurrent.ExecutionException

2020-03-03 Thread Gary Yao
Hi, Can you post the complete stacktrace? Best, Gary On Tue, Mar 3, 2020 at 1:08 PM kant kodali wrote: > Hi All, > > I am just trying to read edges which has the following format in Kafka > > 1,2 > 1,3 > 1,5 > > using the Table API and then converting to DataStream of Edge Objects and >

Re: Alink and Flink ML

2020-03-03 Thread Gary Yao
Hi Flavio, I am looping in Becket (cc'ed) who might be able to answer your question. Best, Gary On Tue, Mar 3, 2020 at 12:19 PM Flavio Pompermaier wrote: > Hi to all, > since Alink has been open sourced, is there any good reason to keep both > Flink ML and Alink? > From what I understood

Re: Unable to recover from savepoint and checkpoint

2020-03-03 Thread Gary Yao
Hi Puneet, Can you describe how you validated that the state is not restored properly? Specifically, how did you introduce faults to the cluster? Best, Gary On Tue, Mar 3, 2020 at 11:08 AM Puneet Kinra < puneet.ki...@customercentria.com> wrote: > Sorry for the missed information > > On

Re: Operator latency metric not working in 1.9.1

2020-03-03 Thread Gary Yao
Hi, There is a release note for Flink 1.7 that could be relevant for you [1] Granularity of latency metrics The default granularity for latency metrics has been modified. To restore the previous behavior users have to explicitly set the granularity to subtask. Best, Gary [1]

Re: yarn session: one JVM per task

2020-02-25 Thread Gary Yao
Hi David, Before with the both n and -s it was not the case. > What do you mean by before? At least in 1.8 "-s" could be used to specify the number of slots per TM. how can I be sure that my Sink that uses this lib is in one JVM ? > Is it enough that no other parallel instance of your sink

Re: REST rescale with Flink on YARN

2020-01-28 Thread Gary Yao
Hi, You can use yarn application -status to find the host and port that the server is listening on (AM host & RPC Port). If you need to access that information programmatically, take a look at the YarnClient [1]. Best, Gary [1]

Re: Localenvironment jobcluster ha High availability

2019-12-11 Thread Gary Yao
Hi Eric, What you say should be possible because your job will be executed in a MiniCluster [1] which has HA support. I have not tried this out myself, and I am not aware that people are doing this in production. However, there are integration tests that use MiniCluster + ZooKeeper [2]. Best,

[ANNOUNCE] Progress of Apache Flink 1.10 #2

2019-11-01 Thread Gary Yao
Hi community, Because we have approximately one month of development time left until the targeted Flink 1.10 feature freeze, we thought now would be a good time to give another progress update. Below we have included a list of the ongoing efforts that have made progress since our last release

Re: Multiple Job Managers in Flink HA Setup

2019-09-25 Thread Gary Yao
Hi Steve, > I also tried attaching a shared NFS folder between the two machines and > tried to set their web.tmpdir property to the shared folder, however it > appears that each job manager creates a seperate job inside that directory. You can create a fixed upload directory via the config

Re: error in Elasticsearch

2019-09-11 Thread Gary Yao
Program arguments should be set to "--input /home/alaa/nycTaxiRides.gz" (without the quotes). On Wed, Sep 11, 2019 at 10:39 AM alaa wrote: > Hallo > > I put arguments but the same error appear .. what should i do ? > > > < >

Re: error in Elasticsearch

2019-09-11 Thread Gary Yao
Hi, You are not supposed to change that part of the exercise code. You have to pass the path to the input file as a program argument (e.g., --input /path/to/file). See [1] and [2] on how to configure program arguments in IntelliJ. Best, Gary [1]

Re: How do I start a Flink application on my Flink+Mesos cluster?

2019-09-11 Thread Gary Yao
Hi Felipe, I am glad that you were able to fix the problem yourself. > But I suppose that Mesos will allocate Slots and Task Managers dynamically. > Is that right? Yes, that is the case since Flink 1.5 [1]. > Then I put the number of "mesos.resourcemanager.tasks.cpus" to be equal or > less the

Re: [ANNOUNCE] Andrey Zagrebin becomes a Flink committer

2019-08-14 Thread Gary Yao
Congratulations Andrey, well deserved! Best, Gary On Thu, Aug 15, 2019 at 7:50 AM Bowen Li wrote: > Congratulations Andrey! > > On Wed, Aug 14, 2019 at 10:18 PM Rong Rong wrote: > >> Congratulations Andrey! >> >> On Wed, Aug 14, 2019 at 10:14 PM chaojianok wrote: >> >> > Congratulations

Re: Queryable State race condition or serialization errors?

2019-05-21 Thread Gary Yao
Hi Burgess Chen, If you are using MemoryStateBackend or FsStateBackend, you can observe race conditions on the state objects. However, the RocksDBStateBackend should be safe from these issues [1]. Best, Gary [1]

Re: [DISCUSS] Temporarily remove support for job rescaling via CLI action "modify"

2019-04-29 Thread Gary Yao
Since there were no objections so far, I will proceed with removing the code [1]. [1] https://issues.apache.org/jira/browse/FLINK-12312 On Wed, Apr 24, 2019 at 1:38 PM Gary Yao wrote: > The idea is to also remove the rescaling code in the JobMaster. This will > make > it easier

Re: Flink CLI

2019-04-26 Thread Gary Yao
Hi Steve, (1) The CLI action you are looking for is called "modify" [1]. However, we want to temporarily disable this feature beginning from Flink 1.9 due to some caveats with it [2]. If you have objections, it would be appreciated if you could comment on the respective thread on

Re: [DISCUSS] Temporarily remove support for job rescaling via CLI action "modify"

2019-04-24 Thread Gary Yao
or now. Actually some users are not aware of that > it’s > > > still experimental, and ask quite a lot about the problem it causes. > > > > > > Best, > > > Paul Lam > > > > > > 在 2019年4月24日,14:49,Stephan Ewen 写道: > > > > > >

[DISCUSS] Temporarily remove support for job rescaling via CLI action "modify"

2019-04-23 Thread Gary Yao
Hi all, As the subject states, I am proposing to temporarily remove support for changing the parallelism of a job via the following syntax [1]: ./bin/flink modify [job-id] -p [new-parallelism] This is an experimental feature that we introduced with the first rollout of FLIP-6 (Flink 1.5).

Re: inconsistent module descriptor for hadoop uber jar (Flink 1.8)

2019-04-16 Thread Gary Yao
Hi, Can you describe how to reproduce this? Best, Gary On Mon, Apr 15, 2019 at 9:26 PM Hao Sun wrote: > Hi, I can not find the root cause of this, I think hadoop version is mixed > up between libs somehow. > > --- ERROR --- > java.text.ParseException: inconsistent module descriptor file found

Re: Where does the logs in Flink GUI's Exception tab come from?

2019-03-15 Thread Gary Yao
Hi Averell, I think I have answered your question previously [1]. The bottom line is that the error is logged on INFO level in the ExecutionGraph [2]. However, your effective log level (of the root logger) is WARN. The log levels are ordered as follows [3]: TRACE < DEBUG < INFO < WARN <

Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

2019-03-15 Thread Gary Yao
I forgot to add line numbers to the first link in my previous email: https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh#L21-L25 On Fri, Mar 15, 2019 at 8:08 AM Gary Yao wrote: > Hi Harshith, > > In the jobm

Re: Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

2019-03-15 Thread Gary Yao
1.us-east-1.abc.com:28945/user/resourcemanager > <http://fl...@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager> > under registration id 170ee6a00f80ee02ead0e88710093d77.* > > > > > > Thanks, > > Harshith > > > > *From: *Harshith Kumar Bolar

Re: Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

2019-03-14 Thread Gary Yao
down remote daemon. > 2019-03-14 11:47:35,952 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote > daemon shut down; proceeding with flushing remote transports. > 2019-03-14 11:47:35,959 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator

Re: Flink 1.7.2: Task Manager not able to connect to Job Manager

2019-03-14 Thread Gary Yao
Hi Harshith, Can you share JM and TM logs? Best, Gary On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith wrote: > Hi all, > > > > I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2 > > > > When I bring up the cluster, the task managers refuse to connect to the > job managers with

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Gary Yao
could test it for you. > > Without 1.8and this exit code we are essentially held up. > > On Tue, Mar 12, 2019 at 10:56 AM Gary Yao wrote: > >> Nobody can tell with 100% certainty. We want to give the RC some exposure >> first, and there is also a release process th

Re: local disk cleanup after crash

2019-03-12 Thread Gary Yao
Hi, If no other TaskManager (TM) is running, you can delete everything. If multiple TMs share the same host, as far as I know, you will have to parse TM logs to know what directories you can delete [1]. As for local recovery, tasks that were running on a crashed TM are lost. From the

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Gary Yao
AM Vishal Santoshi < > vishal.santo...@gmail.com> wrote: > >> :) That makes so much more sense. Is k8s native flink a part of this >> release ? >> >> On Tue, Mar 12, 2019 at 10:27 AM Gary Yao wrote: >> >>> Hi Vishal, >>> >>>

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Gary Yao
art of this > release ? > > On Tue, Mar 12, 2019 at 10:27 AM Gary Yao wrote: > >> Hi Vishal, >> >> This issue was fixed recently [1], and the patch will be released with >> 1.8. If >> the Flink job gets cancelled, the JVM should exit with code 0. There is a

Re: K8s job cluster and cancel and resume from a save point ?

2019-03-12 Thread Gary Yao
Hi Vishal, This issue was fixed recently [1], and the patch will be released with 1.8. If the Flink job gets cancelled, the JVM should exit with code 0. There is a release candidate [2], which you can test. Best, Gary [1] https://issues.apache.org/jira/browse/FLINK-10743 [2]

Re: submit job failed on Yarn HA

2019-03-05 Thread Gary Yao
as possible. > Best! > Sen > > 在 2019年3月5日,下午3:15,Gary Yao 写道: > > Hi Sen, > > In that email I meant that you should disable the ZooKeeper configuration > in > the CLI because the CLI had troubles resolving the leader from ZooKeeper. > What > you should have done

Re: submit job failed on Yarn HA

2019-03-04 Thread Gary Yao
docs-release-1.7/ops/deployment/yarn_setup.html#recovery-behavior-of-flink-on-yarn On Tue, Mar 5, 2019 at 7:41 AM 孙森 wrote: > Hi Gary: > I used FsStateBackend . > > > The jm log is here: > > > After restart , the log is : > > > > > Best! > Sen > &g

Re: submit job failed on Yarn HA

2019-03-04 Thread Gary Yao
/state_backends.html#available-state-backends On Mon, Mar 4, 2019 at 10:01 AM 孙森 wrote: > Hi Gary: > > > Yes, I enable the checkpoints in my program . > > 在 2019年3月4日,上午3:03,Gary Yao 写道: > > Hi Sen, > > Did you set a restart strategy [1]? If you enabled checkpo

Re: okio and okhttp not shaded in the Flink Uber Jar on EMR

2019-02-28 Thread Gary Yao
wrote: > Thanks Gary, > > I will try to look into why the child-first strategy seems to have failed > for this dependency. > > Best, > Austin > > On Wed, Feb 27, 2019 at 12:25 PM Gary Yao wrote: > >> Hi, >> >> Actually Flink's inverted class loading featu

Re: submit job failed on Yarn HA

2019-02-28 Thread Gary Yao
e rest api : http://activeRm/proxy/appId/jars > > > The all client log is in the mail attachment. > > > > > 在 2019年2月27日,下午9:30,Gary Yao 写道: > > Hi, > > How did you determine "jmhost" and "port"? Actually you do not need to > specify >

Re: okio and okhttp not shaded in the Flink Uber Jar on EMR

2019-02-27 Thread Gary Yao
Hi, Actually Flink's inverted class loading feature was designed to mitigate problems with different versions of libraries that are not compatible with each other [1]. You may want to debug why it does not work for you. You can also try to use the Hadoop free Flink distribution, and export the

Re: flink list and flink run commands timeout

2019-02-27 Thread Gary Yao
Hi Sen Sun, The question is already resolved. You can find the entire email thread here: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/flink-list-and-flink-run-commands-timeout-td22826.html Best, Gary On Wed, Feb 27, 2019 at 7:55 AM sen wrote: > Hi Aneesha: > > I

Re: submit job failed on Yarn HA

2019-02-27 Thread Gary Yao
Hi, How did you determine "jmhost" and "port"? Actually you do not need to specify these manually. If the client is using the same configuration as your cluster, the client will look up the leading JM from ZooKeeper. If you have already tried omitting the "-m" parameter, you can check in the

Re: Submitting job to Flink on yarn timesout on flip-6 1.5.x

2019-02-21 Thread Gary Yao
Hi, Beginning with Flink 1.7, you cannot use the legacy mode anymore [1][2]. I am currently working on removing references to the legacy mode in the documentation [3]. Is there any reason, you cannot use the "new mode"? Best, Gary [1] https://flink.apache.org/news/2018/11/30/release-1.7.0.html

Re: Each yarn container only use 1 vcore even if taskmanager.numberOfTaskSlots is set

2019-02-17 Thread Gary Yao
Hi Henry, If I understand you correctly, you want YARN to allocate 4 vcores per TM container. You can achieve this by enabling the FairScheduler in YARN [1][2]. Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-release-1.7/ops/config.html#yarn-containers-vcores [2]

Re: Flink 1.6 Yarn Session behavior

2019-02-17 Thread Gary Yao
machine(8 core), I have 4 nodes, > that will end up 28 taskmanagers and 1 job manager. I was wondering if this > can bring additional burden on jobmanager? Is it recommended? > > Thanks, > > Jins George > On 2/14/19 8:49 AM, Gary Yao wrote: > > Hi Jins George, > >

Re: No resource available error while testing HA

2019-02-14 Thread Gary Yao
Hi Averell, The TM containers fetch the Flink binaries and config files form HDFS (or another DFS if configured) [1]. I think you should be able to change the log level by patching the logback configuration in HDFS, and kill all Flink containers on all hosts. If you are running an HA setup, your

Re: Flink 1.6 Yarn Session behavior

2019-02-14 Thread Gary Yao
Hi Jins George, This has been asked before [1]. The bottom line is that you currently cannot pre-allocate TMs and distribute your tasks evenly. You might be able to achieve a better distribution across hosts by configuring fewer slots in your TMs. Best, Gary [1]

Re: Flink Standalone cluster - logging problem

2019-02-11 Thread Gary Yao
Hi, Are you logging from your own operator implementations, and you expect these log messages to end up in a file prefixed with XYZ-? If that is the case, modifying log4j-cli.properties will not be expedient as I wrote earlier. You should modify the log4j.properties on all hosts that are running

Re: No resource available error while testing HA

2019-02-11 Thread Gary Yao
Hi Averell, Logback has this feature [1] but is not enabled out of the box. You will have to enable the JMX agent by setting the com.sun.management.jmxremote system property [2][3]. I have not tried this out, though. Best, Gary [1] https://logback.qos.ch/manual/jmxConfig.html [2]

Re: Flink Standalone cluster - logging problem

2019-02-11 Thread Gary Yao
Hi, Can you define what you mean by "job logs"? For code that is run on the cluster, i.e., JM or TM, you should add your config to log4j.properties. The log4j-cli.properties file is only used by the Flink CLI process. Best, Gary On Mon, Feb 11, 2019 at 7:39 AM simpleusr wrote: > Hi Chesnay, >

Re: No resource available error while testing HA

2019-02-06 Thread Gary Yao
Hi Averell, That log file does not look complete. I do not see any INFO level log messages such as [1]. Best, Gary [1] https://github.com/apache/flink/blob/46326ab9181acec53d1e9e7ec8f4a26c672fec31/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L544 On Fri, Feb 1, 2019

Re: Issue setting up Flink in Kubernetes

2019-01-29 Thread Gary Yao
Hi Tim, There is an end-to-end test in the Flink repository that starts a job cluster in Kubernetes (minikube) [1]. If that does not help you, can you answer the questions below? What docker images are you using? Can you share the kubernetes resource definitions? Can you share the complete logs

Re: No resource available error while testing HA

2019-01-29 Thread Gary Yao
Hi Averell, > Is there any way to avoid this? As if I run this as an AWS EMR job, the job > would be considered failed, while it is actually be restored automatically by > YARN after 10 minutes). You are writing that it takes YARN 10 minutes to restart the application master (AM). However, in my

Re: No resource available error while testing HA

2019-01-24 Thread Gary Yao
Hi Averell, > Then I have another question: when JM cannot start/connect to the JM on .88, > why didn't it try on .82 where resource are still available? When you are deploying on YARN, the TM container placement is decided by the YARN scheduler and not by Flink. Without seeing the complete

Re: No resource available error while testing HA

2019-01-23 Thread Gary Yao
Hi Averell, What Flink version are you using? Can you attach the full logs from JM and TMs? Since Flink 1.5, the -n parameter (number of taskmanagers) should be omitted unless you are in legacy mode [1]. > As per that screenshot, it looks like there are 2 tasks manager still > running (one on

Re: ElasticSearch RestClient throws NoSuchMethodError due to shade mechanism

2019-01-18 Thread Gary Yao
Hi Henry, Can you share your pom.xml and the full stacktrace with us? It is expected behavior that org.elasticsearch.client.RestClientBuilder is not shaded. That class comes from the elasticsearch Java client, and we only shade its transitive dependencies. Could it be that you have a dependency

Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Gary Yao
u like to have log entries with DEBUG enabled or > INFO would be enough? > > Thanks, > Piotr > > pt., 18 sty 2019 o 15:14 Gary Yao napisał(a): > >> Hi Piotr, >> >> What was the version you were using before 1.7.1? >> How do you deploy your cluster, e.g., YARN

Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Gary Yao
Hi Piotr, What was the version you were using before 1.7.1? How do you deploy your cluster, e.g., YARN, standalone? Can you attach full TM and JM logs? Best, Gary On Fri, Jan 18, 2019 at 3:05 PM Piotr Szczepanek wrote: > Hello, > we have scenario with running Data Processing jobs that

Re: Get the savepointPath of a particular savepoint

2019-01-14 Thread Gary Yao
Hi, The API still returns the location of a completed savepoint. See the example in the Javadoc [1]. Best, Gary [1]

Re: Building Flink from source according to vendor-specific versionbut causes protobuf conflict

2019-01-11 Thread Gary Yao
Hi Wei, Did you build Flink with maven 3.2.5 as recommended in the documentation [1]? Also, did you use the -Pvendor-repos flag to add the cloudera repository when building? Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-release-1.7/flinkDev/building.html#build-flink [2]

Re: [Flink 1.7.0] always got 10s ask timeout exception when submitting job with checkpoint via REST

2019-01-10 Thread Gary Yao
Hi all, I think increasing the default value of the config option web.timeout [1] is what you are looking for. Best, Gary [1] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/RestHandlerConfiguration.java#L76 [2]

Re: How to shut down Flink Web Dashboard in detached Yarn session?

2018-12-31 Thread Gary Yao
Hi, You can use the YARN client to list all applications on your YARN cluster: yarn application -list If this does not show any running applications, the Flink cluster must have somehow terminated. If you have YARN's log aggregation enabled, you should be able to view the Flink logs by

Re: Flink issue while setting up in Kubernetes

2018-12-10 Thread Gary Yao
Hi Abhi Thakur, We need more information to help you. What docker images are you using? Can you share the kubernetes resource definitions? Can you share the complete logs of the JM and TMs? Did you follow the steps outlined in the Flink documentation [1]? Best, Gary [1]

Re: Per job cluster doesn't shut down after the job is canceled

2018-11-20 Thread Gary Yao
commands. > > And also thank you for the pointer, I’ll keep tracking this problem. > > Best, > Paul Lam > > > 在 2018年11月10日,02:10,Gary Yao 写道: > > Hi Paul, > > Can you share the complete logs, or at least the logs after invoking the > cancel command?

Re: Any examples on invoke the Flink REST API post method ?

2018-11-12 Thread Gary Yao
Hi Henry, What you see in the API documentation is a schema definition and not a sample request. The request body should be: { "target-directory": "hdfs:///flinkDsl", "cancel-job": false } Let me know if that helps. Best, Gary On Mon, Nov 12, 2018 at 7:15 AM vino yang

Re: Flink Web UI does not show specific exception messages when job submission fails

2018-11-09 Thread Gary Yao
Hi, We only propagate the exception message but not the complete stacktrace [1]. Can you create a ticket for that? Best, Gary [1]

Re: Per job cluster doesn't shut down after the job is canceled

2018-11-09 Thread Gary Yao
Hi Paul, Can you share the complete logs, or at least the logs after invoking the cancel command? If you want to debug it yourself, check if MiniDispatcher#jobReachedGloballyTerminalState [1] is invoked, and see how the jobTerminationFuture is used. Best, Gary [1]

Re: Never gets into ProcessWindowFunction.process()

2018-11-09 Thread Gary Yao
Hi, You are using event time but are you assigning watermarks [1]? I do not see it in the code. Best, Gary [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kinesis.html#event-time-for-consumed-records On Fri, Nov 9, 2018 at 6:58 PM Vijay Balakrishnan wrote: > Hi, >

Re: Never gets into ProcessWindowFunction.process()

2018-11-09 Thread Gary Yao
Hi, If the job is actually running and consuming from Kinesis, the log you posted is unrelated to your problem. To understand why the process function is not invoked, we would need to see more of your code, or you would need to provide an executable example. The log only shows that all offered

Re: User jar is present in the flink job manager's class path

2018-10-11 Thread Gary Yao
Hi, Could it be that you are submitting the job in attached mode, i.e., without -d parameter? In the "job cluster attached mode", we actually start a Flink session cluster (and stop it again from the CLI) [1]. Therefore, in attached mode, the config option "yarn.per-job-cluster.include-user-jar"

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

2018-10-10 Thread Gary Yao
Hi Borys, I remember that another user reported a similar issue recently [1] – attached to the ticket you can find his log file. If I recall correctly, we concluded that YARN returned the containers very quickly. At the time, Flink's debug level logs were inconclusive because we did not log the

Re: Job manager logs for previous YARN attempts

2018-10-10 Thread Gary Yao
Hi Pawel, As far as I know, the application attempt is incremented if the application master fails and a new one is brought up. Therefore, what you are seeing should not happen. I have just deployed on AWS EMR 5.17.0 (Hadoop 2.8.4) and killed the container running the application master – the

Re: Utilising EMR's master node

2018-10-06 Thread Gary Yao
Hi Averell, It is up to the YARN scheduler on which hosts the containers are started. What Flink version are you using? I assume you are using 1.4 or earlier because you are specifying a fixed number of TMs. If you launch Flink with -yn 2, you should be only seeing 2 TMs in total (not 4). Are

Re: Flink 1.5.2 - excessive ammount of container requests, Received new/Returning excess container "flood"

2018-10-06 Thread Gary Yao
Hi Borys, To debug how many containers Flink is requesting, you can look out for the log statement below [1]: Requesting new TaskExecutor container with resources [...] If you need help debugging, can you attach the full JM logs (preferably on DEBUG level)? Would it be possible for you to

Re: Utilising EMR's master node

2018-09-26 Thread Gary Yao
Hi Averell, There is no general answer to your question. If you are running more TMs, you get better isolation between different Flink jobs because one TM is backed by one JVM [1]. However, every TMs brings additional overhead (heartbeating, running more threads, etc.) [1]. It also depends on the

Re: "405 HTTP method POST is not supported by this URL" is returned when POST a REST url on Flink on yarn

2018-09-26 Thread Gary Yao
Hi Henry, The URL below looks like the one from the YARN proxy (note that "proxy" appears in the URL): http://storage4.test.lan:8089/proxy/application_1532081306755_0124/jobs/269608367e0f548f30d98aa4efa2211e/savepoints You can use yarn application -status to find the host and port of

Re: Utilising EMR's master node

2018-09-18 Thread Gary Yao
Hi Averell, Flink compares the number of user selected vcores to the vcores configured in the yarn-site.xml of the submitting node, i.e., in your case the master node. If there are not enough configured vcores, the client throws an exception. This behavior is not ideal and I found an old JIRA

Re: Utilising EMR's master node

2018-09-17 Thread Gary Yao
Hi Averell, According to the AWS documentation [1], the master node only runs the YARN ResourceManager and the HDFS NameNode. Containers can only by launched on nodes that are running the YARN NodeManager [2]. Therefore, if you want TMs or JMs to be launched on your EMR master node, you have to

Re: Create a file in parquet format

2018-09-11 Thread Gary Yao
Hi Jose, You can find an example here: https://github.com/apache/flink/blob/1a94c2094b8045a717a92e232f9891b23120e0f2/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/avro/ParquetStreamingFileSinkITCase.java#L58 Best, Gary On Tue, Sep 11, 2018 at 11:59 AM, jose farfan

Re: How to get Cluster metrics in FLIP-6 mode

2018-09-11 Thread Gary Yao
Hi Tony, You are right that these metrics are missing. There is already a ticket for that [1]. At the moment you can obtain these information from the REST API (/overview) [2]. Since FLIP-6, the JM is no longer responsible for these metrics but for backwards compatibility we can leave them in

Re: JAXB Classloading errors when using PMML Library (AWS EMR Flink 1.4.2)

2018-09-11 Thread Gary Yao
Hi, Do you also have pmml-model-moxy as a dependency in your job? Using mvn dependency:tree, I do not see that pmml-evaluator has a compile time dependency on jaxb-api. The jaxb-api dependency actually comes from pmml- model-moxy. The exclusion should be therefore defined on pmml-model-moxy. You

Re: Question about akka configuration for FLIP-6

2018-09-10 Thread Gary Yao
mode. > > 1. akka.transport.heartbeat.interval > 2. akka.transport.heartbeat.pause > > It seems they are different from HeartbeatServices and possibly still > valid. > > Best, > tison. > > > Gary Yao 于2018年9月10日周一 下午1:50写道: > >> I should add th

Re: Question about akka configuration for FLIP-6

2018-09-09 Thread Gary Yao
-akka On Mon, Sep 10, 2018 at 7:38 AM, Gary Yao wrote: > Hi Tony, > > You are right that with FLIP-6 Akka is abstracted away. If you want custom > heartbeat settings, you can configure the options below [1]: > > heatbeat.interval > heartbeat.timeout > > The con

Re: Question about akka configuration for FLIP-6

2018-09-09 Thread Gary Yao
Hi Tony, You are right that with FLIP-6 Akka is abstracted away. If you want custom heartbeat settings, you can configure the options below [1]: heatbeat.interval heartbeat.timeout The config option taskmanager.exit-on-fatal-akka-error is also not relevant anymore. I closest I can think

Re: Cancel flink job occur exception

2018-09-08 Thread Gary Yao
Hi all, The question is being handled on the dev mailing list [1]. Best, Gary [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Cancel-flink-job-occur-exception-td24056.html On Tue, Sep 4, 2018 at 2:21 PM, rileyli(李瑞亮) wrote: > Hi all, > I submit a flink job through

Re: Setting Flink Monitoring API Port on YARN Cluster

2018-09-07 Thread Gary Yao
Hi Austin, The config options rest.port, jobmanager.web.port, etc. are intentionally ignored on YARN. The port should be chosen randomly to avoid conflicts with other containers [1]. I do not see a way how you can set a fixed port at the moment but there is a related ticket for that [2]. The

Re: flink list and flink run commands timeout

2018-09-05 Thread Gary Yao
Hi Jason, >From the stacktrace it seems that you are using the 1.4.0 client to list jobs on a 1.5.x cluster. This will not work. You have to use the 1.5.x client. Best, Gary On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania wrote: > Hello, > > Thanks for the response. I had already tried setting

Re: AskTimeoutException when canceling job with savepoint on flink 1.6.0

2018-09-05 Thread Gary Yao
Hi Jelmer, I saw that you have already found the JIRA issue tracking this problem [1] but I will still answer on the mailing list for transparency. The timeout for "cancel with savepoint" should be RpcUtils.INF_TIMEOUT. Unfortunately Flink is currently not respecting this timeout. A pull request

Re: Flink on Yarn, restart job will not destroy original task manager

2018-09-03 Thread Gary Yao
In 1.5.0, I’ve not enable local > recovery. > > > > So whether I need manual disable local recovery via flink.conf? > > > > Regards > > > > James > > > > *From: *"James (Jian Wu) [FDS Data Platform]" > *Date: *Monday, September 3, 20

Re: Flink on Yarn, restart job will not destroy original task manager

2018-09-03 Thread Gary Yao
Hi James, What version of Flink are you running? In 1.5.0, tasks can spread out due to changes that were introduced to support "local recovery". There is a mitigation in 1.5.1 that prevents task spread out but local recovery must be disabled [2]. Best, Gary [1]

Re: akka.ask.timeout setting not honored

2018-08-31 Thread Gary Yao
t; and reply again if the problem comes up again. Thanks again for your quick >> response! >> >> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao wrote: >> >>> Hi Greg, >>> >>> Can you describe the steps to reproduce the problem, or can you attach >&g

Re: akka.ask.timeout setting not honored

2018-08-31 Thread Gary Yao
Hi Greg, Can you describe the steps to reproduce the problem, or can you attach the full jobmanager logs? Because JobExecutionResultHandler appears in your log, I assume that you are starting a job cluster on YARN. Without seeing the complete logs, I cannot be sure what exactly happens. For now,

Re: getting error while running flink job on yarn cluster

2018-08-21 Thread Gary Yao
Hi Yubraj Singh, Can you try submitting with HADOOP_CLASSPATH=`hadoop classpath` set? [1] For example: HADOOP_CLASSPATH=`hadoop classpath` bin/flink run [...] Best, Gary [1]

  1   2   >