Re: Are there pipeline API's for ETL?

2020-01-10 Thread kant kodali
Hi Vino, Another use case would be I want to build a dag of batch sources, sinks and transforms and I want to schedule the jobs periodically. One can say similar to airflow but a Flink api would be lot better! Sent from my iPhone > On Jan 10, 2020, at 6:42 PM, vino yang wrote: > >  > Hi

Re: Are there pipeline API's for ETL?

2020-01-10 Thread kant kodali
Hi Vino, I am new to Flink. I was thinking more like a dag builder api where I can build a dag of source,sink and transforms and hopefully fink take cares of the entire life cycle of the dag. An example would be CDAP pipeline api. Sent from my iPhone > On Jan 10, 2020, at 6:42 PM, vino yang

Re: Are there pipeline API's for ETL?

2020-01-10 Thread vino yang
Hi kant, Can you provide more context about your question? What do you mean about "pipeline API"? IMO, you can build an ETL pipeline via composing several Flink transform APIs. About choosing which transform APIs, it depends on your business logic. Here are the generic APIs list.[1] Best, Vino

Are there pipeline API's for ETL?

2020-01-10 Thread kant kodali
Hi All, I am wondering if there are pipeline API's for ETL? Thanks!

Re: Re: flink savepoint checkpoint

2020-01-10 Thread Px New
Hello ,针对于你的问题 我发现一件有趣的事情 在我以 Yarn per-Job 方式 启动Job程序后 在yarn 的资源管理界面 可以看到我启动的任务 -> 它有属于自己的application-Id 然后当我 通过Yarn 的Tracking Ui 下的 Application Master 点击进入到Job的Web Ui 界面后(flink的web ui)通过在此界面点击canal 这个按钮 kill 掉程序后 在Yarn 的 管理界面 发现还是有个空壳子的。

Re: How can I find out which key group belongs to which subtask

2020-01-10 Thread 杨东晓
Thanks Till , I will do some test about this , will this be some public feature in next release version or later? Till Rohrmann 于2020年1月10日周五 上午6:15写道: > Hi, > > you would need to set the co-location constraint in order to ensure that > the sub-tasks of operators are deployed to the same

Re: How can I find out which key group belongs to which subtask

2020-01-10 Thread 杨东晓
Thanks Zhijiang, looks like serialization will always be there in keyed stream Zhijiang 于2020年1月10日周五 上午12:08写道: > Only chained operators can avoid record serialization cost, but the > chaining mode can not support keyed stream. > If you want to deploy downstream with upstream in the same task

Re: Please suggest helpful tools

2020-01-10 Thread Eva Eva
Thank you both for the suggestions. I did a bit more analysis using UI and identified at least one problem that's occurring with the job rn. Going to fix it first and then take it from there. *Problem that I identified:* I'm running with 26 parallelism. For the checkpoints that are expiring, one

Re: Yarn Kerberos issue

2020-01-10 Thread Juan Gentile
The error we get is the following: org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't deploy Yarn session cluster at org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:423) at

Re: StreamingFileSink doesn't close multipart uploads to s3?

2020-01-10 Thread Ken Krugler
Hi Kostas, I didn’t see a follow-up to this, and have also run into this same issue of winding up with a bunch of .inprogress files when a bounded input stream ends and the job terminates. When StreamingFileSystem.close() is called, shouldn’t all buckets get auto-rolled, so that the

Apache Flink - Sharing state in processors

2020-01-10 Thread M Singh
Hi: I have a few question about how state is shared in processors in Flink. 1. If I have a processor instantiated in the Flink app, and apply use in multiple times in the Flink -     (a) if the tasks are in the same slot - do they share the same processor on the taskmanager ?     (b) if the

Re: Running Flink on java 11

2020-01-10 Thread Chesnay Schepler
The error you got is due to an older asm version which is fixed for 1.10 in https://issues.apache.org/jira/browse/FLINK-13467 . On 10/01/2020 15:58, KristoffSC wrote: Hi, Yangze Guo, Chesnay Schepler thank you very much for your answers. I have actually a funny setup. So I have a Flink Job

Re: Running Flink on java 11

2020-01-10 Thread KristoffSC
Hi, Yangze Guo, Chesnay Schepler thank you very much for your answers. I have actually a funny setup. So I have a Flink Job module, generated from Flink's maven archetype. This module has all operators and Flink environment config and execution. This module is compiled by maven with

Re: [Question] Failed to submit flink job to secure yarn cluster

2020-01-10 Thread Ethan Li
Hi Yangze, Thanks for your reply. Those are the docs I have read and followed. (I was also able to set up a standalone flink cluster with secure HDFS, Zookeeper and Kafa. ) Could you please let me know what I am missing? Thanks Best, Ethan > On Jan 10, 2020, at 6:28 AM, Yangze Guo wrote: >

Re: Yarn Kerberos issue

2020-01-10 Thread Aljoscha Krettek
Hi, Interesting! What problem are you seeing when you don't unset that environment variable? From reading UserGroupInformation.java our code should almost work when that environment variable is set. Best, Aljoscha On 10.01.20 15:23, Juan Gentile wrote: Hello Aljoscha! The way we send

Re: Yarn Kerberos issue

2020-01-10 Thread Juan Gentile
Hello Aljoscha! The way we send the DTs to spark is by setting an env variable (HADOOP_TOKEN_FILE_LOCATION) which interestingly we have to unset when we run Flink because even if we do kinit that variable affects somehow Flink and doesn’t work. I’m not an expert but what you describe (We

Re: How can I find out which key group belongs to which subtask

2020-01-10 Thread Till Rohrmann
Hi, you would need to set the co-location constraint in order to ensure that the sub-tasks of operators are deployed to the same machine. It effectively means that subtasks a_i, b_i of operator a and b will be deployed to the same slot. This feature is not super well exposed but you can take a

Re: Yarn Kerberos issue

2020-01-10 Thread Aljoscha Krettek
Hi, it seems I hin send to early, my mail was missing a small part. This is the full mail again: to summarize and clarify various emails: currently, you can only use Kerberos authentication via tickets (i.e. kinit) or keytab. The relevant bit of code is in the Hadoop security module [1].

Re: Yarn Kerberos issue

2020-01-10 Thread Aljoscha Krettek
Hi Juan, to summarize and clarify various emails: currently, you can only use Kerberos authentication via tickets (i.e. kinit) or keytab. The relevant bit of code is in the Hadoop security module: [1]. Here you can see that we either use keytab. I think we should be able to extend this to

Re: Please suggest helpful tools

2020-01-10 Thread Congxian Qiu
Hi For expired checkpoint, you can find something like " Checkpoint xxx of job xx expired before completing" in jobmanager.log, then you can go to the checkpoint UI to find which tasks did not ack, and go to these tasks to see what happened. If checkpoint was been declined, you can find

Re: [Question] Failed to submit flink job to secure yarn cluster

2020-01-10 Thread Yangze Guo
Hi, Ethan You could first check your cluster following this guide and check if all the related config[2] set correctly. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/ops/security-kerberos.html [2]

Re: Flink Job claster scalability

2020-01-10 Thread Yangze Guo
Hi KristoffSC As Zhu said, Flink enables slot sharing[1] by default. This feature is nothing to do with the resource of your cluster. The benefit of this feature is written in [1] as well. I mean, it will not detect how many slots in your cluster and adjust its behavior toward this number. If you

Re: Running Flink on java 11

2020-01-10 Thread Chesnay Schepler
In regards to what we test: We run our tests against Java 8 *and *Java 11, with the compilation and testing being done with the same JDK. In other words, we don't check whether Flink compiled with JDK 8 runs on JDK 11, but we currently have no reason to believe that there is a problem (and

Re: Running Flink on java 11

2020-01-10 Thread Yangze Guo
Hi Krzysztof All the tests run with Java 11 after FLINK-13457[1]. Its fix version is set to 1.10. So, I fear 1.9.1 is not guaranteed to be running on java 11. I suggest you to wait for the release-1.10. [1]https://issues.apache.org/jira/browse/FLINK-13457 Best, Yangze Guo On Fri, Jan 10, 2020

Re: Please suggest helpful tools

2020-01-10 Thread Yun Tang
Hi Eva If checkpoint failed, please view the web UI or jobmanager log to see why checkpoint failed, might be declined by some specific task. If checkpoint expired, you can also access the web UI to see which tasks did not respond in time, some hot task might not be able to respond in time.

Re: Flink Job claster scalability

2020-01-10 Thread KristoffSC
Hi Zhu Zhu, well In my last test I did not change the job config, so I did not change the parallelism level of any operator and I did not change policy regarding slot sharing (it stays as default one). Operator Chaining is set to true without any extra actions like "start new chain, disable chain

Re: Re: flink savepoint checkpoint

2020-01-10 Thread muyexm329
其实这个主要还是要看你checkpoint的时间间隔,就像我们看视频倒退一样,它们是两个不同的后退时间点,savepoint能在当下生成checkpoint数据,但是自动的checkpoint可能还要在更早的时间点上生成checkpoint数据(因为在cancel job的时候可能还不到自动checkpoint时间)。两种都可以,只是是一前一后,这也决定了你任务恢复的快慢。线上需要经常修改的job savepoint很实用。 个人觉得任务失败,不管是哪种方式失败(除非是savepoint),肯定是回到上一个自动checkpoint的点上,不会是在savepoint。 原始邮件

Re: Re: flink savepoint checkpoint

2020-01-10 Thread amen...@163.com
hi,了解到使用stop进行任务停止并触发savepoint,会在停止之前生成max_watermark,并注册event-time计时器,我想请问使用yarn kill方式直接停止任务,会属于cancel还是stop亦或是其他? amen...@163.com From: Congxian Qiu Date: 2020-01-10 17:16 To: user-zh Subject: Re: flink savepoint checkpoint Hi 从 Flink 的角度看,Checkpoint 用户 Job 运行过程中发生 failover 进行恢复,savepoint

Re: flink savepoint checkpoint

2020-01-10 Thread Congxian Qiu
Hi 从 Flink 的角度看,Checkpoint 用户 Job 运行过程中发生 failover 进行恢复,savepoint 用于 Job 之间的状态复用。 另外,从 1.9 开始,可以尝试下 StopWithSavepoint[1],以及社区另外一个 issue 尝试做 StopWithCheckpoint[2] [1] https://issues.apache.org/jira/browse/FLINK-11458 [2] https://issues.apache.org/jira/browse/FLINK-12619 Best, Congxian zhisheng

Re: Running Flink on java 11

2020-01-10 Thread Krzysztof Chmielewski
Hi, Thank you for your answer. Btw it seams that you send the replay only to my address and not to the mailing list :) I'm looking forward to try out 1.10-rc then. Regarding second thing you wrote, that *"on Java 11, all the tests(including end to end tests) would be run with Java 11 profile

Re: Null result cannot be used for atomic types

2020-01-10 Thread Jingsong Li
Hi sunfulin, Looks like the error is happened in sink instead of source. Caused by: java.lang.NullPointerException: Null result cannot be used for atomic types. at DataStreamSinkConversion$5.map(Unknown Source) So the point is how did you write to sink. Can you share these codes? Best,

Re: Elasticsink sometimes gives NoClassDefFoundError

2020-01-10 Thread Arvid Heise
I don't see a particular reason why you see this behavior. Chesnay's explanation is the only plausible way that this behavior can happen. I fear that without a specific log, we cannot help further. On Fri, Jan 10, 2020 at 5:06 AM Jayant Ameta wrote: > Also, the ES version I'm using is 5.6.7 >

Re: How can I find out which key group belongs to which subtask

2020-01-10 Thread Zhijiang
Only chained operators can avoid record serialization cost, but the chaining mode can not support keyed stream. If you want to deploy downstream with upstream in the same task manager, it can avoid network shuffle cost which can still get performance benefits. As I know @Till Rohrmann has