[jira] [Created] (FLINK-3964) Job submission times out with recursive.file.enumeration
Juho Autio created FLINK-3964: - Summary: Job submission times out with recursive.file.enumeration Key: FLINK-3964 URL: https://issues.apache.org/jira/browse/FLINK-3964 Project: Flink Issue Type: Bug Reporter: Juho Autio When using "recursive.file.enumeration" with a big enough folder structure to list, flink batch job fails right at the beginning because of a timeout. h2. Problem details We get this error: {{Communication with JobManager failed: Job submission to the JobManager timed out}}. The code we have is basically this: {code} val env = ExecutionEnvironment.getExecutionEnvironment val parameters = new Configuration // set the recursive enumeration parameter parameters.setBoolean("recursive.file.enumeration", true) val parameter = ParameterTool.fromArgs(args) val input_data_path : String = parameter.get("input_data_path", null ) val data : DataSet[(Text,Text)] = env.readSequenceFile(classOf[Text], classOf[Text], input_data_path) .withParameters(parameters) data.first(10).print {code} If we set {{input_data_path}} parameter to {{s3n://bucket/path/date=*/}} it times out. If we use a more restrictive pattern like {{s3n://bucket/path/date=20160523/}}, it doesn't time out. To me it seems that time taken to list files shouldn't cause any timeouts on job submission level. For us this was "fixed" by adding {{akka.client.timeout: 600 s}} in {{flink-conf.yaml}}, but I wonder if the timeout would still occur if we have even more files to list? P.S. Is there any way to set {{akka.client.timeout}} when calling {{bin/flink run}} instead of editing {{flink-conf.yaml}}. I tried to add it as a {{-yD}} flag but couldn't get it working. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (FLINK-9335) Expose client logs via Flink UI
Juho Autio created FLINK-9335: - Summary: Expose client logs via Flink UI Key: FLINK-9335 URL: https://issues.apache.org/jira/browse/FLINK-9335 Project: Flink Issue Type: Improvement Reporter: Juho Autio The logs logged by my Flink job jar _before_ *env.execute* can't be found in jobmanager log in Flink UI. In my case they seem to be going to /+home/hadoop/flink-1.5-SNAPSHOT/log/flink-hadoop-client-ip-10-0-10-29.log,+ for example. [~fhueske] said: {quote}It seems like good feature request to include the client logs.{quote} Implementing this may not be as trivial as just reading another log file though. As [~fhueske] commented: {quote}I assume that these logs are generated from a different process, i.e., the client process and not the JM or TM process. Hence, they end up in a different log file and are not covered by the log collection of the UI. The reason is that *this process might also be run on a machine outside of the cluster. So the client log file might not be accessible from the UI process*. {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9334) Docs to have a code snippet of Kafka partition discovery
Juho Autio created FLINK-9334: - Summary: Docs to have a code snippet of Kafka partition discovery Key: FLINK-9334 URL: https://issues.apache.org/jira/browse/FLINK-9334 Project: Flink Issue Type: Improvement Reporter: Juho Autio Tzu-Li (Gordon) said: {quote} Yes, it might be helpful to have a code snippet to demonstrate the configuration for partition discovery. {quote} The docs correctly say: {quote} To enable it, set a non-negative value for +flink.partition-discovery.interval-millis+ in the _provided properties config_ {quote} So it should be set in the Properties that are passed in the constructor of FlinkKafkaConsumer. I had somehow assumed that this should go to flink-conf.yaml (maybe because it starts with "flink."?), and obviously the FlinkKafkaConsumer doesn't read that. A piece of example code might've helped me avoid this mistake. This was discussed on the user mailing list: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-Consumers-Partition-Discovery-doesn-t-work-tp19129p19484.html -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-10216) Add REGEXP_MATCH in TableAPI and SQL
Juho Autio created FLINK-10216: -- Summary: Add REGEXP_MATCH in TableAPI and SQL Key: FLINK-10216 URL: https://issues.apache.org/jira/browse/FLINK-10216 Project: Flink Issue Type: Sub-task Reporter: Juho Autio Here's a naive implementation: {code:java} public class RegexpMatchFunction extends ScalarFunction { // NOTE! Flink calls eval() by reflection public boolean eval(String value, String pattern) { return value != null && pattern != null && value.matches(pattern); } } {code} I wonder if there would be a way to optimize this to use {{Pattern.compile(value)}} and use the compiled Pattern for multiple calls (possibly different values, but same pattern). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9184) Remove warning from kafka consumer docs
Juho Autio created FLINK-9184: - Summary: Remove warning from kafka consumer docs Key: FLINK-9184 URL: https://issues.apache.org/jira/browse/FLINK-9184 Project: Flink Issue Type: Sub-task Reporter: Juho Autio Once the main problem of FLINK-5479 has been fixed, remove the warning about idle from the docs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9183) Kafka consumer docs to warn about idle partitions
Juho Autio created FLINK-9183: - Summary: Kafka consumer docs to warn about idle partitions Key: FLINK-9183 URL: https://issues.apache.org/jira/browse/FLINK-9183 Project: Flink Issue Type: Sub-task Reporter: Juho Autio Looks like the bug FLINK-5479 is entirely preventing FlinkKafkaConsumerBase#assignTimestampsAndWatermarks to be used if there are any idle partitions. It would be nice to mention in documentation that currently this requires all subscribed partitions to have a constant stream of data with growing timestamps. When watermark gets stalled on an idle partition it blocks everything. Link to current documentation: [https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/kafka.html#kafka-consumers-and-timestamp-extractionwatermark-emission] -- This message was sent by Atlassian JIRA (v7.6.3#76005)