[jira] [Created] (FLINK-3964) Job submission times out with recursive.file.enumeration

2016-05-24 Thread Juho Autio (JIRA)
Juho Autio created FLINK-3964:
-

 Summary: Job submission times out with recursive.file.enumeration
 Key: FLINK-3964
 URL: https://issues.apache.org/jira/browse/FLINK-3964
 Project: Flink
  Issue Type: Bug
Reporter: Juho Autio


When using "recursive.file.enumeration" with a big enough folder structure to 
list, flink batch job fails right at the beginning because of a timeout.

h2. Problem details

We get this error: {{Communication with JobManager failed: Job submission to 
the JobManager timed out}}.

The code we have is basically this:

{code}
val env = ExecutionEnvironment.getExecutionEnvironment

val parameters = new Configuration

// set the recursive enumeration parameter
parameters.setBoolean("recursive.file.enumeration", true)

val parameter = ParameterTool.fromArgs(args)

val input_data_path : String = parameter.get("input_data_path", null )

val data : DataSet[(Text,Text)] = env.readSequenceFile(classOf[Text], 
classOf[Text], input_data_path)
.withParameters(parameters)

data.first(10).print
{code}

If we set {{input_data_path}} parameter to {{s3n://bucket/path/date=*/}} it 
times out. If we use a more restrictive pattern like 
{{s3n://bucket/path/date=20160523/}}, it doesn't time out.

To me it seems that time taken to list files shouldn't cause any timeouts on 
job submission level.

For us this was "fixed" by adding {{akka.client.timeout: 600 s}} in 
{{flink-conf.yaml}}, but I wonder if the timeout would still occur if we have 
even more files to list?



P.S. Is there any way to set {{akka.client.timeout}} when calling {{bin/flink 
run}} instead of editing {{flink-conf.yaml}}. I tried to add it as a {{-yD}} 
flag but couldn't get it working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (FLINK-9335) Expose client logs via Flink UI

2018-05-11 Thread Juho Autio (JIRA)
Juho Autio created FLINK-9335:
-

 Summary: Expose client logs via Flink UI
 Key: FLINK-9335
 URL: https://issues.apache.org/jira/browse/FLINK-9335
 Project: Flink
  Issue Type: Improvement
Reporter: Juho Autio


The logs logged by my Flink job jar _before_ *env.execute* can't be found in 
jobmanager log in Flink UI.

In my case they seem to be going to 
/+home/hadoop/flink-1.5-SNAPSHOT/log/flink-hadoop-client-ip-10-0-10-29.log,+ 
for example.

[~fhueske] said:
{quote}It seems like good feature request to include the client logs.{quote}
Implementing this may not be as trivial as just reading another log file 
though. As [~fhueske] commented:
{quote}I assume that these logs are generated from a different process, i.e., 
the client process and not the JM or TM process.
Hence, they end up in a different log file and are not covered by the log 
collection of the UI.
The reason is that *this process might also be run on a machine outside of the 
cluster. So the client log file might not be accessible from the UI process*. 
{quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9334) Docs to have a code snippet of Kafka partition discovery

2018-05-11 Thread Juho Autio (JIRA)
Juho Autio created FLINK-9334:
-

 Summary: Docs to have a code snippet of Kafka partition discovery
 Key: FLINK-9334
 URL: https://issues.apache.org/jira/browse/FLINK-9334
 Project: Flink
  Issue Type: Improvement
Reporter: Juho Autio


Tzu-Li (Gordon) said:
{quote}
Yes, it might be helpful to have a code snippet to demonstrate the 
configuration for partition discovery.
{quote}
 
 
The docs correctly say:
 
{quote}
To enable it, set a non-negative value for 
+flink.partition-discovery.interval-millis+ in the _provided properties config_
{quote}
 
So it should be set in the Properties that are passed in the constructor of 
FlinkKafkaConsumer.
 
I had somehow assumed that this should go to flink-conf.yaml (maybe because it 
starts with "flink."?), and obviously the FlinkKafkaConsumer doesn't read that.
 
A piece of example code might've helped me avoid this mistake.
 
 
This was discussed on the user mailing list:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Kafka-Consumers-Partition-Discovery-doesn-t-work-tp19129p19484.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-10216) Add REGEXP_MATCH in TableAPI and SQL

2018-08-25 Thread Juho Autio (JIRA)
Juho Autio created FLINK-10216:
--

 Summary: Add REGEXP_MATCH in TableAPI and SQL
 Key: FLINK-10216
 URL: https://issues.apache.org/jira/browse/FLINK-10216
 Project: Flink
  Issue Type: Sub-task
Reporter: Juho Autio


Here's a naive implementation:

{code:java}
public class RegexpMatchFunction extends ScalarFunction {
// NOTE! Flink calls eval() by reflection
public boolean eval(String value, String pattern) {
return value != null && pattern != null && value.matches(pattern);
}
}
{code}

I wonder if there would be a way to optimize this to use 
{{Pattern.compile(value)}} and use the compiled Pattern for multiple calls 
(possibly different values, but same pattern).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9184) Remove warning from kafka consumer docs

2018-04-16 Thread Juho Autio (JIRA)
Juho Autio created FLINK-9184:
-

 Summary: Remove warning from kafka consumer docs
 Key: FLINK-9184
 URL: https://issues.apache.org/jira/browse/FLINK-9184
 Project: Flink
  Issue Type: Sub-task
Reporter: Juho Autio


Once the main problem of FLINK-5479 has been fixed, remove the warning about 
idle from the docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9183) Kafka consumer docs to warn about idle partitions

2018-04-16 Thread Juho Autio (JIRA)
Juho Autio created FLINK-9183:
-

 Summary: Kafka consumer docs to warn about idle partitions
 Key: FLINK-9183
 URL: https://issues.apache.org/jira/browse/FLINK-9183
 Project: Flink
  Issue Type: Sub-task
Reporter: Juho Autio


Looks like the bug FLINK-5479 is entirely preventing 
FlinkKafkaConsumerBase#assignTimestampsAndWatermarks to be used if there are 
any idle partitions. It would be nice to mention in documentation that 
currently this requires all subscribed partitions to have a constant stream of 
data with growing timestamps. When watermark gets stalled on an idle partition 
it blocks everything.
 
Link to current documentation:
[https://ci.apache.org/projects/flink/flink-docs-master/dev/connectors/kafka.html#kafka-consumers-and-timestamp-extractionwatermark-emission]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)