Re: CEP issue

2016-12-05 Thread Robert Metzger
Hi Kieran, which statebackend are you using for your CEP job? Using RocksDB as a state backend could potentially fix the issue. What's the number of keys in your stream? On Tue, Nov 29, 2016 at 3:18 PM, kieran . wrote: > Hello, > > I am currently building a

Re: Dealing with Multiple sinks in Flink

2016-12-05 Thread Robert Metzger
For enabling JMX when starting Flink from your IDE, you need to do the following: Configuration configuration = new Configuration(); configuration.setString("metrics.reporters", "my_jmx_reporter"); configuration.setString("metrics.reporter.my_jmx_reporter.class",

Re: Dealing with Multiple sinks in Flink

2016-12-05 Thread Robert Metzger
Hi Vinay, the JMX port depends on the port you've configured for the JMX metrics reporter. Did you configure it? Regards, Robert On Fri, Dec 2, 2016 at 11:14 AM, vinay patil wrote: > Hi Robert, > > I had resolved this issue earlier as I had not set the Kafka source >

Re: Flink 1.1.3 OOME Permgen

2016-12-05 Thread Robert Metzger
I've submitted Wordcount 410 times to a testing cluster and a streaming job 290 times and I could not reproduce the issue with 1.1.3. Also, the heapdump of one of the TaskManagers looked pretty normal. Do you have any ideas how to reproduce the issue? On Fri, Dec 2, 2016 at 3:21 PM, Robert

Re: In 1.2-SNAPSHOT, EventTimeSessionWindows are not firing untill the whole stream is processed

2016-12-05 Thread Robert Metzger
ic long extractAscendingTimestamp(Tuple3<Long,String,String> >> tuple3) { >> return tuple3.f0; >> } >> }) >> >> Best, >> Yassine >> >> 2016-12-05 11:24 GMT+01:00 Robert Metzger <rmetz...@apache.org>: >> >>

Re: Resource under-utilization when using RocksDb state backend

2016-12-05 Thread Robert Metzger
Hi Cliff, which Flink version are you using? Are you using Eventtime or processing time windows? I suspect that your disks are "burning" (= your job is IO bound). Can you check with a tool like "iotop" how much disk IO Flink is producing? Then, I would set this number in relation with the

Re: In 1.2-SNAPSHOT, EventTimeSessionWindows are not firing untill the whole stream is processed

2016-12-05 Thread Robert Metzger
Hi Yassine, are you sure your watermark extractor is the same between the two versions. It sounds a bit like the watermarks for the 1.2 code are not generated correctly. Regards, Robert On Sat, Dec 3, 2016 at 9:01 AM, Yassine MARZOUGUI wrote: > Hi all, > > With

Re: Flink 1.1.3 OOME Permgen

2016-12-02 Thread Robert Metzger
Thank you for reporting the issue Konstantin. I've filed a JIRA for the jackson issue: https://issues.apache.org/jira/browse/FLINK-5233. As I said in the JIRA, I propose to upgrade to Jackson 2.7.8, as this version contains the fix for the issue, but its not a major jackson upgrade. Any chance

Re: Container JMX port setting / discovery for Flink on YARN

2016-11-25 Thread Robert Metzger
Hi Yury, Flink is using its own JMX server instance (not the JVM's one). Therefore, you can configure the server yourself. Check out this documentation page: https://ci.apache.org/projects/flink/flink-docs-release-1.2/monitoring/metrics.html#reporter metrics.reporter.my_jmx_reporter.class:

Re: State Serializer/Deserializer between savepoints

2016-11-25 Thread Robert Metzger
Hi Daniel, This is currently a limitation in Flink's savepoints. You can not change the serialization schema of the state between savepoints. In Flink 1.2 there might be the first building blocks available for using serializers aware of savepoints. Exposing this feature to the API will probably

Re: Error while running Yahoo Streaming Benchmarks on a single machine

2016-11-23 Thread Robert Metzger
Hi, this is not really a failure. It just means that the job has been cancelled by somebody (using the web interface or the ./bin/flink tool). On Sat, Nov 19, 2016 at 3:59 AM, Muhammad Haseeb Javed < 11besemja...@seecs.edu.pk> wrote: > I am trying to run the Yahoo Streaming Benchmarks on a

Re: Flink Streaming Data Source Node

2016-11-23 Thread Robert Metzger
Hi, I'm not sure if I fully understood your question. The number of input sources is always less or equal to the number of slots in one node. Usually source instances are equally distributed among the parallel workers (TaskManagers). Maybe this document describing the deployment model is also

Re: Reading files from an S3 folder

2016-11-23 Thread Robert Metzger
Hi, This is not the expected behavior. Each parallel instance should read only one file. The files should not be read multiple times by the different parallel instances. How did you check / find out that each node is reading all the data? Regards, Robert On Tue, Nov 22, 2016 at 7:42 PM, Alex

Re: Sink not switched to "RUNUNG" even though a task slot is available

2016-11-23 Thread Robert Metzger
Hi Yassine, you don't necessarily need to set the parallelism of the last two operators of 31, the sink with parallelism 1 will fit still into the slots. A task slot can, by default, hold an entire "slice" or parallel instance of a job. The reason why the sink stays in state CREATE in the

Re: S3 checkpointing in AWS in Frankfurt

2016-11-23 Thread Robert Metzger
Hi Jonathan, have you tried using Amazon's latest EMR Hadoop distribution? Maybe they've fixed the issue in their for older Hadoop releases? On Wed, Nov 23, 2016 at 4:38 PM, Scott Kidder wrote: > Hi Jonathan, > > You might be better off creating a small Hadoop HDFS

Re: Running the JobManager and TaskManager on the same node in a cluster

2016-11-19 Thread Robert Metzger
Hi Dominik, Your observation is right, running the JobManager and TaskManager on the same node is no problem. If that machine fails, both services will be affected, but as long as you have infrastructure in place (YARN for example) to start them somewhere else, nothing bad will happen. Regarding

Re: RDF/SPARQL and Flink

2016-11-19 Thread Robert Metzger
Hi Tomas, I'm really not an RDF processing expert, but since nobody responded for 4 days, I'll try to give you some pointers: I know that there've been discussions regarding RDF processing on this mailing list before. Check out this one for example:

Re: Flink streaming with 1+ TB of managed state

2016-11-19 Thread Robert Metzger
Hi Steven, According to this presentation, King.com is using Flink with terabytes of state: http://flink-forward.org/wp-content/uploads/2016/07/Gyulo-Fo%CC%81ra-RBEA-Scalable-Real-Time-Analytics-at-King.compressed.pdf (see Page 4 specifically) For the 90GB experiment, what is the expected time

Re: flink-dist shading

2016-11-19 Thread Robert Metzger
Hi Craig, I also received only this email (and I'm a moderator of the dev@ list, so the message never made it into Apache's infra) When this issue was first reported [1][2] I asked on the Maven mailing list what's going on [3]. I think this JIRA contains the most information on the issue:

Re: Flink Avro Kafka Reading/Writing

2016-11-12 Thread Robert Metzger
Hi, yes, Flink can read and write Avro schema to Kafka, using a custom serialization / deser schema. On Fri, Nov 11, 2016 at 6:05 AM, daviD wrote: > Hi All, > > Does anyone know if Flink can read and write Avro schema to Kafka? > > Thanks > > daviD >

Re: Flink - Nifi Connectors - Class not found

2016-11-12 Thread Robert Metzger
Hi, the problem is that Flink's YARN code is not available in the Hadoop 1.2.1 build. How do you try to execute the Flink job to trigger this error message? On Fri, Nov 11, 2016 at 12:23 PM, PACE, JAMES wrote: > I am running Apache Flink 1.1.3 – Hadoop version 1.2.1 with the

Re: Flink Kafka Connector behaviour on leader change and broker upgrade

2016-11-06 Thread Robert Metzger
Hi, yes, the Flink Kafka connector for Kafka 0.8 handles broker leader changes without failing. The SimpleConsumer provided by Kafka 0.8 doesn't handle that. The 0.9 Flink Kafka consumer also supports broker leader changes transparently. If you keep using the Flink Kafka 0.8 connector with a 0.9

Re: Kinesis Connector Dependency Problems

2016-11-04 Thread Robert Metzger
t > seems to be working properly with EMR 4.8. It seems so obvious in > retrospect... thanks again for the assistance! > > Cheers, > > Justin > > On Tue, Nov 1, 2016 at 11:44 AM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Justin, >> >> tha

Re: Flink Kafka 0.10.0.0 connector

2016-11-03 Thread Robert Metzger
ng on? > > I’ve build the jar distribution as a clean maven package (without running > the tests). > > Thanks, > Dominik > > On 3 Nov 2016, at 13:29, Robert Metzger <rmetz...@apache.org> wrote: > > Hi Dominik, > > Some of Kafka's APIs changed between Kafka

Re: Flink Kafka 0.10.0.0 connector

2016-11-03 Thread Robert Metzger
Hi Dominik, Some of Kafka's APIs changed between Kafka 0.9 and 0.10, so you can not compile the Kafka 0.9 against Kafka 0.10 dependencies. I think the easiest way to get Kafka 0.10 running with Flink is to use the Kafka 0.10 connector in the current Flink master. You can probably copy the

Re: Kinesis Connector Dependency Problems

2016-11-01 Thread Robert Metzger
Hi Justin, thank you for sharing the classpath of the Flink container with us. It contains what Till was already expecting: An older version of the AWS SDK. If you have some spare time, could you quickly try to run your program with a newer EMR version, just to validate our suspicion? If the

Re: Kafka + Flink, Basic Questions

2016-10-31 Thread Robert Metzger
Hi Matt, This is a fairly extensive question. I'll try to answer all of them, but I don't have the time right now to extensively discuss the architecture of your application. Maybe there's some other person on the ML who can extend my answers. (Answers in-line below) On Mon, Oct 31, 2016 at

Re: FlinkKafkaConsumerBase - Received confirmation for unknown checkpoint

2016-10-27 Thread Robert Metzger
Hi, it would be nice if you could check with a stable version as well. Thank you! On Thu, Oct 27, 2016 at 9:58 AM, PedroMrChaves wrote: > Hello, > > I Am using the version 1.2-SNAPSHOT. > I will try with a stable version to see if the problem persists. > > Regards, >

Re: "Slow ReadProcessor" warnings when using BucketSink

2016-10-26 Thread Robert Metzger
reading from Kafka and directly > writing to HDFS. > > I can also run a terasort or teragen in parallel without any problems. > > Best, > Max > > 2016-10-12 11:32 GMT+02:00 Robert Metzger <rmetz...@apache.org>: > >> Hi, >> I haven't seen this error befor

Re: Distributing Tasks over Task manager

2016-10-26 Thread Robert Metzger
, the >> following tasks with a parallelism of 2 are distributed over the two task >> manager. >> >> Interesting is also that the task manager have 6 task slots configured >> and the expensive part has 6 sub tasks on one task manager but still >> everything later

Re: Watermarks and window firing

2016-10-26 Thread Robert Metzger
Just for others who are wondering what this email is about: I suspect that this email was send accidentally and that this is the correct one: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Event-time-watermarks-and-windows-td9687.html On Mon, Oct 24, 2016 at 4:56 PM, Paul

Re: Unit testing a Kafka stream based application?

2016-10-26 Thread Robert Metzger
Hi Niels, Sorry for the late response. you can launch a Kafka Broker within a JVM and use it for testing purposes. Flink's Kafka connector is using that a lot for integration tests. Here is the code starting the Kafka server:

Re: Side effects or multiple sinks on streaming jobs?

2016-10-26 Thread Robert Metzger
than one tuple in my operations then flatmap > and use a KeyedSerializationSchema? > > or is there a way to emit a tuple to another sink from within operations > directly? > > On Wed, Oct 26, 2016 at 9:20 AM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi L

Re: FlinkKafkaConsumerBase - Received confirmation for unknown checkpoint

2016-10-26 Thread Robert Metzger
Hi Pedro, The message is a bit unexpected for me as well, but it does not make the checkpointing inconsistent. The only thing that's not happening in case of this warning is that the offsets are not written to Zookeeper. Which Flink version are you using? On Mon, Oct 24, 2016 at 7:25 PM,

Re: Side effects or multiple sinks on streaming jobs?

2016-10-26 Thread Robert Metzger
Hi Luis, You can define as many data sinks as you want in a Flink job topology. So its not a problem for your use case to define two Kafka sinks, sending data to different topics. Regards, Robert On Tue, Oct 25, 2016 at 3:30 PM, Luis Mariano Guerra < mari...@event-fabric.com> wrote: > hi, > >

Re: add FLINK_LIB_DIR to classpath on yarn -> add properties file to class path on yarn

2016-10-26 Thread Robert Metzger
Hi Vinay, the JobManager and TaskManager logs contain the classpath used when starting a container on YARN. Can you check if the yaml file is in the classpath? On Tue, Oct 25, 2016 at 8:28 AM, vinay patil wrote: > Hi Max, > > As discussed here , I have put my yaml file

Re: Flink Kafka Consumer Behaviour

2016-10-13 Thread Robert Metzger
Thank you for investigating the issue. I've filed a JIRA: https://issues.apache.org/jira/browse/FLINK-4822 On Wed, Oct 12, 2016 at 8:12 PM, Anchit Jatana wrote: > Hi Janardhan/Stephan, > > I just figured out what the issue is (Talking about Flink KafkaConnector08,

Re: Flink Kafka connector08 not updating the offsets with the zookeeper

2016-10-13 Thread Robert Metzger
Okay, I see. According to this document, we need to set a consumer id for each groupid and topic: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+data+structures+in+Zookeeper I created a JIRA for fixing this issue: https://issues.apache.org/jira/browse/FLINK-4822 On Wed, Oct 12, 2016

Re: Data Transfer between TM should be encrypted

2016-10-13 Thread Robert Metzger
Hi, the release dates depend on the community, when features are ready and so on. There was no discussion yet when we plan to do the release, because most of the features we want to have in are not yet done yet. I think its likely that we'll have a 1.2 release by end of this year. Regards, Robert

Re: bucketing in RollingSink

2016-10-13 Thread Robert Metzger
ased or perhaps a > location where I could follow its progress? > > > > Thanks again! > > > > *From: *Robert Metzger <rmetz...@apache.org> > *Reply-To: *"user@flink.apache.org" <user@flink.apache.org> > *Date: *Wednesday, October 12, 2016 at

Re: bucketing in RollingSink

2016-10-12 Thread Robert Metzger
Hi Robert, I see two possible workarounds: 1) You use the unreleased Flink 1.2-SNAPSHOT version. From time to time, there are some unstable commits in that version, but most of the time, its quite stable. We provide nightly binaries and maven artifacts for snapshot versions here:

Re: Distributing Tasks over Task manager

2016-10-12 Thread Robert Metzger
Hi Jürgen, Are you using the DataStream or the DataSet API? Maybe the operator chaining is causing too many operations to be "packed" into one task. Check out this documentation page: https://ci.apache.org/projects/flink/flink-docs-master/dev/datastream_api.html#task-chaining-and-resource-groups

Re: Tumbling window rich functionality

2016-10-12 Thread Robert Metzger
Hi, apply() will be called for each key. On Wed, Oct 12, 2016 at 2:25 PM, Swapnil Chougule wrote: > Thanks Aljoscha. > > Whenever I am using WindowFunction.apply() on keyed stream, apply() will > be called once or multiple times (equal to number of keys in that windowed

Re: Flink Kafka connector08 not updating the offsets with the zookeeper

2016-10-12 Thread Robert Metzger
Hi Anchit, Can you re-run your job with the debug level for Flink set to DEBUG? Then, you should see the following log message every time the offset is committed of Zookeeper: "Committing offsets to Kafka/ZooKeeper for checkpoint" Alternatively, can you check whether the offsets are available

Re: "Slow ReadProcessor" warnings when using BucketSink

2016-10-12 Thread Robert Metzger
Hi, I haven't seen this error before. Also, I didn't find anything helpful searching for the error on Google. Did you check the GC times also for Flink? Is your Flink job doing any heavy tasks (like maintaining large windows, or other operations involving a lot of heap space?) Regards, Robert

Re: Data Transfer between TM should be encrypted

2016-10-12 Thread Robert Metzger
Hi, I think that pull request will be merged for 1.2. On Fri, Oct 7, 2016 at 6:26 PM, vinay patil wrote: > Hi Stephan, > > https://github.com/apache/flink/pull/2518 > Is this pull request going to be part of 1.2 release ? Just wanted to get > an idea on timelines so

Re: Simple batch job hangs if run twice

2016-09-22 Thread Robert Metzger
Can you try running with DEBUG logging level? Then you should see if input splits are assigned. Also, you could try to use a debugger to see what's going on. On Mon, Sep 19, 2016 at 2:04 PM, Yassine MARZOUGUI < y.marzou...@mindlytix.com> wrote: > Hi Chensey, > > I am running Flink 1.1.2, and

Re: how to unit test streaming window jobs?

2016-09-22 Thread Robert Metzger
Hi Luis, using Event Time windows, you should be able to generate some test data and get predictable results. Flink is internally using similar tests to ensure correctness of the windowing implementation (for example the EventTimeWindowCheckpointingITCase). Regards, Robert On Mon, Sep 12, 2016

Re: Distributed Cache support for StreamExecutionEnvironment

2016-09-09 Thread Robert Metzger
Hi Swapnil, there's no support for something like DistributedCache in the DataStream API. However, as a workaround, you can rely on the RichFunction's open() method's to load such data directly from a distributed file system. Regards, Robert On Fri, Sep 9, 2016 at 8:13 AM, Swapnil Chougule

Re: Administration of running jobs

2016-09-09 Thread Robert Metzger
Hi Marek, You can use the RemoteExecutionEnvironment to submit a job programatically to a Flink cluster. there is also some ongoing work to programatically control submitted jobs: https://issues.apache.org/jira/browse/FLINK-4272. But for now you would probably need to hack something using the

Re: NoClassDefFoundError with ElasticsearchSink on Yarn

2016-09-09 Thread Robert Metzger
Hi Steffen, I think it would be good to add it to the documentation. Would you like to open a pull request? Regards, Robert On Mon, Sep 5, 2016 at 10:26 PM, Steffen Hausmann < stef...@hausmann-family.de> wrote: > Thanks Aris for your explanation! > > A guava version mismatch was indeed the

Re: fsbackend with nfs

2016-09-07 Thread Robert Metzger
Hi CPC, It should be possible to use the FsBackend with NFS. However, I'm not sure how well it will perform. Regards, Robert On Mon, Sep 5, 2016 at 2:11 PM, CPC wrote: > Hi, > > Is it possible to use flinkstatebackend with nfs? We dont want to deploy > hadoop in our

Re: How to get latency info from benchmark

2016-09-03 Thread Robert Metzger
eckout 547e7490fb99562ca15a2127f0ce1e784db97f3e > fatal: reference is not a tree: 547e7490fb99562ca15a2127f0ce1e784db97f3e > -- > > Regards, > Eric > > On Fri, Sep 2, 2016 at 12:01 PM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Eric,

Re: How to get latency info from benchmark

2016-09-02 Thread Robert Metzger
and 0.9.1. Could you tell > me the SHA you were using? > > Regards, > Eric > > > On Wed, Aug 24, 2016 at 4:57 PM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi, >> >> Version 0.10-SNAPSHOT is pretty old. The snapshot repository of Apache >

Re: Resource isolation in flink among multiple jobs

2016-08-29 Thread Robert Metzger
Hi, for isolation, we recommend using YARN, or soon Mesos. For standalone installations, you'll need to manually set up multiple independent Flink clusters within one physical cluster if you want them to be isolated. Regards, Robert On Mon, Aug 29, 2016 at 1:41 PM, Abhishek Agarwal

Re: Dynamic scaling in flink

2016-08-29 Thread Robert Metzger
Hi, this JIRA is a good starting point: https://issues.apache.org/jira/browse/FLINK-3755 If you don't care about processing guarantees and you are using a stateless streaming job, you can implement a simple Kafka consumer that uses Kafka's consumer group mechanism. I recently implemented such a

Re: different Kafka serialization for keyed and non keyed messages

2016-08-29 Thread Robert Metzger
Hi Rss, > why Flink implements different serialization schemes for keyed and non keyed messages for Kafka? The non-keyed serialization schema is a basic schema, which works for most use cases. For advanced users which need access to the key, offsets, the partition or topic, there's the keyed ser

Re: Flink long-running YARN configuration

2016-08-29 Thread Robert Metzger
The JobManager UI starts when running Flink on YARN. The address of the UI is registered at YARN, so you can also access it through YARNs command line tools or its web interface. On Fri, Aug 26, 2016 at 7:28 PM, Trevor Grant wrote: > Stephan, > > Will the jobmanager-UI

Re: How to set a custom JAVA_HOME when run flink on YARN?

2016-08-29 Thread Robert Metzger
The "env.java.home" variable is only evaluated by the start scripts, not the YARN code. The solution you've mentioned earlier is a good work around in my opinion. On Fri, Aug 26, 2016 at 3:48 AM, Renkai wrote: > It seems that this config variant only effect local cluster

Re: Setting number of TaskManagers

2016-08-25 Thread Robert Metzger
Hi Craig, For the YARN session, you have to pass the the number of taskManagers using the -n argument. if you need to use a n environment variable, you can create a custom script calling the yarn-session.sh script and passing the value of the env variable to the script. Regards, Robert On

Re: How to get latency info from benchmark

2016-08-24 Thread Robert Metzger
ll not be reattempted until the > update interval of apache.snapshots has elapsed or updates are forced -> > [Help 1] > > > > > On Wed, Aug 24, 2016 at 8:12 AM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Eric, >> >> Max is right, the t

Re: How to get latency info from benchmark

2016-08-24 Thread Robert Metzger
Hi Eric, Max is right, the tool has been used for a different benchmark [1]. The throughput logger that should produce the right output is this one [2]. Very recently, I've opened a pull request for adding metric-measuring support into the engine [3]. Maybe that's helpful for your experiments.

Re: WordCount w/ YARN and EMR local filesystem and/or HDFS

2016-08-23 Thread Robert Metzger
e EMR yum repo on the > cluster. > > > > > > > > *From: *Stephan Ewen <se...@apache.org> > *Reply-To: *"user@flink.apache.org" <user@flink.apache.org> > *Date: *Tuesday, August 23, 2016 at 11:47 AM > *To: *"user@flink.apache.org

Re: Contain topic name in the stream

2016-08-22 Thread Robert Metzger
Hi Sendoh, the KeyedDeserializationSchema allows you to access the topic name, partition id and offset of the message. So you need to implement a custom deserialization schema, extending the KeyedDeserializationSchema to get this information. Regards, Robert On Mon, Aug 22, 2016 at 10:58 AM,

Re: set yarn jvm options

2016-08-19 Thread Robert Metzger
Hi, yes, using "env.java.opts": https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html On Thu, Aug 4, 2016 at 4:03 AM, Prabhu V wrote: > Hi, > > Is there a way to set jvm options on the yarn application-manager and > task-manager with flink ? > > Thanks, >

Re: 1.1.1: JobManager config endpoint no longer supplies port

2016-08-19 Thread Robert Metzger
Hi Shannon, the problem is that YARNs proxy only allows GET HTTP requests, but for uploading files, a different request type is needed. I've filed a JIRA for the problem you've reported: https://issues.apache.org/jira/browse/FLINK-4432 Regards, Robert On Mon, Aug 15, 2016 at 6:03 PM, Shannon

Re: Flink Cluster setup

2016-08-19 Thread Robert Metzger
Hi, Flink allocates the blob server at an ephemeral port, so it'll change every time you start Flink. However, the "blob.server.port" configuration [1] allows you to use a predefined port, or even a port range. If your Kafka topic has only one partition, only one of the 8 tasks will read the

Re: Checking for existance of output directory/files before running a batch job

2016-08-19 Thread Robert Metzger
Ooops. Looks like Google Mail / Apache / the internet needs 13 minutes to deliver an email. Sorry for double answering. On Fri, Aug 19, 2016 at 3:07 PM, Maximilian Michels wrote: > HI Niels, > > Have you tried specifying the fully-qualified path? The default is the > local file

Re: Checking for existance of output directory/files before running a batch job

2016-08-19 Thread Robert Metzger
Hi Niels, I assume the directoryName you are passing doesn't have the file system prefix (hdfs:// or s3://, ...) specified. In those cases, Path.getFileSystem() is looking up the default file system prefix from the configuration. Probably the environment where you are submitting the job from

Re: Batch jobs with a very large number of input splits

2016-08-19 Thread Robert Metzger
Hi Niels, In Flink, you don't need one task per file, since splits are assigned lazily to reading tasks. What exactly is the error you are getting when trying to read that many input splits? (Is it on the JobManager?) Regards, Robert On Thu, Aug 18, 2016 at 1:56 PM, Niels Basjes

Re: Compress DataSink Output

2016-08-19 Thread Robert Metzger
Hi Wes, Flink's own OutputFormats don't support compression, but we have some tools to use Hadoop's OutputFormats with Flink [1], and those support compression: https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html Let me know if you need more

Re: problem running flink using remote environment

2016-08-19 Thread Robert Metzger
Hi Baswaraj, when you are using the ./bin/flink run client for submitting jobs, the StreamExecutionEnvironment.getExecutionEnvironment(); call is the correct one to retrieve the Execution Environment. So you can not use the RemoteEnvironment with the ./bin/flink tool.. The purpose of the remote

Re: Flink redshift table lookup and updates

2016-08-19 Thread Robert Metzger
Hi Harshith, Welcome to the Flink community ;) I would recommend using approach 2. Keeping the state in Flink and just sending updates to the dashboard store should give you better performance and consistency. I don't know whether its better to download the full state snapshot from redshift in

Re: Kafka topic with an empty parition : parallelism issue and Timestamp monotony violated

2016-08-16 Thread Robert Metzger
Hi Yassine, In Flink 1.2 we've added a new feature to the Kafka consumer, allowing you to extract timestamps and emitting watermarks per partition. The consumers now have the following method: public FlinkKafkaConsumerBase assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks assigner)

Re: flink no class found error

2016-08-10 Thread Robert Metzger
picked up instead of ours > > On Thu, Aug 11, 2016 at 1:24 AM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Janardhan, >> >> #1 Is the exception thrown from your user code, or from Flink? >> >> #2 is most likely caused due to a compiler

Re: flink no class found error

2016-08-10 Thread Robert Metzger
Hi Janardhan, #1 Is the exception thrown from your user code, or from Flink? #2 is most likely caused due to a compiler / runtime version mismatch: http://stackoverflow.com/questions/10382929/how-to-fix-java-lang-unsupportedclassversionerror-unsupported-major-minor-versi You compiled the code

Re: Release notes 1.1.0?

2016-08-09 Thread Robert Metzger
ties and flink-conf.yaml directly from 1.0.3 > I have parallelization 1 on my sources, I can increase that to achieve the > same speed, but I’m interested to know why is that. > > > Thanks! > > > Andrew > > On 09 Aug 2016, at 11:47, Robert Metzger <rmetz...@

Re: Release notes 1.1.0?

2016-08-09 Thread Robert Metzger
Hi Andrew, here is the release announcement, with a list of all changes: http://flink.apache.org/news/2016/08/08/release-1.1.0.html, http://flink.apache.org/blog/release_1.1.0-changelog.html What does the chart say? Are the results different? is Flink faster or slower now? Regards, Robert On

Re: Kafka producer connector

2016-08-08 Thread Robert Metzger
Hi Janardhan, the fixed partitioner is the only one shipped with Flink. However, it should be fairly simple to implement one that uses the key to determine the partition. On Mon, Aug 8, 2016 at 7:16 PM, Janardhan Reddy wrote: > Hi, > The Flink kafka producer uses

Re: Flink Kafka Consumer Behaviour

2016-08-08 Thread Robert Metzger
Hi Prabhu, I'm pretty sure that the Kafka 09 consumer commits offsets to Kafka when checkpointing is turned on. In the FlinkKafkaConsumerBase.notifyCheckpointComplete(), we call fetcher.commitSpecificOffsetsToKafka(checkpointOffsets);, which calls this.consumer.commitSync(offsetsToCommit); in

Re: Flink kafka group question

2016-08-08 Thread Robert Metzger
Hi, you can get the offsets (current and committed offsets) in Flink 1.1 using the Flink metrics. In Flink 1.0, we expose the Kafka internal metrics via the accumulator system (so you can access them from the web interface as well). IIRC, Kafka exposes a metric for the lag as well. On Mon, Aug 8,

Re: Using RabbitMQ Sinks

2016-08-08 Thread Robert Metzger
Hi Paul, the example in the code is outdated, StringToByteSerializer has probably been removed quite a while ago. I'll update the documentation once we figured out the other problem you reported. What's the exception you are getting? Regards, Robert On Mon, Aug 8, 2016 at 4:33 PM, Paul Joireman

Re: Having a single copy of an object read in a RichMapFunction

2016-08-08 Thread Robert Metzger
Hi Theo, I think there are some variants you can try out for the problem. I think it depends a bit on the performance characteristics you expect: - The simplest variant is to run one TM per machine with one slot only. This is probably not feasible because you can't use all the CPU cores - ... to

Re: max aggregator dosen't work as expected

2016-08-08 Thread Robert Metzger
I have to admit that the difference between the two methods is subtle, and in my opinion it doesn't make much sense to have the two variants. - max() returns a tuple with the max value at the specified position, the other fields of the tuple/pojo are undefined - maxBy() returns a tuple with the

Re: Random access to small global state

2016-07-14 Thread Robert Metzger
Hi, For Ignite, Flink has a Sink, which is a one-directional thing. I think that Sebastian needs a bi-directional connection. An in-memory KV store like redis or memcache is probably the best option for such a use case (it reminds me a bit of the Yahoo streaming benchmark [1]). [1]

Re: dynamic streams and patterns

2016-07-14 Thread Robert Metzger
Hi Claudia, 1) What do you mean by dynamically adding? In standalone mode (which you would probably use with Docker images), you can just start additional TaskManagers, which will connect to a JobManager. So you could implement some monitoring to start new TaskManagers as soon as they are needed.

Re: [Discuss] Ordering of Records

2016-07-14 Thread Robert Metzger
There is a parallel thread answering the questions going on here already: http://stackoverflow.com/questions/38354713/ordering-of-records-in-stream On Tue, Jul 12, 2016 at 7:12 PM, vinay patil wrote: > Hi, > > Here are some of the queries I have : > > I have two

Re: how to get rid of null pointer exception in collection in DataStream

2016-07-14 Thread Robert Metzger
Hi Subash, The problem you are facing is not related to Flink. The problem is that the "centroids" field is not initialized, which is general Java issue. Please keep in mind that this list is not the best forum for Java questions. Regards, Robert On Wed, Jul 13, 2016 at 6:45 PM, subash basnet

Re: Writing in flink clusters

2016-07-14 Thread Robert Metzger
I agree with Chesnay that we should report the file name. Can you create a [hotfix] PR for that? On Wed, Jul 13, 2016 at 3:46 PM, Chesnay Schepler wrote: > Hello, > > Is that the complete error message? I'm a bit surprised it does not > explicitly name any file name. If it

Re: How large a Flink cluster can have?

2016-07-14 Thread Robert Metzger
Hi, I think the reason why this information is not written anywhere is because we don't know it either. Alibaba seems to run a fork of Flink on "thousands of nodes" [1]. Maybe some of the production users on this mailing list can share some information regarding this. [1]

Re: Ability to partition logs per pipeline

2016-07-14 Thread Robert Metzger
Hi Sumit, What exactly do you mean by pipeline? Are you talking about cases were multiple jobs are running concurrently on the same TaskManager, or are you referring to parallel instances of a Flink job? On Wed, Jul 13, 2016 at 9:49 PM, Chawla,Sumit wrote: > Hi All > >

Re: Questions about slot sharing

2016-07-14 Thread Robert Metzger
Hi Huafeng, yes, the mapper and reducer are running in different threads on the TaskManager. Slot sharing is an abstract concept of the scheduler. Flink also supports thread-sharing, for example when you have a series of mappers (or filters) running with the same parallelism. We call this

Re: HDFS to Kafka

2016-07-13 Thread Robert Metzger
Hi Dominique, In Flink 1.1 we've reworked the reading of static files in the DataStream API. There is now a method for passing any FileInputFormat: readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo). I guess you can pass a FileInputFormat with the recursive enumeration

Re: Parameters to Control Intra-node Parallelism

2016-07-08 Thread Robert Metzger
Hi, from the TaskManager logs, I can not see anything suspicious. Its a bit weird that the TaskManager logs just end, without any shutdown messages. Usually the TMs log some shut down stuff when they are stopping. Also, if they would be still running, I would expect some error messages from akka

Re: Extract type information from SortedMap

2016-07-08 Thread Robert Metzger
Hi Yukun, can you also post the code how you are invoking the GenericFlatMapper on the mailing list? The Java compiler is usually dropping the generic types during compilation ("type erasure"), that's why we can not infer the types. On Fri, Jul 8, 2016 at 12:27 PM, Yukun Guo

Re: Kafka Producer Partitioning issue

2016-07-08 Thread Robert Metzger
Hi, Guyla and I had some offline discussion about this issue. We'll report here once we've found the cause. On Wed, Jul 6, 2016 at 12:01 AM, Gyula Fóra wrote: > Hi, > > I have ran into a strange issue when using the kafka producer. > > I got the following exception: > >

Re: Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?

2016-07-08 Thread Robert Metzger
One thing I would like to add is that your timestamp extractors are not really extracting the event time from your events. They are just returning the current system time, which effectively means you are falling back to processing time. On Fri, Jul 8, 2016 at 10:32 AM, Kostas Kloudas

Re: Start cluster in different modes

2016-06-21 Thread Robert Metzger
Hi Ravinder, the streaming mode has been removed, because Flink now starts in the streaming mode by default. This means that the system is lazily allocating managed memory when user's are executing batch jobs. If you want to preallocate the managed memory, there is a new configuration option for

Re: Caused by: java.lang.Exception: Serialized representation of java.lang.StackOverflowError: null

2016-06-21 Thread Robert Metzger
com> wrote: > Hi, > > I am using flink-1.0.3. > > Warm Regards, > Debaditya > > On Tue, Jun 21, 2016 at 5:29 PM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi, >> which version of Flink are you using? There has been a recent fix for the

Re: Caused by: java.lang.Exception: Serialized representation of java.lang.StackOverflowError: null

2016-06-21 Thread Robert Metzger
Hi, which version of Flink are you using? There has been a recent fix for the issue: https://issues.apache.org/jira/browse/FLINK-3762 Regards, Robert On Tue, Jun 21, 2016 at 5:22 PM, Debaditya Roy wrote: > Hello users, > > I am getting an error from the flat map function

<    1   2   3   4   5   6   7   8   9   10   >