Re: Streaming and batch jobs together

2018-05-09 Thread Kostas Kloudas
Hi Flavio, Flink has no inherent limitations as far as state size is concerned, apart from the fact that the state associated to a single key (not the total state) should fit in memory. For production use, it is also advised to use the RocksDB state backend, as this will allow you to spill on

Using two different configurations for StreamExecutionEnvironment

2018-05-09 Thread Georgi Stoyanov
Hi, folks We have an Util that creates for us StreamExecutionEnvironment with some Checkpoint Configurations. The configuration for externalized checkpoints and state backend fails running of the job locally from IntelliJ. As a solution we currently comment those two configurations, but I

Re: Error handling

2018-05-09 Thread Chesnay Schepler
I'm not aware of any changes made in this direction. On 08.05.2018 23:30, Vishnu Viswanath wrote: Was referring to the original email thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Error-handling-td3448.html On Tue, May 8, 2018 at 5:29 PM, vishnuviswanath

Re: Using two different configurations for StreamExecutionEnvironment

2018-05-09 Thread Chesnay Schepler
Ah my bad, the StreamingMultipleProgramsTestBase doesn't allow setting the configuration... I went ahead and wrote you a utility class that your test class should extend. The configuration for the cluster is passed through the constructor. public class MyMultipleProgramsTestBaseextends

MapWithState for two keyed stream

2018-05-09 Thread Peter Zende
Hi all, Is it possible to define two DataStream sources - one which reads from Kafka, the other reads from HDFS - and apply mapWithState with CoFlatMapFunction? The idea would be to read historical data from HDFS along with the live stream from Kafka and based on some business write the output

Re: Using two different configurations for StreamExecutionEnvironment

2018-05-09 Thread Chesnay Schepler
I suggest to run jobs in the IDE only as tests. This allows you to use various Flink utilities to setup a cluster with your desired configuration, all while keeping these details out of the actual job code. If you are using 1.5-SNAPSHOT, have a look at the MiniClusterResource class. The class

RE: Recommended books

2018-05-09 Thread Georgi Stoyanov
Hi Esa, Afaik, there’s no other resource about using Scala in Flink except the documentation & the blog - https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/scala_api_extensions.html https://flink.apache.org/blog/ If you want something only for scala -

Re: Reading csv-files in parallel

2018-05-09 Thread Fabian Hueske
Hi, this looks roughly as below val env = ExecutionEnvironment.getExecutionEnvironment() val ds: DataSet[…] = env .readTextFile(path) .map(yourCsvLineParser) val tableEnv = TableEnvironment.getTableEnvironment(env) tableEnv.registerDataSet("myTable", ds) val result =

Recommended books

2018-05-09 Thread Esa Heikkinen
Hi Could you recommend some Flink books to learn Scala programming and basics in Flink ? Best, Esa

RE: Using two different configurations for StreamExecutionEnvironment

2018-05-09 Thread Georgi Stoyanov
Hi Chesnay Thanks for the suggestion but this doesn’t sound like a good option, since I prefer local to remote debugging. My question sounds like really common thing and the guys behind Flink should’ve think about it. Of course I’m really new to the field of stream processing and maybe I don’t

RE: Reading csv-files in parallel

2018-05-09 Thread Esa Heikkinen
Hi Sorry the stupid question, but how to connect readTextFile (or readCsvFile), MapFunction and SQL together in Scala code ? Best, Esa From: Fabian Hueske Sent: Tuesday, May 8, 2018 10:26 PM To: Esa Heikkinen Cc: user@flink.apache.org Subject:

Re: Streaming and batch jobs together

2018-05-09 Thread Flavio Pompermaier
Ok, thanks for the clarification Kostas. What about multiple jobs running at the same time? On Wed, 9 May 2018, 14:39 Kostas Kloudas, wrote: > Hi Flavio, > > Flink has no inherent limitations as far as state size is concerned, apart > from the fact that the state

RE: Using two different configurations for StreamExecutionEnvironment

2018-05-09 Thread Georgi Stoyanov
Sorry for the lame question – but how can I achieve that? I’m with 1.4, created a new class that extends StreamingMultipleProgramsTestBase, call the super() and then call the main method. The thing is that in my Util class I create all of the things about Checkpoints (including the setting

Re: MapWithState for two keyed stream

2018-05-09 Thread TechnoMage
CoRichFlatMap or union will work. If you need to know which is historical the flatmap will be better as you can tell which stream it cam from. But, be careful about reading historical data and trying to process it all before processing the new data. That can lead to buffering a lot of

Re: Recommended books

2018-05-09 Thread Bowen Li
I'd recommend this book, *Stream Processing with Apache Flink: Fundamentals, Implementation, and Operation of Streaming Applications.* It's probably the most authentic book about Flink on the market. You can buy and read the early release on OReilly,

Do Flink metrics survive a shutdown?

2018-05-09 Thread Sameer W
I want to use Flink metrics API to store user defined metrics (counters). I instantiate the MetricsGroup in the open() function of the RichMapFunction and increment the counters which are created within the metrics group. If the job restarts on failure, will the counters get restored from state?

Flink 1.5 release timing

2018-05-09 Thread Ken Krugler
Hi all, In the flink-crawler project we switched to 1.5-SNAPSHOT to take advantage of some new features, but that means we have to build a 1.5 distribution if we want to run jobs on EMR. So I’m wondering if there’s any feeling for when the next

Re: Do Flink metrics survive a shutdown?

2018-05-09 Thread Chesnay Schepler
No. You have to checkpoint and restore the value yourself. On 09.05.2018 19:52, Sameer W wrote: I want to use Flink metrics API to store user defined metrics (counters). I instantiate the MetricsGroup in the open() function of the RichMapFunction and increment the counters which are created

Re: Checkpoint is not triggering as per configuration

2018-05-09 Thread xiatao123
I ran into a similar issue. Since it is a "Custom File Source", the first source just listing folder/file path for all existing files. Next operator "Split Reader" will read the content of the file. "Custom File Source" went to "finished" state after first couple secs. That's way we got this

Re:Flink Memory Usage

2018-05-09 Thread 周思华
Hi Pedro, since you are using RocksDB backend, RocksDB will consume some extra native memory, sometimes the amount of that could be very large, because the default setting of RocksDB will keep a `BloomFilter` for every opened sst in memory, and the number of the opened sst is not limited by