Re: Flink first project

2017-04-24 Thread Jörn Franke
I would use flume to import these sources to HDFS and then use flink or Hadoop or whatever to process them. While it is possible to do it in flink, you do not want that your processing fails because the web service is not available etc. Via flume which is suitable for this kind of tasks it is

Re: How to send local files to a flink job on YARN

2017-07-13 Thread Jörn Franke
That does not sound like a good idea to put a configuration file on every node. What about Zookeeper? > On 13. Jul 2017, at 17:10, Guy Harmach wrote: > > Hi, > > I’m running a flink job on YARN. I’d like to pass yaml configuration files to > the job. > I tried to use the

Re: How can i merge more than one flink stream

2017-07-25 Thread Jörn Franke
What do you mean by "consistent"? Of course you can do this only at the time the timpstamp is defined (e.g. Using NTP). However, this is never perfect . Then it is unrealistic that they always end up in the same window because of network delays etc. you will need here a global state that is

Re: FileNotFound Exception in Cluster Standalone

2017-08-06 Thread Jörn Franke
is then loaded by flink into the cluster, so there is no data locality as with HDFS). > On 6. Aug 2017, at 11:20, Kaepke, Marc <marc.kae...@haw-hamburg.de> wrote: > > Thanks Jörn! > > I expected Flink will schedule the input file to all workers. > >> Am 05.08.2017 um 16:2

Re: FileNotFound Exception in Cluster Standalone

2017-08-05 Thread Jörn Franke
Probably you need to refer to the file on HDFS or manually make it available on each node as a local file. HDFS is recommended. If it is already on HDFS then you need to provide an HDFS URL to the file. > On 5. Aug 2017, at 14:27, Kaepke, Marc wrote: > > Hi there,

Re: Memory Issue

2017-08-21 Thread Jörn Franke
One would need to look at your code and possible on some heap statistics. Maybe something wrong happens when you cache them (do you use a 3rd party library or your own implementation?). Do you use a stable version of your protobuf library (not necessarily the most recent). You also may want to

Re: MapR libraries shading issue

2017-06-26 Thread Jörn Franke
The error that you mentioned seem to indicate that some certificates of certification authorities could not be found. You may want to add them to the trust store of the application. > On 26. Jun 2017, at 16:55, ani.desh1512 wrote: > > As Stephan pointed out, this seems

Re: Flink on EMR

2017-09-26 Thread Jörn Franke
Amazon EMR has already a Flink package. You just need to check the checkbox. I would not install it on your own. I think you can find it in the advanced options. > On 26. Sep 2017, at 07:14, Navneeth Krishnan wrote: > > Hello All, > > I'm trying to deploy flink on

Re: Very low-latency - is it possible?

2017-08-31 Thread Jörn Franke
If you really need to get that low something else might be more suitable. Given the times a custom solution might be necessary. Flink is a generic powerful framework - hence it does not address these latencies. > On 31. Aug 2017, at 14:50, Marchant, Hayden wrote: >

Re: getting started with link / scala

2017-11-29 Thread Jörn Franke
If you want to really learn then I recommend you to start with a flink project that contains unit tests and integration tests (maybe augmented with https://wiki.apache.org/hadoop/HowToDevelopUnitTests to simulate a HDFS cluster during unit tests). It should also include coverage reporting.

Re: Can not resolve org.apache.hadoop.fs.Path in 1.4.0

2017-12-19 Thread Jörn Franke
You need to put flink-hadoop-compability*.jar in the lib folder of your flink distribution or in the class path of your Custer nodes > On 19. Dec 2017, at 12:38, shashank agarwal wrote: > > yes, it's working fine. now not getting compile time error. > > But when i

Re: Apache Flink - Question about thread safety for stateful collections (MapState)

2017-11-12 Thread Jörn Franke
Be careful though with racing conditions . > On 12. Nov 2017, at 02:47, Kien Truong wrote: > > Hi Mans, > > They're not executed in the same thread, but the methods that called them are > synchronized[1] and therefore thread-safe. > > Best regards, > > Kien > > [1]

Re: Capacity Planning For Large State in YARN Cluster

2017-10-29 Thread Jörn Franke
Well you can only performance test it beforehand in different scenarios with different configurations. I am not sure what exactly your state holds (eg how many objects etc), but if it is Java objects then 3 times might be a little bit low (depends also how you initially tested state size) -

Re: Batch writing from Flink streaming job

2018-05-13 Thread Jörn Franke
If you want to write in batches from a streaming source you always will need some state ie a state database (here a NoSQL database such as a key value store makes sense). Then you can grab the data at certain points in time and convert it to Avro. You need to make sure that the state is

Re: How to sleep for 1 sec and then call keyBy for partitioning

2018-05-16 Thread Jörn Franke
Just some advice - do not use sleep to simulate a heavy task. Use real data or generated data to simulate. This sleep is garbage from a software quality point of view. Furthermore, it is often forgotten etc. > On 16. May 2018, at 22:32, Vijay Balakrishnan wrote: > > Hi, >

Re: Flink application does not scale as expected, please help!

2018-06-16 Thread Jörn Franke
How large is the input data? If the input data is very small then it does not make sense to scale it even more. The larger the data is the more parallelism you will have. You can modify this behavior of course by changing the partition on the Dataset. > On 16. Jun 2018, at 10:41, Siew Wai Yow

Re: Flink application does not scale as expected, please help!

2018-06-16 Thread Jörn Franke
is distributed to different TM and the performance worse than the low > parallelism case. Is this something expected? The more I scale the less I get? > > From: Siew Wai Yow > Sent: Saturday, June 16, 2018 5:09 PM > To: Jörn Franke > Cc: user@flink.apache.org > Subject:

Re: Kafka to Flink to Hive - Writes failing

2018-06-10 Thread Jörn Franke
Don’t use the JDBC driver to write to Hive. The performance of JDBC in general for large volumes is suboptimal. Write it to a file in HDFS in a format supported by HIve and point the table definition in Hive to it. > On 11. Jun 2018, at 04:47, sagar loke wrote: > > I am trying to Sink data to

Re: java.lang.NoSuchMethodError: org.apache.kafka.clients.consumer.KafkaConsumer.assign

2018-07-02 Thread Jörn Franke
Looks like a version issue , have you made sure that the Kafka version is compatible? > On 2. Jul 2018, at 18:35, Mich Talebzadeh wrote: > > Have you seen this error by any chance in flink streaming with Kafka please? > > org.apache.flink.client.program.ProgramInvocationException: >

Re: Existing files and directories are not overwritten in NO_OVERWRITE mode. Use OVERWRITE mode to overwrite existing files and directories.

2018-07-03 Thread Jörn Franke
import org.apache.flink.core.fs.FileSystem > On 3. Jul 2018, at 08:13, Mich Talebzadeh wrote: > > thanks Hequn. > > When I use as suggested, I am getting this error > > error] > /home/hduser/dba/bin/flink/md_streaming/src/main/scala/myPackage/md_streaming.scala:30: > not found: value

Re: compiling flink job with maven is failing with error: object flink is not a member of package org.apache

2018-07-01 Thread Jörn Franke
Shouldn’t it be 1.5.0 instead of 1.5? > On 1. Jul 2018, at 18:10, Mich Talebzadeh wrote: > > Ok some clumsy work by me not creating the pom.xml file in flink > sub-directory (it was putting it in spark)! > > Anyhow this is the current issue I am facing > > [INFO] >

Re: Different result on running Flink in local mode and Yarn cluster

2018-04-25 Thread Jörn Franke
The problem maybe that it is still static. How will the parser use this HashMap? > On 26. Apr 2018, at 06:42, Soheil Pourbafrani wrote: > > I run a code using Flink Java API that gets some bytes from Kafka and parses > it following by inserting into Cassandra database

Re: Converting a DataStream into a Table throws error

2018-08-01 Thread Jörn Franke
How does your build.sbt looks especially dependencies? > On 2. Aug 2018, at 00:44, Mich Talebzadeh wrote: > > Changed as suggested > >val streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment > val dataStream = streamExecEnv >.addSource(new

Re: Working out through individual messages in Flink

2018-08-07 Thread Jörn Franke
(At the end of your code) > On 8. Aug 2018, at 00:29, Jörn Franke wrote: > > Hi Mich, > > Would it be possible to share the full source code ? > I am missing a call to streamExecEnvironment.execute > > Best regards > >> On 8. Aug 2018, at 00:02, Mich Ta

Re: Working out through individual messages in Flink

2018-08-07 Thread Jörn Franke
Hi Mich, Would it be possible to share the full source code ? I am missing a call to streamExecEnvironment.execute Best regards > On 8. Aug 2018, at 00:02, Mich Talebzadeh wrote: > > Hi Fabian, > > Reading your notes above I have converted the table back to DataStream. > > val tableEnv

Re: Limit on number of files to read for Dataset

2018-08-14 Thread Jörn Franke
It causes more overhead (processes etc) which might make it slower. Furthermore if you have them stored on HDFS then the bottleneck is the namenode which will have to answer millions of requests. The latter point will change in future Hadoop versions with http://ozone.hadoop.apache.org/ > On

Re: Small-files source - partitioning based on prefix of file

2018-08-10 Thread Jörn Franke
Or you write a custom file system for Flink... (for the tar part). Unfortunately gz files can only be processed single threaded (there are some multiple thread implementation but they don’t bring the big gain). > On 10. Aug 2018, at 07:07, vino yang wrote: > > Hi Averell, > > In this case,

Re: Low Performance in High Cardinality Big Window Application

2018-08-26 Thread Jörn Franke
If you have a window larger than hours then you need to rethink your architecture - this is not streaming anymore. Only because you receive events in a streamed fashion you don’t need to do all the processing in a streamed fashion. Can you store the events in a file or a database and then do

Re: understanding purpose of TextInputFormat

2018-07-14 Thread Jörn Franke
Textinputformat defines the format of the data, it could be also different from text , eg orc, parquet etc > On 14. Jul 2018, at 19:15, chrisr123 wrote: > > I'm building a streaming app that continuously monitors a directory for new > files and I'm confused about why I have to specify a

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
It could be related to S3 that seems to be configured for eventual consistency. Maybe it helps to configure strong consistency. However, I recommend to replace S3 with a NoSQL database (since you are amazon Dynamo would help + Dynamodb streams, alternatively sns or sqs). The small size and

Re: Implement Joins with Lookup Data

2018-07-23 Thread Jörn Franke
For the first one (lookup of single entries) you could use a NoSQL db (eg key value store) - a relational database will not scale. Depending on when you need to do the enrichment you could also first store the data and enrich it later as part of a batch process. > On 24. Jul 2018, at 05:25,

Re: Flink mini IDEA runtime

2018-07-22 Thread Jörn Franke
You can run them in a localenvironment. I do it for my integration tests everytime: flinkEnvironment = ExecutionEnvironment.createLocalEnvironment(1) Eg (even together with a local HDFS cluster)

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
Sure kinesis is another way. Can you try read after write consistency (assuming the files are not modified) In any case it looks you would be better suited with a NoSQL store or kinesis (I don’t know your exact use case in order to provide you more details) > On 24. Jul 2018, at 09:51, Averell

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html You will find there a passage of the consistency model. Probably the system is putting them to the folder and Flink is triggered before they are consistent. What happens after Flink put s them on S3 ? Are they reused by another

Re: A use-case for Flink and reactive systems

2018-07-04 Thread Jörn Franke
I think it is a little bit overkill to use Flink for such a simple system. > On 4. Jul 2018, at 18:55, Yersinia Ruckeri wrote: > > Hi all, > > I am working on a prototype which should include Flink in a reactive systems > software. The easiest use-case with a traditional bank system where I

Re: Real time streaming as a microservice

2018-07-08 Thread Jörn Franke
of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > >> On Sun, 8 Jul 2018 at 10:25

Re: Real time streaming as a microservice

2018-07-08 Thread Jörn Franke
That they are loosely coupled does not mean they are independent. For instance, you would not be able to replace Kafka with zeromq in your scenario. Unfortunately also Kafka sometimes needs to introduce breaking changes and the dependent application needs to upgrade. You will not be able to

Re: Move files read by flink

2018-03-10 Thread Jörn Franke
Path has a method getFileSystem > On 10. Mar 2018, at 10:36, flinkuser101 wrote: > > Where does Flink expose filesystem? Is it from env? or inputstream? > > TextInputFormat format = new TextInputFormat(new > org.apache.flink.core.fs.Path(url.toString())); >

Re: Move files read by flink

2018-03-08 Thread Jörn Franke
Why don’t you let your flink job move them once it’s done? > On 9. Mar 2018, at 03:12, flinkuser101 wrote: > > I am reading files from a folder suppose > > /files/* > > Files are pushed into that folder. > > /files/file1_2018_03_09.csv >

Re: Move files read by flink

2018-03-10 Thread Jörn Franke
FileSystem class provides by Flink. https://ci.apache.org/projects/flink/flink-docs-master/api/java/ > On 10. Mar 2018, at 00:44, flinkuser101 wrote: > > Is there any way to do that? I have been searching for way to do that but in > vain. > > > > -- > Sent from:

Re: Move files read by flink

2018-03-10 Thread Jörn Franke
Alternatively static method FileSystem.get > On 10. Mar 2018, at 10:36, flinkuser101 wrote: > > Where does Flink expose filesystem? Is it from env? or inputstream? > > TextInputFormat format = new TextInputFormat(new > org.apache.flink.core.fs.Path(url.toString()));

Re: Flink Uploaded JAR Filename

2018-04-09 Thread Jörn Franke
Why don’t you parse the response from curl and use it to trigger the second request? That is easy automatable using Bash commands - or do I overlook something here? > On 9. Apr 2018, at 18:49, Pavel Ciorba wrote: > > Hi everyone > > I make 2 cURL POST requests to upload

Re: Complexity of Flink

2018-04-14 Thread Jörn Franke
I think this always depends. I found Flink more clean compared to other Big Data platforms and with some experience it is rather easy to deploy. However how do you measure complexity? How do you plan to cater for other components (eg deploy in the cloud, deploy locally in a Hadoop cluster

Re: Complexity of Flink

2018-04-14 Thread Jörn Franke
f source > files and so on ? > How easily it is to get this information ? > > Best, Esa > > -Original Message- > From: Jörn Franke <jornfra...@gmail.com> > Sent: Saturday, April 14, 2018 1:43 PM > To: Esa Heikkinen <esa.heikki...@student.tut.fi> > Cc: use

Re: Graph Analytics on HBase With HGraphDB and Apache Flink Gelly

2018-04-04 Thread Jörn Franke
Have you checked janusgraph source code , it used also hbase as a storage backend: http://janusgraph.org/ It combines it with elasticsearch for indexing. Maybe you can inspire from the architecture there. Generally, hbase it depends a lot on how the data is written to regions, the order of

Re: Need Help/Code Examples with reading/writing Parquet File with Flink ?

2018-04-18 Thread Jörn Franke
You can use the corresponding HadoopInputformat within Flink > On 18. Apr 2018, at 07:23, sohimankotia wrote: > > Hi .. > > I have file in hdfs in format file.snappy.parquet . Can someone please > point/help with code example of reading parquet files . > > > -Sohi >

Re: Beam quickstart

2018-04-25 Thread Jörn Franke
Tried with a fat jar to see if it works in general ? > On 25. Apr 2018, at 15:32, Gyula Fóra wrote: > > Hey, > Is there somewhere an end to end guide how to run a simple beam-on-flink > application (preferrably using Gradle)? I want to run it using the standard >

Re: Multi threaded operators?

2018-04-23 Thread Jörn Franke
I would disable it if possible and use the Flink parallism. The threading might work but can create operational issues depending on how you configure your resource manager. > On 23. Apr 2018, at 11:54, Alexander Smirnov > wrote: > > Hi, > > I have a co-flatmap

Re: How does setMaxParallelism work

2018-03-28 Thread Jörn Franke
What was the input format, the size and the program that you tried to execute > On 28. Mar 2018, at 08:18, Data Engineer wrote: > > I went through the explanation on MaxParallelism in the official docs here: >

Re: Custom Processing per window

2018-03-19 Thread Jörn Franke
How would you start implementing it? Where are you stuck? Did you already try to implement this? > On 18. Mar 2018, at 04:10, Dhruv Kumar wrote: > > Hi > > I am a CS PhD student at UMN Twin Cities. I am trying to use Flink for > implementing some very specific

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-12 Thread Jörn Franke
Thank you very nice , I fully agree with that. > Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu : > > Hi Jörn, > > Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact > it is one of the two approaches that I named in the beginning of the thread. > As also pointed out

Re: Flink 1.4: Queryable State Client

2018-10-14 Thread Jörn Franke
You have to file an issue. One workaround to see if this really fixes your problem could be to use reflection to mark this method as public and then call it (it is of course nothing for production code). You can also try a newer Flink version. > Am 13.10.2018 um 18:02 schrieb Seye Jin : > > I

Re: Could not connect to flink on Amazon EC2

2018-10-23 Thread Jörn Franke
Did you configure the IAM access roles correctly? Are those two machines are allowed to communicate? > Am 23.10.2018 um 12:55 schrieb madan : > > Hi, > > I am trying to setup cluster with 2 EC2 instances. Able to bring up the > cluser with 1 master and 2 slaves. But I am not able to connect

Re: How to use a thin jar instead of a fat jar?

2018-10-03 Thread Jörn Franke
You don’t need to include the flink libraries themselves in the fat jar ! You can put them as provided and this reduces the jar size! They are already provided by your Flink installation. One exception is the table API but I simply recommend to put it in your flink distribution folder (if your

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-10 Thread Jörn Franke
Would it maybe make sense to provide Flink as an engine on Hive („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely coupled than integrating hive in all possible flink core modules and thus introducing a very tight dependency to Hive in the core. 1,2,3 could be achieved

Re: Checkpointing not working

2018-09-19 Thread Jörn Franke
What do the logfiles say? How does the source code looks like? Is it really needed to do checkpointing every 30 seconds? > On 19. Sep 2018, at 08:25, yuvraj singh <19yuvrajsing...@gmail.com> wrote: > > Hi , > > I am doing checkpointing using s3 and rocksdb , > i am doing checkpointing per

Re: Flink Yarn Deployment Issue - 1.7.0

2018-12-09 Thread Jörn Franke
Can you check the Flink log files? You should get there a better description of the error. > Am 08.12.2018 um 18:15 schrieb sohimankotia : > > Hi , > > I have installed flink-1.7.0 Hadoop 2.7 scala 2.11 . We are using > hortonworks hadoop distribution.(hdp/2.6.1.0-129/) > > *Flink lib folder

Re: Using Flink in an university course

2019-03-04 Thread Jörn Franke
It would help to understand the current issues that you have with this approach? I used a similar approach (not with Flink, but a similar big data technology) some years ago > Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager : > > Hi all, > > I'm working on a setup to use Apache Flink in an

Re: How can i just implement a crontab function using flink?

2019-05-24 Thread Jörn Franke
Just sent a dummy event from the source system every minute > Am 24.05.2019 um 10:20 schrieb "wangl...@geekplus.com.cn" > : > > > I want to do something every one minute. > > Using TumblingWindow, the function will not be triigged if there's no message > received during this minute. But i

Re: How to handle Flink Job with 400MB+ Uberjar with 800+ containers ?

2019-08-30 Thread Jörn Franke
Increase replication factor and/or use HDFS cache https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Try to reduce the size of the Jar, eg the Flink libraries do not need to be included. > Am 30.08.2019 um 01:09 schrieb Elkhan Dadashov : > >

Apache Flink and additional fileformats (Excel, blockchains)

2019-07-29 Thread Jörn Franke
Hi, I wrote some time ago several connectors for Apache Flink that are open sourced under the Apache License: * HadoopOffice: Process Excel files (reading/writing) (.xls,.xlsx) in Flink using the datasource or Table API:

Re: Apache Flink and additional fileformats (Excel, blockchains)

2019-08-02 Thread Jörn Franke
still figuring out some details, but hope that it can go live soon. > > Would be great to have your repositories listed there as well. > > Cheers, > Fabian > >> Am Mo., 29. Juli 2019 um 23:39 Uhr schrieb Jörn Franke >> : >> Hi, >> >> I wrote som

Re: flink run jar throw java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

2019-10-30 Thread Jörn Franke
You can create a fat jar (also called Uber jar) that includes all dependencies in your application jar. I would avoid to put things in the Flink lib directory as it can make maintenance difficult. Eg deployment is challenging, upgrade of flink, providing it on new nodes etc. > Am

Re: Complex SQL Queries on Java Streams

2019-10-28 Thread Jörn Franke
Flink is merely StreamProcessing. I would not use it in a synchronous web call. However, I would not make any complex analytic function available on a synchronous web service. I would deploy a messaging bus (rabbitmq, zeromq etc) and send the request there (if the source is a web app

Re: Finding the Maximum Value Received so far in a Stream

2019-10-03 Thread Jörn Franke
You can not compare doubles in Java (and other languages). Reason is that double has a limited precision and is rounded. See here for some examples and discussion: https://howtodoinjava.com/java/basics/correctly-compare-float-double/ Am 03.10.2019 um 08:26 schrieb Komal Mariam : > >  > Hello

Re: Finding the Maximum Value Received so far in a Stream

2019-10-03 Thread Jörn Franke
Btw. Why don’t you use the max method? https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/KeyedStream.html#max-java.lang.String- See here about the state solution:

Re: Using S3 as a streaming File source

2020-09-01 Thread Jörn Franke
Why don’t you get an S3 notification on SQS and do the actions from there? You will probably need to write the content of the files to a no sql database . Alternatively send the s3 notification to Kafka and read flink from there.

Re: Flink s3 streaming performance

2020-06-01 Thread Jörn Franke
I think S3 is a wrong storage backend for this volumes of small messages. Try to use a NoSQL database or write multiple messages into one file in S3 (1 or 10) If you still want to go with your scenario then try a network optimized instance and use s3a in Flink and configure s3 entropy.

Re: Flink 1.11 Table API cannot process Avro

2020-07-11 Thread Jörn Franke
You are missing additional dependencies https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/avro.html > Am 11.07.2020 um 04:16 schrieb Lian Jiang : > >  > Hi, > > According to >

Re: DROOLS rule engine with flink

2020-06-24 Thread Jörn Franke
I would implement them directly in Flink/Flink table API. I don’t think Drools plays well in this distributed scenario. It expects a centralized rule store and evaluation . > Am 23.06.2020 um 21:03 schrieb Jaswin Shah : > >  > Hi I am thinking of using some rule engine like DROOLS with flink

Re: Cep application with Flink

2021-02-20 Thread Jörn Franke
You are working in a distributed system so event ordering by time may not be sufficient (or most likely not). Due to network delays, devices offline etc it can happen that an event arrives much later although it happened before. Check watermarks in flink and read on at least once, mostly once

Re: flink run jar throw java.lang.ClassNotFoundException: com.mysql.jdbc.Driver

2019-10-30 Thread Jörn Franke
You can create a fat jar (also called Uber jar) that includes all dependencies in your application jar. I would avoid to put things in the Flink lib directory as it can make maintenance difficult. Eg deployment is challenging, upgrade of flink, providing it on new nodes etc. > Am