I would use flume to import these sources to HDFS and then use flink or Hadoop
or whatever to process them. While it is possible to do it in flink, you do not
want that your processing fails because the web service is not available etc.
Via flume which is suitable for this kind of tasks it is
That does not sound like a good idea to put a configuration file on every node.
What about Zookeeper?
> On 13. Jul 2017, at 17:10, Guy Harmach wrote:
>
> Hi,
>
> I’m running a flink job on YARN. I’d like to pass yaml configuration files to
> the job.
> I tried to use the
What do you mean by "consistent"?
Of course you can do this only at the time the timpstamp is defined (e.g. Using
NTP). However, this is never perfect .
Then it is unrealistic that they always end up in the same window because of
network delays etc. you will need here a global state that is
is then loaded by flink
into the cluster, so there is no data locality as with HDFS).
> On 6. Aug 2017, at 11:20, Kaepke, Marc <marc.kae...@haw-hamburg.de> wrote:
>
> Thanks Jörn!
>
> I expected Flink will schedule the input file to all workers.
>
>> Am 05.08.2017 um 16:2
Probably you need to refer to the file on HDFS or manually make it available on
each node as a local file. HDFS is recommended.
If it is already on HDFS then you need to provide an HDFS URL to the file.
> On 5. Aug 2017, at 14:27, Kaepke, Marc wrote:
>
> Hi there,
One would need to look at your code and possible on some heap statistics. Maybe
something wrong happens when you cache them (do you use a 3rd party library or
your own implementation?). Do you use a stable version of your protobuf library
(not necessarily the most recent). You also may want to
The error that you mentioned seem to indicate that some certificates of
certification authorities could not be found. You may want to add them to the
trust store of the application.
> On 26. Jun 2017, at 16:55, ani.desh1512 wrote:
>
> As Stephan pointed out, this seems
Amazon EMR has already a Flink package. You just need to check the checkbox. I
would not install it on your own.
I think you can find it in the advanced options.
> On 26. Sep 2017, at 07:14, Navneeth Krishnan wrote:
>
> Hello All,
>
> I'm trying to deploy flink on
If you really need to get that low something else might be more suitable. Given
the times a custom solution might be necessary. Flink is a generic powerful
framework - hence it does not address these latencies.
> On 31. Aug 2017, at 14:50, Marchant, Hayden wrote:
>
If you want to really learn then I recommend you to start with a flink project
that contains unit tests and integration tests (maybe augmented with
https://wiki.apache.org/hadoop/HowToDevelopUnitTests to simulate a HDFS cluster
during unit tests). It should also include coverage reporting.
You need to put flink-hadoop-compability*.jar in the lib folder of your flink
distribution or in the class path of your Custer nodes
> On 19. Dec 2017, at 12:38, shashank agarwal wrote:
>
> yes, it's working fine. now not getting compile time error.
>
> But when i
Be careful though with racing conditions .
> On 12. Nov 2017, at 02:47, Kien Truong wrote:
>
> Hi Mans,
>
> They're not executed in the same thread, but the methods that called them are
> synchronized[1] and therefore thread-safe.
>
> Best regards,
>
> Kien
>
> [1]
Well you can only performance test it beforehand in different scenarios with
different configurations.
I am not sure what exactly your state holds (eg how many objects etc), but if
it is Java objects then 3 times might be a little bit low (depends also how you
initially tested state size) -
If you want to write in batches from a streaming source you always will need
some state ie a state database (here a NoSQL database such as a key value store
makes sense). Then you can grab the data at certain points in time and convert
it to Avro. You need to make sure that the state is
Just some advice - do not use sleep to simulate a heavy task. Use real data or
generated data to simulate. This sleep is garbage from a software quality point
of view. Furthermore, it is often forgotten etc.
> On 16. May 2018, at 22:32, Vijay Balakrishnan wrote:
>
> Hi,
>
How large is the input data? If the input data is very small then it does not
make sense to scale it even more. The larger the data is the more parallelism
you will have. You can modify this behavior of course by changing the partition
on the Dataset.
> On 16. Jun 2018, at 10:41, Siew Wai Yow
is distributed to different TM and the performance worse than the low
> parallelism case. Is this something expected? The more I scale the less I get?
>
> From: Siew Wai Yow
> Sent: Saturday, June 16, 2018 5:09 PM
> To: Jörn Franke
> Cc: user@flink.apache.org
> Subject:
Don’t use the JDBC driver to write to Hive. The performance of JDBC in general
for large volumes is suboptimal.
Write it to a file in HDFS in a format supported by HIve and point the table
definition in Hive to it.
> On 11. Jun 2018, at 04:47, sagar loke wrote:
>
> I am trying to Sink data to
Looks like a version issue , have you made sure that the Kafka version is
compatible?
> On 2. Jul 2018, at 18:35, Mich Talebzadeh wrote:
>
> Have you seen this error by any chance in flink streaming with Kafka please?
>
> org.apache.flink.client.program.ProgramInvocationException:
>
import org.apache.flink.core.fs.FileSystem
> On 3. Jul 2018, at 08:13, Mich Talebzadeh wrote:
>
> thanks Hequn.
>
> When I use as suggested, I am getting this error
>
> error]
> /home/hduser/dba/bin/flink/md_streaming/src/main/scala/myPackage/md_streaming.scala:30:
> not found: value
Shouldn’t it be 1.5.0 instead of 1.5?
> On 1. Jul 2018, at 18:10, Mich Talebzadeh wrote:
>
> Ok some clumsy work by me not creating the pom.xml file in flink
> sub-directory (it was putting it in spark)!
>
> Anyhow this is the current issue I am facing
>
> [INFO]
>
The problem maybe that it is still static. How will the parser use this HashMap?
> On 26. Apr 2018, at 06:42, Soheil Pourbafrani wrote:
>
> I run a code using Flink Java API that gets some bytes from Kafka and parses
> it following by inserting into Cassandra database
How does your build.sbt looks especially dependencies?
> On 2. Aug 2018, at 00:44, Mich Talebzadeh wrote:
>
> Changed as suggested
>
>val streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment
> val dataStream = streamExecEnv
>.addSource(new
(At the end of your code)
> On 8. Aug 2018, at 00:29, Jörn Franke wrote:
>
> Hi Mich,
>
> Would it be possible to share the full source code ?
> I am missing a call to streamExecEnvironment.execute
>
> Best regards
>
>> On 8. Aug 2018, at 00:02, Mich Ta
Hi Mich,
Would it be possible to share the full source code ?
I am missing a call to streamExecEnvironment.execute
Best regards
> On 8. Aug 2018, at 00:02, Mich Talebzadeh wrote:
>
> Hi Fabian,
>
> Reading your notes above I have converted the table back to DataStream.
>
> val tableEnv
It causes more overhead (processes etc) which might make it slower. Furthermore
if you have them stored on HDFS then the bottleneck is the namenode which will
have to answer millions of requests.
The latter point will change in future Hadoop versions with
http://ozone.hadoop.apache.org/
> On
Or you write a custom file system for Flink... (for the tar part).
Unfortunately gz files can only be processed single threaded (there are some
multiple thread implementation but they don’t bring the big gain).
> On 10. Aug 2018, at 07:07, vino yang wrote:
>
> Hi Averell,
>
> In this case,
If you have a window larger than hours then you need to rethink your
architecture - this is not streaming anymore. Only because you receive events
in a streamed fashion you don’t need to do all the processing in a streamed
fashion.
Can you store the events in a file or a database and then do
Textinputformat defines the format of the data, it could be also different from
text , eg orc, parquet etc
> On 14. Jul 2018, at 19:15, chrisr123 wrote:
>
> I'm building a streaming app that continuously monitors a directory for new
> files and I'm confused about why I have to specify a
It could be related to S3 that seems to be configured for eventual consistency.
Maybe it helps to configure strong consistency.
However, I recommend to replace S3 with a NoSQL database (since you are amazon
Dynamo would help + Dynamodb streams, alternatively sns or sqs). The small size
and
For the first one (lookup of single entries) you could use a NoSQL db (eg key
value store) - a relational database will not scale.
Depending on when you need to do the enrichment you could also first store the
data and enrich it later as part of a batch process.
> On 24. Jul 2018, at 05:25,
You can run them in a localenvironment. I do it for my integration tests
everytime:
flinkEnvironment = ExecutionEnvironment.createLocalEnvironment(1)
Eg (even together with a local HDFS cluster)
Sure kinesis is another way.
Can you try read after write consistency (assuming the files are not modified)
In any case it looks you would be better suited with a NoSQL store or kinesis
(I don’t know your exact use case in order to provide you more details)
> On 24. Jul 2018, at 09:51, Averell
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html
You will find there a passage of the consistency model.
Probably the system is putting them to the folder and Flink is triggered before
they are consistent.
What happens after Flink put s them on S3 ? Are they reused by another
I think it is a little bit overkill to use Flink for such a simple system.
> On 4. Jul 2018, at 18:55, Yersinia Ruckeri wrote:
>
> Hi all,
>
> I am working on a prototype which should include Flink in a reactive systems
> software. The easiest use-case with a traditional bank system where I
of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>> On Sun, 8 Jul 2018 at 10:25
That they are loosely coupled does not mean they are independent. For instance,
you would not be able to replace Kafka with zeromq in your scenario.
Unfortunately also Kafka sometimes needs to introduce breaking changes and the
dependent application needs to upgrade.
You will not be able to
Path has a method getFileSystem
> On 10. Mar 2018, at 10:36, flinkuser101 wrote:
>
> Where does Flink expose filesystem? Is it from env? or inputstream?
>
> TextInputFormat format = new TextInputFormat(new
> org.apache.flink.core.fs.Path(url.toString()));
>
Why don’t you let your flink job move them once it’s done?
> On 9. Mar 2018, at 03:12, flinkuser101 wrote:
>
> I am reading files from a folder suppose
>
> /files/*
>
> Files are pushed into that folder.
>
> /files/file1_2018_03_09.csv
>
FileSystem class provides by Flink.
https://ci.apache.org/projects/flink/flink-docs-master/api/java/
> On 10. Mar 2018, at 00:44, flinkuser101 wrote:
>
> Is there any way to do that? I have been searching for way to do that but in
> vain.
>
>
>
> --
> Sent from:
Alternatively static method FileSystem.get
> On 10. Mar 2018, at 10:36, flinkuser101 wrote:
>
> Where does Flink expose filesystem? Is it from env? or inputstream?
>
> TextInputFormat format = new TextInputFormat(new
> org.apache.flink.core.fs.Path(url.toString()));
Why don’t you parse the response from curl and use it to trigger the second
request?
That is easy automatable using Bash commands - or do I overlook something here?
> On 9. Apr 2018, at 18:49, Pavel Ciorba wrote:
>
> Hi everyone
>
> I make 2 cURL POST requests to upload
I think this always depends. I found Flink more clean compared to other Big
Data platforms and with some experience it is rather easy to deploy.
However how do you measure complexity? How do you plan to cater for other
components (eg deploy in the cloud, deploy locally in a Hadoop cluster
f source
> files and so on ?
> How easily it is to get this information ?
>
> Best, Esa
>
> -Original Message-
> From: Jörn Franke <jornfra...@gmail.com>
> Sent: Saturday, April 14, 2018 1:43 PM
> To: Esa Heikkinen <esa.heikki...@student.tut.fi>
> Cc: use
Have you checked janusgraph source code , it used also hbase as a storage
backend:
http://janusgraph.org/
It combines it with elasticsearch for indexing. Maybe you can inspire from the
architecture there.
Generally, hbase it depends a lot on how the data is written to regions, the
order of
You can use the corresponding HadoopInputformat within Flink
> On 18. Apr 2018, at 07:23, sohimankotia wrote:
>
> Hi ..
>
> I have file in hdfs in format file.snappy.parquet . Can someone please
> point/help with code example of reading parquet files .
>
>
> -Sohi
>
Tried with a fat jar to see if it works in general ?
> On 25. Apr 2018, at 15:32, Gyula Fóra wrote:
>
> Hey,
> Is there somewhere an end to end guide how to run a simple beam-on-flink
> application (preferrably using Gradle)? I want to run it using the standard
>
I would disable it if possible and use the Flink parallism. The threading
might work but can create operational issues depending on how you configure
your resource manager.
> On 23. Apr 2018, at 11:54, Alexander Smirnov
> wrote:
>
> Hi,
>
> I have a co-flatmap
What was the input format, the size and the program that you tried to execute
> On 28. Mar 2018, at 08:18, Data Engineer wrote:
>
> I went through the explanation on MaxParallelism in the official docs here:
>
How would you start implementing it? Where are you stuck?
Did you already try to implement this?
> On 18. Mar 2018, at 04:10, Dhruv Kumar wrote:
>
> Hi
>
> I am a CS PhD student at UMN Twin Cities. I am trying to use Flink for
> implementing some very specific
Thank you very nice , I fully agree with that.
> Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu :
>
> Hi Jörn,
>
> Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact
> it is one of the two approaches that I named in the beginning of the thread.
> As also pointed out
You have to file an issue. One workaround to see if this really fixes your
problem could be to use reflection to mark this method as public and then call
it (it is of course nothing for production code). You can also try a newer
Flink version.
> Am 13.10.2018 um 18:02 schrieb Seye Jin :
>
> I
Did you configure the IAM access roles correctly? Are those two machines are
allowed to communicate?
> Am 23.10.2018 um 12:55 schrieb madan :
>
> Hi,
>
> I am trying to setup cluster with 2 EC2 instances. Able to bring up the
> cluser with 1 master and 2 slaves. But I am not able to connect
You don’t need to include the flink libraries themselves in the fat jar ! You
can put them as provided and this reduces the jar size! They are already
provided by your Flink installation. One exception is the table API but I
simply recommend to put it in your flink distribution folder (if your
Would it maybe make sense to provide Flink as an engine on Hive
(„flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely
coupled than integrating hive in all possible flink core modules and thus
introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved
What do the logfiles say?
How does the source code looks like?
Is it really needed to do checkpointing every 30 seconds?
> On 19. Sep 2018, at 08:25, yuvraj singh <19yuvrajsing...@gmail.com> wrote:
>
> Hi ,
>
> I am doing checkpointing using s3 and rocksdb ,
> i am doing checkpointing per
Can you check the Flink log files? You should get there a better description of
the error.
> Am 08.12.2018 um 18:15 schrieb sohimankotia :
>
> Hi ,
>
> I have installed flink-1.7.0 Hadoop 2.7 scala 2.11 . We are using
> hortonworks hadoop distribution.(hdp/2.6.1.0-129/)
>
> *Flink lib folder
It would help to understand the current issues that you have with this
approach? I used a similar approach (not with Flink, but a similar big data
technology) some years ago
> Am 04.03.2019 um 11:32 schrieb Wouter Zorgdrager :
>
> Hi all,
>
> I'm working on a setup to use Apache Flink in an
Just sent a dummy event from the source system every minute
> Am 24.05.2019 um 10:20 schrieb "wangl...@geekplus.com.cn"
> :
>
>
> I want to do something every one minute.
>
> Using TumblingWindow, the function will not be triigged if there's no message
> received during this minute. But i
Increase replication factor and/or use HDFS cache
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html
Try to reduce the size of the Jar, eg the Flink libraries do not need to be
included.
> Am 30.08.2019 um 01:09 schrieb Elkhan Dadashov :
>
>
Hi,
I wrote some time ago several connectors for Apache Flink that are open
sourced under the Apache License:
* HadoopOffice: Process Excel files (reading/writing) (.xls,.xlsx) in Flink
using the datasource or Table API:
still figuring out some details, but hope that it can go live soon.
>
> Would be great to have your repositories listed there as well.
>
> Cheers,
> Fabian
>
>> Am Mo., 29. Juli 2019 um 23:39 Uhr schrieb Jörn Franke
>> :
>> Hi,
>>
>> I wrote som
You can create a fat jar (also called Uber jar) that includes all dependencies
in your application jar.
I would avoid to put things in the Flink lib directory as it can make
maintenance difficult. Eg deployment is challenging, upgrade of flink,
providing it on new nodes etc.
> Am
Flink is merely StreamProcessing. I would not use it in a synchronous web call.
However, I would not make any complex analytic function available on a
synchronous web service. I would deploy a messaging bus (rabbitmq, zeromq etc)
and send the request there (if the source is a web app
You can not compare doubles in Java (and other languages). Reason is that
double has
a limited precision and is rounded. See here for some examples and discussion:
https://howtodoinjava.com/java/basics/correctly-compare-float-double/
Am 03.10.2019 um 08:26 schrieb Komal Mariam :
>
>
> Hello
Btw. Why don’t you use the max method?
https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/streaming/api/datastream/KeyedStream.html#max-java.lang.String-
See here about the state solution:
Why don’t you get an S3 notification on SQS and do the actions from there?
You will probably need to write the content of the files to a no sql database .
Alternatively send the s3 notification to Kafka and read flink from there.
I think S3 is a wrong storage backend for this volumes of small messages.
Try to use a NoSQL database or write multiple messages into one file in S3
(1 or 10)
If you still want to go with your scenario then try a network optimized
instance and use s3a in Flink and configure s3 entropy.
You are missing additional dependencies
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/formats/avro.html
> Am 11.07.2020 um 04:16 schrieb Lian Jiang :
>
>
> Hi,
>
> According to
>
I would implement them directly in Flink/Flink table API.
I don’t think Drools plays well in this distributed scenario. It expects a
centralized rule store and evaluation .
> Am 23.06.2020 um 21:03 schrieb Jaswin Shah :
>
>
> Hi I am thinking of using some rule engine like DROOLS with flink
You are working in a distributed system so event ordering by time may not be
sufficient (or most likely not). Due to network delays, devices offline etc it
can happen that an event arrives much later although it happened before. Check
watermarks in flink and read on at least once, mostly once
You can create a fat jar (also called Uber jar) that includes all dependencies
in your application jar.
I would avoid to put things in the Flink lib directory as it can make
maintenance difficult. Eg deployment is challenging, upgrade of flink,
providing it on new nodes etc.
> Am
72 matches
Mail list logo