RE: YARN : Different cutoff for job and task managers

2019-11-20 Thread Gwenhael Pasquiers
This works; I had some issues modifying my scripts, but it’s OK and I could confirm by JMX that env.java.opts.jobmanager had priority over the “normal” heap size (calculated from cutoff). Thanks !  From: Yang Wang Sent: mercredi 20 novembre 2019 03:52 To: Gwenhael Pasquiers Cc: user

RE: YARN : Different cutoff for job and task managers

2019-11-20 Thread Gwenhael Pasquiers
I see, good idea, I’ll try that and tell you the result. Thanks, From: Yang Wang Sent: mercredi 20 novembre 2019 03:52 To: Gwenhael Pasquiers Cc: user@flink.apache.org Subject: Re: YARN : Different cutoff for job and task managers Hi Gwenhael, I'm afraid that we could not set different cut

YARN : Different cutoff for job and task managers

2019-11-19 Thread Gwenhael Pasquiers
Hello, In a setup where we allocate most of the memory to rocksdb (off-heap) we ha= ve an important cutoff. Our issue is that the same cutoff applies to both task and job managers : the heap size of the job manager then becomes too low. Is there a way to apply different cutoffs to job

RE: Streaming : a way to "key by partition id" without redispatching data

2017-11-13 Thread Gwenhael Pasquiers
>From what I understood, in your case you might solve your issue by using >specific key classes instead of Strings. Maybe you could create key classes that have a user-specified hashcode that could take the previous key's hashcode as a value. That way your data shouldn't be sent over the wire

RE: Streaming : a way to "key by partition id" without redispatching data

2017-11-10 Thread Gwenhael Pasquiers
I think I finally found a way to "simulate" a Timer thanks to the the processWatermark function of the AbstractStreamOperator. Sorry for the monologue. From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: vendredi 10 novembre 2017 16:02 To: 'user@flink.apache.

RE: Streaming : a way to "key by partition id" without redispatching data

2017-11-10 Thread Gwenhael Pasquiers
Hello, Finally, even after creating my operator, I still get the error : "Timers can only be used on keyed operators". Isn't there any way around this ? A way to "key" my stream without shuffling the data ? From: Gwenhael Pasquiers Sent: vendredi 10 novembre 2017 11:42 To

RE: Streaming : a way to "key by partition id" without redispatching data

2017-11-10 Thread Gwenhael Pasquiers
Maybe you don't need to bother with that question. I'm currently discovering AbstractStreamOperator, OneInputStreamOperator and Triggerable. That should do it :-) From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: jeudi 9 novembre 2017 18:00 To: 'user@flink.apache.org

Streaming : a way to "key by partition id" without redispatching data

2017-11-09 Thread Gwenhael Pasquiers
Hello, (Flink 1.2.1) For performances reasons I'm trying to reduce the volume of data of my stream as soon as possible by windowing/folding it for 15 minutes before continuing to the rest of the chain that contains keyBys and windows that will transfer data everywhere. Because of the huge

RE: Great number of jobs and numberOfBuffers

2017-08-31 Thread Gwenhael Pasquiers
Hi, Well yes, I could probably make it work with a constant number of operators (and consequently buffers) by developing specific input and output classes, and that way I'd have a workaround for that buffers issue. The size of my job is input-dependent mostly because my code creates one full

yarn and checkpointing

2017-08-29 Thread Gwenhael Pasquiers
Hi, Is it possible to use checkpointing to restore the state of an app after a restart on yarn ? From what I've seen it looks like that checkpointing only works within a flink cluster life-time. However the yarn mode has one cluster per app, and (unless the app crashes and is automatically

RE: Great number of jobs and numberOfBuffers

2017-08-29 Thread Gwenhael Pasquiers
Hello, Sorry to ask you again, but no idea on this ? -Original Message- From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: lundi 21 août 2017 12:04 To: Nico Kruber <n...@data-artisans.com> Cc: Ufuk Celebi <u...@apache.org>; user@flink.apache.org Subjec

RE: Great number of jobs and numberOfBuffers

2017-08-21 Thread Gwenhael Pasquiers
(32 and 4) in order to limit the numbers of files on HDFS. -Original Message- From: Nico Kruber [mailto:n...@data-artisans.com] Sent: vendredi 18 août 2017 14:58 To: Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com> Cc: Ufuk Celebi <u...@apache.org>; user@flink.apache.org Sub

RE: Great number of jobs and numberOfBuffers

2017-08-17 Thread Gwenhael Pasquiers
4 To: Ufuk Celebi <u...@apache.org> Cc: Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com>; user@flink.apache.org; Nico Kruber <n...@data-artisans.com> Subject: Re: Great number of jobs and numberOfBuffers PS: Also pulling in Nico (CC'd) who is working on the network stack. On Thu, Au

Great number of jobs and numberOfBuffers

2017-08-17 Thread Gwenhael Pasquiers
Hello, We're meeting a limit with the numberOfBuffers. In a quite complex job we do a lot of operations, with a lot of operators, on a lot of folders (datehours). In order to split the job into smaller "batches" (to limit the necessary "numberOfBuffers") I've done a loop over the batches

RE: Event-time and first watermark

2017-08-04 Thread Gwenhael Pasquiers
possibly up-to-date watermark. -Original Message- From: Aljoscha Krettek [mailto:aljos...@apache.org] Sent: vendredi 4 août 2017 12:22 To: Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com> Cc: Nico Kruber <n...@data-artisans.com>; user@flink.apache.org Subject: Re: Event-ti

RE: Event-time and first watermark

2017-08-03 Thread Gwenhael Pasquiers
the watermark ? B.R. -Original Message- From: Nico Kruber [mailto:n...@data-artisans.com] Sent: jeudi 3 août 2017 16:30 To: user@flink.apache.org Cc: Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com> Subject: Re: Event-time and first watermark Hi Gwenhael, "A Watermark(t) decla

Event-time and first watermark

2017-08-03 Thread Gwenhael Pasquiers
Hi, >From my tests it seems that the initial watermark value is Long.MIN_VALUE even >though my first data passed through the timestamp extractor before arriving >into my ProcessFunction. It looks like the watermark "lags" behind the data by >one message. Is there a way to have a watermark

RE: A way to purge an empty session

2017-06-26 Thread Gwenhael Pasquiers
Thanks ! I didn’t know of this function and indeed it seems to match my needs better than Windows. And I’ll be able to clear my state once it’s empty (and re-create it when necessary). B.R. From: Fabian Hueske [mailto:fhue...@gmail.com] Sent: lundi 26 juin 2017 12:13 To: Gwenhael Pasquiers

A way to purge an empty session

2017-06-23 Thread Gwenhael Pasquiers
Hello, This may be premature optimization for memory usage but here is my question : I have to do an app that will have to monitor sessions of (millions of) users. I don’t know when the session starts nor ends, nor a reasonable maximum duration. I want to have a maximum duration (timeout) of

Kafka 0.10 jaas multiple clients

2017-04-26 Thread Gwenhael Pasquiers
Hello, Up to now we’ve been using kafka with jaas (plain login/password) the following way: - yarnship the jaas file - add the jaas file name into “flink-conf.yaml” using property “env.java.opts” How to support multiple secured kafka 0.10 consumers and producers (with

RE: Kafka offset commits

2017-04-21 Thread Gwenhael Pasquiers
ed, right? Best, Aljoscha On 18. Apr 2017, at 16:19, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: Thanks for your answer. Does that means that flink does not rely on the offset in written to zookeeper anymore, but relies on the sna

shaded version of legacy kafka connectors

2017-03-20 Thread Gwenhael Pasquiers
Hi, Before doing it myself I thought it would be better to ask. We need to consume from kafka 0.8 and produce to kafka 0.10 in a flink app. I guess there will be classes and package names conflicts for a lot of dependencies of both connectors. The obvious solution it to make a “shaded” version

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
variables is kept in-memory on each node, it should not become too large. For simpler things like scalar values you can simply make parameters part of the closure of a function, or use the withParameters(...) method to pass in a configuration. From: Gwenhael Pasquiers [mailto:gwenhael.pasqui

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
broadcasting it rather than sending its elements 1 by 1. I’ll try to use it, I’ll take anything that will make my code cleaner ! From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: vendredi 3 mars 2017 17:55 To: user@flink.apache.org Subject: RE: Cross operation on two huge datasets

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
d store the output as Flink State – that would be more fault tolerant but storing it as cache in JVM should work too. Is this a batch job or streaming? Between I am a newbee to Flink, still only learning – so take my suggestions with caution ☺ Thanks Ankit From: Gwenhael Pasquiers <gwenh

RE: Cross operation on two huge datasets

2017-03-03 Thread Gwenhael Pasquiers
for now even if the static thingie is a bit dirty. However I’m surprised that reading 20 MB of parquet become 21GB of “bytes sent” by the flink reader. From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: jeudi 2 mars 2017 16:28 To: user@flink.apache.org Subject: RE: Cross

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
with the partitioned point data set. Cheers, Till On Thu, Mar 2, 2017 at 3:29 PM, Gwenhael Pasquiers [gwenhael.pasqui...@ericsson.com](mailto:gwenhael.pasqui...@ericsson.com)<http://mailto:[gwenhael.pasqui...@ericsson.com](mailto:gwenhael.pasqui...@ericsson.com)> wrote: The best for me would be t

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
ontext().getBroadcastVariable("broadcast"); } @Override public void flatMap(Integer integer, Collector collector) throws Exception { } }).withBroadcastSet(broadcastSet, "broadcast"); Cheers, Till ​ On Thu, Mar 2, 2017 at 12:12 PM, Gwenhael Pasquiers <gw

RE: Cross operation on two huge datasets

2017-03-02 Thread Gwenhael Pasquiers
it is usually better to apply a coarse-grained spatial partitioning and do a key-based join on the partitions. Within each partition you'd perform a cross. Best, Fabian 2017-02-21 18:34 GMT+01:00 Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com&g

RE: Cross operation on two huge datasets

2017-02-23 Thread Gwenhael Pasquiers
Within each partition you'd perform a cross. Best, Fabian 2017-02-21 18:34 GMT+01:00 Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>>: Hi, I need (or at least I think I do) to do a cross operation between two huge datasets. One dataset is a l

RE: Making batches of small messages

2017-01-12 Thread Gwenhael Pasquiers
are on Flink 1.1.x, you will need to implement a custom operator which is a much more low-level interface. Best, Fabian 2017-01-11 17:16 GMT+01:00 Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>>: Hi, Sorry if this was already asked. For

Making batches of small messages

2017-01-11 Thread Gwenhael Pasquiers
Hi, Sorry if this was already asked. For performances reasons (streaming as well as batch) I'd like to "group" messages (let's say by batches of 1000) before sending them to my sink (kafka, but mainly ES) so that I have a smaller overhead. I've seen the "countWindow" operation but if I'm not

RE: Programmatically get live values of accumulators

2017-01-03 Thread Gwenhael Pasquiers
On Mon, Jan 2, 2017 at 1:34 AM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: Hi, and best wishes for the year to come ☺ I’d like to be able to programmatically get the (live) values of accumulators in order to send them us

Programmatically get live values of accumulators

2017-01-02 Thread Gwenhael Pasquiers
Hi, and best wishes for the year to come :) I'd like to be able to programmatically get the (live) values of accumulators in order to send them using a statsd (or another) client in the JobManager of a yarn-deployed application. I say live because I'd like to use that in streaming (24/7)

RE: Generate _SUCCESS (map-reduce style) when folder has been written

2016-12-20 Thread Gwenhael Pasquiers
back? Thanks a lot, Fabian 2016-12-20 16:37 GMT+01:00 Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>>: Thanks, it is working properly now. NB : Had to delete the folder by code because Hadoop’s OuputFormats will only overwrite file by file, n

Generate _SUCCESS (map-reduce style) when folder has been written

2016-12-20 Thread Gwenhael Pasquiers
Hi, Sorry if it's already been asked but is there an embedded way for flink to generate a _SUCCESS file in the folders it's been writing into (using the write method with OutputFormat) ? We are replacing a spark job that was generating those files (and further operations rely on it). Best

RE: Read a given list of HDFS folder

2016-03-21 Thread Gwenhael Pasquiers
-path-directory Regarding the phases: the best way to exchange data between batch jobs is via files. You can then execute two programs one after the other, the first one produces the files, which the second jobs uses as input. – Ufuk On Mon, Mar 21, 2016 at 12:14 PM, Gwenhael Pasquiers

Read a given list of HDFS folder

2016-03-21 Thread Gwenhael Pasquiers
Hello, Sorry if this has been already asked or is already in the docs, I did not find the answer : Is there a way to read a given set of folders in Flink batch ? Let's say we have one folder per hour of data, written by flume, and we'd like to read only the N last hours (or any other pattern

RE: Distribution of sinks among the nodes

2016-02-11 Thread Gwenhael Pasquiers
; > I added a new Ticket: https://issues.apache.org/jira/browse/FLINK-3336 > > This will implement the data shipping pattern that you mentioned in your > initial mail. I also have the Pull request almost ready. > >> On 04 Feb 2016, at 16:25, Gwenhael Pasquiers >>

RE: Internal buffers supervision and yarn vCPUs

2016-02-04 Thread Gwenhael Pasquiers
at 10:56 AM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: Hi, I’ve got two more questions on different topic… First one : Is there a way to monitor the buffers status. In order to find bottleneck in our application we th

Internal buffers supervision and yarn vCPUs

2016-02-04 Thread Gwenhael Pasquiers
Hi, I’ve got two more questions on different topic… First one : Is there a way to monitor the buffers status. In order to find bottleneck in our application we though it could be usefull to be able to have a look at the different exchange buffers’ status. To know if they are full (or as an

RE: Distribution of sinks among the nodes

2016-02-04 Thread Gwenhael Pasquiers
be at 24 if there was no chaining … From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: jeudi 4 février 2016 09:55 To: user@flink.apache.org Subject: RE: Distribution of sinks among the nodes Don’t we need to set the number of slots to 24 (4 sources + 16 mappers + 4 sinks

RE: Distribution of sinks among the nodes

2016-02-04 Thread Gwenhael Pasquiers
re. That is also the reason why the slots are per TaskManager, and not global, to associate them with a constant set of resources (mainly memory). Greetings, Stephan On Thu, Feb 4, 2016 at 9:54 AM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@eric

RE: Distribution of sinks among the nodes

2016-02-04 Thread Gwenhael Pasquiers
, Feb 3, 2016 at 4:31 PM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: It is one type of mapper with a parallelism of 16 It's the same for the sinks and sources (parallelism of 4) The settings are Env.setParallelism(4) Mapper.s

Distribution of sinks among the nodes

2016-02-03 Thread Gwenhael Pasquiers
Hi, We try to deploy an application with the following “architecture” : 4 kafka sources => 16 maps => 4 kafka sinks, on 4 nodes, with 24 slots (we disabled operator chaining). So we’d like on each node : 1x source => 4x map => 1x sink That way there are no exchanges between different

RE: Distribution of sinks among the nodes

2016-02-03 Thread Gwenhael Pasquiers
you say 16 maps, are we talking about one mapper with parallelism 16 or 16 unique map operators? Regards, Aljoscha > On 03 Feb 2016, at 15:48, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: > > Hi, > > We try to deploy an application with the following

RE: about blob.storage.dir and .buffer files

2016-01-29 Thread Gwenhael Pasquiers
then we have to check what the problem is. Could you check the size of your user jars? Cheers, Till ​ On Wed, Jan 27, 2016 at 4:38 PM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: Hello, We got a question about blob.storage.di

about blob.storage.dir and .buffer files

2016-01-27 Thread Gwenhael Pasquiers
Hello, We got a question about blob.storage.dir and it’s .buffer files : What are they ? And are they cleaned or is there a way to limit their size and to evaluate the necessary space ? We got a node root volume disk filled by those files (~20GB) and it crashed. Well, the root was filled

RE: Reading Parquet/Hive

2015-12-18 Thread Gwenhael Pasquiers
later on, the String[] poses no issues with Tuple and it seems to be OK. Now ... Let's write those String[] in parquet too :) From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com] Sent: vendredi 18 décembre 2015 10:04 To: user@flink.apache.org Subject: Reading Parquet/Hive Hi, I'm tr

RE: YARN High Availability

2015-11-23 Thread Gwenhael Pasquiers
this step. The reasoning for the initial solution (not removing anything) was to make sure that no jobs are deleted by accident. But it looks like this is more confusing than helpful. – Ufuk > On 23 Nov 2015, at 11:45, Gwenhael Pasquiers > <gwenhael.pasqui...@ericsson.com> wrote: &

RE: YARN High Availability

2015-11-23 Thread Gwenhael Pasquiers
. The reason is that then all clusters will think that they belong together. Cheers, Till On Mon, Nov 23, 2015 at 2:15 PM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: OK, I understand. Maybe we are not really using flink as

RE: YARN High Availability

2015-11-18 Thread Gwenhael Pasquiers
On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: Hello, We’re trying to set up high availability using an existing zookeeper quorum already running in our Cloudera cluster. So, as per the doc we’ve cha

MaxPermSize on yarn

2015-11-16 Thread Gwenhael Pasquiers
Hi, We're having some OOM permgen exceptions when running on yarn. We're not yet sure if it is either a consequence or a cause of our crashes, but we've been trying to increase that value... And we did not find how to do it. I've seen that the yarn-daemon.sh sets a 256m value. It looks to me

RE: Building Flink with hadoop 2.6

2015-10-14 Thread Gwenhael Pasquiers
To: user@flink.apache.org Subject: Re: Building Flink with hadoop 2.6 Great. Which classes can it not find at runtime? I'll try to build and run Flink with exactly the command you've provided. On Wed, Oct 14, 2015 at 4:49 PM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.

Building Flink with hadoop 2.6

2015-10-14 Thread Gwenhael Pasquiers
Hi ; We need to test some things with flink and hadoop 2.6 (the trunc method). We've set up a build task on our Jenkins and everything seem okay. However when we replace the original jar from your 0.10-SNAPSHOT distribution by ours there are some missing dependencies (log4j, curator, and maybe

RE: Building Flink with hadoop 2.6

2015-10-14 Thread Gwenhael Pasquiers
well its working for you and if you're missing something. Its a pretty new feature :) Also note that you can use the fs sinks with hadoop versions below 2.7.0, then we'll write some metadata containing the valid offsets. On Wed, Oct 14, 2015 at 5:18 PM, Gwenhael Pasquiers <gwenhael.pas

RE: Building Flink with hadoop 2.6

2015-10-14 Thread Gwenhael Pasquiers
u're using for building Flink? On Wed, Oct 14, 2015 at 4:37 PM, Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com<mailto:gwenhael.pasqui...@ericsson.com>> wrote: Hi ; We need to test some things with flink and hadoop 2.6 (the trunc method). We’ve set up a build task on our Jenki

Flink 0.9.1 Kafka 0.8.1

2015-09-10 Thread Gwenhael Pasquiers
Hi everyone, We're trying to use consume a 0.8.1 Kafka on Flink 0.9.1 and we've run into the following issue : My offset became OutOfRange however now when I start my job, it loops on the OutOfRangeException, no matter what the value of auto.offset.reset is... (earliest, latest, largest,

RE: Application-specific loggers configuration

2015-08-26 Thread Gwenhael Pasquiers
-cluster-per-job mode, then we'll have to try to find another solution. Best, Aljoscha On Tue, 25 Aug 2015 at 16:22 Gwenhael Pasquiers gwenhael.pasqui...@ericsson.commailto:gwenhael.pasqui...@ericsson.com wrote: Hi, We’re developing the first of (we hope) many flink streaming app. We’d like

Application-specific loggers configuration

2015-08-25 Thread Gwenhael Pasquiers
Hi, We're developing the first of (we hope) many flink streaming app. We'd like to package the logging configuration (log4j) together with the jar. Meaning, different application will probably have different logging configuration (ex: to different logstash ports) ... Is there a way to