Re: Checking actual config values used by TaskManager

2016-04-28 Thread Timur Fayruzov
If you're talking about parameters that were set on JVM startup then `ps aux|grep flink` on an EMR slave node should do the trick, that'll give you the full command line. On Thu, Apr 28, 2016 at 9:00 PM, Ken Krugler wrote: > Hi all, > > I’m running jobs on EMR via

EMR vCores and slot allocation

2016-04-28 Thread Ken Krugler
Based on what Flink reports in the JobManager GUI, it looks like it thinks that the EC2 instances I’m using for my EMR jobs only have 4 physical cores. Which would make sense, as Amazon describes these servers as having 8 vCores. From

Anyone going to ApacheCon Big Data in Vancouver?

2016-04-28 Thread Ken Krugler
Hi all, Is anyone else from the community going? It would be fun to meet up with other Flink users during the event. I’ll be there from Sunday (May 8th) to early Wednesday afternoon (May 11th). — Ken PS - On Monday I’ll be giving a talk

Re: Multiple windows with large number of partitions

2016-04-28 Thread Christopher Santiago
Hi Aljoscha, Aljoscha Krettek wrote >>is there are reason for keying on both the "date only" field and the "userid". I think you should be fine by just specifying that you want 1-day windows on your timestamps. My mistake, this was from earlier tests that I had performed. I removed it and went

Re: Flink Client use remote app jar

2016-04-28 Thread Theofilos Kakantousis
Thanks for the update. Hopefully when I get the time, I will try to contribute. Cheers, Theo On 2016-04-27 11:24, Till Rohrmann wrote: At the moment, there is no concrete plan to introduce such a feature, because it cannot be guaranteed that you always have a distributed file system

Problem in creating quickstart project using archetype (Scala)

2016-04-28 Thread nsengupta
Hello all, I don't know if anyone else has faced his; I haven't so far. When I try to create a new project template following the instructions here , it fails. This is what happens

Re: aggregation problem

2016-04-28 Thread Vasiliki Kalavri
Hi Riccardo, can you please be a bit more specific? What do you mean by "it didn't work"? Did it crash? Did it give you a wrong value? Something else? -Vasia. On 28 April 2016 at 16:52, Riccardo Diomedi wrote: > Hi everybody > > In a DeltaIteration I have a

Re: Getting java.lang.Exception when try to fetch data from Kafka

2016-04-28 Thread Robert Metzger
I would refer to the SimpleStringSchema as an example. On Wed, Apr 27, 2016 at 7:11 PM, prateekarora wrote: > Thanks for the response . > > can you please suggest some link or example to write own > DeserializationSchema ? > > Regards > Prateek > > On Tue, Apr 26,

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
Hi Ufuk, > On Apr 28, 2016, at 1:32am, Ufuk Celebi wrote: > > Hey Ken! > > That should not happen. Can you check the web interface for two things: > > - How many available slots are advertized on the landing page > (localhost:8081) when you submit your job? I’m running this

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ken Krugler
> On Apr 28, 2016, at 1:32am, Aljoscha Krettek wrote: > > Hi, > is this a streaming or batch job? Batch. > If it is a batch job, are you using either collect() or print() on a DataSet? Definitely not a print(). Don’t know about collect(), since the job is created via

aggregation problem

2016-04-28 Thread Riccardo Diomedi
Hi everybody In a DeltaIteration I have a DataSet> where, at a certain point of the iteration, i need to count the total number of tuples and the total number of elements in the HashSet of each tuple, and then send both value to the ConvergenceCriterion function.

Re: Flink log dir

2016-04-28 Thread Chesnay Schepler
according to https://issues.apache.org/jira/browse/FLINK-3678 it should be available in 1.0.3 On 28.04.2016 16:23, Flavio Pompermaier wrote: Hi to all, I'm using Flink 1.0.1 and I can't find how to change log directory.in the current master I see that there's the

Re: Regarding Broadcast of datasets in streaming context

2016-04-28 Thread Gyula Fóra
Hi Biplob, I have implemented a similar algorithm as Aljoscha mentioned. First things to clarify are the following: There is currently no abstraction for keeping objects (in you case centroids) in a centralized way that can be updated/read by all operators. This would probably be very costly and

Flink log dir

2016-04-28 Thread Flavio Pompermaier
Hi to all, I'm using Flink 1.0.1 and I can't find how to change log directory.in the current master I see that there's the *env.log.dir* parameter to configure. >From which version it is/will be available? Best, Flavio

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
Yes, assigning more than 0.5GB to a JM is a good idea. 3GB is maybe a bit too much, 2GB should be enough. Increasing the timeout should not hurt either. 2016-04-28 14:14 GMT+02:00 Flavio Pompermaier : > So what do you suggest to try for the next run? > I was going to

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
True, flatMap does not have access to watermarks. You can also go a bit more to the low levels and directly implement an AbstractStreamOperator with OneInputStreamOperatorInterface. This is kind of the base class for the built-in stream operators and it has access to Watermarks

Re: Regarding Broadcast of datasets in streaming context

2016-04-28 Thread Biplob Biswas
That would really be great, any example would help me proceed with my work. Thanks a lot. Aljoscha Krettek wrote > Hi Biplob, > one of our developers had a stream clustering example a while back. It was > using a broadcast feedback edge with a co-operator to update the > centroids. > I'll

Re: General Data questions - streams vs batch

2016-04-28 Thread Konstantin Kulagin
Thanks Fabian, works like a charm except the case when the stream is finite (or i have a dataset from the beginning). In this case I need somehow identify that stream is finished and emit latest batch (which might have less amount of elements) to output. What is the best way to do that? In

Re: Regarding Broadcast of datasets in streaming context

2016-04-28 Thread Aljoscha Krettek
Hi Biplob, one of our developers had a stream clustering example a while back. It was using a broadcast feedback edge with a co-operator to update the centroids. I'll directly include him in the email so that he will notice and can send you the example. Cheers, Aljoscha On Thu, 28 Apr 2016 at

Re: Requesting the next InputSplit failed

2016-04-28 Thread Flavio Pompermaier
So what do you suggest to try for the next run? I was going to increase the Job Manager heap to 3 GB and maybe change some gc setting. Do you think I should increase also the akka timeout or other things? On Thu, Apr 28, 2016 at 2:06 PM, Fabian Hueske wrote: > Hmm, 113k

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
Hmm, 113k splits is quite a lot. However, the IF uses the DefaultInputSplitAssigner which is very lightweight and should handle a large number of splits well. 2016-04-28 13:50 GMT+02:00 Flavio Pompermaier : > We generate 113k splits because we can't query more than 100k

Re: Requesting the next InputSplit failed

2016-04-28 Thread Flavio Pompermaier
We generate 113k splits because we can't query more than 100k or records per split (and we have to manage 11 billions of records). We tried to run the job only once, before running it the 2nd time we would like to understand which parameter to tune in order to (try to at least to) avoid such an

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
Is the problem reproducible? Maybe the SplitAssigner gets stuck somehow, but I've never observed something like that. How many splits do you generate? I guess it is not related, but 512MB for a TM is not a lot on machines with 16GB RAM. 2016-04-28 12:12 GMT+02:00 Flavio Pompermaier

Re: Failed to stream on Yarn cluster

2016-04-28 Thread patcharee
Hi again, Actually it works well! I just realized from looking at Yarn application log that the Flink streaming result is printed in taskmanager.out. When I sent a question to the mailing list I looked at the screen where I issued the command, and there was no streaming result there. Where

Re: General Data questions - streams vs batch

2016-04-28 Thread Fabian Hueske
Hi Konstantin, if you do not need a deterministic grouping of elements you should not use a keyed stream or window. Instead you can do the lookups in a parallel flatMap function. The function would collect arriving elements and perform a lookup query after a certain number of elements arrived

Re: WindowedStream vs AllWindowedStream

2016-04-28 Thread Aljoscha Krettek
They are different classes because the signature of their apply method is different. If one were the subclass, it would be possible to call apply with the wrong signature. On Thu, 28 Apr 2016 at 12:25 Radu Prodan wrote: > Hi all, > > I have question about the differences

Re: Failed to stream on Yarn cluster

2016-04-28 Thread Maximilian Michels
Hi Patcharee, What do you mean by "nothing happened"? There is no output? Did you check the logs? Cheers, Max On Thu, Apr 28, 2016 at 12:10 PM, patcharee wrote: > Hi, > > I am testing the streaming wiki example - >

WindowedStream vs AllWindowedStream

2016-04-28 Thread Radu Prodan
Hi all, I have question about the differences between WindowedStream and AllWindowedStream. From the definition I see that WindowedStream are partitioned based on key but for AllWindowedStream this is not the case. So, what comes to my mind is, why WindowedStream is not the special case of

Failed to stream on Yarn cluster

2016-04-28 Thread patcharee
Hi, I am testing the streaming wiki example - https://ci.apache.org/projects/flink/flink-docs-master/quickstart/run_example_quickstart.html It works fine from local mode (mvn exec:java -Dexec.mainClass=wikiedits.WikipediaAnalysis). But when I run the jar on Yarn cluster mode, nothing

Re: Configuring a RichFunction on a DataStream

2016-04-28 Thread Fabian Hueske
Hi Robert, Function configuration via a Configuration object and the open method is an artifact from the past. The recommended way is to configure the function object via the constructor. Flink serializes the function object and ships them to the workers for execution. So the state of a function

Re: Requesting the next InputSplit failed

2016-04-28 Thread Fabian Hueske
I checked the input format from your PR, but didn't see anything suspicious. It is definitely OK if the processing of an input split tasks more than 10 seconds. That should not be the cause. It rather looks like the DataSourceTask fails to request a new split from the JobManager. 2016-04-28 9:37

Re: About flink stream table API

2016-04-28 Thread Fabian Hueske
Hi, Table API and SQL for streaming are work in progress. A first version which supports projection, filter, and union is merged to the master branch. Under the hood, Flink uses Calcite to optimize and translate Table API and SQL queries. Best, Fabian 2016-04-27 14:27 GMT+02:00 Zhangrucong

Re: Multiple windows with large number of partitions

2016-04-28 Thread Aljoscha Krettek
Hi, is there are reason for keying on both the "date only" field and the "userid". I think you should be fine by just specifying that you want 1-day windows on your timestamps. Also, do you have a timestamp extractor in place that takes the timestamp from your data and sets it as the internal

RE: classpath issue on yarn

2016-04-28 Thread aris kol
So,I shaded guava.The whole think works fine locally (stand alone local flink), but on yarn (forgot to mention it runs on EMR), I get the following:org.apache.flink.client.program.ProgramInvocationException: Unknown I/O error while extracting contained jar files. at

Create a cluster inside Flink

2016-04-28 Thread Simone Robutti
Hello everyone, I'm approaching a rather big and complex integration with an existing software and I would like to hear the opinion of more experienced users on how to tackle a few issues. This software builds a cloud with its own logic. What I need is to keep these nodes as instances inside the

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Ufuk Celebi
Hey Ken! That should not happen. Can you check the web interface for two things: - How many available slots are advertized on the landing page (localhost:8081) when you submit your job? - Can you check the actual parallelism of the submitted job (it should appear as a FAILED job in the web

Configuring a RichFunction on a DataStream

2016-04-28 Thread Robert Schmidtke
Hi everyone, I noticed that in the DataSet API, there is the .withParameters function that allows passing values to a RichFunction's open method. I was wondering whether a similar approach can be used to the same thing in a DataStream. Right now I'm getting the parameters via getRuntimeContext,

Re: Reducing parallelism leads to NoResourceAvailableException

2016-04-28 Thread Aljoscha Krettek
Hi, is this a streaming or batch job? If it is a batch job, are you using either collect() or print() on a DataSet? Cheers, Aljoscha On Thu, 28 Apr 2016 at 00:52 Ken Krugler wrote: > Hi all, > > In trying out different settings for performance, I run into a job

Re: Requesting the next InputSplit failed

2016-04-28 Thread Stefano Bortoli
Digging the logs, we found this: WARN Remoting - Tried to associate with unreachable remote address [akka.tcp://flink@127.0.0.1:34984]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connessione rifiutata: /127.0.0.1:34984 however, it

Re: Requesting the next InputSplit failed

2016-04-28 Thread Stefano Bortoli
I had this type of exception when trying to build and test Flink on a "small machine". I worked around the test increasing the timeout for Akka.