verbose console

2015-09-02 Thread Michele Bertoni
Hi everybody, I just found that in version 0.9.1 it is possibile to disable 
that verbose console, can you please explain how to do it both in IDE and local 
environment?
Especially in IDE I am able to set property of log4j for my logger, but 
everything I try for flink internal one does not work


thanks
michele

Travis updates on Github

2015-09-02 Thread Sachin Goel
Hi all
Is there some issue with travis integration? The last three pull requests
do not have their build status on Github page. The builds are getting
triggered though.

Regards
Sachin
-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685


Re: Travis updates on Github

2015-09-02 Thread Robert Metzger
Hi Sachin,

I also noticed that the GitHub integration is not working properly. I'll
ask the Apache Infra team.

On Wed, Sep 2, 2015 at 10:20 AM, Sachin Goel 
wrote:

> Hi all
> Is there some issue with travis integration? The last three pull requests
> do not have their build status on Github page. The builds are getting
> triggered though.
>
> Regards
> Sachin
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>


Bug broadcasting objects (serialization issue)

2015-09-02 Thread Andres R. Masegosa
Hi,

I get a bug when trying to broadcast a list of integers created with the
primitive "Arrays.asList(...)".

For example, if you try to run this "wordcount" example, you can
reproduce the bug.


public class WordCountExample {
public static void main(String[] args) throws Exception {
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();

DataSet text = env.fromElements(
"Who's there?",
"I think I hear them. Stand, ho! Who's there?");

List elements = Arrays.asList(0, 0, 0);

DataSet set = env.fromElements(new TestClass(elements));

DataSet> wordCounts = text
.flatMap(new LineSplitter())
.withBroadcastSet(set, "set")
.groupBy(0)
.sum(1);

wordCounts.print();
}

public static class LineSplitter implements FlatMapFunction> {
@Override
public void flatMap(String line, Collector> out) {
for (String word : line.split(" ")) {
out.collect(new Tuple2(word, 1));
}
}
}

public static class TestClass implements Serializable {
private static final long serialVersionUID = -2932037991574118651L;

List integerList;
public TestClass(List integerList){
this.integerList=integerList;
}


}
}


However, if instead of using the primitive "Arrays.asList(...)", we use
instead the ArrayList<> constructor, there is any problem


Regards,
Andres


Re: question on flink-storm-examples

2015-09-02 Thread Matthias J. Sax
Hi,
StormFileSpout uses a simple FileReader internally an cannot deal with
HDFS. It would be a nice extension to have. I just opened a JIRA for it:
https://issues.apache.org/jira/browse/FLINK-2606

Jerry, feel to work in this feature and contribute code to Flink ;)

-Matthias

On 09/02/2015 07:52 AM, Aljoscha Krettek wrote:
> Hi Jerry,
> unfortunately, it seems that the StormFileSpout can only read files from
> a local filesystem, not from HDFS. Maybe Matthias has something in the
> works for that.
> 
> Regards,
> Aljoscha
> 
> On Tue, 1 Sep 2015 at 23:33 Jerry Peng  > wrote:
> 
> Ya that what I did and everything seems execute fine but when I try
> to run the WordCount-StormTopology with a file on hfs I get
> a java.io.FileNotFoundException :
> 
> java.lang.RuntimeException: java.io.FileNotFoundException:
> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt (No such file or
> directory)
> 
> at
> 
> org.apache.flink.stormcompatibility.util.StormFileSpout.open(StormFileSpout.java:50)
> 
> at
> 
> org.apache.flink.stormcompatibility.wrappers.AbstractStormSpoutWrapper.run(AbstractStormSpoutWrapper.java:102)
> 
> at
> 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
> 
> at
> 
> org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:58)
> 
> at
> 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:172)
> 
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)
> 
> at java.lang.Thread.run(Thread.java:745)
> 
> Caused by: java.io.FileNotFoundException:
> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt (No such file or
> directory)
> 
> at java.io.FileInputStream.open(Native Method)
> 
> at java.io.FileInputStream.(FileInputStream.java:138)
> 
> at java.io.FileInputStream.(FileInputStream.java:93)
> 
> at java.io.FileReader.(FileReader.java:58)
> 
> at
> 
> org.apache.flink.stormcompatibility.util.StormFileSpout.open(StormFileSpout.java:48)
> 
> 
> 
> However I have that file on my hdfs namespace:
> 
> 
> $ hadoop fs -ls -R /
> 
> 15/09/01 21:25:11 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java
> classes where applicable
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40 /home
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40
> /home/jerrypeng
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:41
> /home/jerrypeng/hadoop
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:40
> /home/jerrypeng/hadoop/dir
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-24 16:06
> /home/jerrypeng/hadoop/hadoop_dir
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-09-01 20:48
> /home/jerrypeng/hadoop/hadoop_dir/data
> 
> -rw-r--r--   3 jerrypeng supergroup  18552 2015-09-01 19:18
> /home/jerrypeng/hadoop/hadoop_dir/data/data.txt
> 
> -rw-r--r--   3 jerrypeng supergroup  0 2015-09-01 20:48
> /home/jerrypeng/hadoop/hadoop_dir/data/result.txt
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-21 14:41
> /home/jerrypeng/hadoop/hadoop_dir/dir1
> 
> drwxr-xr-x   - jerrypeng supergroup  0 2015-08-24 15:59
> /home/jerrypeng/hadoop/hadoop_dir/test
> 
> -rw-r--r--   3 jerrypeng supergroup 32 2015-08-24 15:59
> /home/jerrypeng/hadoop/hadoop_dir/test/filename.txt
> 
> 
> Any idea what's going on?
> 
> 
> On Tue, Sep 1, 2015 at 4:20 PM, Matthias J. Sax
>  > wrote:
> 
> You can use "bin/flink cancel JOBID" or JobManager WebUI to
> cancel the
> running job.
> 
> The exception you see, occurs in
> FlinkSubmitter.killTopology(...) which
> is not used by "bin/flink cancel" or JobMaanger WebUI.
> 
> If you compile the example you yourself, just remove the call to
> killTopology().
> 
> -Matthias
> 
> On 09/01/2015 11:16 PM, Matthias J. Sax wrote:
> > Oh yes. I forgot about this. I have already a fix for it in a
> pending
> > pull request... I hope that this PR is merged soon...
> >
> > If you want to observe the progress, look here:
> > https://issues.apache.org/jira/browse/FLINK-2111
> > and
> > https://issues.apache.org/jira/browse/FLINK-2338
> >
> > This PR, resolves both and fixed the problem you observed:
> > https://github.com/apache/flink/pull/750
> >
> > -Matthias
> >
> >
> > On 09/01/2015 11:09 PM, Jerry Peng wrote:
> >> Hello,
> >>
> >> I corrected the number of slots for each task 

Re: Multiple restarts of Local Cluster

2015-09-02 Thread Sachin Goel
If I'm right, all Tests use either the MultipleProgramTestBase or
JavaProgramTestBase​. Those shut down the cluster explicitly anyway.
I will make sure if this is the case.

Regards
Sachin

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Sep 2, 2015 at 9:40 PM, Till Rohrmann  wrote:

> Maybe we can create a single PlanExecutor for the LocalEnvironment which
> is used when calling execute. This of course entails that we don’t call
> stop on the LocalCluster. For cases where the program exits after calling
> execute, this should be fine because all resources will then be released
> anyway. It might matter for the test execution where maven reuses the JVMs
> and where the LocalFlinkMiniCluster won’t be garbage collected right
> away. You could try it out and see what happens.
>
> Cheers,
> Till
> ​
>
> On Wed, Sep 2, 2015 at 6:03 PM, Till Rohrmann 
> wrote:
>
>> Oh sorry, then I got the wrong context. I somehow thought it was about
>> test cases because I read `MultipleProgramTestBase` etc. Sorry my bad.
>>
>> On Wed, Sep 2, 2015 at 6:00 PM, Sachin Goel 
>> wrote:
>>
>>> I was under the impression that the @AfterClass annotation can only be
>>> used in test classes.
>>> Even so, the idea is that a user program running in the IDE should not
>>> be starting up the cluster several times [my primary concern is the
>>> addition of the persist operator], and we certainly cannot ask the user to
>>> terminate the cluster after execution, while in local mode.
>>>
>>> -- Sachin Goel
>>> Computer Science, IIT Delhi
>>> m. +91-9871457685
>>>
>>> On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann 
>>> wrote:
>>>
 Why is it not possible to shut down the local cluster? Can’t you shut
 it down in the @AfterClass method?
 ​

 On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel 
 wrote:

> Yes. That will work too. However, then it isn't possible to shut down
> the local cluster. [Is it necessary to do so or does it shut down
> automatically when the program exists? I'm not entirely sure.]
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:
>
>> Have a look at some other tests, like the checkpointing tests. They
>> start one cluster manually and keep it running. They connect against it
>> using the remote environment ("localhost",
>> miniCluster.getJobManagerRpcPort()).
>>
>> That works nicely...
>>
>> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel > > wrote:
>>
>>> Hi all
>>> While using LocalEnvironment, in case the program triggers execution
>>> several times, the {{LocalFlinkMiniCluster}} is started as many times. 
>>> This
>>> can consume a lot of time in setting up and tearing down the cluster.
>>> Further, this hinders with a new functionality I'm working on based on
>>> persisted results.
>>> One potential solution could be to follow the methodology in
>>> `MultipleProgramsTestBase`. The user code then would have to reside in a
>>> fixed function name, instead of the main method. Or is that too 
>>> cumbersome?
>>>
>>> Regards
>>> Sachin
>>> -- Sachin Goel
>>> Computer Science, IIT Delhi
>>> m. +91-9871457685
>>>
>>
>>
>

>>>
>>
>


Re: Multiple restarts of Local Cluster

2015-09-02 Thread Till Rohrmann
Oh sorry, then I got the wrong context. I somehow thought it was about test
cases because I read `MultipleProgramTestBase` etc. Sorry my bad.

On Wed, Sep 2, 2015 at 6:00 PM, Sachin Goel 
wrote:

> I was under the impression that the @AfterClass annotation can only be
> used in test classes.
> Even so, the idea is that a user program running in the IDE should not be
> starting up the cluster several times [my primary concern is the addition
> of the persist operator], and we certainly cannot ask the user to terminate
> the cluster after execution, while in local mode.
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann 
> wrote:
>
>> Why is it not possible to shut down the local cluster? Can’t you shut it
>> down in the @AfterClass method?
>> ​
>>
>> On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel 
>> wrote:
>>
>>> Yes. That will work too. However, then it isn't possible to shut down
>>> the local cluster. [Is it necessary to do so or does it shut down
>>> automatically when the program exists? I'm not entirely sure.]
>>>
>>> -- Sachin Goel
>>> Computer Science, IIT Delhi
>>> m. +91-9871457685
>>>
>>> On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:
>>>
 Have a look at some other tests, like the checkpointing tests. They
 start one cluster manually and keep it running. They connect against it
 using the remote environment ("localhost",
 miniCluster.getJobManagerRpcPort()).

 That works nicely...

 On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
 wrote:

> Hi all
> While using LocalEnvironment, in case the program triggers execution
> several times, the {{LocalFlinkMiniCluster}} is started as many times. 
> This
> can consume a lot of time in setting up and tearing down the cluster.
> Further, this hinders with a new functionality I'm working on based on
> persisted results.
> One potential solution could be to follow the methodology in
> `MultipleProgramsTestBase`. The user code then would have to reside in a
> fixed function name, instead of the main method. Or is that too 
> cumbersome?
>
> Regards
> Sachin
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>


>>>
>>
>


Re: Multiple restarts of Local Cluster

2015-09-02 Thread Till Rohrmann
Why is it not possible to shut down the local cluster? Can’t you shut it
down in the @AfterClass method?
​

On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel 
wrote:

> Yes. That will work too. However, then it isn't possible to shut down the
> local cluster. [Is it necessary to do so or does it shut down automatically
> when the program exists? I'm not entirely sure.]
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:
>
>> Have a look at some other tests, like the checkpointing tests. They start
>> one cluster manually and keep it running. They connect against it using the
>> remote environment ("localhost", miniCluster.getJobManagerRpcPort()).
>>
>> That works nicely...
>>
>> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
>> wrote:
>>
>>> Hi all
>>> While using LocalEnvironment, in case the program triggers execution
>>> several times, the {{LocalFlinkMiniCluster}} is started as many times. This
>>> can consume a lot of time in setting up and tearing down the cluster.
>>> Further, this hinders with a new functionality I'm working on based on
>>> persisted results.
>>> One potential solution could be to follow the methodology in
>>> `MultipleProgramsTestBase`. The user code then would have to reside in a
>>> fixed function name, instead of the main method. Or is that too cumbersome?
>>>
>>> Regards
>>> Sachin
>>> -- Sachin Goel
>>> Computer Science, IIT Delhi
>>> m. +91-9871457685
>>>
>>
>>
>


Re: Multiple restarts of Local Cluster

2015-09-02 Thread Sachin Goel
I was under the impression that the @AfterClass annotation can only be used
in test classes.
Even so, the idea is that a user program running in the IDE should not be
starting up the cluster several times [my primary concern is the addition
of the persist operator], and we certainly cannot ask the user to terminate
the cluster after execution, while in local mode.

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann  wrote:

> Why is it not possible to shut down the local cluster? Can’t you shut it
> down in the @AfterClass method?
> ​
>
> On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel 
> wrote:
>
>> Yes. That will work too. However, then it isn't possible to shut down the
>> local cluster. [Is it necessary to do so or does it shut down automatically
>> when the program exists? I'm not entirely sure.]
>>
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>> On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:
>>
>>> Have a look at some other tests, like the checkpointing tests. They
>>> start one cluster manually and keep it running. They connect against it
>>> using the remote environment ("localhost",
>>> miniCluster.getJobManagerRpcPort()).
>>>
>>> That works nicely...
>>>
>>> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
>>> wrote:
>>>
 Hi all
 While using LocalEnvironment, in case the program triggers execution
 several times, the {{LocalFlinkMiniCluster}} is started as many times. This
 can consume a lot of time in setting up and tearing down the cluster.
 Further, this hinders with a new functionality I'm working on based on
 persisted results.
 One potential solution could be to follow the methodology in
 `MultipleProgramsTestBase`. The user code then would have to reside in a
 fixed function name, instead of the main method. Or is that too cumbersome?

 Regards
 Sachin
 -- Sachin Goel
 Computer Science, IIT Delhi
 m. +91-9871457685

>>>
>>>
>>
>


NPE thrown when using Storm Kafka Spout with Flink

2015-09-02 Thread Jerry Peng
Hello,

When I try to run a storm topology with a Kafka Spout on top of Flink, I
get an NPE at:

15:00:32,853 ERROR org.apache.flink.streaming.runtime.tasks.StreamTask
  - Error closing stream operators after an exception.

java.lang.NullPointerException

at storm.kafka.KafkaSpout.close(KafkaSpout.java:130)

at
org.apache.flink.stormcompatibility.wrappers.AbstractStormSpoutWrapper.close(AbstractStormSpoutWrapper.java:128)

at
org.apache.flink.api.common.functions.util.FunctionUtils.closeFunction(FunctionUtils.java:40)

at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.close(AbstractUdfStreamOperator.java:75)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.closeAllOperators(StreamTask.java:243)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:197)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)

at java.lang.Thread.run(Thread.java:745)

15:00:32,855 INFO  org.apache.flink.streaming.runtime.tasks.StreamTask
  - State backend for state checkpoints is set to jobmanager.

15:00:32,855 INFO  org.apache.flink.runtime.taskmanager.Task
  - event_deserializer (5/5) switched to RUNNING

15:00:32,859 INFO  org.apache.flink.runtime.taskmanager.Task
  - Source: ads (1/1) switched to FAILED with exception.

java.lang.NullPointerException

at java.util.HashMap.putMapEntries(HashMap.java:500)

at java.util.HashMap.(HashMap.java:489)

at storm.kafka.KafkaSpout.open(KafkaSpout.java:73)

at
org.apache.flink.stormcompatibility.wrappers.AbstractStormSpoutWrapper.run(AbstractStormSpoutWrapper.java:102)

at
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)

at
org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:58)

at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:172)

at org.apache.flink.runtime.taskmanager.Task.run(Task.java:581)

at java.lang.Thread.run(Thread.java:745)


Has someone seen this before? or Have a fix?  I am using 0.10beta1 for all
storm packages and a 0.10-snapshot (latest compiled) for all flink
packages.  Sample of the kafka code I am using:

Broker broker = new Broker("localhost", 9092);
GlobalPartitionInformation partitionInfo = new GlobalPartitionInformation();
partitionInfo.addPartition(0, broker);
StaticHosts hosts = new StaticHosts(partitionInfo);

SpoutConfig spoutConfig = new SpoutConfig(hosts, "stuff",
UUID.randomUUID().toString());
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);

builder.setSpout("kafkaSpout", kafkaSpout, 1);


Re: How to force the parallelism on small streams?

2015-09-02 Thread Matthias J. Sax
Hi,

If I understand you correctly, you want to have 100 mappers. Thus you
need to apply the .setParallelism() after .map()

> addSource(myFileSource).rebalance().map(myFileMapper).setParallelism(100)

The order of commands you used, set the dop for the source to 100 (which
might be ignored, if the provided source function "myFileSource" does
not implements "ParallelSourceFunction" interface). The dop for the
mapper should be the default value.

Using .rebalance() is absolutely correct. It distributes the emitted
tuples in a round robin fashion to all consumer tasks.

-Matthias

On 09/02/2015 05:41 PM, LINZ, Arnaud wrote:
> Hi,
> 
>  
> 
> I have a source that provides few items since it gives file names to the
> mappers. The mapper opens the file and process records. As the files are
> huge, one input line (a filename) gives a consequent work to the next stage.
> 
> My topology looks like :
> 
> addSource(myFileSource).rebalance().setParallelism(100).map(myFileMapper)
> 
> If 100 mappers are created, about 85 end immediately and only a few
> process the files (for hours). I suspect an optimization making that
> there is a minimum number of lines to pass to the next node or it is
> “shutdown” ; but in my case I do want the lines to be evenly distributed
> to each mapper.
> 
> How to enforce that ?
> 
>  
> 
> Greetings,
> 
> Arnaud
> 
> 
> 
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses
> pièces jointes. Toute utilisation ou diffusion non autorisée est
> interdite. Si vous n'êtes pas destinataire de ce message, merci de le
> détruire et d'avertir l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the sender.



signature.asc
Description: OpenPGP digital signature


Re: Multiple restarts of Local Cluster

2015-09-02 Thread Stephan Ewen
You can always shut down a cluster manually (via shutdown()) and if the JVM
simply exists, all is well as well. Crucial cleanup is in shutdown hooks.

On Wed, Sep 2, 2015 at 6:22 PM, Till Rohrmann 
wrote:

> If I'm not mistaken, then the cluster should be properly terminated when
> it gets garbage collected. Thus, also when the main method exits.
>
> On Wed, Sep 2, 2015 at 6:14 PM, Sachin Goel 
> wrote:
>
>> If I'm right, all Tests use either the MultipleProgramTestBase or
>> JavaProgramTestBase​. Those shut down the cluster explicitly anyway.
>> I will make sure if this is the case.
>>
>> Regards
>> Sachin
>>
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>> On Wed, Sep 2, 2015 at 9:40 PM, Till Rohrmann 
>> wrote:
>>
>>> Maybe we can create a single PlanExecutor for the LocalEnvironment
>>> which is used when calling execute. This of course entails that we
>>> don’t call stop on the LocalCluster. For cases where the program exits
>>> after calling execute, this should be fine because all resources will then
>>> be released anyway. It might matter for the test execution where maven
>>> reuses the JVMs and where the LocalFlinkMiniCluster won’t be garbage
>>> collected right away. You could try it out and see what happens.
>>>
>>> Cheers,
>>> Till
>>> ​
>>>
>>> On Wed, Sep 2, 2015 at 6:03 PM, Till Rohrmann 
>>> wrote:
>>>
 Oh sorry, then I got the wrong context. I somehow thought it was about
 test cases because I read `MultipleProgramTestBase` etc. Sorry my bad.

 On Wed, Sep 2, 2015 at 6:00 PM, Sachin Goel 
 wrote:

> I was under the impression that the @AfterClass annotation can only be
> used in test classes.
> Even so, the idea is that a user program running in the IDE should not
> be starting up the cluster several times [my primary concern is the
> addition of the persist operator], and we certainly cannot ask the user to
> terminate the cluster after execution, while in local mode.
>
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>
> On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann 
> wrote:
>
>> Why is it not possible to shut down the local cluster? Can’t you shut
>> it down in the @AfterClass method?
>> ​
>>
>> On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel > > wrote:
>>
>>> Yes. That will work too. However, then it isn't possible to shut
>>> down the local cluster. [Is it necessary to do so or does it shut down
>>> automatically when the program exists? I'm not entirely sure.]
>>>
>>> -- Sachin Goel
>>> Computer Science, IIT Delhi
>>> m. +91-9871457685
>>>
>>> On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen 
>>> wrote:
>>>
 Have a look at some other tests, like the checkpointing tests. They
 start one cluster manually and keep it running. They connect against it
 using the remote environment ("localhost",
 miniCluster.getJobManagerRpcPort()).

 That works nicely...

 On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel <
 sachingoel0...@gmail.com> wrote:

> Hi all
> While using LocalEnvironment, in case the program triggers
> execution several times, the {{LocalFlinkMiniCluster}} is started as 
> many
> times. This can consume a lot of time in setting up and tearing down 
> the
> cluster. Further, this hinders with a new functionality I'm working on
> based on persisted results.
> One potential solution could be to follow the methodology in
> `MultipleProgramsTestBase`. The user code then would have to reside 
> in a
> fixed function name, instead of the main method. Or is that too 
> cumbersome?
>
> Regards
> Sachin
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>


>>>
>>
>

>>>
>>
>


Re: Multiple restarts of Local Cluster

2015-09-02 Thread Till Rohrmann
Maybe we can create a single PlanExecutor for the LocalEnvironment which is
used when calling execute. This of course entails that we don’t call stop
on the LocalCluster. For cases where the program exits after calling
execute, this should be fine because all resources will then be released
anyway. It might matter for the test execution where maven reuses the JVMs
and where the LocalFlinkMiniCluster won’t be garbage collected right away.
You could try it out and see what happens.

Cheers,
Till
​

On Wed, Sep 2, 2015 at 6:03 PM, Till Rohrmann  wrote:

> Oh sorry, then I got the wrong context. I somehow thought it was about
> test cases because I read `MultipleProgramTestBase` etc. Sorry my bad.
>
> On Wed, Sep 2, 2015 at 6:00 PM, Sachin Goel 
> wrote:
>
>> I was under the impression that the @AfterClass annotation can only be
>> used in test classes.
>> Even so, the idea is that a user program running in the IDE should not be
>> starting up the cluster several times [my primary concern is the addition
>> of the persist operator], and we certainly cannot ask the user to terminate
>> the cluster after execution, while in local mode.
>>
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>> On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann 
>> wrote:
>>
>>> Why is it not possible to shut down the local cluster? Can’t you shut it
>>> down in the @AfterClass method?
>>> ​
>>>
>>> On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel 
>>> wrote:
>>>
 Yes. That will work too. However, then it isn't possible to shut down
 the local cluster. [Is it necessary to do so or does it shut down
 automatically when the program exists? I'm not entirely sure.]

 -- Sachin Goel
 Computer Science, IIT Delhi
 m. +91-9871457685

 On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:

> Have a look at some other tests, like the checkpointing tests. They
> start one cluster manually and keep it running. They connect against it
> using the remote environment ("localhost",
> miniCluster.getJobManagerRpcPort()).
>
> That works nicely...
>
> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
> wrote:
>
>> Hi all
>> While using LocalEnvironment, in case the program triggers execution
>> several times, the {{LocalFlinkMiniCluster}} is started as many times. 
>> This
>> can consume a lot of time in setting up and tearing down the cluster.
>> Further, this hinders with a new functionality I'm working on based on
>> persisted results.
>> One potential solution could be to follow the methodology in
>> `MultipleProgramsTestBase`. The user code then would have to reside in a
>> fixed function name, instead of the main method. Or is that too 
>> cumbersome?
>>
>> Regards
>> Sachin
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>
>

>>>
>>
>


Re: Multiple restarts of Local Cluster

2015-09-02 Thread Sachin Goel
Okay. No problem.
Any suggestions for the correct context though? :')

I don't think something like a {{FlinkProgram}} class is a good idea [User
would need to override a {{program}} method and we will make sure the
cluster is setup only once and torn down properly only after the user code
finishes completely].

However, if this isn't so, the shutting down of cluster becomes impossible.
Can I assume however, that the actor system will be shut down automatically
when the main method exits? After all the JVM will terminate. If so, I can
make some changes in LocalExecutor to start up the cluster only once.

Regards
Sachin

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Sep 2, 2015 at 9:33 PM, Till Rohrmann  wrote:

> Oh sorry, then I got the wrong context. I somehow thought it was about
> test cases because I read `MultipleProgramTestBase` etc. Sorry my bad.
>
> On Wed, Sep 2, 2015 at 6:00 PM, Sachin Goel 
> wrote:
>
>> I was under the impression that the @AfterClass annotation can only be
>> used in test classes.
>> Even so, the idea is that a user program running in the IDE should not be
>> starting up the cluster several times [my primary concern is the addition
>> of the persist operator], and we certainly cannot ask the user to terminate
>> the cluster after execution, while in local mode.
>>
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>> On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann 
>> wrote:
>>
>>> Why is it not possible to shut down the local cluster? Can’t you shut it
>>> down in the @AfterClass method?
>>> ​
>>>
>>> On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel 
>>> wrote:
>>>
 Yes. That will work too. However, then it isn't possible to shut down
 the local cluster. [Is it necessary to do so or does it shut down
 automatically when the program exists? I'm not entirely sure.]

 -- Sachin Goel
 Computer Science, IIT Delhi
 m. +91-9871457685

 On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:

> Have a look at some other tests, like the checkpointing tests. They
> start one cluster manually and keep it running. They connect against it
> using the remote environment ("localhost",
> miniCluster.getJobManagerRpcPort()).
>
> That works nicely...
>
> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
> wrote:
>
>> Hi all
>> While using LocalEnvironment, in case the program triggers execution
>> several times, the {{LocalFlinkMiniCluster}} is started as many times. 
>> This
>> can consume a lot of time in setting up and tearing down the cluster.
>> Further, this hinders with a new functionality I'm working on based on
>> persisted results.
>> One potential solution could be to follow the methodology in
>> `MultipleProgramsTestBase`. The user code then would have to reside in a
>> fixed function name, instead of the main method. Or is that too 
>> cumbersome?
>>
>> Regards
>> Sachin
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>
>

>>>
>>
>


RE: How to force the parallelism on small streams?

2015-09-02 Thread LINZ, Arnaud
Hi,

You are right, but in fact it does not solve my problem, since I have 100 
parallelism everywhere. Each of my 100 sources gives only a few lines (say 14 
max), and only the first 14 next nodes will receive data.
Same problem by replacing rebalance() with shuffle().

But I found a workaround: setting parallelism to 1 for the source (I don't need 
a 100 directory scanners anyway), it forces the rebalancing evenly between the 
mappers.

Greetings,
Arnaud


-Message d'origine-
De : Matthias J. Sax [mailto:mj...@apache.org] 
Envoyé : mercredi 2 septembre 2015 17:56
À : user@flink.apache.org
Objet : Re: How to force the parallelism on small streams?

Hi,

If I understand you correctly, you want to have 100 mappers. Thus you need to 
apply the .setParallelism() after .map()

> addSource(myFileSource).rebalance().map(myFileMapper).setParallelism(1
> 00)

The order of commands you used, set the dop for the source to 100 (which might 
be ignored, if the provided source function "myFileSource" does not implements 
"ParallelSourceFunction" interface). The dop for the mapper should be the 
default value.

Using .rebalance() is absolutely correct. It distributes the emitted tuples in 
a round robin fashion to all consumer tasks.

-Matthias

On 09/02/2015 05:41 PM, LINZ, Arnaud wrote:
> Hi,
> 
>  
> 
> I have a source that provides few items since it gives file names to 
> the mappers. The mapper opens the file and process records. As the 
> files are huge, one input line (a filename) gives a consequent work to the 
> next stage.
> 
> My topology looks like :
> 
> addSource(myFileSource).rebalance().setParallelism(100).map(myFileMapp
> er)
> 
> If 100 mappers are created, about 85 end immediately and only a few 
> process the files (for hours). I suspect an optimization making that 
> there is a minimum number of lines to pass to the next node or it is 
> “shutdown” ; but in my case I do want the lines to be evenly 
> distributed to each mapper.
> 
> How to enforce that ?
> 
>  
> 
> Greetings,
> 
> Arnaud
> 
> 
> --
> --
> 
> L'intégrité de ce message n'étant pas assurée sur internet, la société 
> expéditrice ne peut être tenue responsable de son contenu ni de ses 
> pièces jointes. Toute utilisation ou diffusion non autorisée est 
> interdite. Si vous n'êtes pas destinataire de ce message, merci de le 
> détruire et d'avertir l'expéditeur.
> 
> The integrity of this message cannot be guaranteed on the Internet. 
> The company that sent this message cannot therefore be held liable for 
> its content nor attachments. Any unauthorized use or dissemination is 
> prohibited. If you are not the intended recipient of this message, 
> then please delete it and notify the sender.



Re: Multiple restarts of Local Cluster

2015-09-02 Thread Sachin Goel
I'm not sure what you mean by "Crucial cleanup is in shutdown hooks". Could
you elaborate?

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Sep 2, 2015 at 10:25 PM, Stephan Ewen  wrote:

> You can always shut down a cluster manually (via shutdown()) and if the
> JVM simply exists, all is well as well. Crucial cleanup is in shutdown
> hooks.
>
> On Wed, Sep 2, 2015 at 6:22 PM, Till Rohrmann 
> wrote:
>
>> If I'm not mistaken, then the cluster should be properly terminated when
>> it gets garbage collected. Thus, also when the main method exits.
>>
>> On Wed, Sep 2, 2015 at 6:14 PM, Sachin Goel 
>> wrote:
>>
>>> If I'm right, all Tests use either the MultipleProgramTestBase or
>>> JavaProgramTestBase​. Those shut down the cluster explicitly anyway.
>>> I will make sure if this is the case.
>>>
>>> Regards
>>> Sachin
>>>
>>> -- Sachin Goel
>>> Computer Science, IIT Delhi
>>> m. +91-9871457685
>>>
>>> On Wed, Sep 2, 2015 at 9:40 PM, Till Rohrmann 
>>> wrote:
>>>
 Maybe we can create a single PlanExecutor for the LocalEnvironment
 which is used when calling execute. This of course entails that we
 don’t call stop on the LocalCluster. For cases where the program exits
 after calling execute, this should be fine because all resources will then
 be released anyway. It might matter for the test execution where maven
 reuses the JVMs and where the LocalFlinkMiniCluster won’t be garbage
 collected right away. You could try it out and see what happens.

 Cheers,
 Till
 ​

 On Wed, Sep 2, 2015 at 6:03 PM, Till Rohrmann 
 wrote:

> Oh sorry, then I got the wrong context. I somehow thought it was about
> test cases because I read `MultipleProgramTestBase` etc. Sorry my bad.
>
> On Wed, Sep 2, 2015 at 6:00 PM, Sachin Goel 
> wrote:
>
>> I was under the impression that the @AfterClass annotation can only
>> be used in test classes.
>> Even so, the idea is that a user program running in the IDE should
>> not be starting up the cluster several times [my primary concern is the
>> addition of the persist operator], and we certainly cannot ask the user 
>> to
>> terminate the cluster after execution, while in local mode.
>>
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>> On Wed, Sep 2, 2015 at 9:19 PM, Till Rohrmann 
>> wrote:
>>
>>> Why is it not possible to shut down the local cluster? Can’t you
>>> shut it down in the @AfterClass method?
>>> ​
>>>
>>> On Wed, Sep 2, 2015 at 4:56 PM, Sachin Goel <
>>> sachingoel0...@gmail.com> wrote:
>>>
 Yes. That will work too. However, then it isn't possible to shut
 down the local cluster. [Is it necessary to do so or does it shut down
 automatically when the program exists? I'm not entirely sure.]

 -- Sachin Goel
 Computer Science, IIT Delhi
 m. +91-9871457685

 On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen 
 wrote:

> Have a look at some other tests, like the checkpointing tests.
> They start one cluster manually and keep it running. They connect 
> against
> it using the remote environment ("localhost",
> miniCluster.getJobManagerRpcPort()).
>
> That works nicely...
>
> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel <
> sachingoel0...@gmail.com> wrote:
>
>> Hi all
>> While using LocalEnvironment, in case the program triggers
>> execution several times, the {{LocalFlinkMiniCluster}} is started as 
>> many
>> times. This can consume a lot of time in setting up and tearing down 
>> the
>> cluster. Further, this hinders with a new functionality I'm working 
>> on
>> based on persisted results.
>> One potential solution could be to follow the methodology in
>> `MultipleProgramsTestBase`. The user code then would have to reside 
>> in a
>> fixed function name, instead of the main method. Or is that too 
>> cumbersome?
>>
>> Regards
>> Sachin
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>
>

>>>
>>
>

>>>
>>
>


Re: nosuchmethoderror

2015-09-02 Thread Ferenc Turi
Ok. As I see only the method name was changed. It was an unnecessary
modification which caused the incompatibility.

F.

On Wed, Sep 2, 2015 at 8:53 PM, Márton Balassi 
wrote:

> Dear Ferenc,
>
> The Kafka consumer implementations was modified from 0.9.0 to 0.9.1,
> please use the new code. [1]
>
> I suspect that your com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink
> depends on the way the Flink code used to look in 0.9.0, if you take a
> closer look Robert changed the function that is missing in your error in
> [1].
>
> [1]
> https://github.com/apache/flink/commit/940a7c8a667875b8512b63e4a32453b1a2a58785
>
> Best,
>
> Márton
>
> On Wed, Sep 2, 2015 at 8:47 PM, Ferenc Turi  wrote:
>
>> Hi,
>>
>> I tried to use the latest 0.9.1 release but I got:
>>
>> java.lang.NoSuchMethodError:
>> org.apache.flink.util.NetUtils.ensureCorrectHostnamePort(Ljava/lang/String;)V
>> at
>> com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink.(FlinkKafkaTopicWriterSink.java:69)
>> at
>> com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink.(FlinkKafkaTopicWriterSink.java:48)
>> at
>> com.nventdata.kafkaflink.FlinkKafkaTopicWriterMain.main(FlinkKafkaTopicWriterMain.java:54)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
>> at
>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
>> at org.apache.flink.client.program.Client.run(Client.java:315)
>> at
>> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:582)
>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:288)
>> at
>> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:878)
>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:920)
>>
>>
>> Thanks,
>>
>> Ferenc
>>
>
>


-- 
Kind Regards,

Ferenc


max-fan

2015-09-02 Thread Greg Hogan
When workers spill more than 128 files, I have seen these fully merged into
one or more much larger files. Does the following parameter allow more
files to be stored without requiring the intermediate merge-sort? I have
changed it to 1024 without effect. Also, it appears that the entire set of
small files is reprocessed rather than the minimum required to attain the
max fan-in (i.e., starting with 150 files, 23 would be merged leaving 128
to be processed concurrently).

taskmanager.runtime.max-fan: The maximal fan-in for external merge joins
and fan-out for spilling hash tables. Limits the number of file handles per
operator, but may cause intermediate merging/partitioning, if set too small
(DEFAULT: 128).

Greg Hogan


Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-02 Thread Welly Tambunan
Hi All,

I would like to filter some item from the event stream. I think there are
two ways doing this.

Using the regular pipeline filter(...).map(...). We can also use flatMap
for doing both in the same operator.

Any performance improvement if we are using flatMap ? As that will be done
in one operator instance.


Cheers

-- 
Welly Tambunan
Triplelands

http://weltam.wordpress.com
http://www.triplelands.com 


Re: Efficiency for Filter then Transform ( filter().map() vs flatMap() )

2015-09-02 Thread Gyula Fóra
Hey Welly,

If you call filter and map one after the other like you mentioned, these
operators will be chained and executed as if they were running in the same
operator.
The only small performance overhead comes from the fact that the output of
the filter will be copied before passing it as input to the map to keep
immutability guarantees (but no serialization/deserialization will happen).
Copying might be practically free depending on your data type, though.

If you are using operators that don't make use of the immutability of
inputs/outputs (i.e you don't hold references to those values) than you can
disable copying altogether by calling env.getConfig().enableObjectReuse(),
in which case they will have exactly the same performance.

Cheers,
Gyula

Welly Tambunan  ezt írta (időpont: 2015. szept. 3., Cs,
4:33):

> Hi All,
>
> I would like to filter some item from the event stream. I think there are
> two ways doing this.
>
> Using the regular pipeline filter(...).map(...). We can also use flatMap
> for doing both in the same operator.
>
> Any performance improvement if we are using flatMap ? As that will be done
> in one operator instance.
>
>
> Cheers
>
>
> --
> Welly Tambunan
> Triplelands
>
> http://weltam.wordpress.com
> http://www.triplelands.com 
>


Re: Hardware requirements and learning resources

2015-09-02 Thread Juan Rodríguez Hortalá
Hi Robert and Jay,

Thanks for your answers. The petstore jobs could indeed be used as a roseta
code for Flink and Spark.

Regarding the memory requirements, those are very good news to me, just 2GB
of RAM is certainly a modest amount of memory, you can use even some Single
Board Computers for that. Is there any reference load test programs and
benchmarks that can be used to compare different deployments of Flink?
Maybe the petstore implementation mentioned by Jay could be used for that,
and also to compare the performance of Flink to other systems like Spark or
Hadoop MapReduce, which I understand is the current goal.

Greetings,

Juan


2015-09-02 14:56 GMT+02:00 jay vyas :

> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. Something along the
> lines of
> https://spark.apache.org/docs/latest/hardware-provisioning.html.
> Spark is known for having quite high minimal memory requirements (8GB RAM
> and 8 cores minimum), and I was wondering if it is also the case for 
> Flink.
> Lower memory requirements would be very interesting for building 

nosuchmethoderror

2015-09-02 Thread Ferenc Turi
Hi,

I tried to use the latest 0.9.1 release but I got:

java.lang.NoSuchMethodError:
org.apache.flink.util.NetUtils.ensureCorrectHostnamePort(Ljava/lang/String;)V
at
com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink.(FlinkKafkaTopicWriterSink.java:69)
at
com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink.(FlinkKafkaTopicWriterSink.java:48)
at
com.nventdata.kafkaflink.FlinkKafkaTopicWriterMain.main(FlinkKafkaTopicWriterMain.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
at org.apache.flink.client.program.Client.run(Client.java:315)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:582)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:288)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:878)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:920)


Thanks,

Ferenc


Re: nosuchmethoderror

2015-09-02 Thread Márton Balassi
Dear Ferenc,

The Kafka consumer implementations was modified from 0.9.0 to 0.9.1, please
use the new code. [1]

I suspect that your com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink
depends on the way the Flink code used to look in 0.9.0, if you take a
closer look Robert changed the function that is missing in your error in
[1].

[1]
https://github.com/apache/flink/commit/940a7c8a667875b8512b63e4a32453b1a2a58785

Best,

Márton

On Wed, Sep 2, 2015 at 8:47 PM, Ferenc Turi  wrote:

> Hi,
>
> I tried to use the latest 0.9.1 release but I got:
>
> java.lang.NoSuchMethodError:
> org.apache.flink.util.NetUtils.ensureCorrectHostnamePort(Ljava/lang/String;)V
> at
> com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink.(FlinkKafkaTopicWriterSink.java:69)
> at
> com.nventdata.kafkaflink.sink.FlinkKafkaTopicWriterSink.(FlinkKafkaTopicWriterSink.java:48)
> at
> com.nventdata.kafkaflink.FlinkKafkaTopicWriterMain.main(FlinkKafkaTopicWriterMain.java:54)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437)
> at
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353)
> at org.apache.flink.client.program.Client.run(Client.java:315)
> at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:582)
> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:288)
> at
> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:878)
> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:920)
>
>
> Thanks,
>
> Ferenc
>


Re: Hardware requirements and learning resources

2015-09-02 Thread Juan Rodríguez Hortalá
Answering to myself, I have found some nice training material at
http://dataartisans.github.io/flink-training. There are even videos at
youtube for some of the slides

  - http://dataartisans.github.io/flink-training/overview/intro.html
https://www.youtube.com/watch?v=XgC6c4Wiqvs

  - http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
https://www.youtube.com/watch?v=0EARqW15dDk

The third lecture
http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU but
not exactly, and there are more lessons at
http://dataartisans.github.io/flink-training, for stream processing and the
table API for which I haven't found a video. Does anyone have pointers to
the missing videos?

Greetings,

Juan

2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. Something along the
> lines of https://spark.apache.org/docs/latest/hardware-provisioning.html.
> Spark is known for having quite high minimal memory requirements (8GB RAM
> and 8 cores minimum), and I was wondering if it is also the case for Flink.
> Lower memory requirements would be very interesting for building small
> Flink clusters for educational purposes, or for small projects.
>
> Apart from that, I wonder if there is some blog post by the comunity about
> transitioning from Spark to Flink. I think it could be interesting, as
> there are some similarities in the APIs, but also deep differences in the
> underlying approaches. I was thinking in something like Breeze's cheatsheet
> comparing its matrix operatations with those available in Matlab and Numpy
> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
> like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
> any pointer to some online course, book or training for Flink besides the
> official programming guides would be much appreciated
>
> Thanks in advance for help
>
> Greetings,
>
> Juan
>
>


Re: Bug broadcasting objects (serialization issue)

2015-09-02 Thread Maximilian Michels
Hi Andreas,

Thank you for reporting the problem and including the code to reproduce the
problem. I think there is a problem with the class serialization or
deserialization. Arrays.asList uses a private ArrayList class
(java.util.Arrays$ArrayList) which is not the one you would normally use
(java.util.ArrayList).

I'll create a JIRA issue to keep track of the problem and to investigate
further.

Best regards,
Max

Here's the stack trace:

Exception in thread "main"
org.apache.flink.runtime.client.JobExecutionException: Cannot initialize
task 'DataSource (at main(Test.java:32)
(org.apache.flink.api.java.io.CollectionInputFormat))': Deserializing the
InputFormat ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:523)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:507)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at org.apache.flink.runtime.jobmanager.JobManager.org
$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:507)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:190)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at
scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:43)
at
org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at
org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at
org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Deserializing the InputFormat
([mytests.Test$TestClass@4d6025c5]) failed: unread block data
at
org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:60)
at
org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
... 25 more
Caused by: java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
at
org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:302)
at
org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:264)
at
org.apache.flink.runtime.operators.util.TaskConfig.getStubWrapper(TaskConfig.java:282)
at
org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:57)
... 26 more

On Wed, Sep 2, 2015 at 11:17 AM, Andres R. Masegosa 
wrote:

> Hi,
>
> I get a bug when trying to broadcast a list of integers created with the
> primitive "Arrays.asList(...)".
>
> For example, if you try to run this "wordcount" example, you can
> reproduce the bug.
>
>
> public class WordCountExample {
> public static void main(String[] args) throws Exception {
> final ExecutionEnvironment env =
> ExecutionEnvironment.getExecutionEnvironment();
>
> DataSet text = env.fromElements(
> "Who's there?",
> "I think I hear them. 

Re: Bigpetstore - Flink integration

2015-09-02 Thread Robert Metzger
Okay, I see.

As I said before, I was not able to reproduce the serialization issue
you've reported.
Can you maybe post the exception you are seeing?

On Wed, Sep 2, 2015 at 3:32 PM, jay vyas 
wrote:

> Hey, thanks!
>
> Those are just seeds, the files aren't large.
>
> The scale out data is the transactions.
>
> The seed data needs to be the same, shipped to ALL nodes, and then
>
> the nodes generate transactions.
>
>
> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger 
> wrote:
>
>> I'm starting a new discussion thread for the bigpetstore-flink
>> integration ...
>>
>>
>> I took a closer look into the code you've posted.
>> It seems to me that you are generating a lot of data locally on the
>> client, before you actually submit a job to Flink. (Both "customers" and
>> "stores" are generated locally)
>> Is that only some "seed" data?
>>
>> I would actually try to generate as much data as possible in the cluster,
>> making the generator very scalable.
>>
>> I don't think that you need to register a Kryo serializer for the Product
>> and Transaction type.
>> I was able to run the code without the serializer registration.
>>
>>
>> -- Forwarded message --
>> From: jay vyas 
>> Date: Wed, Sep 2, 2015 at 2:56 PM
>> Subject: Re: Hardware requirements and learning resources
>> To: user@flink.apache.org
>>
>>
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>>
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>>
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>>
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>>
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>>
>>>
>>>
>>>
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>>> wrote:
>>>
 Hi Juan,

 Flink is quite nimble with hardware requirements; people have run it in
 old-ish laptops and also the largest instances available in cloud
 providers. I will let others chime in with more details.

 I am not aware of something along the lines of a cheatsheet that you
 mention. If you actually try to do this, I would love to see it, and it
 might be useful to others as well. Both use similar abstractions at the API
 level (i.e., parallel collections), so if you stay true to the functional
 paradigm and not try to "abuse" the system by exploiting knowledge of its
 internals things should be straightforward. These apply to the batch APIs;
 the streaming API in Flink follows a true streaming paradigm, where you get
 an unbounded stream of records and operators on these streams.

 Funny that you ask about a video for the DataStream slides. There is a
 Flink training happening as we speak, and a video is being recorded right
 now :-) Hopefully it will be made available soon.

 Best,
 Kostas


 On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. There are even videos
> at youtube for some of the slides
>
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>
>   -
> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
>
> The third lecture
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
> more or less corresponds to
> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
> there are more lessons at http://dataartisans.github.io/flink-training,
> for stream processing and the table API for which I haven't found a
> video. Does anyone have pointers to the missing videos?
>
> Greetings,
>
> Juan
>
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
>> Hi list,
>>
>> I'm new to Flink, and I find this project very interesting. I have
>> experience with Apache Spark, and for I've seen so far I find that Flink
>> provides an API at a similar abstraction level but based on single record

Re: Bigpetstore - Flink integration

2015-09-02 Thread jay vyas
Hey, thanks!

Those are just seeds, the files aren't large.

The scale out data is the transactions.

The seed data needs to be the same, shipped to ALL nodes, and then

the nodes generate transactions.


On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger  wrote:

> I'm starting a new discussion thread for the bigpetstore-flink integration
> ...
>
>
> I took a closer look into the code you've posted.
> It seems to me that you are generating a lot of data locally on the
> client, before you actually submit a job to Flink. (Both "customers" and
> "stores" are generated locally)
> Is that only some "seed" data?
>
> I would actually try to generate as much data as possible in the cluster,
> making the generator very scalable.
>
> I don't think that you need to register a Kryo serializer for the Product
> and Transaction type.
> I was able to run the code without the serializer registration.
>
>
> -- Forwarded message --
> From: jay vyas 
> Date: Wed, Sep 2, 2015 at 2:56 PM
> Subject: Re: Hardware requirements and learning resources
> To: user@flink.apache.org
>
>
> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find 

Re: Bigpetstore - Flink integration

2015-09-02 Thread Stephan Ewen
If a lot of the data is generated locally, this may face the same issue as
Greg did with oversized payloads (dropped by Akka).

On Wed, Sep 2, 2015 at 3:21 PM, Robert Metzger  wrote:

> I'm starting a new discussion thread for the bigpetstore-flink integration
> ...
>
>
> I took a closer look into the code you've posted.
> It seems to me that you are generating a lot of data locally on the
> client, before you actually submit a job to Flink. (Both "customers" and
> "stores" are generated locally)
> Is that only some "seed" data?
>
> I would actually try to generate as much data as possible in the cluster,
> making the generator very scalable.
>
> I don't think that you need to register a Kryo serializer for the Product
> and Transaction type.
> I was able to run the code without the serializer registration.
>
>
> -- Forwarded message --
> From: jay vyas 
> Date: Wed, Sep 2, 2015 at 2:56 PM
> Subject: Re: Hardware requirements and learning resources
> To: user@flink.apache.org
>
>
> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. 

Re: Bug broadcasting objects (serialization issue)

2015-09-02 Thread Stephan Ewen
Yes, even serialize in the constructor. Then the failure (if serialization
does not work) comes immediately.

On Wed, Sep 2, 2015 at 4:02 PM, Maximilian Michels  wrote:

> Nice suggestion. So you want to serialize and deserialize the InputFormats
> on the Client to check whether they can be transferred correctly? Merely
> serializing is not enough because the above Exception occurs during
> deserialization.
>
> On Wed, Sep 2, 2015 at 2:29 PM, Stephan Ewen  wrote:
>
>> We should try to improve the exception here. More people will run into
>> this issue and the exception should help them understand it well.
>>
>> How about we do eager serialization into a set of byte arrays? Then the
>> serializability issue comes immediately when the program is constructed,
>> rather than later, when it is shipped.
>>
>> On Wed, Sep 2, 2015 at 12:56 PM, Maximilian Michels 
>> wrote:
>>
>>> Here's the JIRA issue: https://issues.apache.org/jira/browse/FLINK-2608
>>>
>>> On Wed, Sep 2, 2015 at 12:49 PM, Maximilian Michels 
>>> wrote:
>>>
 Hi Andreas,

 Thank you for reporting the problem and including the code to reproduce
 the problem. I think there is a problem with the class serialization or
 deserialization. Arrays.asList uses a private ArrayList class
 (java.util.Arrays$ArrayList) which is not the one you would normally use
 (java.util.ArrayList).

 I'll create a JIRA issue to keep track of the problem and to
 investigate further.

 Best regards,
 Max

 Here's the stack trace:

 Exception in thread "main"
 org.apache.flink.runtime.client.JobExecutionException: Cannot initialize
 task 'DataSource (at main(Test.java:32)
 (org.apache.flink.api.java.io.CollectionInputFormat))': Deserializing the
 InputFormat ([mytests.Test$TestClass@4d6025c5]) failed: unread block
 data
 at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:523)
 at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:507)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at
 scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at org.apache.flink.runtime.jobmanager.JobManager.org
 $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:507)
 at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:190)
 at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
 at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
 at
 scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
 at
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:43)
 at
 org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
 at
 scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
 at
 org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
 at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
 at
 org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
 at akka.actor.ActorCell.invoke(ActorCell.scala:487)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
 at akka.dispatch.Mailbox.run(Mailbox.scala:221)
 at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
 at
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: java.lang.Exception: Deserializing the InputFormat
 ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
 at
 org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:60)
 at
 org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
 ... 25 more
 Caused by: java.lang.IllegalStateException: unread block data
 at
 

Re: Bigpetstore - Flink integration

2015-09-02 Thread jay vyas
hmmm interesting looks to be working magically now... :)  I must have wrote
some code late at night that magically fixed it and forgot.  The original
errors I was getting were kayo related.

The objects aren't being serialized on write to anything useful, but thats
I'm sure an easy fix.

Onward and upward !

On Wed, Sep 2, 2015 at 9:31 AM, Stephan Ewen  wrote:

> If a lot of the data is generated locally, this may face the same issue as
> Greg did with oversized payloads (dropped by Akka).
>
> On Wed, Sep 2, 2015 at 3:21 PM, Robert Metzger 
> wrote:
>
>> I'm starting a new discussion thread for the bigpetstore-flink
>> integration ...
>>
>>
>> I took a closer look into the code you've posted.
>> It seems to me that you are generating a lot of data locally on the
>> client, before you actually submit a job to Flink. (Both "customers" and
>> "stores" are generated locally)
>> Is that only some "seed" data?
>>
>> I would actually try to generate as much data as possible in the cluster,
>> making the generator very scalable.
>>
>> I don't think that you need to register a Kryo serializer for the Product
>> and Transaction type.
>> I was able to run the code without the serializer registration.
>>
>>
>> -- Forwarded message --
>> From: jay vyas 
>> Date: Wed, Sep 2, 2015 at 2:56 PM
>> Subject: Re: Hardware requirements and learning resources
>> To: user@flink.apache.org
>>
>>
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>>
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>>
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>>
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>>
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>>
>>>
>>>
>>>
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>>> wrote:
>>>
 Hi Juan,

 Flink is quite nimble with hardware requirements; people have run it in
 old-ish laptops and also the largest instances available in cloud
 providers. I will let others chime in with more details.

 I am not aware of something along the lines of a cheatsheet that you
 mention. If you actually try to do this, I would love to see it, and it
 might be useful to others as well. Both use similar abstractions at the API
 level (i.e., parallel collections), so if you stay true to the functional
 paradigm and not try to "abuse" the system by exploiting knowledge of its
 internals things should be straightforward. These apply to the batch APIs;
 the streaming API in Flink follows a true streaming paradigm, where you get
 an unbounded stream of records and operators on these streams.

 Funny that you ask about a video for the DataStream slides. There is a
 Flink training happening as we speak, and a video is being recorded right
 now :-) Hopefully it will be made available soon.

 Best,
 Kostas


 On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. There are even videos
> at youtube for some of the slides
>
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>
>   -
> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
>
> The third lecture
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
> more or less corresponds to
> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
> there are more lessons at http://dataartisans.github.io/flink-training,
> for stream processing and the table API for which I haven't found a
> video. Does anyone have pointers to the missing videos?
>
> Greetings,
>
> Juan
>
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
>> Hi list,
>>
>> I'm new to Flink, and I find this project very interesting. I have
>> experience with Apache Spark, and for I've seen so far I find that Flink
>> provides an 

Re: Bug broadcasting objects (serialization issue)

2015-09-02 Thread Maximilian Michels
Nice suggestion. So you want to serialize and deserialize the InputFormats
on the Client to check whether they can be transferred correctly? Merely
serializing is not enough because the above Exception occurs during
deserialization.

On Wed, Sep 2, 2015 at 2:29 PM, Stephan Ewen  wrote:

> We should try to improve the exception here. More people will run into
> this issue and the exception should help them understand it well.
>
> How about we do eager serialization into a set of byte arrays? Then the
> serializability issue comes immediately when the program is constructed,
> rather than later, when it is shipped.
>
> On Wed, Sep 2, 2015 at 12:56 PM, Maximilian Michels 
> wrote:
>
>> Here's the JIRA issue: https://issues.apache.org/jira/browse/FLINK-2608
>>
>> On Wed, Sep 2, 2015 at 12:49 PM, Maximilian Michels 
>> wrote:
>>
>>> Hi Andreas,
>>>
>>> Thank you for reporting the problem and including the code to reproduce
>>> the problem. I think there is a problem with the class serialization or
>>> deserialization. Arrays.asList uses a private ArrayList class
>>> (java.util.Arrays$ArrayList) which is not the one you would normally use
>>> (java.util.ArrayList).
>>>
>>> I'll create a JIRA issue to keep track of the problem and to investigate
>>> further.
>>>
>>> Best regards,
>>> Max
>>>
>>> Here's the stack trace:
>>>
>>> Exception in thread "main"
>>> org.apache.flink.runtime.client.JobExecutionException: Cannot initialize
>>> task 'DataSource (at main(Test.java:32)
>>> (org.apache.flink.api.java.io.CollectionInputFormat))': Deserializing the
>>> InputFormat ([mytests.Test$TestClass@4d6025c5]) failed: unread block
>>> data
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:523)
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:507)
>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>>> at org.apache.flink.runtime.jobmanager.JobManager.org
>>> $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:507)
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:190)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>>> at
>>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>>> at
>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:43)
>>> at
>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
>>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>>> at
>>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>> Caused by: java.lang.Exception: Deserializing the InputFormat
>>> ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
>>> at
>>> org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:60)
>>> at
>>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>>> ... 25 more
>>> Caused by: java.lang.IllegalStateException: unread block data
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>>> at
>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>>> at
>>> 

Re: verbose console

2015-09-02 Thread Maximilian Michels
Hi Michele,

Please supply a log4j.properties file path as a Java VM property like
so: -Dlog4j.configuration=/path/to/log4j.properties

Your IDE should have an option to adjust VM arguments.

Cheers,
Max

On Wed, Sep 2, 2015 at 9:10 AM, Michele Bertoni
 wrote:
> Hi everybody, I just found that in version 0.9.1 it is possibile to disable 
> that verbose console, can you please explain how to do it both in IDE and 
> local environment?
> Especially in IDE I am able to set property of log4j for my logger, but 
> everything I try for flink internal one does not work
>
>
> thanks
> michele


Re: Bug broadcasting objects (serialization issue)

2015-09-02 Thread Maximilian Michels
Here's the JIRA issue: https://issues.apache.org/jira/browse/FLINK-2608

On Wed, Sep 2, 2015 at 12:49 PM, Maximilian Michels  wrote:

> Hi Andreas,
>
> Thank you for reporting the problem and including the code to reproduce
> the problem. I think there is a problem with the class serialization or
> deserialization. Arrays.asList uses a private ArrayList class
> (java.util.Arrays$ArrayList) which is not the one you would normally use
> (java.util.ArrayList).
>
> I'll create a JIRA issue to keep track of the problem and to investigate
> further.
>
> Best regards,
> Max
>
> Here's the stack trace:
>
> Exception in thread "main"
> org.apache.flink.runtime.client.JobExecutionException: Cannot initialize
> task 'DataSource (at main(Test.java:32)
> (org.apache.flink.api.java.io.CollectionInputFormat))': Deserializing the
> InputFormat ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:523)
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:507)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at org.apache.flink.runtime.jobmanager.JobManager.org
> $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:507)
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:190)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> at
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> at
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:43)
> at
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> at
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.lang.Exception: Deserializing the InputFormat
> ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
> at
> org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:60)
> at
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
> ... 25 more
> Caused by: java.lang.IllegalStateException: unread block data
> at
> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
> at
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
> at
> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:302)
> at
> org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:264)
> at
> org.apache.flink.runtime.operators.util.TaskConfig.getStubWrapper(TaskConfig.java:282)
> at
> org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:57)
> ... 26 more
>
> On Wed, Sep 2, 2015 at 11:17 AM, Andres R. Masegosa 
> wrote:
>
>> Hi,
>>
>> I get a bug when trying to broadcast a list of integers created with the
>> primitive "Arrays.asList(...)".
>>
>> For example, if you try to run this "wordcount" example, you can

Re: Hardware requirements and learning resources

2015-09-02 Thread Robert Metzger
Hi Juan,

I think the recommendations in the Spark guide are quite good, and are
similar to what I would recommend for Flink as well.
Depending on the workloads you are interested to run, you can certainly use
Flink with less than 8 GB per machine. I think you can start Flink
TaskManagers with 500 MB of heap space and they'll still be able to process
some GB of data.

Everything above 2 GB is probably good enough for some initial
experimentation (again depending on your workloads, network, disk speed
etc.)




On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas  wrote:

> Hi Juan,
>
> Flink is quite nimble with hardware requirements; people have run it in
> old-ish laptops and also the largest instances available in cloud
> providers. I will let others chime in with more details.
>
> I am not aware of something along the lines of a cheatsheet that you
> mention. If you actually try to do this, I would love to see it, and it
> might be useful to others as well. Both use similar abstractions at the API
> level (i.e., parallel collections), so if you stay true to the functional
> paradigm and not try to "abuse" the system by exploiting knowledge of its
> internals things should be straightforward. These apply to the batch APIs;
> the streaming API in Flink follows a true streaming paradigm, where you get
> an unbounded stream of records and operators on these streams.
>
> Funny that you ask about a video for the DataStream slides. There is a
> Flink training happening as we speak, and a video is being recorded right
> now :-) Hopefully it will be made available soon.
>
> Best,
> Kostas
>
>
> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Answering to myself, I have found some nice training material at
>> http://dataartisans.github.io/flink-training. There are even videos at
>> youtube for some of the slides
>>
>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>
>>   - http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>> https://www.youtube.com/watch?v=0EARqW15dDk
>>
>> The third lecture
>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
>> but not exactly, and there are more lessons at
>> http://dataartisans.github.io/flink-training, for stream processing and
>> the table API for which I haven't found a video. Does anyone have pointers
>> to the missing videos?
>>
>> Greetings,
>>
>> Juan
>>
>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com>:
>>
>>> Hi list,
>>>
>>> I'm new to Flink, and I find this project very interesting. I have
>>> experience with Apache Spark, and for I've seen so far I find that Flink
>>> provides an API at a similar abstraction level but based on single record
>>> processing instead of batch processing. I've read in Quora that Flink
>>> extends stream processing to batch processing, while Spark extends batch
>>> processing to streaming. Therefore I find Flink specially attractive for
>>> low latency stream processing. Anyway, I would appreciate if someone could
>>> give some indication about where I could find a list of hardware
>>> requirements for the slave nodes in a Flink cluster. Something along the
>>> lines of https://spark.apache.org/docs/latest/hardware-provisioning.html.
>>> Spark is known for having quite high minimal memory requirements (8GB RAM
>>> and 8 cores minimum), and I was wondering if it is also the case for Flink.
>>> Lower memory requirements would be very interesting for building small
>>> Flink clusters for educational purposes, or for small projects.
>>>
>>> Apart from that, I wonder if there is some blog post by the comunity
>>> about transitioning from Spark to Flink. I think it could be interesting,
>>> as there are some similarities in the APIs, but also deep differences in
>>> the underlying approaches. I was thinking in something like Breeze's
>>> cheatsheet comparing its matrix operatations with those available in Matlab
>>> and Numpy
>>> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
>>> like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
>>> any pointer to some online course, book or training for Flink besides the
>>> official programming guides would be much appreciated
>>>
>>> Thanks in advance for help
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>>
>>
>


Re: Hardware requirements and learning resources

2015-09-02 Thread Jay Vyas
Just running the main class is sufficient

> On Sep 2, 2015, at 8:59 AM, Robert Metzger  wrote:
> 
> Hey jay,
> 
> How can I reproduce the error?
> 
>> On Wed, Sep 2, 2015 at 2:56 PM, jay vyas  wrote:
>> We're also working on a bigpetstore implementation of flink which will help 
>> onboard spark/mapreduce folks.
>> 
>> I have prototypical code here that runs a simple job in memory, 
>> contributions welcome,
>> 
>> right now there is a serialization error 
>> https://github.com/bigpetstore/bigpetstore-flink .
>> 
>>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger  wrote:
>>> Hi Juan,
>>> 
>>> I think the recommendations in the Spark guide are quite good, and are 
>>> similar to what I would recommend for Flink as well. 
>>> Depending on the workloads you are interested to run, you can certainly use 
>>> Flink with less than 8 GB per machine. I think you can start Flink 
>>> TaskManagers with 500 MB of heap space and they'll still be able to process 
>>> some GB of data.
>>> 
>>> Everything above 2 GB is probably good enough for some initial 
>>> experimentation (again depending on your workloads, network, disk speed 
>>> etc.)
>>> 
>>> 
>>> 
>>> 
 On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas  wrote:
 Hi Juan,
 
 Flink is quite nimble with hardware requirements; people have run it in 
 old-ish laptops and also the largest instances available in cloud 
 providers. I will let others chime in with more details.
 
 I am not aware of something along the lines of a cheatsheet that you 
 mention. If you actually try to do this, I would love to see it, and it 
 might be useful to others as well. Both use similar abstractions at the 
 API level (i.e., parallel collections), so if you stay true to the 
 functional paradigm and not try to "abuse" the system by exploiting 
 knowledge of its internals things should be straightforward. These apply 
 to the batch APIs; the streaming API in Flink follows a true streaming 
 paradigm, where you get an unbounded stream of records and operators on 
 these streams.
 
 Funny that you ask about a video for the DataStream slides. There is a 
 Flink training happening as we speak, and a video is being recorded right 
 now :-) Hopefully it will be made available soon.
 
 Best,
 Kostas
 
 
> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá 
>  wrote:
> Answering to myself, I have found some nice training material at 
> http://dataartisans.github.io/flink-training. There are even videos at 
> youtube for some of the slides
> 
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
> 
>   - http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
> 
> The third lecture 
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html 
> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU 
> but not exactly, and there are more lessons at 
> http://dataartisans.github.io/flink-training, for stream processing and 
> the table API for which I haven't found a video. Does anyone have 
> pointers to the missing videos?
> 
> Greetings, 
> 
> Juan
> 
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá 
> :
>> Hi list, 
>> 
>> I'm new to Flink, and I find this project very interesting. I have 
>> experience with Apache Spark, and for I've seen so far I find that Flink 
>> provides an API at a similar abstraction level but based on single 
>> record processing instead of batch processing. I've read in Quora that 
>> Flink extends stream processing to batch processing, while Spark extends 
>> batch processing to streaming. Therefore I find Flink specially 
>> attractive for low latency stream processing. Anyway, I would appreciate 
>> if someone could give some indication about where I could find a list of 
>> hardware requirements for the slave nodes in a Flink cluster. Something 
>> along the lines of 
>> https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark 
>> is known for having quite high minimal memory requirements (8GB RAM and 
>> 8 cores minimum), and I was wondering if it is also the case for Flink. 
>> Lower memory requirements would be very interesting for building small 
>> Flink clusters for educational purposes, or for small projects. 
>> 
>> Apart from that, I wonder if there is some blog post by the comunity 
>> about transitioning from Spark to Flink. I think it could be 
>> interesting, as there are some similarities in the APIs, but also deep 
>> 

Re: Hardware requirements and learning resources

2015-09-02 Thread Robert Metzger
Hey jay,

How can I reproduce the error?

On Wed, Sep 2, 2015 at 2:56 PM, jay vyas 
wrote:

> We're also working on a bigpetstore implementation of flink which will
> help onboard spark/mapreduce folks.
>
> I have prototypical code here that runs a simple job in memory,
> contributions welcome,
>
> right now there is a serialization error
> https://github.com/bigpetstore/bigpetstore-flink .
>
> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
> wrote:
>
>> Hi Juan,
>>
>> I think the recommendations in the Spark guide are quite good, and are
>> similar to what I would recommend for Flink as well.
>> Depending on the workloads you are interested to run, you can certainly
>> use Flink with less than 8 GB per machine. I think you can start Flink
>> TaskManagers with 500 MB of heap space and they'll still be able to process
>> some GB of data.
>>
>> Everything above 2 GB is probably good enough for some initial
>> experimentation (again depending on your workloads, network, disk speed
>> etc.)
>>
>>
>>
>>
>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> Flink is quite nimble with hardware requirements; people have run it in
>>> old-ish laptops and also the largest instances available in cloud
>>> providers. I will let others chime in with more details.
>>>
>>> I am not aware of something along the lines of a cheatsheet that you
>>> mention. If you actually try to do this, I would love to see it, and it
>>> might be useful to others as well. Both use similar abstractions at the API
>>> level (i.e., parallel collections), so if you stay true to the functional
>>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>>> internals things should be straightforward. These apply to the batch APIs;
>>> the streaming API in Flink follows a true streaming paradigm, where you get
>>> an unbounded stream of records and operators on these streams.
>>>
>>> Funny that you ask about a video for the DataStream slides. There is a
>>> Flink training happening as we speak, and a video is being recorded right
>>> now :-) Hopefully it will be made available soon.
>>>
>>> Best,
>>> Kostas
>>>
>>>
>>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com> wrote:
>>>
 Answering to myself, I have found some nice training material at
 http://dataartisans.github.io/flink-training. There are even videos at
 youtube for some of the slides

   - http://dataartisans.github.io/flink-training/overview/intro.html
 https://www.youtube.com/watch?v=XgC6c4Wiqvs

   -
 http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
 https://www.youtube.com/watch?v=0EARqW15dDk

 The third lecture
 http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
 more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
 but not exactly, and there are more lessons at
 http://dataartisans.github.io/flink-training, for stream processing
 and the table API for which I haven't found a video. Does anyone have
 pointers to the missing videos?

 Greetings,

 Juan

 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com>:

> Hi list,
>
> I'm new to Flink, and I find this project very interesting. I have
> experience with Apache Spark, and for I've seen so far I find that Flink
> provides an API at a similar abstraction level but based on single record
> processing instead of batch processing. I've read in Quora that Flink
> extends stream processing to batch processing, while Spark extends batch
> processing to streaming. Therefore I find Flink specially attractive for
> low latency stream processing. Anyway, I would appreciate if someone could
> give some indication about where I could find a list of hardware
> requirements for the slave nodes in a Flink cluster. Something along the
> lines of
> https://spark.apache.org/docs/latest/hardware-provisioning.html.
> Spark is known for having quite high minimal memory requirements (8GB RAM
> and 8 cores minimum), and I was wondering if it is also the case for 
> Flink.
> Lower memory requirements would be very interesting for building small
> Flink clusters for educational purposes, or for small projects.
>
> Apart from that, I wonder if there is some blog post by the comunity
> about transitioning from Spark to Flink. I think it could be interesting,
> as there are some similarities in the APIs, but also deep differences in
> the underlying approaches. I was thinking in something like Breeze's
> cheatsheet comparing its matrix operatations with those available in 
> Matlab
> and Numpy
> https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet,
> or like 

Re: Bug broadcasting objects (serialization issue)

2015-09-02 Thread Stephan Ewen
We should try to improve the exception here. More people will run into this
issue and the exception should help them understand it well.

How about we do eager serialization into a set of byte arrays? Then the
serializability issue comes immediately when the program is constructed,
rather than later, when it is shipped.

On Wed, Sep 2, 2015 at 12:56 PM, Maximilian Michels  wrote:

> Here's the JIRA issue: https://issues.apache.org/jira/browse/FLINK-2608
>
> On Wed, Sep 2, 2015 at 12:49 PM, Maximilian Michels 
> wrote:
>
>> Hi Andreas,
>>
>> Thank you for reporting the problem and including the code to reproduce
>> the problem. I think there is a problem with the class serialization or
>> deserialization. Arrays.asList uses a private ArrayList class
>> (java.util.Arrays$ArrayList) which is not the one you would normally use
>> (java.util.ArrayList).
>>
>> I'll create a JIRA issue to keep track of the problem and to investigate
>> further.
>>
>> Best regards,
>> Max
>>
>> Here's the stack trace:
>>
>> Exception in thread "main"
>> org.apache.flink.runtime.client.JobExecutionException: Cannot initialize
>> task 'DataSource (at main(Test.java:32)
>> (org.apache.flink.api.java.io.CollectionInputFormat))': Deserializing the
>> InputFormat ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:523)
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:507)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>> at org.apache.flink.runtime.jobmanager.JobManager.org
>> $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:507)
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:190)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
>> at
>> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
>> at
>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:43)
>> at
>> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
>> at
>> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
>> at
>> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:221)
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> Caused by: java.lang.Exception: Deserializing the InputFormat
>> ([mytests.Test$TestClass@4d6025c5]) failed: unread block data
>> at
>> org.apache.flink.runtime.jobgraph.InputFormatVertex.initializeOnMaster(InputFormatVertex.java:60)
>> at
>> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:520)
>> ... 25 more
>> Caused by: java.lang.IllegalStateException: unread block data
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2424)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1383)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1993)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1918)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>> at
>> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:302)
>> at
>> org.apache.flink.util.InstantiationUtil.readObjectFromConfig(InstantiationUtil.java:264)
>> 

Re: Hardware requirements and learning resources

2015-09-02 Thread jay vyas
We're also working on a bigpetstore implementation of flink which will help
onboard spark/mapreduce folks.

I have prototypical code here that runs a simple job in memory,
contributions welcome,

right now there is a serialization error
https://github.com/bigpetstore/bigpetstore-flink .

On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger  wrote:

> Hi Juan,
>
> I think the recommendations in the Spark guide are quite good, and are
> similar to what I would recommend for Flink as well.
> Depending on the workloads you are interested to run, you can certainly
> use Flink with less than 8 GB per machine. I think you can start Flink
> TaskManagers with 500 MB of heap space and they'll still be able to process
> some GB of data.
>
> Everything above 2 GB is probably good enough for some initial
> experimentation (again depending on your workloads, network, disk speed
> etc.)
>
>
>
>
> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
> wrote:
>
>> Hi Juan,
>>
>> Flink is quite nimble with hardware requirements; people have run it in
>> old-ish laptops and also the largest instances available in cloud
>> providers. I will let others chime in with more details.
>>
>> I am not aware of something along the lines of a cheatsheet that you
>> mention. If you actually try to do this, I would love to see it, and it
>> might be useful to others as well. Both use similar abstractions at the API
>> level (i.e., parallel collections), so if you stay true to the functional
>> paradigm and not try to "abuse" the system by exploiting knowledge of its
>> internals things should be straightforward. These apply to the batch APIs;
>> the streaming API in Flink follows a true streaming paradigm, where you get
>> an unbounded stream of records and operators on these streams.
>>
>> Funny that you ask about a video for the DataStream slides. There is a
>> Flink training happening as we speak, and a video is being recorded right
>> now :-) Hopefully it will be made available soon.
>>
>> Best,
>> Kostas
>>
>>
>> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com> wrote:
>>
>>> Answering to myself, I have found some nice training material at
>>> http://dataartisans.github.io/flink-training. There are even videos at
>>> youtube for some of the slides
>>>
>>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>>> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>>
>>>   -
>>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>>> https://www.youtube.com/watch?v=0EARqW15dDk
>>>
>>> The third lecture
>>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>>> more or less corresponds to https://www.youtube.com/watch?v=1yWKZ26NQeU
>>> but not exactly, and there are more lessons at
>>> http://dataartisans.github.io/flink-training, for stream processing and
>>> the table API for which I haven't found a video. Does anyone have pointers
>>> to the missing videos?
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
>>> juan.rodriguez.hort...@gmail.com>:
>>>
 Hi list,

 I'm new to Flink, and I find this project very interesting. I have
 experience with Apache Spark, and for I've seen so far I find that Flink
 provides an API at a similar abstraction level but based on single record
 processing instead of batch processing. I've read in Quora that Flink
 extends stream processing to batch processing, while Spark extends batch
 processing to streaming. Therefore I find Flink specially attractive for
 low latency stream processing. Anyway, I would appreciate if someone could
 give some indication about where I could find a list of hardware
 requirements for the slave nodes in a Flink cluster. Something along the
 lines of
 https://spark.apache.org/docs/latest/hardware-provisioning.html. Spark
 is known for having quite high minimal memory requirements (8GB RAM and 8
 cores minimum), and I was wondering if it is also the case for Flink. Lower
 memory requirements would be very interesting for building small Flink
 clusters for educational purposes, or for small projects.

 Apart from that, I wonder if there is some blog post by the comunity
 about transitioning from Spark to Flink. I think it could be interesting,
 as there are some similarities in the APIs, but also deep differences in
 the underlying approaches. I was thinking in something like Breeze's
 cheatsheet comparing its matrix operatations with those available in Matlab
 and Numpy
 https://github.com/scalanlp/breeze/wiki/Linear-Algebra-Cheat-Sheet, or
 like http://rosettacode.org/wiki/Factorial. Just an idea anyway. Also,
 any pointer to some online course, book or training for Flink besides the
 official programming guides would be much appreciated

 Thanks in advance for help

 Greetings,

Re: Hardware requirements and learning resources

2015-09-02 Thread Robert Metzger
@Jay: I've looked into your code,  but I was not able to reproduce the
issue.
I'll start a new discussion thread on the user@flink list for the
Flink-BigPetStore discussion. I don't want to take over Juan's
hardware-requirements discussion ;)

On Wed, Sep 2, 2015 at 3:01 PM, Jay Vyas 
wrote:

> Just running the main class is sufficient
>
> On Sep 2, 2015, at 8:59 AM, Robert Metzger  wrote:
>
> Hey jay,
>
> How can I reproduce the error?
>
> On Wed, Sep 2, 2015 at 2:56 PM, jay vyas 
> wrote:
>
>> We're also working on a bigpetstore implementation of flink which will
>> help onboard spark/mapreduce folks.
>>
>> I have prototypical code here that runs a simple job in memory,
>> contributions welcome,
>>
>> right now there is a serialization error
>> https://github.com/bigpetstore/bigpetstore-flink .
>>
>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>> wrote:
>>
>>> Hi Juan,
>>>
>>> I think the recommendations in the Spark guide are quite good, and are
>>> similar to what I would recommend for Flink as well.
>>> Depending on the workloads you are interested to run, you can certainly
>>> use Flink with less than 8 GB per machine. I think you can start Flink
>>> TaskManagers with 500 MB of heap space and they'll still be able to process
>>> some GB of data.
>>>
>>> Everything above 2 GB is probably good enough for some initial
>>> experimentation (again depending on your workloads, network, disk speed
>>> etc.)
>>>
>>>
>>>
>>>
>>> On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
>>> wrote:
>>>
 Hi Juan,

 Flink is quite nimble with hardware requirements; people have run it in
 old-ish laptops and also the largest instances available in cloud
 providers. I will let others chime in with more details.

 I am not aware of something along the lines of a cheatsheet that you
 mention. If you actually try to do this, I would love to see it, and it
 might be useful to others as well. Both use similar abstractions at the API
 level (i.e., parallel collections), so if you stay true to the functional
 paradigm and not try to "abuse" the system by exploiting knowledge of its
 internals things should be straightforward. These apply to the batch APIs;
 the streaming API in Flink follows a true streaming paradigm, where you get
 an unbounded stream of records and operators on these streams.

 Funny that you ask about a video for the DataStream slides. There is a
 Flink training happening as we speak, and a video is being recorded right
 now :-) Hopefully it will be made available soon.

 Best,
 Kostas


 On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
 juan.rodriguez.hort...@gmail.com> wrote:

> Answering to myself, I have found some nice training material at
> http://dataartisans.github.io/flink-training. There are even videos
> at youtube for some of the slides
>
>   - http://dataartisans.github.io/flink-training/overview/intro.html
> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>
>   -
> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
> https://www.youtube.com/watch?v=0EARqW15dDk
>
> The third lecture
> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
> more or less corresponds to
> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
> there are more lessons at http://dataartisans.github.io/flink-training,
> for stream processing and the table API for which I haven't found a
> video. Does anyone have pointers to the missing videos?
>
> Greetings,
>
> Juan
>
> 2015-09-02 12:50 GMT+02:00 Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com>:
>
>> Hi list,
>>
>> I'm new to Flink, and I find this project very interesting. I have
>> experience with Apache Spark, and for I've seen so far I find that Flink
>> provides an API at a similar abstraction level but based on single record
>> processing instead of batch processing. I've read in Quora that Flink
>> extends stream processing to batch processing, while Spark extends batch
>> processing to streaming. Therefore I find Flink specially attractive for
>> low latency stream processing. Anyway, I would appreciate if someone 
>> could
>> give some indication about where I could find a list of hardware
>> requirements for the slave nodes in a Flink cluster. Something along the
>> lines of
>> https://spark.apache.org/docs/latest/hardware-provisioning.html.
>> Spark is known for having quite high minimal memory requirements (8GB RAM
>> and 8 cores minimum), and I was wondering if it is also the case for 
>> Flink.
>> Lower memory requirements would be very interesting for building small
>> Flink clusters 

Re: verbose console

2015-09-02 Thread Michele Bertoni
Totally, it is the first time i am using them and I thought they were the same
I will try it asap

thanks



Il giorno 02/set/2015, alle ore 16:41, Till Rohrmann 
> ha scritto:


Can it be that you confuse the logback configuration file with the log4j 
configuration file? The log4j.properties file should look like

log4j.rootLogger=INFO, console

# Log all infos in the given file
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{HH:mm:ss,SSS} %-5p %-60c %x 
- %m%n

# suppress the irrelevant (wrong) warnings from the netty channel handler
log4j.logger.org.jboss.netty.channel.DefaultChannelPipeline=ERROR, console


Cheers,
Till

​

On Wed, Sep 2, 2015 at 4:33 PM, Michele Bertoni 
> wrote:
thanks for answering,
yes I did it, in fact it is working for my logger, but I don’t know how to 
limit flink logger using the log4j property

if i use this line



as suggested by doc, i get a compilation error of the log4j property document: 
level should not be in that position
instead doing something like this is accepted





Re: Multiple restarts of Local Cluster

2015-09-02 Thread Sachin Goel
Yes. That will work too. However, then it isn't possible to shut down the
local cluster. [Is it necessary to do so or does it shut down automatically
when the program exists? I'm not entirely sure.]

-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685

On Wed, Sep 2, 2015 at 7:59 PM, Stephan Ewen  wrote:

> Have a look at some other tests, like the checkpointing tests. They start
> one cluster manually and keep it running. They connect against it using the
> remote environment ("localhost", miniCluster.getJobManagerRpcPort()).
>
> That works nicely...
>
> On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
> wrote:
>
>> Hi all
>> While using LocalEnvironment, in case the program triggers execution
>> several times, the {{LocalFlinkMiniCluster}} is started as many times. This
>> can consume a lot of time in setting up and tearing down the cluster.
>> Further, this hinders with a new functionality I'm working on based on
>> persisted results.
>> One potential solution could be to follow the methodology in
>> `MultipleProgramsTestBase`. The user code then would have to reside in a
>> fixed function name, instead of the main method. Or is that too cumbersome?
>>
>> Regards
>> Sachin
>> -- Sachin Goel
>> Computer Science, IIT Delhi
>> m. +91-9871457685
>>
>
>


Re: Bug broadcasting objects (serialization issue)

2015-09-02 Thread Stephan Ewen
I see.

Manual serialization implies also manual deserialization (on the workers
only), which would give a better exception.

BTW: There is an opportunity to fix two problems with one patch: The
framesize overflow for the input format, and the serialization.

On Wed, Sep 2, 2015 at 4:16 PM, Maximilian Michels  wrote:

> Ok but that would not prevent the above error, right? Serializing is
> not the issue here.
>
> Nevertheless, it would catch all errors during initial serialization.
> Deserializing has its own hazards due to possible Classloader issues.
>
> On Wed, Sep 2, 2015 at 4:05 PM, Stephan Ewen  wrote:
> > Yes, even serialize in the constructor. Then the failure (if
> serialization
> > does not work) comes immediately.
> >
> > On Wed, Sep 2, 2015 at 4:02 PM, Maximilian Michels 
> wrote:
> >>
> >> Nice suggestion. So you want to serialize and deserialize the
> InputFormats
> >> on the Client to check whether they can be transferred correctly? Merely
> >> serializing is not enough because the above Exception occurs during
> >> deserialization.
> >>
> >> On Wed, Sep 2, 2015 at 2:29 PM, Stephan Ewen  wrote:
> >>>
> >>> We should try to improve the exception here. More people will run into
> >>> this issue and the exception should help them understand it well.
> >>>
> >>> How about we do eager serialization into a set of byte arrays? Then the
> >>> serializability issue comes immediately when the program is
> constructed,
> >>> rather than later, when it is shipped.
> >>>
> >>> On Wed, Sep 2, 2015 at 12:56 PM, Maximilian Michels 
> >>> wrote:
> 
>  Here's the JIRA issue:
> https://issues.apache.org/jira/browse/FLINK-2608
> 
>  On Wed, Sep 2, 2015 at 12:49 PM, Maximilian Michels 
>  wrote:
> >
> > Hi Andreas,
> >
> > Thank you for reporting the problem and including the code to
> reproduce
> > the problem. I think there is a problem with the class serialization
> or
> > deserialization. Arrays.asList uses a private ArrayList class
> > (java.util.Arrays$ArrayList) which is not the one you would normally
> use
> > (java.util.ArrayList).
> >
> > I'll create a JIRA issue to keep track of the problem and to
> > investigate further.
> >
> > Best regards,
> > Max
> >
> > Here's the stack trace:
> >
> > Exception in thread "main"
> > org.apache.flink.runtime.client.JobExecutionException: Cannot
> initialize
> > task 'DataSource (at main(Test.java:32)
> > (org.apache.flink.api.java.io.CollectionInputFormat))':
> Deserializing the
> > InputFormat ([mytests.Test$TestClass@4d6025c5]) failed: unread
> block data
> > at
> >
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:523)
> > at
> >
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$org$apache$flink$runtime$jobmanager$JobManager$$submitJob$4.apply(JobManager.scala:507)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> > at
> > scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> > at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> > at
> > org.apache.flink.runtime.jobmanager.JobManager.org
> $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:507)
> > at
> >
> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:190)
> > at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
> > at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
> > at
> >
> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
> > at
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:43)
> > at
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:29)
> > at
> > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
> > at
> >
> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:29)
> > at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> > at
> >
> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:92)
> > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> > at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
> > at akka.dispatch.Mailbox.run(Mailbox.scala:221)
> > at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
> > at
> > 

Re: verbose console

2015-09-02 Thread Michele Bertoni
thanks for answering,
yes I did it, in fact it is working for my logger, but I don’t know how to 
limit flink logger using the log4j property

if i use this line



as suggested by doc, i get a compilation error of the log4j property document: 
level should not be in that position
instead doing something like this is accepted





Re: Multiple restarts of Local Cluster

2015-09-02 Thread Stephan Ewen
Have a look at some other tests, like the checkpointing tests. They start
one cluster manually and keep it running. They connect against it using the
remote environment ("localhost", miniCluster.getJobManagerRpcPort()).

That works nicely...

On Wed, Sep 2, 2015 at 4:23 PM, Sachin Goel 
wrote:

> Hi all
> While using LocalEnvironment, in case the program triggers execution
> several times, the {{LocalFlinkMiniCluster}} is started as many times. This
> can consume a lot of time in setting up and tearing down the cluster.
> Further, this hinders with a new functionality I'm working on based on
> persisted results.
> One potential solution could be to follow the methodology in
> `MultipleProgramsTestBase`. The user code then would have to reside in a
> fixed function name, instead of the main method. Or is that too cumbersome?
>
> Regards
> Sachin
> -- Sachin Goel
> Computer Science, IIT Delhi
> m. +91-9871457685
>


Multiple restarts of Local Cluster

2015-09-02 Thread Sachin Goel
Hi all
While using LocalEnvironment, in case the program triggers execution
several times, the {{LocalFlinkMiniCluster}} is started as many times. This
can consume a lot of time in setting up and tearing down the cluster.
Further, this hinders with a new functionality I'm working on based on
persisted results.
One potential solution could be to follow the methodology in
`MultipleProgramsTestBase`. The user code then would have to reside in a
fixed function name, instead of the main method. Or is that too cumbersome?

Regards
Sachin
-- Sachin Goel
Computer Science, IIT Delhi
m. +91-9871457685


Re: Bigpetstore - Flink integration

2015-09-02 Thread jay vyas
hmmm interesting looks to be working magically now... :)  I must have wrote
some code late at night that magically fixed it and forgot.  The original
errors I was getting were kryo related.

The objects aren't being serialized on write to anything useful, but thats
I'm sure an easy fix.

Onward and upward !

On Wed, Sep 2, 2015 at 9:33 AM, Robert Metzger  wrote:

> Okay, I see.
>
> As I said before, I was not able to reproduce the serialization issue
> you've reported.
> Can you maybe post the exception you are seeing?
>
> On Wed, Sep 2, 2015 at 3:32 PM, jay vyas 
> wrote:
>
>> Hey, thanks!
>>
>> Those are just seeds, the files aren't large.
>>
>> The scale out data is the transactions.
>>
>> The seed data needs to be the same, shipped to ALL nodes, and then
>>
>> the nodes generate transactions.
>>
>>
>> On Wed, Sep 2, 2015 at 9:21 AM, Robert Metzger 
>> wrote:
>>
>>> I'm starting a new discussion thread for the bigpetstore-flink
>>> integration ...
>>>
>>>
>>> I took a closer look into the code you've posted.
>>> It seems to me that you are generating a lot of data locally on the
>>> client, before you actually submit a job to Flink. (Both "customers" and
>>> "stores" are generated locally)
>>> Is that only some "seed" data?
>>>
>>> I would actually try to generate as much data as possible in the
>>> cluster, making the generator very scalable.
>>>
>>> I don't think that you need to register a Kryo serializer for the
>>> Product and Transaction type.
>>> I was able to run the code without the serializer registration.
>>>
>>>
>>> -- Forwarded message --
>>> From: jay vyas 
>>> Date: Wed, Sep 2, 2015 at 2:56 PM
>>> Subject: Re: Hardware requirements and learning resources
>>> To: user@flink.apache.org
>>>
>>>
>>> We're also working on a bigpetstore implementation of flink which will
>>> help onboard spark/mapreduce folks.
>>>
>>> I have prototypical code here that runs a simple job in memory,
>>> contributions welcome,
>>>
>>> right now there is a serialization error
>>> https://github.com/bigpetstore/bigpetstore-flink .
>>>
>>> On Wed, Sep 2, 2015 at 8:50 AM, Robert Metzger 
>>> wrote:
>>>
 Hi Juan,

 I think the recommendations in the Spark guide are quite good, and are
 similar to what I would recommend for Flink as well.
 Depending on the workloads you are interested to run, you can certainly
 use Flink with less than 8 GB per machine. I think you can start Flink
 TaskManagers with 500 MB of heap space and they'll still be able to process
 some GB of data.

 Everything above 2 GB is probably good enough for some initial
 experimentation (again depending on your workloads, network, disk speed
 etc.)




 On Wed, Sep 2, 2015 at 2:30 PM, Kostas Tzoumas 
 wrote:

> Hi Juan,
>
> Flink is quite nimble with hardware requirements; people have run it
> in old-ish laptops and also the largest instances available in cloud
> providers. I will let others chime in with more details.
>
> I am not aware of something along the lines of a cheatsheet that you
> mention. If you actually try to do this, I would love to see it, and it
> might be useful to others as well. Both use similar abstractions at the 
> API
> level (i.e., parallel collections), so if you stay true to the functional
> paradigm and not try to "abuse" the system by exploiting knowledge of its
> internals things should be straightforward. These apply to the batch APIs;
> the streaming API in Flink follows a true streaming paradigm, where you 
> get
> an unbounded stream of records and operators on these streams.
>
> Funny that you ask about a video for the DataStream slides. There is a
> Flink training happening as we speak, and a video is being recorded right
> now :-) Hopefully it will be made available soon.
>
> Best,
> Kostas
>
>
> On Wed, Sep 2, 2015 at 1:13 PM, Juan Rodríguez Hortalá <
> juan.rodriguez.hort...@gmail.com> wrote:
>
>> Answering to myself, I have found some nice training material at
>> http://dataartisans.github.io/flink-training. There are even videos
>> at youtube for some of the slides
>>
>>   - http://dataartisans.github.io/flink-training/overview/intro.html
>> https://www.youtube.com/watch?v=XgC6c4Wiqvs
>>
>>   -
>> http://dataartisans.github.io/flink-training/dataSetBasics/intro.html
>> https://www.youtube.com/watch?v=0EARqW15dDk
>>
>> The third lecture
>> http://dataartisans.github.io/flink-training/dataSetAdvanced/intro.html
>> more or less corresponds to
>> https://www.youtube.com/watch?v=1yWKZ26NQeU but not exactly, and
>> there are more lessons at
>> http://dataartisans.github.io/flink-training, for stream