job manager timeout

2016-02-10 Thread Radu Tudoran
Hi,

I am running a program that works fine locally, but when I try to run it on the 
cluster I get a timeout error from the client that tries to connect to the 
jobmanager. There is no issue with contacting the jobmanager form the machine, 
as it works just fine for other stream applications. I suspect that because the 
stream topology is rather complex, there is an issue with deploying the 
schematic. I am not sure if this is a normal behavior (IMHO I would think it 
should not fail just because the topology is more complex). Hence, if the error 
helps to identify the underlyin issue (if any) please see it below.
Meanwhile, can you please educate me on how I can configure the timeout such 
that it won't fail anymore.

Thanks



org.apache.flink.client.program.ProgramInvocationException: The program 
execution failed: Communication with JobManager failed: Job submission to the 
JobManager timed out.
at org.apache.flink.client.program.Client.runBlocking(Client.java:370)
at 
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:96)
at application.MainStreamApp.main(MainStreamApp.java:108)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:497)
at 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:395)
at org.apache.flink.client.program.Client.runBlocking(Client.java:252)
at 
org.apache.flink.client.CliFrontend.executeProgramBlocking(CliFrontend.java:676)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:326)
at 
org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:978)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1028)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Communication 
with JobManager failed: Job submission to the JobManager timed out.
at 
org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:140)
at org.apache.flink.client.program.Client.runBlocking(Client.java:368)
... 13 more
Caused by: 
org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job 
submission to the JobManager timed out.
at 
org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:255)
at 
org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:88)
at 
org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:68)
at 
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



Dr. Radu Tudoran
Research Engineer - Big Data Expert
IT R&D Division

[cid:image007.jpg@01CD52EB.AD060EE0]
HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: radu.tudo...@huawei.com
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, 
www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, 
which is intended only for the person or entity whose address is listed above. 
Any use of the information contained herein in any way (including, but not 
limited to, total or partial disclosure, reproduction, or dissemination) by 
persons other than the intended recipient(s) is prohibited. If you receive this 
e-mail in error, please notify the sender by phone or email immed

Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
I would try to do the outlier compuation with the DataSet API instead of
fetching the results to the client with collect().
If you do that, you can directly use writeAsCsv because the result is still
a DataSet.

What you have to do, is to translate your findOutliers method into DataSet
API code.

Best, Fabian

2016-02-10 18:29 GMT+01:00 subash basnet :

> Hello Fabian,
>
> As written before code:
>
>
>
> *DataSet fElements =
> env.fromCollection(findOutliers(clusteredPoints,
> finalCentroids));fElements.writeAsCsv(outputPath, "\n", "
> ");env.execute("KMeans Example");*
> I am very new to flink so not so clear about what you suggested, by
> option(1) you meant that I write my own FileWriter here rather than using
> *writeAsCsv()* method. And option(2) I couldn't understand where to
> compute the outlier. I would want to use the *writeAsCsv() *method but
> currently it doesn't perform the write operation and unable to understand
> why.
>
> An interesting thing I found is, when I run the *outlierDetection* class
> from eclipse a single file *result* gets written within the kmeans
> folder, whereas in case of default *KMeans* class it writes a result
> folder within the kmeans folder and the files with points are written
> inside the result folder.
> I give the necessary path in the arguments while running.
> Eg: file:///home/softwares/flink-0.10.0/kmeans/points
> file:///home/softwares/flink-0.10.0/kmeans/centers
> file:///home/softwares/flink-0.10.0/kmeans/result 10
>
> Now, after I create the runnable jar file for KMeans and outlierDetection
> class,  when I upload it to *flink web submission client *it works fine
> for *KMeans.jar*, the folder and files get created. But incase of
> *outlierDetection.jar* no file or folder get's written inside kmeans.
>
> How is it that outlier class is able to write file via eclipse but outlier
> jar not able to write via flink web submission client.
>
>
> Best Regards,
> Subash Basnet
>
> On Wed, Feb 10, 2016 at 1:58 PM, Fabian Hueske  wrote:
>
>> Hi Subash,
>>
>> I would not fetch the data to the client, do the computation there, and
>> send it back, just for the purpose of writing it to a file.
>>
>> Either 1) pull the results to the client and write the file from there or
>> 2) compute the outliers in the cluster.
>> I did not study your code completely, but the two nested loops and the
>> condition are a join for example.
>>
>> I would go for option 2, if possible.
>>
>> Best, Fabian
>>
>>
>> 2016-02-10 13:07 GMT+01:00 subash basnet :
>>
>>> Hello Fabian,
>>>
>>> I use the collect() method to get the elements locally and perform
>>> operations on that and return the result as a collection. The collection
>>> result is converted to the DataSet in the calling method.
>>> Below is the code of *findOutliers *method:
>>>
>>> public static List findOutliers(DataSet>
>>> clusteredPoints,
>>> DataSet centroids) throws Exception {
>>> List finalElements = new ArrayList();
>>> *List> elements = clusteredPoints.collect();*
>>> * List centroidList = centroids.collect();*
>>> List, Double>>
>>> elementsWithDistance = new ArrayList>> Tuple2, Double>>();
>>> for (Centroid centroid : centroidList) {
>>> elementsWithDistance = new ArrayList>> Point>, Double>>();
>>> double totalDistance = 0;
>>> int elementsCount = 0;
>>> for (Tuple2 e : elements) {
>>> // compute distance
>>> if (e.f0 == centroid.id) {
>>> Tuple3, Double> newElement = new
>>> Tuple3>> Tuple2, Double>();
>>> double distance = e.f1.euclideanDistance(centroid);
>>> totalDistance += distance;
>>> newElement.setFields(centroid, e, distance);
>>> elementsWithDistance.add(newElement);
>>> elementsCount++;
>>> }
>>> }
>>> // finding mean
>>> double mean = totalDistance / elementsCount;
>>> double sdTotalDistanceSquare = 0;
>>> for (Tuple3, Double>
>>> elementWithDistance : elementsWithDistance) {
>>> double distanceSquare = Math.pow(mean - elementWithDistance.f2, 2);
>>> sdTotalDistanceSquare += distanceSquare;
>>> }
>>> double sd = Math.sqrt(sdTotalDistanceSquare / elementsCount);
>>> double upperlimit = mean + 2 * sd;
>>> double lowerlimit = mean - 2 * sd;
>>> Tuple3 newElement = new Tuple3>> Boolean>();// true
>>> // =
>>> // outlier
>>> for (Tuple3, Double>
>>> elementWithDistance : elementsWithDistance) {
>>> newElement = new Tuple3();
>>> if (elementWithDistance.f2 < lowerlimit || elementWithDistance.f2 >
>>> upperlimit) {
>>> // set as outlier
>>> newElement.setFields(elementWithDistance.f1.f0,
>>> elementWithDistance.f1.f1, true);
>>> } else {
>>> newElement.setFields(elementWithDistance.f1.f0,
>>> elementWithDistance.f1.f1, false);
>>> }
>>> finalElements.add(newElement);
>>> }
>>> }
>>> return finalElements;
>>> }
>>>
>>> I have attached herewith the screenshot of my project structure and
>>> KMeansOutlierDetection.java file for more clarity.
>>>
>>>
>>> Best Regards,
>>> Subash Basnet
>>>
>>> On Wed, Feb 10, 2016 at 12:26 PM, Fabian Hueske 
>>> wrote:
>>>
 [image: Boxbe]  T

Re: How to convert List to flink DataSet

2016-02-10 Thread subash basnet
Hello Fabian,

As written before code:



*DataSet fElements =
env.fromCollection(findOutliers(clusteredPoints,
finalCentroids));fElements.writeAsCsv(outputPath, "\n", "
");env.execute("KMeans Example");*
I am very new to flink so not so clear about what you suggested, by
option(1) you meant that I write my own FileWriter here rather than using
*writeAsCsv()* method. And option(2) I couldn't understand where to compute
the outlier. I would want to use the *writeAsCsv() *method but currently it
doesn't perform the write operation and unable to understand why.

An interesting thing I found is, when I run the *outlierDetection* class
from eclipse a single file *result* gets written within the kmeans folder,
whereas in case of default *KMeans* class it writes a result folder within
the kmeans folder and the files with points are written inside the result
folder.
I give the necessary path in the arguments while running.
Eg: file:///home/softwares/flink-0.10.0/kmeans/points
file:///home/softwares/flink-0.10.0/kmeans/centers
file:///home/softwares/flink-0.10.0/kmeans/result 10

Now, after I create the runnable jar file for KMeans and outlierDetection
class,  when I upload it to *flink web submission client *it works fine for
*KMeans.jar*, the folder and files get created. But incase of
*outlierDetection.jar* no file or folder get's written inside kmeans.

How is it that outlier class is able to write file via eclipse but outlier
jar not able to write via flink web submission client.


Best Regards,
Subash Basnet

On Wed, Feb 10, 2016 at 1:58 PM, Fabian Hueske  wrote:

> Hi Subash,
>
> I would not fetch the data to the client, do the computation there, and
> send it back, just for the purpose of writing it to a file.
>
> Either 1) pull the results to the client and write the file from there or
> 2) compute the outliers in the cluster.
> I did not study your code completely, but the two nested loops and the
> condition are a join for example.
>
> I would go for option 2, if possible.
>
> Best, Fabian
>
>
> 2016-02-10 13:07 GMT+01:00 subash basnet :
>
>> Hello Fabian,
>>
>> I use the collect() method to get the elements locally and perform
>> operations on that and return the result as a collection. The collection
>> result is converted to the DataSet in the calling method.
>> Below is the code of *findOutliers *method:
>>
>> public static List findOutliers(DataSet>
>> clusteredPoints,
>> DataSet centroids) throws Exception {
>> List finalElements = new ArrayList();
>> *List> elements = clusteredPoints.collect();*
>> * List centroidList = centroids.collect();*
>> List, Double>>
>> elementsWithDistance = new ArrayList> Tuple2, Double>>();
>> for (Centroid centroid : centroidList) {
>> elementsWithDistance = new ArrayList> Point>, Double>>();
>> double totalDistance = 0;
>> int elementsCount = 0;
>> for (Tuple2 e : elements) {
>> // compute distance
>> if (e.f0 == centroid.id) {
>> Tuple3, Double> newElement = new
>> Tuple3> Tuple2, Double>();
>> double distance = e.f1.euclideanDistance(centroid);
>> totalDistance += distance;
>> newElement.setFields(centroid, e, distance);
>> elementsWithDistance.add(newElement);
>> elementsCount++;
>> }
>> }
>> // finding mean
>> double mean = totalDistance / elementsCount;
>> double sdTotalDistanceSquare = 0;
>> for (Tuple3, Double> elementWithDistance
>> : elementsWithDistance) {
>> double distanceSquare = Math.pow(mean - elementWithDistance.f2, 2);
>> sdTotalDistanceSquare += distanceSquare;
>> }
>> double sd = Math.sqrt(sdTotalDistanceSquare / elementsCount);
>> double upperlimit = mean + 2 * sd;
>> double lowerlimit = mean - 2 * sd;
>> Tuple3 newElement = new Tuple3> Boolean>();// true
>> // =
>> // outlier
>> for (Tuple3, Double> elementWithDistance
>> : elementsWithDistance) {
>> newElement = new Tuple3();
>> if (elementWithDistance.f2 < lowerlimit || elementWithDistance.f2 >
>> upperlimit) {
>> // set as outlier
>> newElement.setFields(elementWithDistance.f1.f0,
>> elementWithDistance.f1.f1, true);
>> } else {
>> newElement.setFields(elementWithDistance.f1.f0,
>> elementWithDistance.f1.f1, false);
>> }
>> finalElements.add(newElement);
>> }
>> }
>> return finalElements;
>> }
>>
>> I have attached herewith the screenshot of my project structure and
>> KMeansOutlierDetection.java file for more clarity.
>>
>>
>> Best Regards,
>> Subash Basnet
>>
>> On Wed, Feb 10, 2016 at 12:26 PM, Fabian Hueske 
>> wrote:
>>
>>> [image: Boxbe]  This message is
>>> eligible for Automatic Cleanup! (fhue...@gmail.com) Add cleanup rule
>>> 

Re: Flink 1.0-SNAPSHOT scala 2.11 in S3 has scala 2.10

2016-02-10 Thread Stephan Ewen
We discovered yesterday that the snapshot builds were not updated in a
while (because the build server experienced timeouts).
Hence the SNAPSHOT build may have quite stale.

It is updating frequently again now, that's probably why you find a correct
build today...



On Wed, Feb 10, 2016 at 5:31 PM, David Kim 
wrote:

> Hi Chiwan, Max,
>
> Thanks for checking. I also downloaded it now and verified the 2.10 jar is
> gone :)
>
> A new build must have overwrote yesterday's and corrected itself.
>
> flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz
> 2016-02-10T15:55:33.000Z
>
> Thanks!
> David
>
> On Wed, Feb 10, 2016 at 4:44 AM, Maximilian Michels 
> wrote:
>
>> Hi David,
>>
>> Just had a check as well. Can't find a 2.10 Jar in the lib folder.
>>
>> Cheers,
>> Max
>>
>> On Wed, Feb 10, 2016 at 6:17 AM, Chiwan Park 
>> wrote:
>> > Hi David,
>> >
>> > I just downloaded the "flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz” but
>> there is no jar compiled with Scala 2.10. Could you check again?
>> >
>> > Regards,
>> > Chiwan Park
>> >
>> >> On Feb 10, 2016, at 2:59 AM, David Kim <
>> david@braintreepayments.com> wrote:
>> >>
>> >> Hello,
>> >>
>> >> I noticed that the flink binary for scala 2.11 located at
>> http://stratosphere-bin.s3.amazonaws.com/flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz
>> contains the scala 2.10 flavor.
>> >>
>> >> If you open the lib folder the name of the jar in lib is
>> flink-dist_2.10-1.0-SNAPSHOT.jar.
>> >>
>> >> Could this be an error in the process that updates these files in S3?
>> >>
>> >> We're using that download link following the suggestions here:
>> https://flink.apache.org/contribute-code.html#snapshots-nightly-builds.
>> If there's a better place let us know as well!
>> >>
>> >> Thanks,
>> >> David
>> >
>>
>
>
>
> --
> Note: this information is confidential. It is prohibited to share, post
> online or otherwise publicize without Braintree's prior written consent.
>


Re: Flink 1.0-SNAPSHOT scala 2.11 in S3 has scala 2.10

2016-02-10 Thread David Kim
Hi Chiwan, Max,

Thanks for checking. I also downloaded it now and verified the 2.10 jar is
gone :)

A new build must have overwrote yesterday's and corrected itself.

flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz
2016-02-10T15:55:33.000Z

Thanks!
David

On Wed, Feb 10, 2016 at 4:44 AM, Maximilian Michels  wrote:

> Hi David,
>
> Just had a check as well. Can't find a 2.10 Jar in the lib folder.
>
> Cheers,
> Max
>
> On Wed, Feb 10, 2016 at 6:17 AM, Chiwan Park 
> wrote:
> > Hi David,
> >
> > I just downloaded the "flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz” but
> there is no jar compiled with Scala 2.10. Could you check again?
> >
> > Regards,
> > Chiwan Park
> >
> >> On Feb 10, 2016, at 2:59 AM, David Kim 
> wrote:
> >>
> >> Hello,
> >>
> >> I noticed that the flink binary for scala 2.11 located at
> http://stratosphere-bin.s3.amazonaws.com/flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz
> contains the scala 2.10 flavor.
> >>
> >> If you open the lib folder the name of the jar in lib is
> flink-dist_2.10-1.0-SNAPSHOT.jar.
> >>
> >> Could this be an error in the process that updates these files in S3?
> >>
> >> We're using that download link following the suggestions here:
> https://flink.apache.org/contribute-code.html#snapshots-nightly-builds.
> If there's a better place let us know as well!
> >>
> >> Thanks,
> >> David
> >
>



-- 
Note: this information is confidential. It is prohibited to share, post
online or otherwise publicize without Braintree's prior written consent.


Compilation error while instancing FlinkKafkaConsumer082

2016-02-10 Thread Simone Robutti
Hello,

the compiler has been raising an error since I added this line to the code

val testData=streamEnv.addSource(new
FlinkKafkaConsumer082[String]("data-input",new
SimpleStringSchema(),kafkaProp))

Here is the error:

Error:scalac: Class
org.apache.flink.streaming.api.checkpoint.CheckpointNotifier not found -
continuing with a stub.
Error:scalac: Class
org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema
not found - continuing with a stub.
Warning:scalac: Class
org.apache.flink.streaming.util.serialization.KeyedDeserializationSchema
not found - continuing with a stub.
Warning:scalac: Class
org.apache.flink.streaming.api.checkpoint.CheckpointNotifier not found -
continuing with a stub.

I've been using the 1.0-Snapshot version built with scala 2.11. Any
suggestion?


Re: TaskManager unable to register with JobManager

2016-02-10 Thread Ravinder Kaur
Hello Fabian,

Thank you very much for the resource. I had already gone through this and
have found port '6123' as default for taskmanager registration. But I want
to know the specific range of ports the taskmanager access during job
execution.

The taskmanager always tries to access a random port during job execution
for which I need to disable firewall using 'ufw allow port' during the
execution, otherwise the job hangs and finally fails. So I  wanted to know
a particular range of ports which I can specify in the iptables to always
allow access.


Kind Regards,
Ravinder Kaur

On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske  wrote:

> Hi Ravinder,
>
> please have a look at the configuration documentation:
>
> -->
> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager
>
> Best, Fabian
>
> 2016-02-10 13:55 GMT+01:00 Ravinder Kaur :
>
>> Hello All,
>>
>> I need to know the range of ports that are being used during the
>> master/slave communication in the Flink cluster. Also is there a way I can
>> specify a range of ports, at the slaves, to restrict them to connect to
>> master only in this range?
>>
>> Kind Regards,
>> Ravinder Kaur
>>
>>
>> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen  wrote:
>>
>>> Can machines connect to port 6123? The firewall may block that port, put
>>> permit SSH.
>>>
>>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur 
>>> wrote:
>>>
 Hello,

 Here is the log file of Jobmanager. I did not see some thing suspicious
 and as it suggests the ports are also listening.

 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager on IP-of-master:6123 with execution mode
 CLUSTER and streaming mode BATCH_ONLY
 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Security is not enabled. Starting non-authenticated JobManager.
 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager
 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager actor system at 10.155.208.138:6123
 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
  - Slf4jLogger started
 20:58:48,295 INFO  Remoting
  - Starting remoting
 20:58:48,541 INFO  Remoting
  - Remoting started; listening on addresses
 :[akka.tcp://flink@IP-of-master:6123]
 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManger web frontend
 20:58:48,690 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using
 directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web
 interface files
 20:58:48,691 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving
 job manager log from
 /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
 20:58:48,691 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving
 job manager stdout from
 /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
 20:58:49,044 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web
 frontend listening at 0:0:0:0:0:0:0:0:8081
 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager actor
 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
  - Created BLOB server storage directory
 /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
  - Started BLOB server at 0.0.0.0:43683 - max concurrent
 requests: 50 - max backlog: 1000
 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
   - Started memory archivist akka://flink/user/archive
 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager at akka.tcp://flink@IP-of-master
 :6123/user/jobmanager.
 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager
 was granted leadership with leader session ID None.
 20:58:49,082 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting
 with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on
 port 8081
 20:58:49,083 INFO
  org.apache.flink.runtime.webmonitor.JobManagerRetriever   - New leader
 reachable under akka.tcp://flink@IP-of-master
 :6123/user/jobmanager:null.
 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job
 at Wed Feb 03 20:59:22 CET 2016).
 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Scheduling job 72733d69588678ec224003ab

Re: Simple Flink - Kafka Test

2016-02-10 Thread Stephan Ewen
Yes, 0.10.x does not always have Scala version suffixes.

1.0 is doing this consistently, should cause less confusion...

On Wed, Feb 10, 2016 at 2:38 PM, shotte  wrote:

> Ok It is working now
>
> I had to change a few dependency with the _2.11 suffix
>
> Thanks
>
> Sylvain
>
>
> 
> 
> org.apache.flink
> flink-java
> ${flink.version}
> 
> 
> org.apache.flink
> flink-clients_2.11
> ${flink.version}
> 
> 
> org.apache.flink
> flink-connector-kafka_2.11
> 0.10.1
> 
> 
> org.apache.flink
> flink-streaming-java_2.11
> 0.10.1
> 
> 
>
>
>
>
> --
> View this message in context:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Simple-Flink-Kafka-Test-tp4828p4852.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>


Re: TaskManager unable to register with JobManager

2016-02-10 Thread Stephan Ewen
Note that some of these config options are only available starting from
version 1.0-SNAPSHOT

On Wed, Feb 10, 2016 at 2:16 PM, Fabian Hueske  wrote:

> Hi Ravinder,
>
> please have a look at the configuration documentation:
>
> -->
> https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager
>
> Best, Fabian
>
> 2016-02-10 13:55 GMT+01:00 Ravinder Kaur :
>
>> Hello All,
>>
>> I need to know the range of ports that are being used during the
>> master/slave communication in the Flink cluster. Also is there a way I can
>> specify a range of ports, at the slaves, to restrict them to connect to
>> master only in this range?
>>
>> Kind Regards,
>> Ravinder Kaur
>>
>>
>> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen  wrote:
>>
>>> Can machines connect to port 6123? The firewall may block that port, put
>>> permit SSH.
>>>
>>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur 
>>> wrote:
>>>
 Hello,

 Here is the log file of Jobmanager. I did not see some thing suspicious
 and as it suggests the ports are also listening.

 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager on IP-of-master:6123 with execution mode
 CLUSTER and streaming mode BATCH_ONLY
 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Security is not enabled. Starting non-authenticated JobManager.
 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager
 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager actor system at 10.155.208.138:6123
 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
  - Slf4jLogger started
 20:58:48,295 INFO  Remoting
  - Starting remoting
 20:58:48,541 INFO  Remoting
  - Remoting started; listening on addresses
 :[akka.tcp://flink@IP-of-master:6123]
 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManger web frontend
 20:58:48,690 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Using
 directory /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web
 interface files
 20:58:48,691 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving
 job manager log from
 /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
 20:58:48,691 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Serving
 job manager stdout from
 /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
 20:58:49,044 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Web
 frontend listening at 0:0:0:0:0:0:0:0:8081
 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager actor
 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
  - Created BLOB server storage directory
 /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
  - Started BLOB server at 0.0.0.0:43683 - max concurrent
 requests: 50 - max backlog: 1000
 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
   - Started memory archivist akka://flink/user/archive
 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Starting JobManager at akka.tcp://flink@IP-of-master
 :6123/user/jobmanager.
 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager
 was granted leadership with leader session ID None.
 20:58:49,082 INFO
  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting
 with JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager on
 port 8081
 20:58:49,083 INFO
  org.apache.flink.runtime.webmonitor.JobManagerRetriever   - New leader
 reachable under akka.tcp://flink@IP-of-master
 :6123/user/jobmanager:null.
 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job
 at Wed Feb 03 20:59:22 CET 2016).
 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job
 at Wed Feb 03 20:59:22 CET 2016).
 20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager
  - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job
 at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
 20:59:22,859 INFO
  org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN
 DataSource (at getDefaultTextLineDataSet(WordCountData.java:70)
 (org.apache.flink.api.java.io.CollectionInputF

Re: Simple Flink - Kafka Test

2016-02-10 Thread shotte
Ok It is working now 

I had to change a few dependency with the _2.11 suffix

Thanks

Sylvain




org.apache.flink
flink-java
${flink.version}


org.apache.flink
flink-clients_2.11
${flink.version}


org.apache.flink
flink-connector-kafka_2.11
0.10.1


org.apache.flink
flink-streaming-java_2.11
0.10.1






--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Simple-Flink-Kafka-Test-tp4828p4852.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: TaskManager unable to register with JobManager

2016-02-10 Thread Fabian Hueske
Hi Ravinder,

please have a look at the configuration documentation:

-->
https://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#jobmanager-amp-taskmanager

Best, Fabian

2016-02-10 13:55 GMT+01:00 Ravinder Kaur :

> Hello All,
>
> I need to know the range of ports that are being used during the
> master/slave communication in the Flink cluster. Also is there a way I can
> specify a range of ports, at the slaves, to restrict them to connect to
> master only in this range?
>
> Kind Regards,
> Ravinder Kaur
>
>
> On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen  wrote:
>
>> Can machines connect to port 6123? The firewall may block that port, put
>> permit SSH.
>>
>> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur 
>> wrote:
>>
>>> Hello,
>>>
>>> Here is the log file of Jobmanager. I did not see some thing suspicious
>>> and as it suggests the ports are also listening.
>>>
>>> 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Starting JobManager on IP-of-master:6123 with execution mode
>>> CLUSTER and streaming mode BATCH_ONLY
>>> 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Security is not enabled. Starting non-authenticated JobManager.
>>> 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Starting JobManager
>>> 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Starting JobManager actor system at 10.155.208.138:6123
>>> 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
>>>  - Slf4jLogger started
>>> 20:58:48,295 INFO  Remoting
>>>  - Starting remoting
>>> 20:58:48,541 INFO  Remoting
>>>  - Remoting started; listening on addresses
>>> :[akka.tcp://flink@IP-of-master:6123]
>>> 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Starting JobManger web frontend
>>> 20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> - Using directory
>>> /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface
>>> files
>>> 20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> - Serving job manager log from
>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
>>> 20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> - Serving job manager stdout from
>>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
>>> 20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> - Web frontend listening at 0:0:0:0:0:0:0:0:8081
>>> 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Starting JobManager actor
>>> 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
>>>  - Created BLOB server storage directory
>>> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
>>> 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
>>>  - Started BLOB server at 0.0.0.0:43683 - max concurrent
>>> requests: 50 - max backlog: 1000
>>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
>>> - Started memory archivist akka://flink/user/archive
>>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Starting JobManager at akka.tcp://flink@IP-of-master
>>> :6123/user/jobmanager.
>>> 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager
>>> was granted leadership with leader session ID None.
>>> 20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>>> - Starting with JobManager 
>>> akka.tcp://flink@IP-of-master:6123/user/jobmanager
>>> on port 8081
>>> 20:58:49,083 INFO
>>>  org.apache.flink.runtime.webmonitor.JobManagerRetriever   - New leader
>>> reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
>>> 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>> at Wed Feb 03 20:59:22 CET 2016).
>>> 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>> at Wed Feb 03 20:59:22 CET 2016).
>>> 20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>>  - Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job
>>> at Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
>>> 20:59:22,859 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph- CHAIN
>>> DataSource (at getDefaultTextLineDataSet(WordCountData.java:70)
>>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
>>> 20:59:22,881 INFO
>>>  org.apache.flink.runtime.executiongraph.ExecutionGraph- CH

Re: Simple Flink - Kafka Test

2016-02-10 Thread shotte
Hi,

Thanks for your reply, but I am still a bit confuse.

I have downloaded  flink-0.10.1-bin-hadoop27-scala_2.11.tgz
and kafka_2.11-0.9.0.0.tgz

I did not install myself Scala

Now tell me if I understand correctly. 
Depending on the version of Flink I have (in my case the scala 2.11) I must
specify a dependency in the pom file


But still have the same error

Here's an extract of my file



org.apache.flink
flink-java
${flink.version}


org.apache.flink
flink-clients
${flink.version}


org.apache.flink
flink-connector-kafka
0.10.1


org.apache.flink
flink-streaming-java_2.11
0.10.1





--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Simple-Flink-Kafka-Test-tp4828p4850.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
Hi Subash,

I would not fetch the data to the client, do the computation there, and
send it back, just for the purpose of writing it to a file.

Either 1) pull the results to the client and write the file from there or
2) compute the outliers in the cluster.
I did not study your code completely, but the two nested loops and the
condition are a join for example.

I would go for option 2, if possible.

Best, Fabian


2016-02-10 13:07 GMT+01:00 subash basnet :

> Hello Fabian,
>
> I use the collect() method to get the elements locally and perform
> operations on that and return the result as a collection. The collection
> result is converted to the DataSet in the calling method.
> Below is the code of *findOutliers *method:
>
> public static List findOutliers(DataSet>
> clusteredPoints,
> DataSet centroids) throws Exception {
> List finalElements = new ArrayList();
> *List> elements = clusteredPoints.collect();*
> * List centroidList = centroids.collect();*
> List, Double>>
> elementsWithDistance = new ArrayList Tuple2, Double>>();
> for (Centroid centroid : centroidList) {
> elementsWithDistance = new ArrayList Point>, Double>>();
> double totalDistance = 0;
> int elementsCount = 0;
> for (Tuple2 e : elements) {
> // compute distance
> if (e.f0 == centroid.id) {
> Tuple3, Double> newElement = new
> Tuple3 Tuple2, Double>();
> double distance = e.f1.euclideanDistance(centroid);
> totalDistance += distance;
> newElement.setFields(centroid, e, distance);
> elementsWithDistance.add(newElement);
> elementsCount++;
> }
> }
> // finding mean
> double mean = totalDistance / elementsCount;
> double sdTotalDistanceSquare = 0;
> for (Tuple3, Double> elementWithDistance
> : elementsWithDistance) {
> double distanceSquare = Math.pow(mean - elementWithDistance.f2, 2);
> sdTotalDistanceSquare += distanceSquare;
> }
> double sd = Math.sqrt(sdTotalDistanceSquare / elementsCount);
> double upperlimit = mean + 2 * sd;
> double lowerlimit = mean - 2 * sd;
> Tuple3 newElement = new Tuple3 Boolean>();// true
> // =
> // outlier
> for (Tuple3, Double> elementWithDistance
> : elementsWithDistance) {
> newElement = new Tuple3();
> if (elementWithDistance.f2 < lowerlimit || elementWithDistance.f2 >
> upperlimit) {
> // set as outlier
> newElement.setFields(elementWithDistance.f1.f0, elementWithDistance.f1.f1,
> true);
> } else {
> newElement.setFields(elementWithDistance.f1.f0, elementWithDistance.f1.f1,
> false);
> }
> finalElements.add(newElement);
> }
> }
> return finalElements;
> }
>
> I have attached herewith the screenshot of my project structure and
> KMeansOutlierDetection.java file for more clarity.
>
>
> Best Regards,
> Subash Basnet
>
> On Wed, Feb 10, 2016 at 12:26 PM, Fabian Hueske  wrote:
>
>> [image: Boxbe]  This message is eligible
>> for Automatic Cleanup! (fhue...@gmail.com) Add cleanup rule
>> 
>> | More info
>> 
>>
>> Hi Subash,
>>
>> how is findOutliers implemented?
>>
>> It might be that you mix-up local and cluster computation. All DataSets
>> are processed in the cluster. Please note the following:
>> - ExecutionEnvironment.fromCollection() transforms a client local
>> connection into a DataSet by serializing it and sending it to the cluster.
>> - DataSet.collect() transforms a DataSet into a collection and ships it
>> back to the client.
>>
>> So, does findOutliers operate on the cluster or on the local client,
>> i.e., does it work with DataSet and send the result back as a collection or
>> does it first collect the results as collection and operate on these?
>>
>> Best, Fabian
>>
>> 2016-02-10 12:13 GMT+01:00 subash basnet :
>>
>>> Hello Stefano,
>>>
>>> Yeah the type casting worked, thank you. But not able to print the
>>> Dataset to the file.
>>>
>>> The default below code which writes the KMeans points along with their
>>> centroid numbers to the file works fine:
>>> // feed new centroids back into next iteration
>>> DataSet finalCentroids = loop.closeWith(newCentroids);
>>> DataSet> clusteredPoints = points
>>> // assign points to final clusters
>>> .map(new SelectNearestCenter()).withBroadcastSet(finalCentroids,
>>> "centroids");
>>>   if (fileOutput) {
>>> clusteredPoints.writeAsCsv(outputPath, "\n", " ");
>>> // since file sinks are lazy, we trigger the execution explicitly
>>> env.execute("KMeans Example");
>>> }
>>>
>>> But my modified code be

Re: TaskManager unable to register with JobManager

2016-02-10 Thread Ravinder Kaur
Hello All,

I need to know the range of ports that are being used during the
master/slave communication in the Flink cluster. Also is there a way I can
specify a range of ports, at the slaves, to restrict them to connect to
master only in this range?

Kind Regards,
Ravinder Kaur


On Wed, Feb 3, 2016 at 10:09 PM, Stephan Ewen  wrote:

> Can machines connect to port 6123? The firewall may block that port, put
> permit SSH.
>
> On Wed, Feb 3, 2016 at 9:52 PM, Ravinder Kaur  wrote:
>
>> Hello,
>>
>> Here is the log file of Jobmanager. I did not see some thing suspicious
>> and as it suggests the ports are also listening.
>>
>> 20:58:46,906 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Starting JobManager on IP-of-master:6123 with execution mode
>> CLUSTER and streaming mode BATCH_ONLY
>> 20:58:46,978 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Security is not enabled. Starting non-authenticated JobManager.
>> 20:58:46,979 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Starting JobManager
>> 20:58:46,980 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Starting JobManager actor system at 10.155.208.138:6123
>> 20:58:48,196 INFO  akka.event.slf4j.Slf4jLogger
>>- Slf4jLogger started
>> 20:58:48,295 INFO  Remoting
>>- Starting remoting
>> 20:58:48,541 INFO  Remoting
>>- Remoting started; listening on addresses
>> :[akka.tcp://flink@IP-of-master:6123]
>> 20:58:48,549 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Starting JobManger web frontend
>> 20:58:48,690 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> - Using directory
>> /tmp/flink-web-876a4755-4f38-4ff7-8202-f263afa9b986 for the web interface
>> files
>> 20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> - Serving job manager log from
>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.log
>> 20:58:48,691 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> - Serving job manager stdout from
>> /home/flink/flink-0.10.1/log/flink-flink-jobmanager-0-hostname.out
>> 20:58:49,044 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> - Web frontend listening at 0:0:0:0:0:0:0:0:8081
>> 20:58:49,045 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Starting JobManager actor
>> 20:58:49,052 INFO  org.apache.flink.runtime.blob.BlobServer
>>- Created BLOB server storage directory
>> /tmp/blobStore-e0c52bfb-2411-4a83-ac8d-5664a5894258
>> 20:58:49,054 INFO  org.apache.flink.runtime.blob.BlobServer
>>- Started BLOB server at 0.0.0.0:43683 - max concurrent requests:
>> 50 - max backlog: 1000
>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.MemoryArchivist
>> - Started memory archivist akka://flink/user/archive
>> 20:58:49,075 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Starting JobManager at akka.tcp://flink@IP-of-master
>> :6123/user/jobmanager.
>> 20:58:49,081 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- JobManager akka.tcp://flink@IP-of-master:6123/user/jobmanager
>> was granted leadership with leader session ID None.
>> 20:58:49,082 INFO  org.apache.flink.runtime.webmonitor.WebRuntimeMonitor
>> - Starting with JobManager 
>> akka.tcp://flink@IP-of-master:6123/user/jobmanager
>> on port 8081
>> 20:58:49,083 INFO
>>  org.apache.flink.runtime.webmonitor.JobManagerRetriever   - New leader
>> reachable under akka.tcp://flink@IP-of-master:6123/user/jobmanager:null.
>> 20:59:22,794 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Submitting job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
>> Wed Feb 03 20:59:22 CET 2016).
>> 20:59:22,853 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Scheduling job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
>> Wed Feb 03 20:59:22 CET 2016).
>> 20:59:22,857 INFO  org.apache.flink.runtime.jobmanager.JobManager
>>- Status of job 72733d69588678ec224003ab5577cab8 (Flink Java Job at
>> Wed Feb 03 20:59:22 CET 2016) changed to RUNNING.
>> 20:59:22,859 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>>- CHAIN DataSource (at
>> getDefaultTextLineDataSet(WordCountData.java:70)
>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from CREATED to SCHEDULED
>> 20:59:22,881 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
>>- CHAIN DataSource (at
>> getDefaultTextLineDataSet(WordCountData.java:70)
>> (org.apache.flink.api.java.io.CollectionInputFormat)) -> FlatMap (FlatMap
>> at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)
>> (1/1) (23fb37019a504fd6c7bf95e46a8cd7a3) switched from SCHEDULED to CANCELED
>> 20:59:22,881 INFO  org.apache.flink.runtime.jobmanager.JobMan

Re: Dataset filter improvement

2016-02-10 Thread Flavio Pompermaier
What do you mean exactly..? Probably I'm missing something here..remember
that I can specify the right subClass only after the last flatMap, after
the first map neither me nor Flink can know the exact subclass of BaseClass

On Wed, Feb 10, 2016 at 12:42 PM, Stephan Ewen  wrote:

> Class hierarchies should definitely work, even if the base class has no
> fields.
>
> They work more efficiently if you register the subclasses at the execution
> environment (Flink cannot infer them from the function signatures because
> the function signatures only contain the abstract base class).
>
> On Wed, Feb 10, 2016 at 12:23 PM, Flavio Pompermaier  > wrote:
>
>> Because The classes are not related to each other. Do you think it's a
>> good idea to have something like this?
>>
>> abstract class BaseClass(){
>>String someField;
>> }
>>
>> class ExtendedClass1 extends BaseClass (){
>>String someOtherField11;
>>String someOtherField12;
>>String someOtherField13;
>>  ...
>> }
>>
>> class ExtendedClass2 extends BaseClass (){
>>Integer someOtherField21;
>>Double someOtherField22;
>>Integer someOtherField23;
>>  ...
>> }
>>
>> and then declare my map as Map. and then apply a
>> flatMap that can be used to generated the specific datasets?
>> Doesn't this cause problem to Flink? Classes can be vrry different to
>> each other..maybe this can cause problems with the plan generation..isn't
>> it?
>>
>> Thanks Fabian and Stephan for the support!
>>
>>
>> On Wed, Feb 10, 2016 at 11:47 AM, Stephan Ewen  wrote:
>>
>>> Why not use an abstract base class and N subclasses?
>>>
>>> On Wed, Feb 10, 2016 at 10:05 AM, Fabian Hueske 
>>> wrote:
>>>
 Unfortunately, there is no Either<1,...,n>.
 You could implement something like a Tuple3,
 Option, Option>. However, Flink does not provide an Option
 type (comes with Java8). You would need to implement it yourself incl.
 TypeInfo and Serializer. You can get some inspiration from the Either type
 info /serializer, if you want to go this way.

 Using a byte array would also work but doesn't look much easier than
 the Option approach to me.

 2016-02-10 9:47 GMT+01:00 Flavio Pompermaier :

> Yes, the intermediate dataset I create then join again between
> themselves. What I'd need is a Either<1,...,n>. Is that possible to add?
> Otherwise I was thinking to generate a Tuple2 and in
> the subsequent filter+map/flatMap deserialize only those elements I want 
> to
> group togheter (e.g. t.f0=="someEventType") in order to generate the typed
> dataset based.
> Which one  do you think is the best solution?
>
> On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske 
> wrote:
>
>> Hi Flavio,
>>
>> I did not completely understand which objects should go where, but
>> here are some general guidelines:
>>
>> - early filtering is mostly a good idea (unless evaluating the filter
>> expression is very expensive)
>> - you can use a flatMap function to combine a map and a filter
>> - applying multiple functions on the same data set does not
>> necessarily materialize the data set (in memory or on disk). In most 
>> cases
>> it prevents chaining, hence there is serialization overhead. In some 
>> cases
>> where the forked data streams are joined again, the data set must be
>> materialized in order to avoid deadlocks.
>> - it is not possible to write a map that generates two different
>> types, but you could implement a mapper that returns an Either> Second> type.
>>
>> Hope this helps,
>> Fabian
>>
>> 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :
>>
>>> Any help on this?
>>> On 9 Feb 2016 18:03, "Flavio Pompermaier" 
>>> wrote:
>>>
 Hi to all,

 in my program I have a Dataset that generated different types of
 object wrt the incoming element.
 Thus it's like a Map.
 In order to type the different generated datasets I do something:

 Dataset start =...

 Dataset ds1 = start.filter().map(..);
 Dataset ds2 = start.filter().map(..);
 Dataset ds3 = start.filter().map(..);
 Dataset ds4 = start.filter().map(..);

 However this is very inefficient (I think because Flink needs to
 materialize the entire source dataset for every slot).

 It's much more efficient to group the generation of objects of the
 same type. E.g.:

 Dataset start =..

 Dataset tmp1 = start.map(..);
 Dataset tmp2 = start.map(..);
 Dataset ds1 = tmp1.filter();
 Dataset ds2 = tmp1.filter();
 Dataset ds3 = tmp2.filter();
 Dataset ds4 = tmp2.filter();

 Increasing the number of slots per task manager make things worse
 and worse :)
 Is there a way to improve this situation? Is it poss

Re: Dataset filter improvement

2016-02-10 Thread Stephan Ewen
Class hierarchies should definitely work, even if the base class has no
fields.

They work more efficiently if you register the subclasses at the execution
environment (Flink cannot infer them from the function signatures because
the function signatures only contain the abstract base class).

On Wed, Feb 10, 2016 at 12:23 PM, Flavio Pompermaier 
wrote:

> Because The classes are not related to each other. Do you think it's a
> good idea to have something like this?
>
> abstract class BaseClass(){
>String someField;
> }
>
> class ExtendedClass1 extends BaseClass (){
>String someOtherField11;
>String someOtherField12;
>String someOtherField13;
>  ...
> }
>
> class ExtendedClass2 extends BaseClass (){
>Integer someOtherField21;
>Double someOtherField22;
>Integer someOtherField23;
>  ...
> }
>
> and then declare my map as Map. and then apply a
> flatMap that can be used to generated the specific datasets?
> Doesn't this cause problem to Flink? Classes can be vrry different to each
> other..maybe this can cause problems with the plan generation..isn't it?
>
> Thanks Fabian and Stephan for the support!
>
>
> On Wed, Feb 10, 2016 at 11:47 AM, Stephan Ewen  wrote:
>
>> Why not use an abstract base class and N subclasses?
>>
>> On Wed, Feb 10, 2016 at 10:05 AM, Fabian Hueske 
>> wrote:
>>
>>> Unfortunately, there is no Either<1,...,n>.
>>> You could implement something like a Tuple3,
>>> Option, Option>. However, Flink does not provide an Option
>>> type (comes with Java8). You would need to implement it yourself incl.
>>> TypeInfo and Serializer. You can get some inspiration from the Either type
>>> info /serializer, if you want to go this way.
>>>
>>> Using a byte array would also work but doesn't look much easier than the
>>> Option approach to me.
>>>
>>> 2016-02-10 9:47 GMT+01:00 Flavio Pompermaier :
>>>
 Yes, the intermediate dataset I create then join again between
 themselves. What I'd need is a Either<1,...,n>. Is that possible to add?
 Otherwise I was thinking to generate a Tuple2 and in the
 subsequent filter+map/flatMap deserialize only those elements I want to
 group togheter (e.g. t.f0=="someEventType") in order to generate the typed
 dataset based.
 Which one  do you think is the best solution?

 On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske 
 wrote:

> Hi Flavio,
>
> I did not completely understand which objects should go where, but
> here are some general guidelines:
>
> - early filtering is mostly a good idea (unless evaluating the filter
> expression is very expensive)
> - you can use a flatMap function to combine a map and a filter
> - applying multiple functions on the same data set does not
> necessarily materialize the data set (in memory or on disk). In most cases
> it prevents chaining, hence there is serialization overhead. In some cases
> where the forked data streams are joined again, the data set must be
> materialized in order to avoid deadlocks.
> - it is not possible to write a map that generates two different
> types, but you could implement a mapper that returns an Either Second> type.
>
> Hope this helps,
> Fabian
>
> 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :
>
>> Any help on this?
>> On 9 Feb 2016 18:03, "Flavio Pompermaier" 
>> wrote:
>>
>>> Hi to all,
>>>
>>> in my program I have a Dataset that generated different types of
>>> object wrt the incoming element.
>>> Thus it's like a Map.
>>> In order to type the different generated datasets I do something:
>>>
>>> Dataset start =...
>>>
>>> Dataset ds1 = start.filter().map(..);
>>> Dataset ds2 = start.filter().map(..);
>>> Dataset ds3 = start.filter().map(..);
>>> Dataset ds4 = start.filter().map(..);
>>>
>>> However this is very inefficient (I think because Flink needs to
>>> materialize the entire source dataset for every slot).
>>>
>>> It's much more efficient to group the generation of objects of the
>>> same type. E.g.:
>>>
>>> Dataset start =..
>>>
>>> Dataset tmp1 = start.map(..);
>>> Dataset tmp2 = start.map(..);
>>> Dataset ds1 = tmp1.filter();
>>> Dataset ds2 = tmp1.filter();
>>> Dataset ds3 = tmp2.filter();
>>> Dataset ds4 = tmp2.filter();
>>>
>>> Increasing the number of slots per task manager make things worse
>>> and worse :)
>>> Is there a way to improve this situation? Is it possible to write a
>>> "map" generating different type of object and then filter them by 
>>> generated
>>> class type?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>


>>>
>>
>


Re: How to convert List to flink DataSet

2016-02-10 Thread Fabian Hueske
Hi Subash,

how is findOutliers implemented?

It might be that you mix-up local and cluster computation. All DataSets are
processed in the cluster. Please note the following:
- ExecutionEnvironment.fromCollection() transforms a client local
connection into a DataSet by serializing it and sending it to the cluster.
- DataSet.collect() transforms a DataSet into a collection and ships it
back to the client.

So, does findOutliers operate on the cluster or on the local client, i.e.,
does it work with DataSet and send the result back as a collection or does
it first collect the results as collection and operate on these?

Best, Fabian

2016-02-10 12:13 GMT+01:00 subash basnet :

> Hello Stefano,
>
> Yeah the type casting worked, thank you. But not able to print the Dataset
> to the file.
>
> The default below code which writes the KMeans points along with their
> centroid numbers to the file works fine:
> // feed new centroids back into next iteration
> DataSet finalCentroids = loop.closeWith(newCentroids);
> DataSet> clusteredPoints = points
> // assign points to final clusters
> .map(new SelectNearestCenter()).withBroadcastSet(finalCentroids,
> "centroids");
>   if (fileOutput) {
> clusteredPoints.writeAsCsv(outputPath, "\n", " ");
> // since file sinks are lazy, we trigger the execution explicitly
> env.execute("KMeans Example");
> }
>
> But my modified code below to find outlier:
> // feed new centroids back into next iteration
> DataSet finalCentroids = loop.closeWith(newCentroids);
> DataSet> clusteredPoints = points
> // assign points to final clusters
> .map(new SelectNearestCenter()).withBroadcastSet(finalCentroids,
> "centroids");
>*DataSet fElements =
> env.fromCollection(findOutliers(clusteredPoints, finalCentroids));*
>if (fileOutput) {
> *fElements.writeAsCsv(outputPath, "\n", " ");*
> // since file sinks are lazy, we trigger the execution explicitly
> env.execute("KMeans Example");
> }
>
> It's not writing to the file, the *result *folder does not get created
> inside kmeans folder where my centers, points file are located. I am only
> able to print it to the console via *fElements.print();*
>
> Does it have something to do with *env.exectue("")*, which must be set
> somewhere in the previous case but not in my case.
>
>
>
> Best Regards,
> Subash Basnet
>
>
> On Tue, Feb 9, 2016 at 6:29 PM, Stefano Baghino <
> stefano.bagh...@radicalbit.io> wrote:
>
>> [image: Boxbe]  This message is eligible
>> for Automatic Cleanup! (stefano.bagh...@radicalbit.io) Add cleanup rule
>> 
>> | More info
>> 
>>
>> Assuming your EnvironmentContext is named `env` Simply call:
>>
>> DataSet> fElements = env.*fromCollection*
>> (finalElements);
>>
>> Does this help?
>>
>> On Tue, Feb 9, 2016 at 6:06 PM, subash basnet  wrote:
>>
>>> Hello all,
>>>
>>> I have performed a modification in KMeans code to detect outliers. I
>>> have printed the output in the console but I am not able to write it to the
>>> file using the given 'writeAsCsv' method.
>>> The problem is I generate a list of tuples.
>>> My List is:
>>> List finalElements = new ArrayList();
>>> Following is the datatype of the elements added to the list:
>>> Tuple3 newElement = new Tuple3>> Boolean>();
>>> finalElements.add(newElement);
>>> Now I am stuck on how to convert this 'finalElements' to
>>> DataSet> fElements,
>>> so that I could use
>>> fElements.writeAsCsv(outputPath, "\n"," ");
>>>
>>> Best Regards,
>>> Subash Basnet
>>>
>>
>>
>>
>> --
>> BR,
>> Stefano Baghino
>>
>> Software Engineer @ Radicalbit
>>
>>
>


Re: Dataset filter improvement

2016-02-10 Thread Flavio Pompermaier
Because The classes are not related to each other. Do you think it's a good
idea to have something like this?

abstract class BaseClass(){
   String someField;
}

class ExtendedClass1 extends BaseClass (){
   String someOtherField11;
   String someOtherField12;
   String someOtherField13;
 ...
}

class ExtendedClass2 extends BaseClass (){
   Integer someOtherField21;
   Double someOtherField22;
   Integer someOtherField23;
 ...
}

and then declare my map as Map. and then apply a flatMap
that can be used to generated the specific datasets?
Doesn't this cause problem to Flink? Classes can be vrry different to each
other..maybe this can cause problems with the plan generation..isn't it?

Thanks Fabian and Stephan for the support!


On Wed, Feb 10, 2016 at 11:47 AM, Stephan Ewen  wrote:

> Why not use an abstract base class and N subclasses?
>
> On Wed, Feb 10, 2016 at 10:05 AM, Fabian Hueske  wrote:
>
>> Unfortunately, there is no Either<1,...,n>.
>> You could implement something like a Tuple3, Option,
>> Option>. However, Flink does not provide an Option type (comes with
>> Java8). You would need to implement it yourself incl. TypeInfo and
>> Serializer. You can get some inspiration from the Either type info
>> /serializer, if you want to go this way.
>>
>> Using a byte array would also work but doesn't look much easier than the
>> Option approach to me.
>>
>> 2016-02-10 9:47 GMT+01:00 Flavio Pompermaier :
>>
>>> Yes, the intermediate dataset I create then join again between
>>> themselves. What I'd need is a Either<1,...,n>. Is that possible to add?
>>> Otherwise I was thinking to generate a Tuple2 and in the
>>> subsequent filter+map/flatMap deserialize only those elements I want to
>>> group togheter (e.g. t.f0=="someEventType") in order to generate the typed
>>> dataset based.
>>> Which one  do you think is the best solution?
>>>
>>> On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske 
>>> wrote:
>>>
 Hi Flavio,

 I did not completely understand which objects should go where, but here
 are some general guidelines:

 - early filtering is mostly a good idea (unless evaluating the filter
 expression is very expensive)
 - you can use a flatMap function to combine a map and a filter
 - applying multiple functions on the same data set does not necessarily
 materialize the data set (in memory or on disk). In most cases it prevents
 chaining, hence there is serialization overhead. In some cases where the
 forked data streams are joined again, the data set must be materialized in
 order to avoid deadlocks.
 - it is not possible to write a map that generates two different types,
 but you could implement a mapper that returns an Either 
 type.

 Hope this helps,
 Fabian

 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :

> Any help on this?
> On 9 Feb 2016 18:03, "Flavio Pompermaier" 
> wrote:
>
>> Hi to all,
>>
>> in my program I have a Dataset that generated different types of
>> object wrt the incoming element.
>> Thus it's like a Map.
>> In order to type the different generated datasets I do something:
>>
>> Dataset start =...
>>
>> Dataset ds1 = start.filter().map(..);
>> Dataset ds2 = start.filter().map(..);
>> Dataset ds3 = start.filter().map(..);
>> Dataset ds4 = start.filter().map(..);
>>
>> However this is very inefficient (I think because Flink needs to
>> materialize the entire source dataset for every slot).
>>
>> It's much more efficient to group the generation of objects of the
>> same type. E.g.:
>>
>> Dataset start =..
>>
>> Dataset tmp1 = start.map(..);
>> Dataset tmp2 = start.map(..);
>> Dataset ds1 = tmp1.filter();
>> Dataset ds2 = tmp1.filter();
>> Dataset ds3 = tmp2.filter();
>> Dataset ds4 = tmp2.filter();
>>
>> Increasing the number of slots per task manager make things worse and
>> worse :)
>> Is there a way to improve this situation? Is it possible to write a
>> "map" generating different type of object and then filter them by 
>> generated
>> class type?
>>
>> Best,
>> Flavio
>>
>>
>>
>>
>>
>>
>>

>>>
>>>
>>
>


Re: How to convert List to flink DataSet

2016-02-10 Thread subash basnet
Hello Stefano,

Yeah the type casting worked, thank you. But not able to print the Dataset
to the file.

The default below code which writes the KMeans points along with their
centroid numbers to the file works fine:
// feed new centroids back into next iteration
DataSet finalCentroids = loop.closeWith(newCentroids);
DataSet> clusteredPoints = points
// assign points to final clusters
.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids,
"centroids");
  if (fileOutput) {
clusteredPoints.writeAsCsv(outputPath, "\n", " ");
// since file sinks are lazy, we trigger the execution explicitly
env.execute("KMeans Example");
}

But my modified code below to find outlier:
// feed new centroids back into next iteration
DataSet finalCentroids = loop.closeWith(newCentroids);
DataSet> clusteredPoints = points
// assign points to final clusters
.map(new SelectNearestCenter()).withBroadcastSet(finalCentroids,
"centroids");
   *DataSet fElements =
env.fromCollection(findOutliers(clusteredPoints, finalCentroids));*
   if (fileOutput) {
*fElements.writeAsCsv(outputPath, "\n", " ");*
// since file sinks are lazy, we trigger the execution explicitly
env.execute("KMeans Example");
}

It's not writing to the file, the *result *folder does not get created
inside kmeans folder where my centers, points file are located. I am only
able to print it to the console via *fElements.print();*

Does it have something to do with *env.exectue("")*, which must be set
somewhere in the previous case but not in my case.



Best Regards,
Subash Basnet


On Tue, Feb 9, 2016 at 6:29 PM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> [image: Boxbe]  This message is eligible
> for Automatic Cleanup! (stefano.bagh...@radicalbit.io) Add cleanup rule
> 
> | More info
> 
>
> Assuming your EnvironmentContext is named `env` Simply call:
>
> DataSet> fElements = env.*fromCollection*
> (finalElements);
>
> Does this help?
>
> On Tue, Feb 9, 2016 at 6:06 PM, subash basnet  wrote:
>
>> Hello all,
>>
>> I have performed a modification in KMeans code to detect outliers. I have
>> printed the output in the console but I am not able to write it to the file
>> using the given 'writeAsCsv' method.
>> The problem is I generate a list of tuples.
>> My List is:
>> List finalElements = new ArrayList();
>> Following is the datatype of the elements added to the list:
>> Tuple3 newElement = new Tuple3> Boolean>();
>> finalElements.add(newElement);
>> Now I am stuck on how to convert this 'finalElements' to
>> DataSet> fElements,
>> so that I could use
>> fElements.writeAsCsv(outputPath, "\n"," ");
>>
>> Best Regards,
>> Subash Basnet
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>
>


Re: Dataset filter improvement

2016-02-10 Thread Stephan Ewen
Why not use an abstract base class and N subclasses?

On Wed, Feb 10, 2016 at 10:05 AM, Fabian Hueske  wrote:

> Unfortunately, there is no Either<1,...,n>.
> You could implement something like a Tuple3, Option,
> Option>. However, Flink does not provide an Option type (comes with
> Java8). You would need to implement it yourself incl. TypeInfo and
> Serializer. You can get some inspiration from the Either type info
> /serializer, if you want to go this way.
>
> Using a byte array would also work but doesn't look much easier than the
> Option approach to me.
>
> 2016-02-10 9:47 GMT+01:00 Flavio Pompermaier :
>
>> Yes, the intermediate dataset I create then join again between
>> themselves. What I'd need is a Either<1,...,n>. Is that possible to add?
>> Otherwise I was thinking to generate a Tuple2 and in the
>> subsequent filter+map/flatMap deserialize only those elements I want to
>> group togheter (e.g. t.f0=="someEventType") in order to generate the typed
>> dataset based.
>> Which one  do you think is the best solution?
>>
>> On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske  wrote:
>>
>>> Hi Flavio,
>>>
>>> I did not completely understand which objects should go where, but here
>>> are some general guidelines:
>>>
>>> - early filtering is mostly a good idea (unless evaluating the filter
>>> expression is very expensive)
>>> - you can use a flatMap function to combine a map and a filter
>>> - applying multiple functions on the same data set does not necessarily
>>> materialize the data set (in memory or on disk). In most cases it prevents
>>> chaining, hence there is serialization overhead. In some cases where the
>>> forked data streams are joined again, the data set must be materialized in
>>> order to avoid deadlocks.
>>> - it is not possible to write a map that generates two different types,
>>> but you could implement a mapper that returns an Either type.
>>>
>>> Hope this helps,
>>> Fabian
>>>
>>> 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :
>>>
 Any help on this?
 On 9 Feb 2016 18:03, "Flavio Pompermaier"  wrote:

> Hi to all,
>
> in my program I have a Dataset that generated different types of
> object wrt the incoming element.
> Thus it's like a Map.
> In order to type the different generated datasets I do something:
>
> Dataset start =...
>
> Dataset ds1 = start.filter().map(..);
> Dataset ds2 = start.filter().map(..);
> Dataset ds3 = start.filter().map(..);
> Dataset ds4 = start.filter().map(..);
>
> However this is very inefficient (I think because Flink needs to
> materialize the entire source dataset for every slot).
>
> It's much more efficient to group the generation of objects of the
> same type. E.g.:
>
> Dataset start =..
>
> Dataset tmp1 = start.map(..);
> Dataset tmp2 = start.map(..);
> Dataset ds1 = tmp1.filter();
> Dataset ds2 = tmp1.filter();
> Dataset ds3 = tmp2.filter();
> Dataset ds4 = tmp2.filter();
>
> Increasing the number of slots per task manager make things worse and
> worse :)
> Is there a way to improve this situation? Is it possible to write a
> "map" generating different type of object and then filter them by 
> generated
> class type?
>
> Best,
> Flavio
>
>
>
>
>
>
>
>>>
>>
>>
>


Re: Flink 1.0-SNAPSHOT scala 2.11 in S3 has scala 2.10

2016-02-10 Thread Maximilian Michels
Hi David,

Just had a check as well. Can't find a 2.10 Jar in the lib folder.

Cheers,
Max

On Wed, Feb 10, 2016 at 6:17 AM, Chiwan Park  wrote:
> Hi David,
>
> I just downloaded the "flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz” but there is 
> no jar compiled with Scala 2.10. Could you check again?
>
> Regards,
> Chiwan Park
>
>> On Feb 10, 2016, at 2:59 AM, David Kim  
>> wrote:
>>
>> Hello,
>>
>> I noticed that the flink binary for scala 2.11 located at 
>> http://stratosphere-bin.s3.amazonaws.com/flink-1.0-SNAPSHOT-bin-hadoop2_2.11.tgz
>>  contains the scala 2.10 flavor.
>>
>> If you open the lib folder the name of the jar in lib is 
>> flink-dist_2.10-1.0-SNAPSHOT.jar.
>>
>> Could this be an error in the process that updates these files in S3?
>>
>> We're using that download link following the suggestions here: 
>> https://flink.apache.org/contribute-code.html#snapshots-nightly-builds. If 
>> there's a better place let us know as well!
>>
>> Thanks,
>> David
>


Re: Dataset filter improvement

2016-02-10 Thread Fabian Hueske
Unfortunately, there is no Either<1,...,n>.
You could implement something like a Tuple3, Option,
Option>. However, Flink does not provide an Option type (comes with
Java8). You would need to implement it yourself incl. TypeInfo and
Serializer. You can get some inspiration from the Either type info
/serializer, if you want to go this way.

Using a byte array would also work but doesn't look much easier than the
Option approach to me.

2016-02-10 9:47 GMT+01:00 Flavio Pompermaier :

> Yes, the intermediate dataset I create then join again between themselves.
> What I'd need is a Either<1,...,n>. Is that possible to add?
> Otherwise I was thinking to generate a Tuple2 and in the
> subsequent filter+map/flatMap deserialize only those elements I want to
> group togheter (e.g. t.f0=="someEventType") in order to generate the typed
> dataset based.
> Which one  do you think is the best solution?
>
> On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske  wrote:
>
>> Hi Flavio,
>>
>> I did not completely understand which objects should go where, but here
>> are some general guidelines:
>>
>> - early filtering is mostly a good idea (unless evaluating the filter
>> expression is very expensive)
>> - you can use a flatMap function to combine a map and a filter
>> - applying multiple functions on the same data set does not necessarily
>> materialize the data set (in memory or on disk). In most cases it prevents
>> chaining, hence there is serialization overhead. In some cases where the
>> forked data streams are joined again, the data set must be materialized in
>> order to avoid deadlocks.
>> - it is not possible to write a map that generates two different types,
>> but you could implement a mapper that returns an Either type.
>>
>> Hope this helps,
>> Fabian
>>
>> 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :
>>
>>> Any help on this?
>>> On 9 Feb 2016 18:03, "Flavio Pompermaier"  wrote:
>>>
 Hi to all,

 in my program I have a Dataset that generated different types of object
 wrt the incoming element.
 Thus it's like a Map.
 In order to type the different generated datasets I do something:

 Dataset start =...

 Dataset ds1 = start.filter().map(..);
 Dataset ds2 = start.filter().map(..);
 Dataset ds3 = start.filter().map(..);
 Dataset ds4 = start.filter().map(..);

 However this is very inefficient (I think because Flink needs to
 materialize the entire source dataset for every slot).

 It's much more efficient to group the generation of objects of the same
 type. E.g.:

 Dataset start =..

 Dataset tmp1 = start.map(..);
 Dataset tmp2 = start.map(..);
 Dataset ds1 = tmp1.filter();
 Dataset ds2 = tmp1.filter();
 Dataset ds3 = tmp2.filter();
 Dataset ds4 = tmp2.filter();

 Increasing the number of slots per task manager make things worse and
 worse :)
 Is there a way to improve this situation? Is it possible to write a
 "map" generating different type of object and then filter them by generated
 class type?

 Best,
 Flavio







>>
>
>


Re: Dataset filter improvement

2016-02-10 Thread Flavio Pompermaier
Yes, the intermediate dataset I create then join again between themselves.
What I'd need is a Either<1,...,n>. Is that possible to add?
Otherwise I was thinking to generate a Tuple2 and in the
subsequent filter+map/flatMap deserialize only those elements I want to
group togheter (e.g. t.f0=="someEventType") in order to generate the typed
dataset based.
Which one  do you think is the best solution?

On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske  wrote:

> Hi Flavio,
>
> I did not completely understand which objects should go where, but here
> are some general guidelines:
>
> - early filtering is mostly a good idea (unless evaluating the filter
> expression is very expensive)
> - you can use a flatMap function to combine a map and a filter
> - applying multiple functions on the same data set does not necessarily
> materialize the data set (in memory or on disk). In most cases it prevents
> chaining, hence there is serialization overhead. In some cases where the
> forked data streams are joined again, the data set must be materialized in
> order to avoid deadlocks.
> - it is not possible to write a map that generates two different types,
> but you could implement a mapper that returns an Either type.
>
> Hope this helps,
> Fabian
>
> 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :
>
>> Any help on this?
>> On 9 Feb 2016 18:03, "Flavio Pompermaier"  wrote:
>>
>>> Hi to all,
>>>
>>> in my program I have a Dataset that generated different types of object
>>> wrt the incoming element.
>>> Thus it's like a Map.
>>> In order to type the different generated datasets I do something:
>>>
>>> Dataset start =...
>>>
>>> Dataset ds1 = start.filter().map(..);
>>> Dataset ds2 = start.filter().map(..);
>>> Dataset ds3 = start.filter().map(..);
>>> Dataset ds4 = start.filter().map(..);
>>>
>>> However this is very inefficient (I think because Flink needs to
>>> materialize the entire source dataset for every slot).
>>>
>>> It's much more efficient to group the generation of objects of the same
>>> type. E.g.:
>>>
>>> Dataset start =..
>>>
>>> Dataset tmp1 = start.map(..);
>>> Dataset tmp2 = start.map(..);
>>> Dataset ds1 = tmp1.filter();
>>> Dataset ds2 = tmp1.filter();
>>> Dataset ds3 = tmp2.filter();
>>> Dataset ds4 = tmp2.filter();
>>>
>>> Increasing the number of slots per task manager make things worse and
>>> worse :)
>>> Is there a way to improve this situation? Is it possible to write a
>>> "map" generating different type of object and then filter them by generated
>>> class type?
>>>
>>> Best,
>>> Flavio
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>


Re: Dataset filter improvement

2016-02-10 Thread Fabian Hueske
Hi Flavio,

I did not completely understand which objects should go where, but here are
some general guidelines:

- early filtering is mostly a good idea (unless evaluating the filter
expression is very expensive)
- you can use a flatMap function to combine a map and a filter
- applying multiple functions on the same data set does not necessarily
materialize the data set (in memory or on disk). In most cases it prevents
chaining, hence there is serialization overhead. In some cases where the
forked data streams are joined again, the data set must be materialized in
order to avoid deadlocks.
- it is not possible to write a map that generates two different types, but
you could implement a mapper that returns an Either type.

Hope this helps,
Fabian

2016-02-10 8:43 GMT+01:00 Flavio Pompermaier :

> Any help on this?
> On 9 Feb 2016 18:03, "Flavio Pompermaier"  wrote:
>
>> Hi to all,
>>
>> in my program I have a Dataset that generated different types of object
>> wrt the incoming element.
>> Thus it's like a Map.
>> In order to type the different generated datasets I do something:
>>
>> Dataset start =...
>>
>> Dataset ds1 = start.filter().map(..);
>> Dataset ds2 = start.filter().map(..);
>> Dataset ds3 = start.filter().map(..);
>> Dataset ds4 = start.filter().map(..);
>>
>> However this is very inefficient (I think because Flink needs to
>> materialize the entire source dataset for every slot).
>>
>> It's much more efficient to group the generation of objects of the same
>> type. E.g.:
>>
>> Dataset start =..
>>
>> Dataset tmp1 = start.map(..);
>> Dataset tmp2 = start.map(..);
>> Dataset ds1 = tmp1.filter();
>> Dataset ds2 = tmp1.filter();
>> Dataset ds3 = tmp2.filter();
>> Dataset ds4 = tmp2.filter();
>>
>> Increasing the number of slots per task manager make things worse and
>> worse :)
>> Is there a way to improve this situation? Is it possible to write a "map"
>> generating different type of object and then filter them by generated class
>> type?
>>
>> Best,
>> Flavio
>>
>>
>>
>>
>>
>>
>>


Re: Join two Datasets --> InvalidProgramException

2016-02-10 Thread Dominique Rondé

Hi Fabian,

your hint was good! Maven fools me with the dependency management. Now 
everything works as expected!


Many many thanks to all of you!

Greets
Dominique

Am 10.02.2016 um 08:45 schrieb Fabian Hueske:

Hi Dominique,

can you check if the versions of the remotely running job manager & 
task managers are the same as the Flink version that is used to submit 
the job? The version and commit hash are logged at the top of the JM 
and TM log files.


Right now, the local client optimizes the job, chooses the execution 
strategies, and sends the plan to the remote JobManager. Recently, we 
added and removed some strategies. So it might be that the strategy 
enum of client and jobmanager got out of sync.


Cheers, Fabian

2016-02-10 7:33 GMT+01:00 Dominique Rondé 
mailto:dominique.ro...@codecentric.de>>:


Hi,

your guess is correct. I use java all the time... Here is the
complete stacktrace:

Exception in thread "main"
org.apache.flink.client.program.ProgramInvocationException: The
program execution failed: Job execution failed.
at
org.apache.flink.client.program.Client.runBlocking(Client.java:367)
at
org.apache.flink.client.program.Client.runBlocking(Client.java:345)
at
org.apache.flink.client.program.Client.runBlocking(Client.java:312)
at

org.apache.flink.client.RemoteExecutor.executePlanWithJars(RemoteExecutor.java:212)
at
org.apache.flink.client.RemoteExecutor.executePlan(RemoteExecutor.java:189)
at

org.apache.flink.api.java.RemoteEnvironment.execute(RemoteEnvironment.java:160)
at

org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:803)
at org.apache.flink.api.java.DataSet.collect(DataSet.java:410)
at org.apache.flink.api.java.DataSet.print(DataSet.java:1583)
at
x.y.z.eventstore.processing.pmc.PmcProcessor.main(PmcProcessor.java:103)
Caused by: org.apache.flink.runtime.client.JobExecutionException:
Job execution failed.
at

org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply$mcV$sp(JobManager.scala:563)
at

org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at

org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:509)
at

scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
at

akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at

scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
at

scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at

scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: The data preparation for task
'CHAIN Join(Join at main(PmcProcessor.java:103)) -> FlatMap
(collect())' , caused an error: Unsupported driver strategy for
join driver: CO_GROUP_RAW
at
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:465)
at
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:354)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:584)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Unsupported driver strategy for
join driver: CO_GROUP_RAW
at
org.apache.flink.runtime.operators.JoinDriver.prepare(JoinDriver.java:193)
at
org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:459)
... 3 more


Am 09.02.2016 um 21:03 schrieb Fabian Hueske:

Hi,
glad you could resolve the POJO issue, but the new error doesn't
look right.
The CO_GROUP_RAW strategy should only be used for programs that
are implemented against the Python DataSet API.
I guess that's not the case since all code snippets were Java so
far.

Can you post the full stacktrace of the exception?

2016-02-09 20:13 GMT+01:00 Dominique Rondé
mailto:dominique.ro...@codecentric.de>>:

Hi all,

i finally figured out that there is a getter for a boolean
field which may be the source of the trouble. It seems that
getBooleanField (as we use it) is not the best choice. Now
the plan is executed with another error code. :(

Caused by: java.lang.Exception: Unsupported driver strategy
for join driver: CO_GROUP_RAW