Re: how Flink Optimizer work and what is process do it?

2015-05-26 Thread hagersaleh
this optimizer automatic mode or I determination it 



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/how-Flink-Optimizer-work-and-what-is-process-do-it-tp1359p1363.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: how Flink Optimizer work and what is process do it?

2015-05-26 Thread hagersaleh
very thanks
what meaning Optimizer in flink?



--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/how-Flink-Optimizer-work-and-what-is-process-do-it-tp1359p1361.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: org.apache.flink.addons.hbase.TableInputFormat

2015-05-26 Thread Hilmi Yildirim

Thank you very much

Am 26.05.2015 um 11:26 schrieb Robert Metzger:

Hi,

you need to include dependencies not part of the main flink jar into 
your usercode jar file.

Therefore, you have to build a fat jar.
The easiest way to do that is like this: 
http://stackoverflow.com/questions/574594/how-can-i-create-an-executable-jar-with-dependencies-using-maven
But that approach will lead to a very big file, because it includes 
all the dependencies.


The best approach is doing it like this: 
https://github.com/apache/flink/blob/master/flink-quickstart/flink-quickstart-java/src/main/resources/archetype-resources/pom.xml
(Setting the regular flink dependencies to provided and using the 
maven-shade-plugin).


Best,
Robert

On Tue, May 26, 2015 at 11:15 AM, Hilmi Yildirim 
hilmi.yildi...@neofonie.de mailto:hilmi.yildi...@neofonie.de wrote:


Hi,
I want to deploy my job on a cluster. Unfortunately, the job does
not run because I used
org.apache.flink.addons.hbase.TableInputFormat which is not
included in the library of the flink folder. Therefore, I wanted
to build my own flink version with mvn package -DskipTests. Again,
the library folder does not contain
org.apache.flink.addons.hbase.TableInputFormat. Moreover, it does
not contain the addons at all. I looked into the pom files but I
could not detect where the addons are excluded.

Does anyone know how I can run my Job on a flink cluster which
does support org.apache.flink.addons.hbase.TableInputFormat?

Best Regards,
Hilmi

-- 
--

Hilmi Yildirim
Software Developer RD

http://www.neofonie.de

Besuchen Sie den Neo Tech Blog für Anwender:
http://blog.neofonie.de/

Folgen Sie uns:
https://plus.google.com/+neofonie
http://www.linkedin.com/company/neofonie-gmbh
https://www.xing.com/companies/neofoniegmbh

Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
Handelsregister Berlin-Charlottenburg: HRB 67460
Geschäftsführung: Thomas Kitlitschko




--
--
Hilmi Yildirim
Software Developer RD

T: +49 30 24627-281
hilmi.yildi...@neofonie.de

http://www.neofonie.de

Besuchen Sie den Neo Tech Blog für Anwender:
http://blog.neofonie.de/

Folgen Sie uns:
https://plus.google.com/+neofonie
http://www.linkedin.com/company/neofonie-gmbh
https://www.xing.com/companies/neofoniegmbh

Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
Handelsregister Berlin-Charlottenburg: HRB 67460
Geschäftsführung: Thomas Kitlitschko



Re: k means - waiting for dataset

2015-05-26 Thread Pa Rö
thanks for your message,

maybe you can give me a exsample for the GroupReduceFunction?

2015-05-22 23:29 GMT+02:00 Fabian Hueske fhue...@gmail.com:

 There are two ways to do that:

 1) You use a GroupReduceFunction, which gives you an iterator over all
 points similar to Hadoop's ReduceFunction.
 2) You use the ReduceFunction to compute the sum and the count at the same
 time (e.g., in two fields of a Tuple2) and use a MapFunction to do the
 final division.

 I'd go with the first choice. It's easier.

 Best, Fabian

 2015-05-22 23:09 GMT+02:00 Paul Röwer paul.roewer1...@googlemail.com:

  good evening,

 sorry, my english is not the best.

 by comupte the new centroid, i will sum all points of the cluster and
 form the new center.
 in my other implementation firstly i sum all point and at the end i
 divides by number of points.
 to example: (1+2+3+4)/4=2,5

 at flink i reduce always two point to one,
 for the example upstairs: (1+2)/2=1,5 -- (1,5+3)/2=2,25 --
 (2,25+4)=3,125

 how can i rewrite my function so, that it work like my other
 implementation?

 best regards,
 paul



 Am 22.05.2015 um 16:52 schrieb Stephan Ewen:

 Sorry, I don't understand the question.

  Can you describe a bit better what you mean with how i can sum all
 points and share thoug the counter ?

  Thanks!

 On Fri, May 22, 2015 at 2:06 PM, Pa Rö paul.roewer1...@googlemail.com
 wrote:

   i have fix a bug at the input reading, but the results are still
 different.

 i think i have local the problem, in the other implementation i sum all
 geo points/time points and share thougt the counter.
  but in flink i sum two points and share thougt two, and sum the next...

  the method is the following:

 // sums and counts point coordinates
 private static final class CentroidAccumulator implements
 ReduceFunctionTuple2Integer, GeoTimeDataTupel {

 private static final long serialVersionUID =
 -4868797820391121771L;

 public Tuple2Integer, GeoTimeDataTupel reduce(Tuple2Integer,
 GeoTimeDataTupel val1, Tuple2Integer, GeoTimeDataTupel val2) {
 return new Tuple2Integer, GeoTimeDataTupel(val1.f0,
 addAndDiv(val1.f0,val1.f1,val2.f1));
 }
 }

 private static GeoTimeDataTupel addAndDiv(int
 clusterid,GeoTimeDataTupel input1, GeoTimeDataTupel input2){
 long time = (input1.getTime()+input2.getTime())/2;
 ListLatLongSeriable list = new ArrayListLatLongSeriable();
 list.add(input1.getGeo());
 list.add(input2.getGeo());
 LatLongSeriable geo = Geometry.getGeoCenterOf(list);

 return new GeoTimeDataTupel(geo,time,POINT);
 }

  how i can sum all points and share thoug the counter?


 2015-05-22 9:53 GMT+02:00 Pa Rö paul.roewer1...@googlemail.com:

  hi,
  if i print the centroids all are show in the output. i have implement
 k means with map reduce und spark. by same input, i get the same output.
 but in flink i get a one cluster output with this input set. (i use csv
 files from the GDELT projekt)

  here my class:

 public class FlinkMain {


 public static void main(String[] args) {
 //load properties
 Properties pro = new Properties();
 try {
 pro.load(new
 FileInputStream(./resources/config.properties));
 } catch (Exception e) {
 e.printStackTrace();
 }
 int maxIteration =
 1;//Integer.parseInt(pro.getProperty(maxiterations));
 String outputPath = pro.getProperty(flink.output);
 // set up execution environment
 ExecutionEnvironment env =
 ExecutionEnvironment.getExecutionEnvironment();
 // get input points
 DataSetGeoTimeDataTupel points = getPointDataSet(env);
 DataSetGeoTimeDataCenter centroids = getCentroidDataSet(env);
 // set number of bulk iterations for KMeans algorithm
 IterativeDataSetGeoTimeDataCenter loop =
 centroids.iterate(maxIteration);
 DataSetGeoTimeDataCenter newCentroids = points
 // compute closest centroid for each point
 .map(new SelectNearestCenter()).withBroadcastSet(loop,
 centroids)
 // count and sum point coordinates for each centroid
 .groupBy(0).reduce(new CentroidAccumulator())
 // compute new centroids from point counts and coordinate
 sums
 .map(new CentroidAverager());
 // feed new centroids back into next iteration
 DataSetGeoTimeDataCenter finalCentroids =
 loop.closeWith(newCentroids);
 DataSetTuple2Integer, GeoTimeDataTupel clusteredPoints =
 points
 // assign points to final clusters
 .map(new
 SelectNearestCenter()).withBroadcastSet(finalCentroids, centroids);
 // emit result
 clusteredPoints.writeAsCsv(outputPath+/points, \n,  );
 finalCentroids.writeAsText(outputPath+/centers);//print();
 // execute program
 try {
 env.execute(KMeans Flink);
 } catch (Exception e) {
 

Re: HBase Connection in cluster

2015-05-26 Thread Flavio Pompermaier
I usually put those connection params inside the hbase-site.xml that will
be included in the generated jar..

On Tue, May 26, 2015 at 2:07 PM, Hilmi Yildirim hilmi.yildi...@neofonie.de
wrote:

 I want to add that it is strange that the client wants to establish a
 connection to localhost but I have defined another machine.




 Am 26.05.2015 um 14:05 schrieb Hilmi Yildirim:

 Hi,
 I implemented a job which reads data from HBASE with following code (I
 replaced the real address by m1.example.com)

 protected Scan getScanner() {
 Scan scan = new Scan();
 Configuration conf = HBaseConfiguration.create();
 conf.set(hbase.zookeeper.quorum, m1.example.com);
 conf.set(hbase.zookeeper.property.clientPort, 2181);

 try {
 table = new HTable(conf, table);
 table.getScanner(scan);
 } catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
 }
 return scan;
 }


 My code works well when I execute it local but when I execute deploy it
 on the cluster then I get following exceptions:

 14:00:30,610 INFO org.apache.zookeeper.ClientCnxn
- Opening socket connection to server localhost/127.0.0.1:2181.
 Will not attempt to authenticate using SASL (unknown error)
 14:00:30,610 WARN org.apache.zookeeper.ClientCnxn
- Session 0x0 for server null, unexpected error, closing socket
 connection and attempting reconnect
 java.net.ConnectException: Verbindungsaufbau abgelehnt
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
 14:00:31,089 INFO org.apache.zookeeper.ClientCnxn
- Opening socket connection to server localhost/127.0.0.1:2181.
 Will not attempt to authenticate using SASL (unknown error)
 14:00:31,089 WARN org.apache.zookeeper.ClientCnxn
- Session 0x0 for server null, unexpected error, closing socket
 connection and attempting reconnect
 java.net.ConnectException: Verbindungsaufbau abgelehnt
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)


 Does anyone know what I make wrong?

 Best Regards,
 Hilmi


 --
 --
 Hilmi Yildirim
 Software Developer RD

 T: +49 30 24627-281
 hilmi.yildi...@neofonie.de


 http://www.neofonie.de

 Besuchen Sie den Neo Tech Blog für Anwender:
 http://blog.neofonie.de/

 Folgen Sie uns:
 https://plus.google.com/+neofonie
 http://www.linkedin.com/company/neofonie-gmbh
 https://www.xing.com/companies/neofoniegmbh

 Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
 Handelsregister Berlin-Charlottenburg: HRB 67460
 Geschäftsführung: Thomas Kitlitschko




Re: HBase Connection in cluster

2015-05-26 Thread Hilmi Yildirim
Where do you put the hbase-site.xml? In the resource folder of the 
project or on the cluster?


Am 26.05.2015 um 14:12 schrieb Flavio Pompermaier:
I usually put those connection params inside the hbase-site.xml that 
will be included in the generated jar..


On Tue, May 26, 2015 at 2:07 PM, Hilmi Yildirim 
hilmi.yildi...@neofonie.de mailto:hilmi.yildi...@neofonie.de wrote:


I want to add that it is strange that the client wants to
establish a connection to localhost but I have defined another
machine.




Am 26.05.2015 um 14:05 schrieb Hilmi Yildirim:

Hi,
I implemented a job which reads data from HBASE with following
code (I replaced the real address by m1.example.com
http://m1.example.com)

protected Scan getScanner() {
Scan scan = new Scan();
Configuration conf = HBaseConfiguration.create();
conf.set(hbase.zookeeper.quorum, m1.example.com
http://m1.example.com);
conf.set(hbase.zookeeper.property.clientPort, 2181);

try {
table = new HTable(conf, table);
table.getScanner(scan);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return scan;
}


My code works well when I execute it local but when I execute
deploy it on the cluster then I get following exceptions:

14:00:30,610 INFO org.apache.zookeeper.ClientCnxn
   - Opening socket connection to server

localhost/127.0.0.1:2181 http://127.0.0.1:2181. Will not
attempt to authenticate using SASL (unknown error)
14:00:30,610 WARN org.apache.zookeeper.ClientCnxn
   - Session 0x0 for server null, unexpected

error, closing socket connection and attempting reconnect
java.net.ConnectException: Verbindungsaufbau abgelehnt
at sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at

org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
14:00:31,089 INFO org.apache.zookeeper.ClientCnxn
   - Opening socket connection to server

localhost/127.0.0.1:2181 http://127.0.0.1:2181. Will not
attempt to authenticate using SASL (unknown error)
14:00:31,089 WARN org.apache.zookeeper.ClientCnxn
   - Session 0x0 for server null, unexpected

error, closing socket connection and attempting reconnect
java.net.ConnectException: Verbindungsaufbau abgelehnt
at sun.nio.ch.SocketChannelImpl.checkConnect(Native
Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at

org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at
org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)


Does anyone know what I make wrong?

Best Regards,
Hilmi


-- 
--

Hilmi Yildirim
Software Developer RD

T: +49 30 24627-281 tel:%2B49%2030%2024627-281
hilmi.yildi...@neofonie.de mailto:hilmi.yildi...@neofonie.de


http://www.neofonie.de

Besuchen Sie den Neo Tech Blog für Anwender:
http://blog.neofonie.de/

Folgen Sie uns:
https://plus.google.com/+neofonie
http://www.linkedin.com/company/neofonie-gmbh
https://www.xing.com/companies/neofoniegmbh

Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
Handelsregister Berlin-Charlottenburg: HRB 67460
Geschäftsführung: Thomas Kitlitschko





--
--
Hilmi Yildirim
Software Developer RD

T: +49 30 24627-281
hilmi.yildi...@neofonie.de

http://www.neofonie.de

Besuchen Sie den Neo Tech Blog für Anwender:
http://blog.neofonie.de/

Folgen Sie uns:
https://plus.google.com/+neofonie
http://www.linkedin.com/company/neofonie-gmbh
https://www.xing.com/companies/neofoniegmbh

Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
Handelsregister Berlin-Charlottenburg: HRB 67460
Geschäftsführung: Thomas Kitlitschko



Re: Recursive directory reading error

2015-05-26 Thread Flavio Pompermaier
I have 10 files..I debugged the code and it seems that there's a loop in
the FileInputFormat when files are nested far away from the root directory
of the scan

On Tue, May 26, 2015 at 2:14 PM, Robert Metzger rmetz...@apache.org wrote:

 Hi Flavio,

 how many files are in the directory?
 You can count with find /tmp/myDir | wc -l

 Flink running out of memory while creating input splits indicates to me
 that there are a lot of files in there.

 On Tue, May 26, 2015 at 2:10 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 Hi to all,

 I'm trying to recursively read a directory but it seems that the
  totalLength value in the FileInputformat.createInputSplits() is not
 computed correctly..

 I have a files organized as:

 /tmp/myDir/A/B/cunk-1.txt
 /tmp/myDir/A/B/cunk-2.txt
  ..

 If I try to do the following:

 Configuration parameters = new Configuration();
 parameters.setBoolean(recursive.file.enumeration, true);
 env.readTextFile(file:tmp/myDir)).withParameters(parameters).print();

 I get:

 Caused by: org.apache.flink.runtime.JobException: Creating the input
 splits caused an error: Java heap space
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:162)
 at
 org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:471)
 at org.apache.flink.runtime.jobmanager.JobManager.org
 $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:515)
 ... 19 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:503)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:51)
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:146)

 Am I doing something wrong or is it a bug?

 Best,
 Flavio





Re: Recursive directory reading error

2015-05-26 Thread Maximilian Michels
Yes, there is a loop to recursively search for files in directory but that
should be ok. The code fails when adding a new InputSplit to an ArrayList.
This is a standard operation.

Oh, I think I found a bug in `addNestedFiles`. It does not pick up the
length of the recursively found files in line 546. That can result in a
returned size of 0 which causes infinite InputSplits to be created and
added to the aforementioned ArrayList. Can you change

addNestedFiles(dir.getPath(), files, length, logExcludedFiles);

to

length += addNestedFiles(dir.getPath(), files, length, logExcludedFiles);

?



On Tue, May 26, 2015 at 2:21 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 I have 10 files..I debugged the code and it seems that there's a loop in
 the FileInputFormat when files are nested far away from the root directory
 of the scan

 On Tue, May 26, 2015 at 2:14 PM, Robert Metzger rmetz...@apache.org
 wrote:

 Hi Flavio,

 how many files are in the directory?
 You can count with find /tmp/myDir | wc -l

 Flink running out of memory while creating input splits indicates to me
 that there are a lot of files in there.

 On Tue, May 26, 2015 at 2:10 PM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 Hi to all,

 I'm trying to recursively read a directory but it seems that the
  totalLength value in the FileInputformat.createInputSplits() is not
 computed correctly..

 I have a files organized as:

 /tmp/myDir/A/B/cunk-1.txt
 /tmp/myDir/A/B/cunk-2.txt
  ..

 If I try to do the following:

 Configuration parameters = new Configuration();
 parameters.setBoolean(recursive.file.enumeration, true);

 env.readTextFile(file:tmp/myDir)).withParameters(parameters).print();

 I get:

 Caused by: org.apache.flink.runtime.JobException: Creating the input
 splits caused an error: Java heap space
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:162)
 at
 org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:471)
 at org.apache.flink.runtime.jobmanager.JobManager.org
 $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:515)
 ... 19 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:503)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:51)
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:146)

 Am I doing something wrong or is it a bug?

 Best,
 Flavio







Re: HBase Connection in cluster

2015-05-26 Thread Flavio Pompermaier
In the src/main/resources folder of the project

On Tue, May 26, 2015 at 2:42 PM, Hilmi Yildirim hilmi.yildi...@neofonie.de
wrote:

  Where do you put the hbase-site.xml? In the resource folder of the
 project or on the cluster?


 Am 26.05.2015 um 14:12 schrieb Flavio Pompermaier:

 I usually put those connection params inside the hbase-site.xml that will
 be included in the generated jar..

 On Tue, May 26, 2015 at 2:07 PM, Hilmi Yildirim 
 hilmi.yildi...@neofonie.de wrote:

 I want to add that it is strange that the client wants to establish a
 connection to localhost but I have defined another machine.




 Am 26.05.2015 um 14:05 schrieb Hilmi Yildirim:

 Hi,
 I implemented a job which reads data from HBASE with following code (I
 replaced the real address by m1.example.com)

 protected Scan getScanner() {
 Scan scan = new Scan();
 Configuration conf = HBaseConfiguration.create();
 conf.set(hbase.zookeeper.quorum, m1.example.com);
 conf.set(hbase.zookeeper.property.clientPort, 2181);

 try {
 table = new HTable(conf, table);
 table.getScanner(scan);
 } catch (IOException e) {
 // TODO Auto-generated catch block
 e.printStackTrace();
 }
 return scan;
 }


 My code works well when I execute it local but when I execute deploy it
 on the cluster then I get following exceptions:

 14:00:30,610 INFO org.apache.zookeeper.ClientCnxn
- Opening socket connection to server localhost/127.0.0.1:2181.
 Will not attempt to authenticate using SASL (unknown error)
 14:00:30,610 WARN org.apache.zookeeper.ClientCnxn
- Session 0x0 for server null, unexpected error, closing socket
 connection and attempting reconnect
 java.net.ConnectException: Verbindungsaufbau abgelehnt
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
 14:00:31,089 INFO org.apache.zookeeper.ClientCnxn
- Opening socket connection to server localhost/127.0.0.1:2181.
 Will not attempt to authenticate using SASL (unknown error)
 14:00:31,089 WARN org.apache.zookeeper.ClientCnxn
- Session 0x0 for server null, unexpected error, closing socket
 connection and attempting reconnect
 java.net.ConnectException: Verbindungsaufbau abgelehnt
 at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
 at
 sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
 at
 org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
 at
 org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)


 Does anyone know what I make wrong?

 Best Regards,
 Hilmi


 --
 --
 Hilmi Yildirim
 Software Developer RD

  T: +49 30 24627-281
 hilmi.yildi...@neofonie.de


 http://www.neofonie.de

 Besuchen Sie den Neo Tech Blog für Anwender:
 http://blog.neofonie.de/

 Folgen Sie uns:
 https://plus.google.com/+neofonie
 http://www.linkedin.com/company/neofonie-gmbh
 https://www.xing.com/companies/neofoniegmbh

 Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
 Handelsregister Berlin-Charlottenburg: HRB 67460
 Geschäftsführung: Thomas Kitlitschko




 --
 --
 Hilmi Yildirim
 Software Developer RD

 T: +49 30 24627-281hilmi.yildi...@neofonie.de
 http://www.neofonie.de

 Besuchen Sie den Neo Tech Blog für Anwender:http://blog.neofonie.de/

 Folgen Sie 
 uns:https://plus.google.com/+neofoniehttp://www.linkedin.com/company/neofonie-gmbhhttps://www.xing.com/companies/neofoniegmbh

 Neofonie GmbH | Robert-Koch-Platz 4 | 10115 Berlin
 Handelsregister Berlin-Charlottenburg: HRB 67460
 Geschäftsführung: Thomas Kitlitschko




Dataset output callback

2015-05-26 Thread Flavio Pompermaier
Hi to all,

In my program I'd like to infer from a mysql table the list of directory I
have to output on hdfs (status=0).
Once my job finish to clean each directory and update the status value of
my sql table.
How can I do that in Flink? Is there any callback on the dataset.output()
finish?

Best,
Flavio


Re: Hash join failing

2015-05-26 Thread Stephan Ewen
If you have this case, giving more memory is fighting a symptom, rather
than a cause.

If you really have that many duplicates in the data set (and you have not
just a bad implementation of hashCode()), then try the following:

1) Reverse hash join sides. Duplicates hurt only on the build-side, not on
the probe side. This works if the other input has much fewer duplicate
keys. You can do this with a JoinHint.

2) Switch to a sort-merge join. This will be slow with very many duplicate
keys, but should not break.

Let me know how it works!

On Tue, May 26, 2015 at 10:22 PM, Sebastian s...@apache.org wrote:

 Hi,

 What can I do to give Flink more memory when running it from my IDE? I'm
 getting the following exception:

 Caused by: java.lang.RuntimeException: Hash join exceeded maximum number
 of recursions, without reducing partitions enough to be memory resident.
 Probably cause: Too many duplicate keys.
 at
 org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:720)
 at
 org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
 at
 org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:541)
 at
 org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)
 at
 org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
 at
 org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:494)
 at
 org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:360)
 at
 org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:223)
 at java.lang.Thread.run(Thread.java:745)



Hash join failing

2015-05-26 Thread Sebastian

Hi,

What can I do to give Flink more memory when running it from my IDE? I'm 
getting the following exception:


Caused by: java.lang.RuntimeException: Hash join exceeded maximum number 
of recursions, without reducing partitions enough to be memory resident. 
Probably cause: Too many duplicate keys.
	at 
org.apache.flink.runtime.operators.hash.MutableHashTable.buildTableFromSpilledPartition(MutableHashTable.java:720)
	at 
org.apache.flink.runtime.operators.hash.MutableHashTable.prepareNextPartition(MutableHashTable.java:508)
	at 
org.apache.flink.runtime.operators.hash.MutableHashTable.nextRecord(MutableHashTable.java:541)
	at 
org.apache.flink.runtime.operators.hash.NonReusingBuildFirstHashMatchIterator.callWithNextKey(NonReusingBuildFirstHashMatchIterator.java:104)

at 
org.apache.flink.runtime.operators.MatchDriver.run(MatchDriver.java:173)
	at 
org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:494)
	at 
org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:360)
	at 
org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:223)

at java.lang.Thread.run(Thread.java:745)


Re: Recursive directory reading error

2015-05-26 Thread Flavio Pompermaier
Yeap, that definitively solves the problem! Could you make a PR to fix
that..?

Thank you in advance,
Flavio

On Tue, May 26, 2015 at 3:20 PM, Maximilian Michels m...@apache.org wrote:

 Yes, there is a loop to recursively search for files in directory but that
 should be ok. The code fails when adding a new InputSplit to an ArrayList.
 This is a standard operation.

 Oh, I think I found a bug in `addNestedFiles`. It does not pick up the
 length of the recursively found files in line 546. That can result in a
 returned size of 0 which causes infinite InputSplits to be created and
 added to the aforementioned ArrayList. Can you change

 addNestedFiles(dir.getPath(), files, length, logExcludedFiles);

 to

 length += addNestedFiles(dir.getPath(), files, length, logExcludedFiles);

 ?



 On Tue, May 26, 2015 at 2:21 PM, Flavio Pompermaier pomperma...@okkam.it
 wrote:

 I have 10 files..I debugged the code and it seems that there's a loop in
 the FileInputFormat when files are nested far away from the root directory
 of the scan

 On Tue, May 26, 2015 at 2:14 PM, Robert Metzger rmetz...@apache.org
 wrote:

 Hi Flavio,

 how many files are in the directory?
 You can count with find /tmp/myDir | wc -l

 Flink running out of memory while creating input splits indicates to me
 that there are a lot of files in there.

 On Tue, May 26, 2015 at 2:10 PM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Hi to all,

 I'm trying to recursively read a directory but it seems that the
  totalLength value in the FileInputformat.createInputSplits() is not
 computed correctly..

 I have a files organized as:

 /tmp/myDir/A/B/cunk-1.txt
 /tmp/myDir/A/B/cunk-2.txt
  ..

 If I try to do the following:

 Configuration parameters = new Configuration();
 parameters.setBoolean(recursive.file.enumeration, true);

 env.readTextFile(file:tmp/myDir)).withParameters(parameters).print();

 I get:

 Caused by: org.apache.flink.runtime.JobException: Creating the input
 splits caused an error: Java heap space
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:162)
 at
 org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:471)
 at org.apache.flink.runtime.jobmanager.JobManager.org
 $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:515)
 ... 19 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:503)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:51)
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:146)

 Am I doing something wrong or is it a bug?

 Best,
 Flavio








Re: Visibility of FileInputFormat constants

2015-05-26 Thread Fabian Hueske
Definitely! Much better than using the String value.

2015-05-26 16:55 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my program I need to set recursive.file.enumeration to true and I
 discovered that there's a constant for that variable in FileInputFormat
 (ENUMERATE_NESTED_FILES_FLAG) but it's private. Do you think it could be a
 good idea to change it's visibility to public?

 Best,
 Flavio



Re: Visibility of FileInputFormat constants

2015-05-26 Thread Fabian Hueske
Done

2015-05-26 16:59 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Could you do that so I can avoid to make a PR just for that, please?

 On Tue, May 26, 2015 at 4:58 PM, Fabian Hueske fhue...@gmail.com wrote:

 Definitely! Much better than using the String value.

 2015-05-26 16:55 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my program I need to set recursive.file.enumeration to true and I
 discovered that there's a constant for that variable in FileInputFormat
 (ENUMERATE_NESTED_FILES_FLAG) but it's private. Do you think it could be a
 good idea to change it's visibility to public?

 Best,
 Flavio







count the k-means iteration

2015-05-26 Thread Pa Rö
hi community,

my k-means works fine now. thanks a lot for your help.

now i want test something, how is the best way in flink to cout
the iteration?

best regards,
paul


Re: Visibility of FileInputFormat constants

2015-05-26 Thread Flavio Pompermaier
Thanks ;)

On Tue, May 26, 2015 at 5:08 PM, Fabian Hueske fhue...@gmail.com wrote:

 Done

 2015-05-26 16:59 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Could you do that so I can avoid to make a PR just for that, please?

 On Tue, May 26, 2015 at 4:58 PM, Fabian Hueske fhue...@gmail.com wrote:

 Definitely! Much better than using the String value.

 2015-05-26 16:55 GMT+02:00 Flavio Pompermaier pomperma...@okkam.it:

 Hi to all,

 in my program I need to set recursive.file.enumeration to true and I
 discovered that there's a constant for that variable in FileInputFormat
 (ENUMERATE_NESTED_FILES_FLAG) but it's private. Do you think it could be a
 good idea to change it's visibility to public?

 Best,
 Flavio








Re: Recursive directory reading error

2015-05-26 Thread Maximilian Michels
Pushed a fix to the master and will open a PR to programmatically fix this.

On Tue, May 26, 2015 at 4:22 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:

 Yeap, that definitively solves the problem! Could you make a PR to fix
 that..?

 Thank you in advance,
 Flavio

 On Tue, May 26, 2015 at 3:20 PM, Maximilian Michels m...@apache.org
 wrote:

 Yes, there is a loop to recursively search for files in directory but
 that should be ok. The code fails when adding a new InputSplit to an
 ArrayList. This is a standard operation.

 Oh, I think I found a bug in `addNestedFiles`. It does not pick up the
 length of the recursively found files in line 546. That can result in a
 returned size of 0 which causes infinite InputSplits to be created and
 added to the aforementioned ArrayList. Can you change

 addNestedFiles(dir.getPath(), files, length, logExcludedFiles);

 to

 length += addNestedFiles(dir.getPath(), files, length, logExcludedFiles);

 ?



 On Tue, May 26, 2015 at 2:21 PM, Flavio Pompermaier pomperma...@okkam.it
  wrote:

 I have 10 files..I debugged the code and it seems that there's a loop in
 the FileInputFormat when files are nested far away from the root directory
 of the scan

 On Tue, May 26, 2015 at 2:14 PM, Robert Metzger rmetz...@apache.org
 wrote:

 Hi Flavio,

 how many files are in the directory?
 You can count with find /tmp/myDir | wc -l

 Flink running out of memory while creating input splits indicates to me
 that there are a lot of files in there.

 On Tue, May 26, 2015 at 2:10 PM, Flavio Pompermaier 
 pomperma...@okkam.it wrote:

 Hi to all,

 I'm trying to recursively read a directory but it seems that the
  totalLength value in the FileInputformat.createInputSplits() is not
 computed correctly..

 I have a files organized as:

 /tmp/myDir/A/B/cunk-1.txt
 /tmp/myDir/A/B/cunk-2.txt
  ..

 If I try to do the following:

 Configuration parameters = new Configuration();
 parameters.setBoolean(recursive.file.enumeration, true);

 env.readTextFile(file:tmp/myDir)).withParameters(parameters).print();

 I get:

 Caused by: org.apache.flink.runtime.JobException: Creating the input
 splits caused an error: Java heap space
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:162)
 at
 org.apache.flink.runtime.executiongraph.ExecutionGraph.attachJobGraph(ExecutionGraph.java:471)
 at org.apache.flink.runtime.jobmanager.JobManager.org
 $apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:515)
 ... 19 more
 Caused by: java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2219)
 at java.util.ArrayList.grow(ArrayList.java:242)
 at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:216)
 at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:208)
 at java.util.ArrayList.add(ArrayList.java:440)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:503)
 at
 org.apache.flink.api.common.io.FileInputFormat.createInputSplits(FileInputFormat.java:51)
 at
 org.apache.flink.runtime.executiongraph.ExecutionJobVertex.init(ExecutionJobVertex.java:146)

 Am I doing something wrong or is it a bug?

 Best,
 Flavio