Re: DelimitedInputFormat reads entire buffer when splitLength is 0
Hi Stephan, I figured as much, since 128k is a plit size that is not commonly used in large scale data processing engines. I will go for increasing the split size to reduce coordination overhead for Flink. It just so happened that my small toy example brought up the issue. Thanks for clearing this up. Robert On Sun, Jul 12, 2015 at 9:21 PM, Stephan Ewen se...@apache.org wrote: Hi Robert! I did some debugging and added some tests. Turns out, this is actually expected behavior. It has to do with the splitting of the records. Because creating the splits happens without knowing the contents, the split can be either in the middle of a record, or (by chance) exactly at the boundary of a record. To make each split handle this consistently without knowing what the others do, the contract is the following: - Each but the first split skip initially over the records until the first delimiter. - Each split reads to the next delimiter beyond the split boundary. The case when the split size is 0 is the point when the split has to read one more record (or complete the current record), so it gets one more chunk of data. The problem in your case is actually that the split size is so low, that the read buffer to compete the current record operation reads the split twice. Can you reduce the buffer size to something that is reasonable? you can also increase the split size. I think 128KB will result in high coordination overhead for Flink, because these are distributed with RPC messages from the master (1 message per split). Greetings, Stephan On Fri, Jul 10, 2015 at 6:55 PM, Stephan Ewen se...@apache.org wrote: Hi Robert! This clearly sounds like unintended behavior. Thanks for reporting this. Apparently, the 0 line length was supposed to have a double meaning, but it goes haywire in this case. Let me try to come with a fix for this... Greetings, Stephan On Fri, Jul 10, 2015 at 6:05 PM, Robert Schmidtke ro.schmid...@gmail.com wrote: Hey everyone, I just noticed that when processing input splits from a DelimitedInputFormat (specifically, I have a text file with words in it), that if the splitLength is 0, the entire readbuffer is filled (see https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java#L577). I'm using XtreemFS as underlying file system, which stripes files in blocks of 128kb across storage servers. I have 8 physically separate nodes, and my input file is 1MB, such that each node stores 128kb of data. This is reported accurately to Flink (e.g. split sizes and hostnames). Now when the splitLength is 0 at some point during processing (which it will become eventually), the entire file is read in again, which kind of defeats the point of processing a split of length 0. Is this intended behavior? I've tried multiple hot-fixes, but they ended up in the file not bein read in its entirety. I would like to know the rationale behind this implementation, and maybe figure out a way around it. Thanks in advance, Robert -- My GPG Key ID: 336E2680 -- My GPG Key ID: 336E2680
Re: UI for flink
If you would search a graphical interface for data analytics like Jupyter, you should look Apache Zeppelin [1]. Apache Zeppelin is a web-based notebook. It supports Scala, Spark and Flink. Regards, Chiwan Park [1] https://zeppelin.incubator.apache.org On Jul 13, 2015, at 9:23 PM, Till Rohrmann trohrm...@apache.org wrote: Hi Hermann, when you start a Flink cluster, then also the web interface is started. It is reachable under http://jobManagerURL:8081. The web interface tells you a lot about the current state of your cluster and the currently executed Flink jobs. Additionally, you can start the web client via ./start-webclient.sh, which you can find in the bin directory. The web client, which is reachable under port 8080, allows you to submit Flink jobs to your cluster via a browser. Cheers, Till On Mon, Jul 13, 2015 at 2:07 PM, Hermann Azong hermann.az...@gmail.com wrote: Hello Flinkers, I'm wondering if a UI Solution for Flink already exist when starting Sincerly, Hermann
Re: UI for flink
Hey Till, thank you for the answer. Cheers Am 13.07.2015 um 14:23 schrieb Till Rohrmann: Hi Hermann, when you start a Flink cluster, then also the web interface is started. It is reachable under |http://jobManagerURL:8081|. The web interface tells you a lot about the current state of your cluster and the currently executed Flink jobs. Additionally, you can start the web client via |./start-webclient.sh|, which you can find in the |bin| directory. The web client, which is reachable under port |8080|, allows you to submit Flink jobs to your cluster via a browser. Cheers, Till On Mon, Jul 13, 2015 at 2:07 PM, Hermann Azong hermann.az...@gmail.com mailto:hermann.az...@gmail.com wrote: Hello Flinkers, I'm wondering if a UI Solution for Flink already exist when starting Sincerly, Hermann
UI for flink
Hello Flinkers, I'm wondering if a UI Solution for Flink already exist when starting Sincerly, Hermann
Re: UI for flink
Hi Hermann, when you start a Flink cluster, then also the web interface is started. It is reachable under http://jobManagerURL:8081. The web interface tells you a lot about the current state of your cluster and the currently executed Flink jobs. Additionally, you can start the web client via ./start-webclient.sh, which you can find in the bin directory. The web client, which is reachable under port 8080, allows you to submit Flink jobs to your cluster via a browser. Cheers, Till On Mon, Jul 13, 2015 at 2:07 PM, Hermann Azong hermann.az...@gmail.com wrote: Hello Flinkers, I'm wondering if a UI Solution for Flink already exist when starting Sincerly, Hermann
Re: Building Big Data Benchmarking suite for Apache Flink
Hi Slim I will follow this and keep you posted. Thanks. Best regards Hawin On Mon, Jul 13, 2015 at 7:04 PM, Slim Baltagi sbalt...@gmail.com wrote: Hi BigDataBench is an open source Big Data Benchmarking suite from both industry and academia. As a subset of BigDataBench, BigDataBench-DCA is China’s first industry-standard big data benchmark suite: http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/ It comes with real-world data sets and many workloads: TeraSort, WordCount, PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan and also a tool that uses Hadoop, HBase and Mahout. This might be inspiring to build a Big Data Benchmarking suite for Flink! I would like to share with you the news that professor Jianfeng Zhan from the Institute of Computing Technology, Chinese Academy of Sciences is planning to support Flink in the BigDataBench project! Reference: https://www.linkedin.com/grp/home?gid=6777483 Thanks Slim Baltagi -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-Big-Data-Benchmarking-suite-for-Apache-Flink-tp2035.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
DataSet Conversion
Hi guys, is it possible to convert a Java DataSet to a Scala Dataset? Right now I get the following error: Error:(102, 29) java: incompatible types: 'org.apache.flink.api.java.DataSetjava.lang.String cannot be converted to org.apache.flink.api.scala.DataSetjava.lang.String‘ Thanks in advance, Lydia
Re: HBase Machine Learning
Hi Lydia, Here are some examples of how to read/write data from/to HBase: https://github.com/apache/flink/tree/master/flink-staging/flink-hbase/src/test/java/org/apache/flink/addons/hbase/example Hope that helps you to develop your Flink job. If not feel free to ask! Best, Max On Sat, Jul 11, 2015 at 8:19 PM, Lydia Ickler ickle...@googlemail.com wrote: Dear Sir or Madame, I would like to use the Flink-HBase addon to read out data that then serves as an input for the machine learning algorithms, respectively the SVM and MLR. Right now I first write the extracted data to a temporary file and then read it in via the libSVM method...but i guess there should Be a more sophisticated way. Do you have a code snippet or an idea how to do so? Many thanks in advance and best regards, Lydia
Re: bigpetstore flink : parallelizing collections
ok. now ** my thoughts ** on this are that it should be synergistic with flink needs, rather than an orthogonal task that you guys help us with, so please keep us updated what your needs are so that the work is synergistic https://issues.apache.org/jira/browse/BIGTOP-1927 On Mon, Jul 13, 2015 at 9:07 AM, Maximilian Michels m...@apache.org wrote: Hi Jay, Great to hear there is effort to integrate Flink with BigTop. Please let us know if any questions come up in the course of the integration! Best, Max On Sun, Jul 12, 2015 at 3:57 PM, jay vyas jayunit100.apa...@gmail.com wrote: awesome thanks ! i ll try it out. This is part of a wave of jiras for bigtop flink integration. If your distro/packaging folks collaborate with us - it will save you time in the long run, because you can piggy back the bigtop infra for rpm/deb packaging, smoke testing, and HDFS interop testing https://issues.apache.org/jira/browse/BIGTOP-1927 Just FYI, great to connect stephan and others, will keep you posted ! On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen se...@apache.org wrote: Hi Jay! You can use the fromCollection() or fromElements() method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements. Make sure you set the parallelism of the operation that you want to be parallel. Here is a code sample: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSetMyType data = env.fromElements(myArray); data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers Stephan On Sun, Jul 12, 2015 at 3:04 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi flink. Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. TL;DR the main question is at the bottom! Currently, i want to generate transactions for a list of customers. The generation of transactions is a parallel process, and the customers are generated beforehand. In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files. in spark, there is a conveneint parallelize which we can run on a list, which we can then capture the RDD from , and run a parallelized transform. In flink, i have an array of customers and i want to parallelize our transaction generator for each customer. How would i do that? -- jay vyas -- jay vyas -- jay vyas
Re: DataSet Conversion
Hi Lydia, it might work using new DataSet(javaSet) where DataSet is org.apache.flink.api.scala.DataSet. I'm not sure, however. What is your use case for this? Cheers, Aljoscha On Mon, 13 Jul 2015 at 15:55 Lydia Ickler ickle...@googlemail.com wrote: Hi guys, is it possible to convert a Java DataSet to a Scala Dataset? Right now I get the following error: Error:(102, 29) java: incompatible types: 'org.apache.flink.api.java.DataSetjava.lang.String cannot be converted to org.apache.flink.api.scala.DataSetjava.lang.String‘ Thanks in advance, Lydia
Re: local web-client error
Hi Michele, Sorry to hear you are experiencing problems with the web client. Which version of Flink are you using? Could you paste the whole error message you see? Thank you. Best regards, Max On Sun, Jul 12, 2015 at 11:21 AM, Michele Bertoni michele1.bert...@mail.polimi.it wrote: I think there is a problem with the web-client Quite often I can use it for a single run and then it crash especially if after seeing the graph i click back, on the second run i get a class not found exception from terminal i have to stop and restart it and it works again Michele
Re: DataSet Conversion
Hi! I think you can simply do new org.apache.flink.api.scala. DataSet[T](javaSet) Greetings, Stephan On Mon, Jul 13, 2015 at 3:55 PM, Lydia Ickler ickle...@googlemail.com wrote: Hi guys, is it possible to convert a Java DataSet to a Scala Dataset? Right now I get the following error: Error:(102, 29) java: incompatible types: 'org.apache.flink.api.java.DataSetjava.lang.String cannot be converted to org.apache.flink.api.scala.DataSetjava.lang.String‘ Thanks in advance, Lydia
Building Big Data Benchmarking suite for Apache Flink
Hi BigDataBench is an open source Big Data Benchmarking suite from both industry and academia. As a subset of BigDataBench, BigDataBench-DCA is China’s first industry-standard big data benchmark suite: http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/ It comes with real-world data sets and many workloads: TeraSort, WordCount, PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan and also a tool that uses Hadoop, HBase and Mahout. This might be inspiring to build a Big Data Benchmarking suite for Flink! I would like to share with you the news that professor Jianfeng Zhan from the Institute of Computing Technology, Chinese Academy of Sciences is planning to support Flink in the BigDataBench project! Reference: https://www.linkedin.com/grp/home?gid=6777483 Thanks Slim Baltagi -- View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-Big-Data-Benchmarking-suite-for-Apache-Flink-tp2035.html Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.
Re: bigpetstore flink : parallelizing collections
Absolutely. I see it as a synergistic process too. I just learned about BigTop. As for the packaging, I think Flink doesn't have very different demands compared to the other frameworks already integrated. As for the rest, I'm not familiar enough with BigTop. Currently, Henry is the only Flink committer looking into it but I'm sure we will find other Flink contributors to help out as well. I can even see myself getting into it. Thanks for your effort so far and I'm sure we'll have a good collaboration between the projects. Cheers, Max On Mon, Jul 13, 2015 at 4:56 PM, jay vyas jayunit100.apa...@gmail.com wrote: ok. now ** my thoughts ** on this are that it should be synergistic with flink needs, rather than an orthogonal task that you guys help us with, so please keep us updated what your needs are so that the work is synergistic https://issues.apache.org/jira/browse/BIGTOP-1927 On Mon, Jul 13, 2015 at 9:07 AM, Maximilian Michels m...@apache.org wrote: Hi Jay, Great to hear there is effort to integrate Flink with BigTop. Please let us know if any questions come up in the course of the integration! Best, Max On Sun, Jul 12, 2015 at 3:57 PM, jay vyas jayunit100.apa...@gmail.com wrote: awesome thanks ! i ll try it out. This is part of a wave of jiras for bigtop flink integration. If your distro/packaging folks collaborate with us - it will save you time in the long run, because you can piggy back the bigtop infra for rpm/deb packaging, smoke testing, and HDFS interop testing https://issues.apache.org/jira/browse/BIGTOP-1927 Just FYI, great to connect stephan and others, will keep you posted ! On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen se...@apache.org wrote: Hi Jay! You can use the fromCollection() or fromElements() method to create a DataSet or DataStream from a Java/Scala collection. That moves the data into the cluster and allows you to run parallel transformations on the elements. Make sure you set the parallelism of the operation that you want to be parallel. Here is a code sample: ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSetMyType data = env.fromElements(myArray); data.map(new TrasactionMapper()).setParallelism(80); // makes sure you have 80 mappers Stephan On Sun, Jul 12, 2015 at 3:04 PM, jay vyas jayunit100.apa...@gmail.com wrote: Hi flink. Im happy to announce that ive done a small bit of initial hacking on bigpetstore-flink, in order to represent what we do in spark in flink. TL;DR the main question is at the bottom! Currently, i want to generate transactions for a list of customers. The generation of transactions is a parallel process, and the customers are generated beforehand. In hadoop , we can create an input format with custom splits if we want to split a data set up, otherwise, we can break it into files. in spark, there is a conveneint parallelize which we can run on a list, which we can then capture the RDD from , and run a parallelized transform. In flink, i have an array of customers and i want to parallelize our transaction generator for each customer. How would i do that? -- jay vyas -- jay vyas -- jay vyas
Sort Benchmark infrastructure
Hi Michael and George First of all, congratulation you guys have won the sort game again. We are coming from Flink community. I am not sure if it is possible to get your test environment to test our Flink for free. we saw that Apache spark did a good job as well. We want to challenge your records. But we don't have that much servers for testing. Please let me know if you can help us or not. Thank you very much. Best regards Hawin