Re: DelimitedInputFormat reads entire buffer when splitLength is 0

2015-07-13 Thread Robert Schmidtke
Hi Stephan,

I figured as much, since 128k is a plit size that is not commonly used in
large scale data processing engines. I will go for increasing the split
size to reduce coordination overhead for Flink. It just so happened that my
small toy example brought up the issue. Thanks for clearing this up.

Robert

On Sun, Jul 12, 2015 at 9:21 PM, Stephan Ewen se...@apache.org wrote:

 Hi Robert!

 I did some debugging and added some tests. Turns out, this is actually
 expected behavior. It has to do with the splitting of the records.

 Because creating the splits happens without knowing the contents, the
 split can be either in the middle of a record, or (by chance) exactly at
 the boundary of a record.

 To make each split handle this consistently without knowing what the
 others do, the contract is the following:
  - Each but the first split skip initially over the records until the
 first delimiter.
  - Each split reads to the next delimiter beyond the split boundary.

 The case when the split size is 0 is the point when the split has to read
 one more record (or complete the current record), so it gets one more chunk
 of data.

 The problem in your case is actually that the split size is so low, that
 the read buffer to compete the current record operation reads the split
 twice.

 Can you reduce the buffer size to something that is reasonable? you can
 also increase the split size. I think 128KB will result in high
 coordination overhead for Flink, because these are distributed with RPC
 messages from the master (1 message per split).

 Greetings,
 Stephan


 On Fri, Jul 10, 2015 at 6:55 PM, Stephan Ewen se...@apache.org wrote:

 Hi Robert!

 This clearly sounds like unintended behavior. Thanks for reporting this.

 Apparently, the 0 line length was supposed to have a double meaning, but
 it goes haywire in this case.

 Let me try to come with a fix for this...

 Greetings,
 Stephan


 On Fri, Jul 10, 2015 at 6:05 PM, Robert Schmidtke ro.schmid...@gmail.com
  wrote:

 Hey everyone,

 I just noticed that when processing input splits from a
 DelimitedInputFormat (specifically, I have a text file with words in it),
 that if the splitLength is 0, the entire readbuffer is filled (see
 https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/io/DelimitedInputFormat.java#L577).
 I'm using XtreemFS as underlying file system, which stripes files in blocks
 of 128kb across storage servers. I have 8 physically separate nodes, and my
 input file is 1MB, such that each node stores 128kb of data. This is
 reported accurately to Flink (e.g. split sizes and hostnames). Now when the
 splitLength is 0 at some point during processing (which it will become
 eventually), the entire file is read in again, which kind of defeats the
 point of processing a split of length 0. Is this intended behavior? I've
 tried multiple hot-fixes, but they ended up in the file not bein read in
 its entirety. I would like to know the rationale behind this
 implementation, and maybe figure out a way around it. Thanks in advance,

 Robert

 --
 My GPG Key ID: 336E2680






-- 
My GPG Key ID: 336E2680


Re: UI for flink

2015-07-13 Thread Chiwan Park
If you would search a graphical interface for data analytics like Jupyter, you 
should look Apache Zeppelin [1].
Apache Zeppelin is a web-based notebook. It supports Scala, Spark and Flink.

Regards,
Chiwan Park

[1] https://zeppelin.incubator.apache.org

 On Jul 13, 2015, at 9:23 PM, Till Rohrmann trohrm...@apache.org wrote:
 
 Hi Hermann,
 
 when you start a Flink cluster, then also the web interface is started. It is 
 reachable under http://jobManagerURL:8081. The web interface tells you a 
 lot about the current state of your cluster and the currently executed Flink 
 jobs.
 
 Additionally, you can start the web client via ./start-webclient.sh, which 
 you can find in the bin directory. The web client, which is reachable under 
 port 8080, allows you to submit Flink jobs to your cluster via a browser.
 
 Cheers,
 Till
 
 
 On Mon, Jul 13, 2015 at 2:07 PM, Hermann Azong hermann.az...@gmail.com 
 wrote:
 Hello Flinkers,
 
 I'm wondering if a UI Solution for Flink already exist when starting
 
 Sincerly,
 
 Hermann 
 







Re: UI for flink

2015-07-13 Thread Hermann Azong

Hey Till,

thank you for the answer.

Cheers

Am 13.07.2015 um 14:23 schrieb Till Rohrmann:


Hi Hermann,

when you start a Flink cluster, then also the web interface is 
started. It is reachable under |http://jobManagerURL:8081|. The web 
interface tells you a lot about the current state of your cluster and 
the currently executed Flink jobs.


Additionally, you can start the web client via |./start-webclient.sh|, 
which you can find in the |bin| directory. The web client, which is 
reachable under port |8080|, allows you to submit Flink jobs to your 
cluster via a browser.


Cheers,
Till

​

On Mon, Jul 13, 2015 at 2:07 PM, Hermann Azong 
hermann.az...@gmail.com mailto:hermann.az...@gmail.com wrote:


Hello Flinkers,

I'm wondering if a UI Solution for Flink already exist when starting

Sincerly,

Hermann






UI for flink

2015-07-13 Thread Hermann Azong
Hello Flinkers,

I'm wondering if a UI Solution for Flink already exist when starting

Sincerly,

Hermann


Re: UI for flink

2015-07-13 Thread Till Rohrmann
Hi Hermann,

when you start a Flink cluster, then also the web interface is started. It
is reachable under http://jobManagerURL:8081. The web interface tells you
a lot about the current state of your cluster and the currently executed
Flink jobs.

Additionally, you can start the web client via ./start-webclient.sh, which
you can find in the bin directory. The web client, which is reachable under
port 8080, allows you to submit Flink jobs to your cluster via a browser.

Cheers,
Till
​

On Mon, Jul 13, 2015 at 2:07 PM, Hermann Azong hermann.az...@gmail.com
wrote:

 Hello Flinkers,

 I'm wondering if a UI Solution for Flink already exist when starting

 Sincerly,

 Hermann



Re: Building Big Data Benchmarking suite for Apache Flink

2015-07-13 Thread Hawin Jiang
Hi  Slim

I will follow this and keep you posted.
Thanks.



Best regards
Hawin

On Mon, Jul 13, 2015 at 7:04 PM, Slim Baltagi sbalt...@gmail.com wrote:

 Hi

 BigDataBench is  an open source Big Data Benchmarking suite from both
 industry and academia.  As a subset of BigDataBench, BigDataBench-DCA  is
 China’s first industry-standard big data benchmark suite:
 http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/
 It comes with real-world data sets and many workloads: TeraSort, WordCount,
 PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan and also a
 tool that uses Hadoop, HBase and Mahout.
 This might be inspiring to build a Big Data Benchmarking suite for Flink!

 I would like to share with you the news that professor Jianfeng Zhan from
 the Institute of Computing Technology, Chinese Academy of Sciences is
 planning to support Flink in the BigDataBench project! Reference:
 https://www.linkedin.com/grp/home?gid=6777483

 Thanks

 Slim Baltagi




 --
 View this message in context:
 http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-Big-Data-Benchmarking-suite-for-Apache-Flink-tp2035.html
 Sent from the Apache Flink User Mailing List archive. mailing list archive
 at Nabble.com.



DataSet Conversion

2015-07-13 Thread Lydia Ickler
Hi guys,

is it possible to convert a Java DataSet to a Scala Dataset?
Right now I get the following error:
Error:(102, 29) java: incompatible types: 
'org.apache.flink.api.java.DataSetjava.lang.String cannot be converted to 
org.apache.flink.api.scala.DataSetjava.lang.String‘

Thanks in advance,
Lydia

Re: HBase Machine Learning

2015-07-13 Thread Maximilian Michels
Hi Lydia,

Here are some examples of how to read/write data from/to HBase:
https://github.com/apache/flink/tree/master/flink-staging/flink-hbase/src/test/java/org/apache/flink/addons/hbase/example

Hope that helps you to develop your Flink job. If not feel free to ask!

Best,
Max

On Sat, Jul 11, 2015 at 8:19 PM, Lydia Ickler ickle...@googlemail.com
wrote:

 Dear Sir or Madame,

 I would like to use the Flink-HBase addon to read out data that then
 serves as an input for the machine learning algorithms, respectively the
 SVM and MLR. Right now I first write the extracted data to a temporary file
 and then read it in via the libSVM method...but i guess there should Be a
 more sophisticated way.

 Do you have a code snippet or an idea how to do so?

 Many thanks in advance and best regards,
 Lydia


Re: bigpetstore flink : parallelizing collections

2015-07-13 Thread jay vyas
ok.  now ** my thoughts ** on this are that it should be synergistic with
flink needs, rather than an orthogonal task that you guys help us with, so
please keep us updated what your needs are so that the work is synergistic
https://issues.apache.org/jira/browse/BIGTOP-1927

On Mon, Jul 13, 2015 at 9:07 AM, Maximilian Michels m...@apache.org wrote:

 Hi Jay,

 Great to hear there is effort to integrate Flink with BigTop. Please let
 us know if any questions come up in the course of the integration!

 Best,
 Max


 On Sun, Jul 12, 2015 at 3:57 PM, jay vyas jayunit100.apa...@gmail.com
 wrote:

 awesome thanks ! i ll  try it out.

 This is part of  a wave of jiras for bigtop flink integration.  If your
 distro/packaging folks collaborate with us - it will save you time in the
 long run, because you can piggy back the bigtop infra for rpm/deb
 packaging, smoke testing, and HDFS interop testing 

 https://issues.apache.org/jira/browse/BIGTOP-1927

 Just FYI, great to connect stephan and others, will keep you posted !

 On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen se...@apache.org wrote:

 Hi Jay!

 You can use the fromCollection() or fromElements() method to create
 a DataSet or DataStream from a Java/Scala collection. That moves the data
 into the cluster and allows you to run parallel transformations on the
 elements.

 Make sure you set the parallelism of the operation that you want to be
 parallel.


 Here is a code sample:

 ExecutionEnvironment env =
 ExecutionEnvironment.getExecutionEnvironment();

 DataSetMyType data = env.fromElements(myArray);

 data.map(new TrasactionMapper()).setParallelism(80); // makes sure you
 have 80 mappers


 Stephan


 On Sun, Jul 12, 2015 at 3:04 PM, jay vyas jayunit100.apa...@gmail.com
 wrote:

 Hi flink.

 Im happy to announce that ive done a small bit of initial hacking on
 bigpetstore-flink, in order to represent what we do in spark in flink.

 TL;DR the main question is at the bottom!

 Currently, i want to generate transactions for a list of customers.
 The generation of transactions is a parallel process, and the customers are
 generated beforehand.

 In hadoop , we can create an input format with custom splits if we want
 to split a data set up, otherwise, we can break it into files.

 in spark, there is a conveneint parallelize which we can run on a
 list, which we can then capture the RDD from , and run a parallelized
 transform.

 In flink, i have an array of customers and i want to parallelize our
 transaction generator for each customer.  How would i do that?

 --
 jay vyas





 --
 jay vyas





-- 
jay vyas


Re: DataSet Conversion

2015-07-13 Thread Aljoscha Krettek
Hi Lydia,
it might work using new DataSet(javaSet) where DataSet
is org.apache.flink.api.scala.DataSet. I'm not sure, however. What is your
use case for this?

Cheers,
Aljoscha
On Mon, 13 Jul 2015 at 15:55 Lydia Ickler ickle...@googlemail.com wrote:

 Hi guys,

 is it possible to convert a Java DataSet to a Scala Dataset?
 Right now I get the following error:
 Error:(102, 29) java: incompatible types:
 'org.apache.flink.api.java.DataSetjava.lang.String cannot be converted to
 org.apache.flink.api.scala.DataSetjava.lang.String‘

 Thanks in advance,
 Lydia



Re: local web-client error

2015-07-13 Thread Maximilian Michels
Hi Michele,

Sorry to hear you are experiencing problems with the web client. Which
version of Flink are you using?

Could you paste the whole error message you see? Thank you.

Best regards,
Max

On Sun, Jul 12, 2015 at 11:21 AM, Michele Bertoni 
michele1.bert...@mail.polimi.it wrote:

 I think there is a problem with the web-client

 Quite often I can use it for a single run and then it crash
 especially if after seeing the graph i click back, on the second run i get
 a class not found exception

 from terminal i have to stop and restart it and it works again


 Michele


Re: DataSet Conversion

2015-07-13 Thread Stephan Ewen
Hi!

I think you can simply do new org.apache.flink.api.scala.
DataSet[T](javaSet)

Greetings,
Stephan

On Mon, Jul 13, 2015 at 3:55 PM, Lydia Ickler ickle...@googlemail.com
wrote:

 Hi guys,

 is it possible to convert a Java DataSet to a Scala Dataset?
 Right now I get the following error:
 Error:(102, 29) java: incompatible types:
 'org.apache.flink.api.java.DataSetjava.lang.String cannot be converted to
 org.apache.flink.api.scala.DataSetjava.lang.String‘

 Thanks in advance,
 Lydia



Building Big Data Benchmarking suite for Apache Flink

2015-07-13 Thread Slim Baltagi
Hi

BigDataBench is  an open source Big Data Benchmarking suite from both
industry and academia.  As a subset of BigDataBench, BigDataBench-DCA  is
China’s first industry-standard big data benchmark suite:
http://prof.ict.ac.cn/BigDataBench/industry-standard-benchmarks/
It comes with real-world data sets and many workloads: TeraSort, WordCount,
PageRank, K-means, NaiveBayes, Aggregation and Read/Write/Scan and also a
tool that uses Hadoop, HBase and Mahout. 
This might be inspiring to build a Big Data Benchmarking suite for Flink! 

I would like to share with you the news that professor Jianfeng Zhan from
the Institute of Computing Technology, Chinese Academy of Sciences is
planning to support Flink in the BigDataBench project! Reference:
https://www.linkedin.com/grp/home?gid=6777483

Thanks

Slim Baltagi




--
View this message in context: 
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Building-Big-Data-Benchmarking-suite-for-Apache-Flink-tp2035.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at 
Nabble.com.


Re: bigpetstore flink : parallelizing collections

2015-07-13 Thread Maximilian Michels
Absolutely. I see it as a synergistic process too. I just learned about
BigTop. As for the packaging, I think Flink doesn't have very different
demands compared to the other frameworks already integrated. As for the
rest, I'm not familiar enough with BigTop. Currently, Henry is the only
Flink committer looking into it but I'm sure we will find other Flink
contributors to help out as well. I can even see myself getting into it.

Thanks for your effort so far and I'm sure we'll have a good collaboration
between the projects.

Cheers,
Max

On Mon, Jul 13, 2015 at 4:56 PM, jay vyas jayunit100.apa...@gmail.com
wrote:

 ok.  now ** my thoughts ** on this are that it should be synergistic with
 flink needs, rather than an orthogonal task that you guys help us with, so
 please keep us updated what your needs are so that the work is synergistic
 https://issues.apache.org/jira/browse/BIGTOP-1927

 On Mon, Jul 13, 2015 at 9:07 AM, Maximilian Michels m...@apache.org
 wrote:

 Hi Jay,

 Great to hear there is effort to integrate Flink with BigTop. Please let
 us know if any questions come up in the course of the integration!

 Best,
 Max


 On Sun, Jul 12, 2015 at 3:57 PM, jay vyas jayunit100.apa...@gmail.com
 wrote:

 awesome thanks ! i ll  try it out.

 This is part of  a wave of jiras for bigtop flink integration.  If your
 distro/packaging folks collaborate with us - it will save you time in the
 long run, because you can piggy back the bigtop infra for rpm/deb
 packaging, smoke testing, and HDFS interop testing 

 https://issues.apache.org/jira/browse/BIGTOP-1927

 Just FYI, great to connect stephan and others, will keep you posted !

 On Sun, Jul 12, 2015 at 9:16 AM, Stephan Ewen se...@apache.org wrote:

 Hi Jay!

 You can use the fromCollection() or fromElements() method to create
 a DataSet or DataStream from a Java/Scala collection. That moves the data
 into the cluster and allows you to run parallel transformations on the
 elements.

 Make sure you set the parallelism of the operation that you want to be
 parallel.


 Here is a code sample:

 ExecutionEnvironment env =
 ExecutionEnvironment.getExecutionEnvironment();

 DataSetMyType data = env.fromElements(myArray);

 data.map(new TrasactionMapper()).setParallelism(80); // makes sure you
 have 80 mappers


 Stephan


 On Sun, Jul 12, 2015 at 3:04 PM, jay vyas jayunit100.apa...@gmail.com
 wrote:

 Hi flink.

 Im happy to announce that ive done a small bit of initial hacking on
 bigpetstore-flink, in order to represent what we do in spark in flink.

 TL;DR the main question is at the bottom!

 Currently, i want to generate transactions for a list of customers.
 The generation of transactions is a parallel process, and the customers 
 are
 generated beforehand.

 In hadoop , we can create an input format with custom splits if we
 want to split a data set up, otherwise, we can break it into files.

 in spark, there is a conveneint parallelize which we can run on a
 list, which we can then capture the RDD from , and run a parallelized
 transform.

 In flink, i have an array of customers and i want to parallelize our
 transaction generator for each customer.  How would i do that?

 --
 jay vyas





 --
 jay vyas





 --
 jay vyas



Sort Benchmark infrastructure

2015-07-13 Thread Hawin Jiang
Hi Michael and George

 

First of all, congratulation you guys have won the sort game again.  We are
coming from Flink community.  

I am not sure if it is possible to get your test environment to test our
Flink for free.  we saw that Apache spark did a good job as well.  

We want to challenge your records. But we don't have that much servers for
testing. 

Please let me know if you can help us or not. 

Thank you very much.

 

 

 

Best regards

Hawin