Re: CSV header management

2015-04-17 Thread Robert Metzger
Yes, using a map() function that is transforming the input lines into the String array. On Fri, Apr 17, 2015 at 9:03 PM, Flavio Pompermaier wrote: > My csv has 32 columns,ia there a way to create a Dataset? > On Apr 17, 2015 7:53 PM, "Stephan Ewen" wrote: > >> I don't think there is any built-i

Re: CSV header management

2015-04-17 Thread Flavio Pompermaier
My csv has 32 columns,ia there a way to create a Dataset? On Apr 17, 2015 7:53 PM, "Stephan Ewen" wrote: > I don't think there is any built-in functionality for that. > > You can probably read the header with some custom code (in the driver > program) and use the CSV reader (skipping the header)

Re: CSV header management

2015-04-17 Thread Stephan Ewen
I don't think there is any built-in functionality for that. You can probably read the header with some custom code (in the driver program) and use the CSV reader (skipping the header) as a regular data source. On Fri, Apr 17, 2015 at 5:03 PM, Flavio Pompermaier wrote: > Hi guys, > how can I rea

CSV header management

2015-04-17 Thread Flavio Pompermaier
Hi guys, how can I read a csv but keeping the header in some variable without throwing it away? Thanks in advance, Flavio

Re: EOFException when running Flink job

2015-04-17 Thread Stephan Ewen
Hi! After a quick look over the code, this seems like a bug. One cornercase of the overflow handling code does not check for the "running out of memory" condition. I would like to wait if Robert Waury has some ideas about that, he is the one most familiar with the code. I would guess, though, th

EOFException when running Flink job

2015-04-17 Thread Stefan Bunk
Hi Squirrels, I have some trouble with a delta-iteration transitive closure program [1]. When I run the program, I get the following error: java.io.EOFException at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333) at org.apache.flink.runti

Re: Orphaned chunks

2015-04-17 Thread Flavio Pompermaier
I just set the -Xmx in the VM parameters to 10g and I have 8 virtual cores (4 physical). On Fri, Apr 17, 2015 at 2:48 PM, Robert Metzger wrote: > How much memory are you giving to each Flink TaskManager? > > On Fri, Apr 17, 2015 at 2:05 PM, Flavio Pompermaier > wrote: > >> Hi Robertm >> I forgo

Re: Hash join exceeded exception

2015-04-17 Thread Stephan Ewen
Hi Flavio! The cause is usually as the exception method says: Too many duplicate keys. The side that builds the hash table has one key occurring so often that not all records with that key fit into memory together, even after multiple out-of-core recursions. Here is a list of things to check: -

Re: Adding log4j.properties file into src project

2015-04-17 Thread Robert Metzger
Hi, the log4j.properties file looks nice. The issue is that the resources folder is not marked as a source/resource folder, that's why Eclipse is not adding the file to the classpath. Have a look here: http://stackoverflow.com/questions/5081316/where-is-the-correct-location-to-put-log4j-propertie

Re: Orphaned chunks

2015-04-17 Thread Robert Metzger
How much memory are you giving to each Flink TaskManager? On Fri, Apr 17, 2015 at 2:05 PM, Flavio Pompermaier wrote: > Hi Robertm > I forgot to update about this error. > The root cause was an OOM cause by Jena RDF serialization that was causing > the failing of the entire job. > I also > https:

Re: Orphaned chunks

2015-04-17 Thread Flavio Pompermaier
Hi Robertm I forgot to update about this error. The root cause was an OOM cause by Jena RDF serialization that was causing the failing of the entire job. I also https://stackoverflow.com/questions/29660894/jena-thrift-serialization-oom-due-to-gc-overhead I created a thread on StackOverflow, let's s

Hash join exceeded exception

2015-04-17 Thread Flavio Pompermaier
Hi to all, I have this strange exception in my program, do you know what could be the cause of it? java.lang.RuntimeException: Hash join exceeded maximum number of recursions, without reducing partitions enough to be memory resident. Probably cause: Too many duplicate keys. at org.apache.flink.run

Re: Left outer join

2015-04-17 Thread Fabian Hueske
There is no caching mechanism. To do the left outer join as in Tills implementation, you need to collect all elements of one! iterator in memory. If you know, that one of the two iterators contains at most 1 element, you should collect that in memory and stream the elements of the other iterator.

Re: Left outer join

2015-04-17 Thread Flavio Pompermaier
Could you explain a little more in detail this caching mechanism with a simple code snippet...? Thanks, Flavio On Apr 17, 2015 1:12 PM, "Fabian Hueske" wrote: > If you know that the group cardinality of one input is always 1 (or 0) you > can make that input the one to cache in memory and stream

Re: Collections within POJOs/tuples

2015-04-17 Thread Stephan Ewen
Here are some rough cornerpoints for serialization efficiency in Flink: - Tuples are a bit more efficient than POJOs, because they do not support (and encode) possible subclasses and they do not involve and reflection code at all. - Arrays are more efficient than collections (collections go in

Re: Left outer join

2015-04-17 Thread Fabian Hueske
If you know that the group cardinality of one input is always 1 (or 0) you can make that input the one to cache in memory and stream the other input with potentially more group elements. 2015-04-17 4:09 GMT-05:00 Flavio Pompermaier : > That would be very helpful... > > Thanks for the support, > F

Re: Nested Iterations supported in Flink?

2015-04-17 Thread Stephan Ewen
I think running the program multiple times is a reasonable way to start working on this. I would try and see whether this can be re-written to a non-nested iterations case. Nestes iterations algorithms may have much more overhead to start with. Stephan On Tue, Apr 14, 2015 at 3:53 PM, BenoƮt Ha

Re: distinct() Java API

2015-04-17 Thread Fabian Hueske
Cool :-) We should add this tip to the documentation of the project operator. I'll open a JIRA for that. 2015-04-17 5:41 GMT-05:00 Flavio Pompermaier : > That worked like a charm! > > Thanks a lot Fabian! > > On Fri, Apr 17, 2015 at 12:37 PM, Fabian Hueske wrote: > >> The problem is cause by th

Re: Parallelism question

2015-04-17 Thread Fabian Hueske
You could try to work around this using a custom Partioner [1]. myData.partitionCustom(new MyPartitioner(), "myPartitionField").sortPartition("myPartitionField").writeToCsv(...); In that case, you need to implement the Partition function yourself. To do that "right" you need to know the value dis

Re: distinct() Java API

2015-04-17 Thread Flavio Pompermaier
That worked like a charm! Thanks a lot Fabian! On Fri, Apr 17, 2015 at 12:37 PM, Fabian Hueske wrote: > The problem is cause by the project() operator. > The Java compiler does infer its return type and defaults to Tuple. > > You can help the compiler like this: > > DataSet> ds2 = ds.project(0)

Re: distinct() Java API

2015-04-17 Thread Fabian Hueske
The problem is cause by the project() operator. The Java compiler does infer its return type and defaults to Tuple. You can help the compiler like this: DataSet> ds2 = ds.project(0).distinct(0); 2015-04-17 4:33 GMT-05:00 Flavio Pompermaier : > I have errors in Eclipse doing something like: > >

Re: distinct() Java API

2015-04-17 Thread Flavio Pompermaier
I have errors in Eclipse doing something like: DataSet> ds = DataSet> ds2 = .ds.project(0).distinct(0); It says that I have to declare ds2 as a Dataset On Fri, Apr 17, 2015 at 11:15 AM, Maximilian Michels wrote: > Hi Flavio, > > Do you have an exapmple? The DistinctOperator should return

Re: Left outer join

2015-04-17 Thread Flavio Pompermaier
That would be very helpful... Thanks for the support, Flavio On Fri, Apr 17, 2015 at 10:04 AM, Till Rohrmann wrote: > No its not, but at the moment there is afaik no other way around it. There > is an issue for proper outer join support [1] > > [1] https://issues.apache.org/jira/browse/FLINK-68

Re: distinct() Java API

2015-04-17 Thread Maximilian Michels
Hi Flavio, Do you have an exapmple? The DistinctOperator should return a typed output just like all the other operators do. Best, Max On Fri, Apr 17, 2015 at 10:07 AM, Flavio Pompermaier wrote: > Hi guys, > > I'm trying to make (in Java) a project().distinct() but then I cannot > create the ge

Re: Left outer join

2015-04-17 Thread Till Rohrmann
No its not, but at the moment there is afaik no other way around it. There is an issue for proper outer join support [1] [1] https://issues.apache.org/jira/browse/FLINK-687 On Fri, Apr 17, 2015 at 10:01 AM, Flavio Pompermaier wrote: > Could resolve the problem but the fact to accumulate stuff i

distinct() Java API

2015-04-17 Thread Flavio Pompermaier
Hi guys, I'm trying to make (in Java) a project().distinct() but then I cannot create the generated dataset with a typed tuple because the distinct operator returns just an untyped Tuple. Is this an error in the APIs or am I doing something wrong? Best, Flavio

Re: Left outer join

2015-04-17 Thread Flavio Pompermaier
Could resolve the problem but the fact to accumulate stuff in a local variable is it safe if datasets are huge..? On Fri, Apr 17, 2015 at 9:54 AM, Till Rohrmann wrote: > If it's fine when you have null string values in the cases where > D1.f1!="a1" or D1.f2!="a2" then a possible solution could l

Re: Left outer join

2015-04-17 Thread Till Rohrmann
If it's fine when you have null string values in the cases where D1.f1!="a1" or D1.f2!="a2" then a possible solution could look like (with Scala API): val ds1: DataSet[(String, String, String)] = getDS1 val ds2: DataSet[(String, String, String)] = getDS2 ds1.coGroup(ds2).where(2).equalTo(0) { (

Re: Parallelism question

2015-04-17 Thread Giacomo Licari
Hi Fabian, thanks for your reply, my question was exactly about that problem, range partitioning. As I have to process a large dataset of values, and to apply a datamining algorythm on each partition, for me an important point is that the final result is ordered, to do not lose the sense of data.

Re: Left outer join

2015-04-17 Thread Flavio Pompermaier
Hi Till, thanks for the reply. What I'd like to do is to merge D1 and D2 if there's a ref from D1 to D2 (D1.f2==D2.f0). If this condition is true, I would like to produce a set of tuples with the matching elements at the first to places (D1.*f2*, D2.*f0*) and the other two values (if present) of th

Collections within POJOs/tuples

2015-04-17 Thread Kruse, Sebastian
Hello everyone, I was just wondering, which class would be most efficient to store collections of primitive elements, and which one to store objects, within POJOs and tuples from a serialization point of view. And would it make any difference if such a collection is not embedded within a POJO/t

Re: Left outer join

2015-04-17 Thread Till Rohrmann
Hi Flavio, I don't really understand what you try to do. What does D1.f2(D1.f1==p1) mean? What does happen if the condition in D1.f2(if D1.f1==p2) is false? Where does the values a1 and a2 in (A, X, a1, a2) come from when you join [(A, p3, X), (X, s, V)] and [(A, p3, X), (X, r, 2)]? Maybe you can