Re: distinct() Java API

2015-04-17 Thread Maximilian Michels
Hi Flavio, Do you have an exapmple? The DistinctOperator should return a typed output just like all the other operators do. Best, Max On Fri, Apr 17, 2015 at 10:07 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi guys, I'm trying to make (in Java) a project().distinct() but then I

Re: Left outer join

2015-04-17 Thread Till Rohrmann
No its not, but at the moment there is afaik no other way around it. There is an issue for proper outer join support [1] [1] https://issues.apache.org/jira/browse/FLINK-687 On Fri, Apr 17, 2015 at 10:01 AM, Flavio Pompermaier pomperma...@okkam.it wrote: Could resolve the problem but the fact

distinct() Java API

2015-04-17 Thread Flavio Pompermaier
Hi guys, I'm trying to make (in Java) a project().distinct() but then I cannot create the generated dataset with a typed tuple because the distinct operator returns just an untyped Tuple. Is this an error in the APIs or am I doing something wrong? Best, Flavio

Re: Left outer join

2015-04-17 Thread Flavio Pompermaier
Could resolve the problem but the fact to accumulate stuff in a local variable is it safe if datasets are huge..? On Fri, Apr 17, 2015 at 9:54 AM, Till Rohrmann till.rohrm...@gmail.com wrote: If it's fine when you have null string values in the cases where D1.f1!=a1 or D1.f2!=a2 then a

Re: Left outer join

2015-04-17 Thread Flavio Pompermaier
Could you explain a little more in detail this caching mechanism with a simple code snippet...? Thanks, Flavio On Apr 17, 2015 1:12 PM, Fabian Hueske fhue...@gmail.com wrote: If you know that the group cardinality of one input is always 1 (or 0) you can make that input the one to cache in

Hash join exceeded exception

2015-04-17 Thread Flavio Pompermaier
Hi to all, I have this strange exception in my program, do you know what could be the cause of it? java.lang.RuntimeException: Hash join exceeded maximum number of recursions, without reducing partitions enough to be memory resident. Probably cause: Too many duplicate keys. at

Re: Adding log4j.properties file into src project

2015-04-17 Thread Robert Metzger
Hi, the log4j.properties file looks nice. The issue is that the resources folder is not marked as a source/resource folder, that's why Eclipse is not adding the file to the classpath. Have a look here:

EOFException when running Flink job

2015-04-17 Thread Stefan Bunk
Hi Squirrels, I have some trouble with a delta-iteration transitive closure program [1]. When I run the program, I get the following error: java.io.EOFException at org.apache.flink.runtime.operators.hash.InMemoryPartition$WriteView.nextSegment(InMemoryPartition.java:333) at

Re: Orphaned chunks

2015-04-17 Thread Flavio Pompermaier
I just set the -Xmx in the VM parameters to 10g and I have 8 virtual cores (4 physical). On Fri, Apr 17, 2015 at 2:48 PM, Robert Metzger rmetz...@apache.org wrote: How much memory are you giving to each Flink TaskManager? On Fri, Apr 17, 2015 at 2:05 PM, Flavio Pompermaier

Re: Left outer join

2015-04-17 Thread Fabian Hueske
If you know that the group cardinality of one input is always 1 (or 0) you can make that input the one to cache in memory and stream the other input with potentially more group elements. 2015-04-17 4:09 GMT-05:00 Flavio Pompermaier pomperma...@okkam.it: That would be very helpful... Thanks

Re: Collections within POJOs/tuples

2015-04-17 Thread Stephan Ewen
Here are some rough cornerpoints for serialization efficiency in Flink: - Tuples are a bit more efficient than POJOs, because they do not support (and encode) possible subclasses and they do not involve and reflection code at all. - Arrays are more efficient than collections (collections go in

Re: Orphaned chunks

2015-04-17 Thread Flavio Pompermaier
Hi Robertm I forgot to update about this error. The root cause was an OOM cause by Jena RDF serialization that was causing the failing of the entire job. I also https://stackoverflow.com/questions/29660894/jena-thrift-serialization-oom-due-to-gc-overhead I created a thread on StackOverflow, let's

Re: distinct() Java API

2015-04-17 Thread Flavio Pompermaier
That worked like a charm! Thanks a lot Fabian! On Fri, Apr 17, 2015 at 12:37 PM, Fabian Hueske fhue...@gmail.com wrote: The problem is cause by the project() operator. The Java compiler does infer its return type and defaults to Tuple. You can help the compiler like this:

Re: Nested Iterations supported in Flink?

2015-04-17 Thread Stephan Ewen
I think running the program multiple times is a reasonable way to start working on this. I would try and see whether this can be re-written to a non-nested iterations case. Nestes iterations algorithms may have much more overhead to start with. Stephan On Tue, Apr 14, 2015 at 3:53 PM, BenoƮt

Re: EOFException when running Flink job

2015-04-17 Thread Stephan Ewen
Hi! After a quick look over the code, this seems like a bug. One cornercase of the overflow handling code does not check for the running out of memory condition. I would like to wait if Robert Waury has some ideas about that, he is the one most familiar with the code. I would guess, though,

CSV header management

2015-04-17 Thread Flavio Pompermaier
Hi guys, how can I read a csv but keeping the header in some variable without throwing it away? Thanks in advance, Flavio

Re: CSV header management

2015-04-17 Thread Stephan Ewen
I don't think there is any built-in functionality for that. You can probably read the header with some custom code (in the driver program) and use the CSV reader (skipping the header) as a regular data source. On Fri, Apr 17, 2015 at 5:03 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi

Re: CSV header management

2015-04-17 Thread Robert Metzger
Yes, using a map() function that is transforming the input lines into the String array. On Fri, Apr 17, 2015 at 9:03 PM, Flavio Pompermaier pomperma...@okkam.it wrote: My csv has 32 columns,ia there a way to create a DatasetString[32]? On Apr 17, 2015 7:53 PM, Stephan Ewen se...@apache.org