[jira] [Updated] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-21 Thread Mike Dusenberry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SYSTEMML-909:
-
Assignee: Matthias Boehm

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>Assignee: Matthias Boehm
> Fix For: SystemML 0.11
>
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of 
> the lazy DataFrames I pass in, which are created from another DataFrame via 
> df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going 
> to happen anyways in the internal conversion, but it wastes a lot of time by 
> having to also do it in this step too. Assume that I have more data than I 
> can efficiently cache (~7TB with the potential for much more), so I need to 
> incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-13 Thread Mike Dusenberry (JIRA)

 [ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mike Dusenberry updated SYSTEMML-909:
-
Description: 
The {{[determineDataFrameDimensionsIfNeeded(...) | 
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
 function in {{MLContext}} is a major bottleneck, particularly due to the 
`javaRDD` call.

The issue I'm seeing is that the javaRDD.count() function causes execution of 
the lazy DataFrames I pass in, which are created from another DataFrame via 
df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going to 
happen anyways in the internal conversion, but it wastes a lot of time by 
having to also do it in this step too. Assume that I have more data than I can 
efficiently cache (~7TB with the potential for much more), so I need to incur 
the shuffle step only once on the way into the engine.

  was:The {{[determineDataFrameDimensionsIfNeeded(...) | 
https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
 function in {{MLContext}} is a major bottleneck, particularly due to the 
`javaRDD` call.


> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of 
> the lazy DataFrames I pass in, which are created from another DataFrame via 
> df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going 
> to happen anyways in the internal conversion, but it wastes a lot of time by 
> having to also do it in this step too. Assume that I have more data than I 
> can efficiently cache (~7TB with the potential for much more), so I need to 
> incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)