[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-15 Thread Matthias Boehm (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15494240#comment-15494240
 ] 

Matthias Boehm commented on SYSTEMML-909:
-

yes, this is understandable - let me explain why this happens: the __INDEX 
column was always expected to be a 1-based row ID because this is how we create 
it ourself. However, in earlier versions of the converters, we sorted the data 
frame by ID, dropped the ID and reappended it with zipwithindex - so it did not 
matter if it was 0-based or 1-based. Since this previous approach was horrible 
slow, we now actually simply use this ID to directly construct the binary block 
representation. 

I don't have a strong opinion for 1-based vs 0-based indexing here and you're 
right 0-based indexing would certainly be a better fit with the existing RDD 
operations. The only thing we need to ensure it consistency across all places 
where we construct it.  

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of 
> the lazy DataFrames I pass in, which are created from another DataFrame via 
> df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going 
> to happen anyways in the internal conversion, but it wastes a lot of time by 
> having to also do it in this step too. Assume that I have more data than I 
> can efficiently cache (~7TB with the potential for much more), so I need to 
> incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-14 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15492355#comment-15492355
 ] 

Mike Dusenberry commented on SYSTEMML-909:
--

Also I would split the scripts up and just use one to convert from DataFrames 
to SystemML matrices, but the issue in SYSTEMML-869 is present still.

Just to be clear, the API interfaces are great in terms of being able to create 
scripts and pass around inputs/outputs.  We just need to knock out a few of 
these internal performance issues and scratch space bugs and we'll be set. 

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of 
> the lazy DataFrames I pass in, which are created from another DataFrame via 
> df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going 
> to happen anyways in the internal conversion, but it wastes a lot of time by 
> having to also do it in this step too. Assume that I have more data than I 
> can efficiently cache (~7TB with the potential for much more), so I need to 
> incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-13 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15489124#comment-15489124
 ] 

Mike Dusenberry commented on SYSTEMML-909:
--

cc [~acs_s]

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.
> The issue I'm seeing is that the javaRDD.count() function causes execution of 
> the lazy DataFrames I pass in, which are created from another DataFrame via 
> df.randomSplit([0.8, 0.2]), thus a shuffle occurs. I know that this is going 
> to happen anyways in the internal conversion, but it wastes a lot of time by 
> having to also do it in this step too. Assume that I have more data than I 
> can efficiently cache (~7TB with the potential for much more), so I need to 
> incur the shuffle step only once on the way into the engine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-12 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485422#comment-15485422
 ] 

Mike Dusenberry commented on SYSTEMML-909:
--

[~deron] Yes, I heard [~mboehm7] was going to add that capability to the engine 
for 1.0. :D

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-12 Thread Deron Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485415#comment-15485415
 ] 

Deron Eriksson commented on SYSTEMML-909:
-

[~mwdus...@us.ibm.com] Have you considered Quantum Computing? 
https://en.wikipedia.org/wiki/Quantum_computing


> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-12 Thread Matthias Boehm (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485273#comment-15485273
 ] 

Matthias Boehm commented on SYSTEMML-909:
-

Which datatypes do you have in your example dataframe? If it's not double then 
I would suspect the implicit double parsing being the reason. In any case, this 
narrow transformation will not be a bottleneck compared to creating the binary 
block representation which requires a shuffle.

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-12 Thread Deron Eriksson (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485267#comment-15485267
 ] 

Deron Eriksson commented on SYSTEMML-909:
-

I am fine scrapping this check. If the user does not supply metadata, 
information such as this could happen at a lower level at avoid an extraneous 
pass over the data.


> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SYSTEMML-909) `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.

2016-09-12 Thread Mike Dusenberry (JIRA)

[ 
https://issues.apache.org/jira/browse/SYSTEMML-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15485245#comment-15485245
 ] 

Mike Dusenberry commented on SYSTEMML-909:
--

cc [~deron], [~mboehm7], [~niketanpansare]

> `determineDataFrameDimensionsIfNeeded(...)` is a bottleneck.
> 
>
> Key: SYSTEMML-909
> URL: https://issues.apache.org/jira/browse/SYSTEMML-909
> Project: SystemML
>  Issue Type: Improvement
>Reporter: Mike Dusenberry
>
> The {{[determineDataFrameDimensionsIfNeeded(...) | 
> https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/api/mlcontext/MLContextConversionUtil.java#L585]}}
>  function in {{MLContext}} is a major bottleneck, particularly due to the 
> `javaRDD` call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)