[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-08 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14736139#comment-14736139
 ] 

Xiangrui Meng commented on SPARK-10199:
---

Yes, please. Thanks for doing the benchmark! We will close the JIRAs as well. 
Next time, we should discuss on the JIRA page first and implement something 
minimal for more discussions before we implement everything.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731834#comment-14731834
 ] 

Joseph K. Bradley commented on SPARK-10199:
---

I agree that the work required for these changes is large compared to the small 
gains for most use cases.  I could imagine allocating time to get this merged 
at some point in the future, but I don't think it can be prioritized right now. 
 I'd recommend keeping your code branch for the future, but closing the PR and 
marking this JIRA to be addressed later.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731753#comment-14731753
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr]
Thanks for the suggestion. 
Shall I close the PR?

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14731529#comment-14731529
 ] 

Xiangrui Meng commented on SPARK-10199:
---

The improvement numbers also depends on the model size. In unit tests, the 
model sizes are usually very small. Then the overhead of reflection becomes 
significant. With real models, it could be either the model itself is too small 
or the model is large and then the overhead of reflection becomes 
insignificant. Keeping the code simple and easy to understand is also quite 
important. +[~josephkb]

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-02 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728427#comment-14728427
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr] could you please check the above micro-benchmarks and give your 
suggestions?

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-09-01 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14725211#comment-14725211
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr]
I've measured the overhead of reflexion in save/load operation, please refer 
the results in this link
https://github.com/vinodkc/xtique/blob/master/overhead_duetoReflection.csv

Also I've measured the performance gain in save/load methods without reflexion 
after taking  average of 5  times test executions
Please refer the performance gain % in this two links
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_save.csv
https://github.com/vinodkc/xtique/blob/master/performance_Benchmark_load.csv


> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-31 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724459#comment-14724459
 ] 

Vinod KC commented on SPARK-10199:
--

[~mengxr]
1) I measured only schema inference part. 
Now, I will  add  measure for  entire save/load operation and   schema 
inference part separately

2)Also  I will run test tests multiples times and will share the result

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724438#comment-14724438
 ] 

Xiangrui Meng commented on SPARK-10199:
---

[~vinodkc] Did yo measure the entire save/load operation or the schema 
inference part? If the schema inference only takes 1% of the entire save/load 
operation, maybe we shouldn't over-optimize this part.

Also, for micro-benchmarks, you should run the test multiple times and compare 
the average to reduce variance. Most of your test runs in less than 1 second. 
It is very likely to observe huge variance.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-31 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14724138#comment-14724138
 ] 

Feynman Liang commented on SPARK-10199:
---

CC [~mengxr] [~josephkb]

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-30 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721907#comment-14721907
 ] 

Feynman Liang commented on SPARK-10199:
---

[~vinodkc] Thanks! I think these results are convincing. Let's see what others 
think but FWIW I'm all for these changes, particularly because it sets 
precedence for future model save/load to explicitly specify the schema.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-30 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721491#comment-14721491
 ] 

Vinod KC commented on SPARK-10199:
--

[~fliang] , As you suggested, 

1) I've made  micro-benchmarks by surrounding   createDataFrame in model save 
methods and Loader.checkSchema in load methods with below  timing code 

def time[R](block: => R): R = {
val t0 = System.nanoTime()
val result = block   
val t1 = System.nanoTime()
println("Elapsed time: " + (t1 - t0) + "ns")
result
  }

2) Then I ran mllib test suites on code before and after the change.

Please see the measurements and performance gain % in this google docs 

https://docs.google.com/spreadsheets/d/1TPUctB62xAHx0IaJttyx98MjRo4zVmO4neTkdi7uVDs/edit?usp=sharing

There is good  performance  improvement without reflection


> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-28 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14720277#comment-14720277
 ] 

Feynman Liang commented on SPARK-10199:
---

[~vinodkc] would it be possible to get some microbenchmarks? You can surround 
the call to 
[createDataFrame|https://github.com/apache/spark/pull/8507/files#diff-13d1de98ab7ae677f9b345eb90a8b8e8R237]
 with some timing code before and after the change.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-28 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718609#comment-14718609
 ] 

Xiangrui Meng commented on SPARK-10199:
---

What is the overhead? Do we have measurement?

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14718417#comment-14718417
 ] 

Apache Spark commented on SPARK-10199:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/8507

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-26 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14715148#comment-14715148
 ] 

Feynman Liang commented on SPARK-10199:
---

Awesome, thanks! You can tag that PR with the parent JIRA (SPARK-10199) then.

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-25 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712496#comment-14712496
 ] 

Vinod KC commented on SPARK-10199:
--

Sure, I'll group all the changes into single PR

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-25 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14712492#comment-14712492
 ] 

Feynman Liang commented on SPARK-10199:
---

Hi [~vinodkc], I saw that you took all of these issues. Thanks for your help! 
To make things easier for review, do you mind grouping all the changes into a 
single PR?

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org