date:20151118

[jira] [Assigned] (SPARK-11846) Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11846:


Assignee: Apache Spark  (was: Xusen Yin)

> Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
> --
>
> Key: SPARK-11846
> URL: https://issues.apache.org/jira/browse/SPARK-11846
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Add read/write support to AFTSurvivalRegression and IsotonicRegression using 
> LinearRegression read/write as reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11747) Can not specify input path in python logistic_regression example under ml

2015-11-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11747.
---
Resolution: Won't Fix

> Can not specify input path in python logistic_regression example under ml
> -
>
> Key: SPARK-11747
> URL: https://issues.apache.org/jira/browse/SPARK-11747
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Reporter: Jeff Zhang
>Priority: Minor
>
> Not sure why it is hard coded, it would be nice to allow user to specify 
> input path
> {code}
> # Load and parse the data file into a dataframe.
> df = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: ok)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: ok)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-4036:

Comment: was deleted

(was: ok)

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013095#comment-15013095
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013097#comment-15013097
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013098#comment-15013098
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013096#comment-15013096
 ] 

hujiayin commented on SPARK-4036:
-

ok

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS

2015-11-18 Thread Virgil Palanciuc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013088#comment-15013088
 ] 

Virgil Palanciuc commented on SPARK-10109:
--

Haven't tried it lately  - just ran it with parallel execution disabled.

Like I said -  root cause seems to be the fact that I was writing 2 dataframes 
simultaneously in the same destination (albeit with a partitioning scheme that 
made the destination non-overlapping). I partition by the columns "dpid" and 
"pid" and if I process 2 PIDs in parallel (on different threads), and they both 
have the same DPID (e.g I simultaneously write to  /dpid=1/pid=1/  
and to /dpid=1/pid=2/ ), I get this problem. This seems to be 
caused by the fact that, on 2 different threads, I do 
"df.write.partitionBy().parquet('') " - i.e. I write 2 DataFrames 
simultaneously "to the same destination" (due to partitioning, the actual files 
should never overlap; but it still seems to be a problem).

Not sure if this was really a Spark bug or it's an application problem - if you 
think Spark should've worked in this scenario, let me know and I'll retry it. 
But it kinda' feels it was really an application bug (bad assumption on my part 
about how writing works)

> NPE when saving Parquet To HDFS
> ---
>
> Key: SPARK-10109
> URL: https://issues.apache.org/jira/browse/SPARK-10109
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Sparc-ec2, standalone cluster on amazon
>Reporter: Virgil Palanciuc
>
> Very simple code, trying to save a dataframe
> I get this in the driver
> {quote}
> 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID 
> 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) 
> and  (not for that task):
> 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID 
> 5607, 172.yy.yy.yy): java.lang.NullPointerException
> at 
> parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
> at 
> parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
> at 
> parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
> at 
> org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536)
> at 
> org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at 
> org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> I get this in the executor log:
> {quote}
> 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet
>  File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not 
> have any open files.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNames

[jira] [Updated] (SPARK-11339) Fix and document the list of functions in R base package that are masked by functions with same name in SparkR

2015-11-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-11339:
--
Assignee: Felix Cheung

> Fix and document the list of functions in R base package that are masked by 
> functions with same name in SparkR
> --
>
> Key: SPARK-11339
> URL: https://issues.apache.org/jira/browse/SPARK-11339
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Sun Rui
>Assignee: Felix Cheung
> Fix For: 1.6.0
>
>
> There may be name conflicts between API functions of SparkR and functions 
> exposed in R base package (or other popular 3rd-party packages). If some 
> functions in name conflict are very popular and frequently used in R base 
> package, we may rename the functions in SparkR to avoid conflict and 
> in-convenience to R users. Otherwise, we keep the name of functions in 
> SparkR, so the functions of same name in R base package are masked.
> We'd better have a list of such functions of name conflict to reduce 
> confusion of users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11339) Fix and document the list of functions in R base package that are masked by functions with same name in SparkR

2015-11-18 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11339.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9785
[https://github.com/apache/spark/pull/9785]

> Fix and document the list of functions in R base package that are masked by 
> functions with same name in SparkR
> --
>
> Key: SPARK-11339
> URL: https://issues.apache.org/jira/browse/SPARK-11339
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Reporter: Sun Rui
> Fix For: 1.6.0
>
>
> There may be name conflicts between API functions of SparkR and functions 
> exposed in R base package (or other popular 3rd-party packages). If some 
> functions in name conflict are very popular and frequently used in R base 
> package, we may rename the functions in SparkR to avoid conflict and 
> in-convenience to R users. Otherwise, we keep the name of functions in 
> SparkR, so the functions of same name in R base package are masked.
> We'd better have a list of such functions of name conflict to reduce 
> confusion of users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013068#comment-15013068
 ] 

Kai Sasaki edited comment on SPARK-4036 at 11/19/15 7:32 AM:
-

[~hujiayin] 
I'm sorry for being late for response. I haven't yet create any patch. So never 
mind to work on this JIRA instead of me.
Anyway, can I give a check and comment to your patch?


was (Author: lewuathe):
[~hujiayin] 
I'm sorry for being late for response. I haven't yet create any patch. So never 
mind to work in this JIRA instead of me.
Anyway, can I give a check and comment to your patch?

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013068#comment-15013068
 ] 

Kai Sasaki commented on SPARK-4036:
---

[~hujiayin] 
I'm sorry for being late for response. I haven't yet create any patch. So never 
mind to work in this JIRA instead of me.
Anyway, can I give a check and comment to your patch?

> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-18 Thread Virgil Palanciuc (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013064#comment-15013064
 ] 

Virgil Palanciuc commented on SPARK-11657:
--

Yes, I think I was using Kryo. Thanks!

> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3, 1.6.0
>
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib

2015-11-18 Thread hujiayin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353
 ] 

hujiayin edited comment on SPARK-4036 at 11/19/15 7:09 AM:
---

Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794
I referenced the other document besides Sasaki's design for the implementation.
http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf




was (Author: hujiayin):
Hi Sasaki,
I'm not sure if you worked on it as the jira is still open. If you have a PR, 
you could close my PR https://github.com/apache/spark/pull/9794


> Add Conditional Random Fields (CRF) algorithm to Spark MLlib
> 
>
> Key: SPARK-4036
> URL: https://issues.apache.org/jira/browse/SPARK-4036
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Kai Sasaki
> Attachments: CRF_design.1.pdf
>
>
> Conditional random fields (CRFs) are a class of statistical modelling method 
> often applied in pattern recognition and machine learning, where they are 
> used for structured prediction. 
> The paper: 
> http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7499:
---

Assignee: Apache Spark

> Investigate how to specify columns in SparkR without $ or strings
> -
>
> Key: SPARK-7499
> URL: https://issues.apache.org/jira/browse/SPARK-7499
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Apache Spark
>
> Right now in SparkR we need to specify the columns used using `$` or strings. 
> For example to run select we would do
> {code}
> df1 <- select(df, df$age > 10)
> {code}
> It would be good to infer the set of columns in a dataframe automatically and 
> resolve symbols for column names. For example
> {code} 
> df1 <- select(df, age > 10)
> {code}
> One way to do this is to build an environment with all the column names to 
> column handles and then use `substitute(arg, env = columnNameEnv)`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7499:
---

Assignee: (was: Apache Spark)

> Investigate how to specify columns in SparkR without $ or strings
> -
>
> Key: SPARK-7499
> URL: https://issues.apache.org/jira/browse/SPARK-7499
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> Right now in SparkR we need to specify the columns used using `$` or strings. 
> For example to run select we would do
> {code}
> df1 <- select(df, df$age > 10)
> {code}
> It would be good to infer the set of columns in a dataframe automatically and 
> resolve symbols for column names. For example
> {code} 
> df1 <- select(df, age > 10)
> {code}
> One way to do this is to build an environment with all the column names to 
> column handles and then use `substitute(arg, env = columnNameEnv)`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013045#comment-15013045
 ] 

Apache Spark commented on SPARK-7499:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/9835

> Investigate how to specify columns in SparkR without $ or strings
> -
>
> Key: SPARK-7499
> URL: https://issues.apache.org/jira/browse/SPARK-7499
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> Right now in SparkR we need to specify the columns used using `$` or strings. 
> For example to run select we would do
> {code}
> df1 <- select(df, df$age > 10)
> {code}
> It would be good to infer the set of columns in a dataframe automatically and 
> resolve symbols for column names. For example
> {code} 
> df1 <- select(df, age > 10)
> {code}
> One way to do this is to build an environment with all the column names to 
> column handles and then use `substitute(arg, env = columnNameEnv)`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11021) SparkSQL cli throws exception when using with Hive 0.12 metastore in spark-1.5.0 version

2015-11-18 Thread bit1129 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013046#comment-15013046
 ] 

bit1129 commented on SPARK-11021:
-

I encounter exactly the same issue with Spark 1.5.0 + 0.14.0, The problem is 
gone after I added the configuration as Jeff suggested,Thanks Jeff!

> SparkSQL cli throws exception when using with Hive 0.12 metastore in 
> spark-1.5.0 version
> 
>
> Key: SPARK-11021
> URL: https://issues.apache.org/jira/browse/SPARK-11021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: iward
>
> After upgrade spark from 1.4.1 to 1.5.0,I get the following exception when I 
> set set the following properties in spark-defaults.conf:
> {noformat}
> spark.sql.hive.metastore.version=0.12.0
> spark.sql.hive.metastore.jars=hive 0.12 jars and hadoop jars
> {noformat}
> when I run a task,it got following exception:
> {noformat}
> java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_12.loadTable(HiveShim.scala:249)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>   at 
> org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927)
>   at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:144)
>   at org.apache.spark.sql.DataFrame.(DataFrame.scala:129)
>   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:61)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:165)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>  Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
> results from 
> hdfs://ns

[jira] [Commented] (SPARK-11748) Result is null after alter column name of table stored as Parquet

2015-11-18 Thread pin_zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013030#comment-15013030
 ] 

pin_zhang commented on SPARK-11748:
---

Apache hive 0.14 has added Support for Parquet Column Rename 
https://issues.apache.org/jira/browse/HIVE-6938
That doesn't work in spark hive


> Result is null after alter column name of table stored as Parquet 
> --
>
> Key: SPARK-11748
> URL: https://issues.apache.org/jira/browse/SPARK-11748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: pin_zhang
>
> 1. Test with the following code
> hctx.sql(" create table " + table + " (id int, str string) STORED AS 
> PARQUET ")
> val df = hctx.jsonFile("g:/vip.json")
> df.write.format("parquet").mode(SaveMode.Append).saveAsTable(table)
> hctx.sql(" select * from " + table).show()
> // alter table
> val alter = "alter table " + table + " CHANGE id i_d int "
> hctx.sql(alter)
>  
> hctx.sql(" select * from " + table).show()
> 2. Result
> after change table column name, data in null for the changed column
> Result before alter table
> +---+---+
> | id|str|
> +---+---+
> |  1| s1|
> |  2| s2|
> +---+---+
> Result after alter table
> ++---+
> | i_d|str|
> ++---+
> |null| s1|
> |null| s2|
> ++---+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11659) Codegen sporadically fails with same input character

2015-11-18 Thread Catalin Alexandru Zamfir (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012995#comment-15012995
 ] 

Catalin Alexandru Zamfir commented on SPARK-11659:
--

It is not explicit. We've only put spark_sql on the POM which drags in all the 
default dependencies. For the moment we have deactivated codegen (although it 
was faster) and are waiting next week for an 1.5.2 update. I see version 2.6.1 
(managed from 2.7.8). Should I force the newer version?

> Codegen sporadically fails with same input character
> 
>
> Key: SPARK-11659
> URL: https://issues.apache.org/jira/browse/SPARK-11659
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1
> Environment: Default, Linux (Jessie)
>Reporter: Catalin Alexandru Zamfir
>
> We pretty much have a default installation of Spark 1.5.1. Some of our jobs 
> sporadically fail with the below exception for the same "input character" (we 
> don't have @ in our inputs as we check the types that we filter from the 
> data, but jobs still fail) and when we re-run the same job with the same 
> input, the same job passes without any failures. I believe it's a bug in 
> code-gen but I can't debug this on a production cluster. One thing to note is 
> that this has a higher chance of occurring when multiple jobs are run in 
> parallel to one another (eg. 4 jobs at a time started on the same second 
> using a scheduler and sharing the same context). However, I have no reproduce 
> rule. For example, from 32 jobs scheduled in batches of 4 jobs per batch, 1 
> of the jobs in one of the batches may fail with the below error and with a 
> different job, randomly. I don't know an idea on how to approach this 
> situation to produce better information so maybe you can advise us.
> {noformat}
> Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most 
> recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): 
> java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
> compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
> Invalid character input "@" (character code 64)
> public SpecificOrdering 
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
>   return new SpecificOrdering(expr);
> }
> class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
>   
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   
>   
>   
>   public 
> SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
> {
> expressions = expr;
> 
>   }
>   
>   @Override
>   public int compare(InternalRow a, InternalRow b) {
> InternalRow i = null;  // Holds current row being evaluated.
> 
> i = a;
> boolean isNullA2;
> long primitiveA3;
> {
>   /* input[0, LongType] */
>   
>   boolean isNull0 = i.isNullAt(0);
>   long primitive1 = isNull0 ? -1L : (i.getLong(0));
>   
>   isNullA2 = isNull0;
>   primitiveA3 = primitive1;
> }
> i = b;
> boolean isNullB4;
> long primitiveB5;
> {
>   /* input[0, LongType] */
>   
>   boolean isNull0 = i.isNullAt(0);
>   long primitive1 = isNull0 ? -1L : (i.getLong(0));
>   
>   isNullB4 = isNull0;
>   primitiveB5 = primitive1;
> }
> if (isNullA2 && isNullB4) {
>   // Nothing
> } else if (isNullA2) {
>   return -1;
> } else if (isNullB4) {
>   return 1;
> } else {
>   int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
> -1 : 0);
>   if (comp != 0) {
> return comp;
>   }
> }
> 
> 
> i = a;
> boolean isNullA8;
> long primitiveA9;
> {
>   /* input[1, LongType] */
>   
>   boolean isNull6 = i.isNullAt(1);
>   long primitive7 = isNull6 ? -1L : (i.getLong(1));
>   
>   isNullA8 = isNull6;
>   primitiveA9 = primitive7;
> }
> i = b;
> boolean isNullB10;
> long primitiveB11;
> {
>   /* input[1, LongType] */
>   
>   boolean isNull6 = i.isNullAt(1);
>   long primitive7 = isNull6 ? -1L : (i.getLong(1));
>   
>   isNullB10 = isNull6;
>   primitiveB11 = primitive7;
> }
> if (isNullA8 && isNullB10) {
>   // Nothing
> } else if (isNullA8) {
>   return -1;
> } else if (isNullB10) {
>   return 1;
> } else {
>   int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 
> ? -1 : 0);
>   if (comp != 0) {
> return comp;
>   }
> }
> 
> 
> i = a;
> boolean isNullA14;
> long primitiveA15;
> {
>   /* input[2, LongType] */
>   
>   boolean isNull12 = i.isNullA

[jira] [Assigned] (SPARK-11817) insert of timestamp with factional seconds inserts a NULL

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11817:


Assignee: Apache Spark

> insert of timestamp with factional seconds inserts a NULL
> -
>
> Key: SPARK-11817
> URL: https://issues.apache.org/jira/browse/SPARK-11817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Chip Sands
>Assignee: Apache Spark
>
> Using the Thrift jdbc interface.
> The insert of  the value of  "1970-01-01 00:00:00.123456789" to a timestamp 
> column, inserts a NULL into the database. I am aware the of the change 
> From 1.5 releases notes  Timestamp Type’s precision is reduced to 1 
> microseconds (1us). However, to be compatible  with previous versions, I 
> would suggest either rounding or truncating the fractional seconds not 
> inserting a NULL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11817) insert of timestamp with factional seconds inserts a NULL

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11817:


Assignee: (was: Apache Spark)

> insert of timestamp with factional seconds inserts a NULL
> -
>
> Key: SPARK-11817
> URL: https://issues.apache.org/jira/browse/SPARK-11817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Chip Sands
>
> Using the Thrift jdbc interface.
> The insert of  the value of  "1970-01-01 00:00:00.123456789" to a timestamp 
> column, inserts a NULL into the database. I am aware the of the change 
> From 1.5 releases notes  Timestamp Type’s precision is reduced to 1 
> microseconds (1us). However, to be compatible  with previous versions, I 
> would suggest either rounding or truncating the fractional seconds not 
> inserting a NULL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11817) insert of timestamp with factional seconds inserts a NULL

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012981#comment-15012981
 ] 

Apache Spark commented on SPARK-11817:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9834

> insert of timestamp with factional seconds inserts a NULL
> -
>
> Key: SPARK-11817
> URL: https://issues.apache.org/jira/browse/SPARK-11817
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Chip Sands
>
> Using the Thrift jdbc interface.
> The insert of  the value of  "1970-01-01 00:00:00.123456789" to a timestamp 
> column, inserts a NULL into the database. I am aware the of the change 
> From 1.5 releases notes  Timestamp Type’s precision is reduced to 1 
> microseconds (1us). However, to be compatible  with previous versions, I 
> would suggest either rounding or truncating the fractional seconds not 
> inserting a NULL. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6725) Model export/import for Pipeline API

2015-11-18 Thread Earthson Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012815#comment-15012815
 ] 

Earthson Lu edited comment on SPARK-6725 at 11/19/15 6:34 AM:
--

-I'm glad to give some help:) Does it mean to do some unit tests?-

I'm sorry, I have to focus on my own work now, may not have time to give help 
to 1.6's release~


was (Author: earthsonlu):
I'm glad to give some help:) Does it mean to do some unit tests?

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012972#comment-15012972
 ] 

Apache Spark commented on SPARK-11849:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9833

> Analyzer should replace current_date and current_timestamp with literals
> 
>
> Key: SPARK-11849
> URL: https://issues.apache.org/jira/browse/SPARK-11849
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently rely on the optimizer's constant folding to replace 
> current_timestamp and current_date. However, this can still result in 
> different values for different instances of current_timestamp/current_date if 
> the optimizer is not running fast enough.
> A better solution is to replace these functions in the analyzer in one shot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11849:


Assignee: Reynold Xin  (was: Apache Spark)

> Analyzer should replace current_date and current_timestamp with literals
> 
>
> Key: SPARK-11849
> URL: https://issues.apache.org/jira/browse/SPARK-11849
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We currently rely on the optimizer's constant folding to replace 
> current_timestamp and current_date. However, this can still result in 
> different values for different instances of current_timestamp/current_date if 
> the optimizer is not running fast enough.
> A better solution is to replace these functions in the analyzer in one shot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11849:


Assignee: Apache Spark  (was: Reynold Xin)

> Analyzer should replace current_date and current_timestamp with literals
> 
>
> Key: SPARK-11849
> URL: https://issues.apache.org/jira/browse/SPARK-11849
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We currently rely on the optimizer's constant folding to replace 
> current_timestamp and current_date. However, this can still result in 
> different values for different instances of current_timestamp/current_date if 
> the optimizer is not running fast enough.
> A better solution is to replace these functions in the analyzer in one shot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals

2015-11-18 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-11849:
---

 Summary: Analyzer should replace current_date and 
current_timestamp with literals
 Key: SPARK-11849
 URL: https://issues.apache.org/jira/browse/SPARK-11849
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


We currently rely on the optimizer's constant folding to replace 
current_timestamp and current_date. However, this can still result in different 
values for different instances of current_timestamp/current_date if the 
optimizer is not running fast enough.

A better solution is to replace these functions in the analyzer in one shot.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights

2015-11-18 Thread Kai Sasaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012964#comment-15012964
 ] 

Kai Sasaki commented on SPARK-11520:


[~josephkb] The metrics of {{RegressionMetrics}} seems to be based on 
{{MultivariateStatisticalSummary}}. And current {{RegressionMetrics}} does not 
support weighted samples as argument. So we can pass the weighted samples to 
MultivariateStatisticalSummary ({{MultivariateOnlineSummarizer}}) and calculate 
metrics for regression metrics. 
Is this assumption correct? Can I work on this JIRA, if possible?

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11622) Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11622:
--
Target Version/s: 1.7.0

> Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter
> ---
>
> Key: SPARK-11622
> URL: https://issues.apache.org/jira/browse/SPARK-11622
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
>
> so that LibSVMRealtion can leverage the features from HadoopFsRelation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11275) [SQL] Incorrect results when using rollup/cube

2015-11-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11275:
-
Priority: Critical  (was: Major)

> [SQL] Incorrect results when using rollup/cube 
> ---
>
> Key: SPARK-11275
> URL: https://issues.apache.org/jira/browse/SPARK-11275
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.1
>Reporter: Xiao Li
>Priority: Critical
>
> Spark SQL is unable to generate a correct result when the following query 
> using rollup. 
> "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, 
> b with rollup"
> Spark SQL generates a wrong result:
> [2,4,6,3]
> [2,null,null,1]
> [1,null,null,1]
> [null,null,null,0]
> [1,2,3,3]
> The table mytable is super simple, containing two rows and two columns:
> testData = Seq((1, 2), (2, 4)).toDF("a", "b")
> After turning off codegen, the query plan is like 
> == Parsed Logical Plan ==
> 'Rollup ['a,'b], 
> [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS 
> sumAB#20),unresolvedalias('GROUPING__ID)]
>  'UnresolvedRelation `mytable`, None
> == Analyzed Logical Plan ==
> a: int, b: int, sumAB: bigint, GROUPING__ID: int
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   Subquery mytable
>Project [_1#0 AS a#2,_2#1 AS b#3]
> LocalRelation [_1#0,_2#1], [[1,2],[2,4]]
> == Optimized Logical Plan ==
> Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as 
> bigint)) AS sumAB#20L,GROUPING__ID#23]
>  Expand [0,1,3], [a#2,b#3], grouping__id#23
>   LocalRelation [a#2,b#3], [[1,2],[2,4]]
> == Physical Plan ==
> Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS 
> sumAB#20L,grouping__id#23]
>  Exchange hashpartitioning(a#2,b#3,grouping__id#23,5)
>   Aggregate true, [a#2,b#3,grouping__id#23], 
> [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L]
>Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], 
> [a#2,b#3,grouping__id#23]
> LocalTableScan [a#2,b#3], [[1,2],[2,4]]
> Below are my observations:
> 1. Generation of GROUP__ID looks OK. 
> 2. The problem still exists no matter whether turning on/off CODEGEN
> 3. Rollup still works in a simple query when group-by columns have only one 
> column. For example, "select b, sum(a), GROUPING__ID from mytable group by b 
> with rollup"
> 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. 
> Unfortunately, they hide the bugs. Although the buckets passed, they just 
> compare the results of SQL and Dataframe. This way is unable to capture the 
> regression when both return the same wrong results.  
> 5. The same problem also exists in cube. I have not started the investigation 
> in cube, but I believe the root causes should be the same. 
> 6. It looks like all the logical plans are correct.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11842) Cleanups to existing Readers and Writers

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11842.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9829
[https://github.com/apache/spark/pull/9829]

> Cleanups to existing Readers and Writers
> 
>
> Key: SPARK-11842
> URL: https://issues.apache.org/jira/browse/SPARK-11842
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.6.0
>
>
> Small cleanups to existing Readers and Writers
> * Add {{repartition(1)}} to save() methods' saving of data for 
> LogisticRegressionModel, LinearRegressionModel.
> * Strengthen privacy to class and companion object for Writers and Readers
> * Change LogisticRegressionSuite read/write test to fit intercept
> * Add Since versions for read/write methods in Pipeline, LogisticRegression
> * Switch from hand-written class names in Readers to using getClass



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7457) Perf test for ALS.recommendAll

2015-11-18 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012843#comment-15012843
 ] 

Jeff Zhang commented on SPARK-7457:
---

[~mengxr] I don't find the api of ALS.recommandAll,  is it removed ? And is 
this ticket for some performance report ?

> Perf test for ALS.recommendAll
> --
>
> Key: SPARK-7457
> URL: https://issues.apache.org/jira/browse/SPARK-7457
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, Tests
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6725) Model export/import for Pipeline API

2015-11-18 Thread Earthson Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012815#comment-15012815
 ] 

Earthson Lu edited comment on SPARK-6725 at 11/19/15 5:14 AM:
--

I'm glad to give some help:) Does it mean to do some unit tests?


was (Author: earthsonlu):
I'm glad to give help:)

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012817#comment-15012817
 ] 

Apache Spark commented on SPARK-11848:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/9832

> [SQL] Support EXPLAIN in DataSet APIs
> -
>
> Key: SPARK-11848
> URL: https://issues.apache.org/jira/browse/SPARK-11848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> Prints the plans (logical and physical) to the console for debugging purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11848:


Assignee: Apache Spark

> [SQL] Support EXPLAIN in DataSet APIs
> -
>
> Key: SPARK-11848
> URL: https://issues.apache.org/jira/browse/SPARK-11848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Prints the plans (logical and physical) to the console for debugging purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11848:


Assignee: (was: Apache Spark)

> [SQL] Support EXPLAIN in DataSet APIs
> -
>
> Key: SPARK-11848
> URL: https://issues.apache.org/jira/browse/SPARK-11848
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> Prints the plans (logical and physical) to the console for debugging purposes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API

2015-11-18 Thread Earthson Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012815#comment-15012815
 ] 

Earthson Lu commented on SPARK-6725:


I'm glad to give help:)

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs

2015-11-18 Thread Xiao Li (JIRA)

Xiao Li created SPARK-11848:
---

 Summary: [SQL] Support EXPLAIN in DataSet APIs
 Key: SPARK-11848
 URL: https://issues.apache.org/jira/browse/SPARK-11848
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Xiao Li


Prints the plans (logical and physical) to the console for debugging purposes.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11829) Model export/import for spark.ml: estimators under ml.feature (II)

2015-11-18 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012785#comment-15012785
 ] 

Yanbo Liang commented on SPARK-11829:
-

Sure, I can take this.

> Model export/import for spark.ml: estimators under ml.feature (II)
> --
>
> Key: SPARK-11829
> URL: https://issues.apache.org/jira/browse/SPARK-11829
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Add read/write support to the following estimators and models under spark.ml:
> * ChiSqSelector
> * PCA
> * QuantileDiscretizer
> * VectorIndexer
> * Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-18 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012776#comment-15012776
 ] 

yuhao yang commented on SPARK-11847:


Sure, I can take it.

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11847:
--
Description: Add read/write support to LDA, similar to ALS.

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012775#comment-15012775
 ] 

Xiangrui Meng commented on SPARK-11847:
---

[~yuhaoyan] We need some help on pipeline persistence. Do you have time to work 
on this JIRA?

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-18 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-11847:
-

 Summary: Model export/import for spark.ml: LDA
 Key: SPARK-11847
 URL: https://issues.apache.org/jira/browse/SPARK-11847
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: yuhao yang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API

2015-11-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012770#comment-15012770
 ] 

Xiangrui Meng commented on SPARK-6725:
--

[~EarthsonLu] We are adding more import/export to existing algorithms. Could 
you help test them in both Scala and Java and let us know if you find any 
issues? Thanks!

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11846:
--
Description: 
Add read/write support to AFTSurvivalRegression and IsotonicRegression using 
LinearRegression read/write as reference.


> CLONE - Model export/import for spark.ml: AFTSurvivalRegression and 
> IsotonicRegression
> --
>
> Key: SPARK-11846
> URL: https://issues.apache.org/jira/browse/SPARK-11846
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Add read/write support to AFTSurvivalRegression and IsotonicRegression using 
> LinearRegression read/write as reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11846) Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11846:
--
Summary: Model export/import for spark.ml: AFTSurvivalRegression and 
IsotonicRegression  (was: CLONE - Model export/import for spark.ml: 
AFTSurvivalRegression and IsotonicRegression)

> Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
> --
>
> Key: SPARK-11846
> URL: https://issues.apache.org/jira/browse/SPARK-11846
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> Add read/write support to AFTSurvivalRegression and IsotonicRegression using 
> LinearRegression read/write as reference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11846:
--
Assignee: Xusen Yin  (was: Wenjian Huang)

> CLONE - Model export/import for spark.ml: AFTSurvivalRegression and 
> IsotonicRegression
> --
>
> Key: SPARK-11846
> URL: https://issues.apache.org/jira/browse/SPARK-11846
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression

2015-11-18 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-11846:
-

 Summary: CLONE - Model export/import for spark.ml: 
AFTSurvivalRegression and IsotonicRegression
 Key: SPARK-11846
 URL: https://issues.apache.org/jira/browse/SPARK-11846
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Wenjian Huang
 Fix For: 1.6.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11846:
--
Fix Version/s: (was: 1.6.0)

> CLONE - Model export/import for spark.ml: AFTSurvivalRegression and 
> IsotonicRegression
> --
>
> Key: SPARK-11846
> URL: https://issues.apache.org/jira/browse/SPARK-11846
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11845:


Assignee: Tathagata Das  (was: Apache Spark)

> Add unit tests to verify correct checkpointing of TrackStateRDD
> ---
>
> Key: SPARK-11845
> URL: https://issues.apache.org/jira/browse/SPARK-11845
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012738#comment-15012738
 ] 

Apache Spark commented on SPARK-11845:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/9831

> Add unit tests to verify correct checkpointing of TrackStateRDD
> ---
>
> Key: SPARK-11845
> URL: https://issues.apache.org/jira/browse/SPARK-11845
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11845:


Assignee: Apache Spark  (was: Tathagata Das)

> Add unit tests to verify correct checkpointing of TrackStateRDD
> ---
>
> Key: SPARK-11845
> URL: https://issues.apache.org/jira/browse/SPARK-11845
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD

2015-11-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-11845:
--
Target Version/s: 1.6.0

> Add unit tests to verify correct checkpointing of TrackStateRDD
> ---
>
> Key: SPARK-11845
> URL: https://issues.apache.org/jira/browse/SPARK-11845
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD

2015-11-18 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-11845:
-

 Summary: Add unit tests to verify correct checkpointing of 
TrackStateRDD
 Key: SPARK-11845
 URL: https://issues.apache.org/jira/browse/SPARK-11845
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD

2015-11-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-11845:
--
Issue Type: Test  (was: Bug)

> Add unit tests to verify correct checkpointing of TrackStateRDD
> ---
>
> Key: SPARK-11845
> URL: https://issues.apache.org/jira/browse/SPARK-11845
> Project: Spark
>  Issue Type: Test
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11838) Spark SQL query fragment RDD reuse

2015-11-18 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012724#comment-15012724
 ] 

Mark Hamstra commented on SPARK-11838:
--

One significant difference between this and CacheManager is that what is 
proposed here is caching and reuse of the RDD itself, not the blocks of data 
computed by that RDD.  As Mikhail noted, that can avoid significant amounts of 
duplicate computation even when nothing is explicitly persisted/cached.

> Spark SQL query fragment RDD reuse
> --
>
> Key: SPARK-11838
> URL: https://issues.apache.org/jira/browse/SPARK-11838
> Project: Spark
>  Issue Type: Improvement
>Reporter: Mikhail Bautin
>
> With many analytical Spark SQL workloads against slowly changing tables, 
> successive queries frequently share fragments that produce the same result. 
> Instead of re-computing those fragments for every query, it makes sense to 
> detect similar fragments and substitute RDDs previously created for matching 
> SparkPlan fragments into every new SparkPlan at the execution time whenever 
> possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, 
> many stages can still be skipped due to map output files being present on 
> executor nodes.
> The implementation involves the following steps:
> (1) Logical plan "canonicalization". 
> Logical plans mapping to the same "canonical" logical plan should always 
> produce the same results (except for possible output column reordering), 
> although the inverse statement won't always be true. 
>   - Re-mapping expression ids to "canonical expression ids" (successively 
> increasing numbers always starting with 1).
>   - Eliminating alias names that are unimportant after analysis completion. 
> Only the names that are necessary to determine the Hive table columns to be 
> scanned are retained.
>   - Reordering columns in projections, grouping/aggregation expressions, etc. 
> This can be done e.g. by using the string representation as a sort key. Union 
> inputs always have to be reordered the same way.
>   - Tree traversal has to happen starting from leaves and progressing towards 
> the root, because we need to already have identified canonical expression ids 
> for children of a node before we can come up with sort keys that would allow 
> to reorder expressions in a node deterministically. This is a bit more 
> complicated for Union nodes.
>   - Special handling for MetastoreRelations. We replace MetastoreRelation 
> with a special class CanonicalMetastoreRelation that uses attributes and 
> partitionKeys as part of its equals() and hashCode() implementation, but the 
> visible attributes and aprtitionKeys are restricted to expression ids that 
> the rest of the query actually needs from that MetastoreRelation.
> An example of logical plans and corresponding canonical logical plans: 
> https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw
> (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When 
> generating a SparkPlan, we keep an optional reference to a LogicalPlan 
> instance in every node. This allows us to populate the cache with mappings 
> from canonical logical plans of query fragments to the corresponding RDDs 
> generated as part of query execution. Note that there is no new work 
> necessary to generate the RDDs, we are merely utilizing the RDDs that would 
> have been produced as part of SparkPlan execution anyway.
> (3) SparkPlan fragment substitution. After generating a SparkPlan and before 
> calling prepare() or execute() on it, we check if any of its nodes have an 
> associated LogicalPlan that maps to a canonical logical plan matching a cache 
> entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD 
> wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of 
> the current query fragment. If the expected column order differs from what 
> the current SparkPlan fragment produces, we add a projection to reorder the 
> columns. We also add safe/unsafe row conversions as necessary to match the 
> row type that is expected by the parent of the current SparkPlan fragment.
> (4) The execute() method of SparkPlan also needs to perform the cache lookup 
> and substitution described above before producing a new RDD for the current 
> SparkPlan node. The "loading cache" pattern (e.g. as implemented in Guava) 
> allows to reuse query fragments between simultaneously submitted queries: 
> whichever query runs execute() for a particular fragment's canonical logical 
> plan starts producing an RDD first, and if another query has a fragment with 
> the same canonical logical plan, it waits for the RDD to be produced by the 
> first query and inserts it in its SparkPlan instead.
> This kind of query fragment caching will mostly be useful

[jira] [Created] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2015-11-18 Thread Yin Huai (JIRA)

Yin Huai created SPARK-11844:


 Summary: can not read class org.apache.parquet.format.PageHeader: 
don't know what type: 13
 Key: SPARK-11844
 URL: https://issues.apache.org/jira/browse/SPARK-11844
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Minor


I got the following error once when I was running a query
{code}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
don't know what type: 13
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
at 
org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
at 
org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
at 
org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
what type: 13
at 
parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
at 
parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
at 
org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
at 
parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
{code}
The next retry was good. Right now, seems not critical. But, let's still track 
it in case we see it in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11614) serde parameters should be set only when all params are ready

2015-11-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11614:
-
Assignee: Navis

> serde parameters should be set only when all params are ready
> -
>
> Key: SPARK-11614
> URL: https://issues.apache.org/jira/browse/SPARK-11614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Navis
>Assignee: Navis
>Priority: Minor
> Fix For: 1.6.0
>
>
> see HIVE-7975 and HIVE-12373
> With changed semantic of setters in thrift objects in hive, setter should be 
> called only after all parameters are set. It's not problem of current state 
> but will be a problem in some day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11614) serde parameters should be set only when all params are ready

2015-11-18 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11614.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9580
[https://github.com/apache/spark/pull/9580]

> serde parameters should be set only when all params are ready
> -
>
> Key: SPARK-11614
> URL: https://issues.apache.org/jira/browse/SPARK-11614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Navis
>Priority: Minor
> Fix For: 1.6.0
>
>
> see HIVE-7975 and HIVE-12373
> With changed semantic of setters in thrift objects in hive, setter should be 
> called only after all parameters are set. It's not problem of current state 
> but will be a problem in some day.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11544) sqlContext doesn't use PathFilter

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012718#comment-15012718
 ] 

Apache Spark commented on SPARK-11544:
--

User 'dilipbiswal' has created a pull request for this issue:
https://github.com/apache/spark/pull/9830

> sqlContext doesn't use PathFilter
> -
>
> Key: SPARK-11544
> URL: https://issues.apache.org/jira/browse/SPARK-11544
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
> Environment: AWS EMR 4.1.0, Spark 1.5.0
>Reporter: Frank Dai
>Assignee: Dilip Biswal
> Fix For: 1.6.0
>
>
> When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the 
> underlying SparkContext
> {code:java}
> val sc = new SparkContext(conf)
> sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", 
> classOf[TmpFileFilter], classOf[PathFilter])
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> {code}
> The definition of {{TmpFileFilter}} is:
> {code:title=TmpFileFilter.scala|borderStyle=solid}
> import org.apache.hadoop.fs.{Path, PathFilter}
> class TmpFileFilter  extends PathFilter {
>   override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
> }
> {code}
> When use {{sqlContext}} to read JSON files, e.g., 
> {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an 
> exception:
> {quote}
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
> s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp
> {quote}
> It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which 
> causes the above exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-11669) Python interface to SparkR GLM module

2015-11-18 Thread Shubhanshu Mishra (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shubhanshu Mishra reopened SPARK-11669:
---

What I mean't when I said a Python API to GLM was that the GLM module is 
something which is implemented in Spark and should be made a part of the MLLIB 
module rather than just being a SparkR feature. This will allow users who come 
to python statsmodels background to use the GLM module in their python code as 
well. 

I know the current GLM module is just build using SparkR but I feel it should 
be a core module with just a common API for multiple languages. 

> Python interface to SparkR GLM module
> -
>
> Key: SPARK-11669
> URL: https://issues.apache.org/jira/browse/SPARK-11669
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SparkR
>Affects Versions: 1.5.0, 1.5.1
> Environment: LINUX
> MAC
> WINDOWS
>Reporter: Shubhanshu Mishra
>Priority: Minor
>  Labels: GLM, pyspark, sparkR, statistics
>
> There should be a python interface to the sparkR GLM module. Currently the 
> only python library which creates R style GLM module results in statsmodels. 
> Inspiration for the API can be taken from the following page. 
> http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012710#comment-15012710
 ] 

Apache Spark commented on SPARK-5682:
-

User 'winningsix' has created a pull request for this issue:
https://github.com/apache/spark/pull/8880

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11838) Spark SQL query fragment RDD reuse

2015-11-18 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012701#comment-15012701
 ] 

Michael Armbrust commented on SPARK-11838:
--

As I said to [~markhamstra] offline, my biggest questions are about how we 
expose this to users.  (what are the interfaces to opt into this, what are the 
interfaces to invalidate, etc)

That said, you should look at how we do in-memory caching as its very similar 
to what you are proposing.  Some relevant parts of the code to look at:
 - 
[sameResult|https://github.com/apache/spark/blob/67c75828ff4df2e305bdf5d6be5a11201d1da3f3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L122]
  a less ambitious version of the query subsumption calculation you describe.  
Ideally we would just improve this.
 - 
[CacheManager|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala]
 Where we do substitution into a logical plan (I think i like this better than 
mixing it into execution).
 - 
[QueryExecution|https://github.com/apache/spark/blob/67c75828ff4df2e305bdf5d6be5a11201d1da3f3/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L38]
  How it plugs into execution.

> Spark SQL query fragment RDD reuse
> --
>
> Key: SPARK-11838
> URL: https://issues.apache.org/jira/browse/SPARK-11838
> Project: Spark
>  Issue Type: Improvement
>Reporter: Mikhail Bautin
>
> With many analytical Spark SQL workloads against slowly changing tables, 
> successive queries frequently share fragments that produce the same result. 
> Instead of re-computing those fragments for every query, it makes sense to 
> detect similar fragments and substitute RDDs previously created for matching 
> SparkPlan fragments into every new SparkPlan at the execution time whenever 
> possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, 
> many stages can still be skipped due to map output files being present on 
> executor nodes.
> The implementation involves the following steps:
> (1) Logical plan "canonicalization". 
> Logical plans mapping to the same "canonical" logical plan should always 
> produce the same results (except for possible output column reordering), 
> although the inverse statement won't always be true. 
>   - Re-mapping expression ids to "canonical expression ids" (successively 
> increasing numbers always starting with 1).
>   - Eliminating alias names that are unimportant after analysis completion. 
> Only the names that are necessary to determine the Hive table columns to be 
> scanned are retained.
>   - Reordering columns in projections, grouping/aggregation expressions, etc. 
> This can be done e.g. by using the string representation as a sort key. Union 
> inputs always have to be reordered the same way.
>   - Tree traversal has to happen starting from leaves and progressing towards 
> the root, because we need to already have identified canonical expression ids 
> for children of a node before we can come up with sort keys that would allow 
> to reorder expressions in a node deterministically. This is a bit more 
> complicated for Union nodes.
>   - Special handling for MetastoreRelations. We replace MetastoreRelation 
> with a special class CanonicalMetastoreRelation that uses attributes and 
> partitionKeys as part of its equals() and hashCode() implementation, but the 
> visible attributes and aprtitionKeys are restricted to expression ids that 
> the rest of the query actually needs from that MetastoreRelation.
> An example of logical plans and corresponding canonical logical plans: 
> https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw
> (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When 
> generating a SparkPlan, we keep an optional reference to a LogicalPlan 
> instance in every node. This allows us to populate the cache with mappings 
> from canonical logical plans of query fragments to the corresponding RDDs 
> generated as part of query execution. Note that there is no new work 
> necessary to generate the RDDs, we are merely utilizing the RDDs that would 
> have been produced as part of SparkPlan execution anyway.
> (3) SparkPlan fragment substitution. After generating a SparkPlan and before 
> calling prepare() or execute() on it, we check if any of its nodes have an 
> associated LogicalPlan that maps to a canonical logical plan matching a cache 
> entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD 
> wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of 
> the current query fragment. If the expected column order differs from what 
> the current SparkPlan fragment produces, we add a projection to reorder the 
> columns. We also a

[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager

2015-11-18 Thread Nishkam Ravi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012691#comment-15012691
 ] 

Nishkam Ravi commented on SPARK-11278:
--

[~andrewor14] This was last tested on Nov 11th, which would include the commit 
you mentioned. Each node has 16 vcores and 48GB memory.

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.1
>Reporter: Nishkam Ravi
>Assignee: Andrew Or
>Priority: Critical
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11842) Cleanups to existing Readers and Writers

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11842:
--
Description: 
Small cleanups to existing Readers and Writers
* Add {{repartition(1)}} to save() methods' saving of data for 
LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept
* Add Since versions for read/write methods in Pipeline, LogisticRegression
* Switch from hand-written class names in Readers to using getClass

  was:
Small cleanups to existing Readers and Writers
* Add {{repartition(1)}} to save() methods' saving of data for 
LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept
* Add Since versions for read/write methods in Pipeline, LogisticRegression


> Cleanups to existing Readers and Writers
> 
>
> Key: SPARK-11842
> URL: https://issues.apache.org/jira/browse/SPARK-11842
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Small cleanups to existing Readers and Writers
> * Add {{repartition(1)}} to save() methods' saving of data for 
> LogisticRegressionModel, LinearRegressionModel.
> * Strengthen privacy to class and companion object for Writers and Readers
> * Change LogisticRegressionSuite read/write test to fit intercept
> * Add Since versions for read/write methods in Pipeline, LogisticRegression
> * Switch from hand-written class names in Readers to using getClass



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: CrossValidator

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6791:
-
Description: Updated to be for CrossValidator only  (was: Algorithms: 
Pipeline, CrossValidator (and associated models)

This task will block on all other subtasks for [SPARK-6725].  This task will 
also include adding export/import as a required part of the PipelineStage 
interface since meta-algorithms will depend on sub-algorithms supporting 
save/load.)

> Model export/import for spark.ml: CrossValidator
> 
>
> Key: SPARK-6791
> URL: https://issues.apache.org/jira/browse/SPARK-6791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Updated to be for CrossValidator only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11829) Model export/import for spark.ml: estimators under ml.feature (II)

2015-11-18 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012676#comment-15012676
 ] 

Xiangrui Meng commented on SPARK-11829:
---

[~yanboliang] Could you help on this JIRA? I don't think we can make `RFormula` 
work in 1.6, but the rest of them should be very similar to SPARK-6787 and 
SPARK-11839.

> Model export/import for spark.ml: estimators under ml.feature (II)
> --
>
> Key: SPARK-11829
> URL: https://issues.apache.org/jira/browse/SPARK-11829
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Add read/write support to the following estimators under spark.ml:
> * ChiSqSelector
> * PCA
> * QuantileDiscretizer
> * VectorIndexer
> * Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6791) Model export/import for spark.ml: CrossValidator

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-6791:


Assignee: Joseph K. Bradley

> Model export/import for spark.ml: CrossValidator
> 
>
> Key: SPARK-6791
> URL: https://issues.apache.org/jira/browse/SPARK-6791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Updated to be for CrossValidator only



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11829) Model export/import for spark.ml: estimators under ml.feature (II)

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11829:
--
Description: 
Add read/write support to the following estimators and models under spark.ml:

* ChiSqSelector
* PCA
* QuantileDiscretizer
* VectorIndexer
* Word2Vec

  was:
Add read/write support to the following estimators under spark.ml:

* ChiSqSelector
* PCA
* QuantileDiscretizer
* VectorIndexer
* Word2Vec


> Model export/import for spark.ml: estimators under ml.feature (II)
> --
>
> Key: SPARK-11829
> URL: https://issues.apache.org/jira/browse/SPARK-11829
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Add read/write support to the following estimators and models under spark.ml:
> * ChiSqSelector
> * PCA
> * QuantileDiscretizer
> * VectorIndexer
> * Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: CrossValidator

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6791:
-
Summary: Model export/import for spark.ml: CrossValidator  (was: Model 
export/import for spark.ml: meta-algorithms)

> Model export/import for spark.ml: CrossValidator
> 
>
> Key: SPARK-6791
> URL: https://issues.apache.org/jira/browse/SPARK-6791
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Algorithms: Pipeline, CrossValidator (and associated models)
> This task will block on all other subtasks for [SPARK-6725].  This task will 
> also include adding export/import as a required part of the PipelineStage 
> interface since meta-algorithms will depend on sub-algorithms supporting 
> save/load.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11816) fix some style issue in ML/MLlib examples

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11816.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9808
[https://github.com/apache/spark/pull/9808]

> fix some style issue in ML/MLlib examples
> -
>
> Key: SPARK-11816
> URL: https://issues.apache.org/jira/browse/SPARK-11816
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Trivial
> Fix For: 1.6.0
>
>
> Currently I only fixed some obvious comments issue like 
> // scalastyle:off println 
> on the bottom.
> Yet the style in examples is not quite consistent, like only half of the 
> examples  are with 
> // Example usage: ./bin/run-example mllib.FPGrowthExample \, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11816) fix some style issue in ML/MLlib examples

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11816:
--
Target Version/s: 1.6.0
 Component/s: Documentation

> fix some style issue in ML/MLlib examples
> -
>
> Key: SPARK-11816
> URL: https://issues.apache.org/jira/browse/SPARK-11816
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Trivial
>
> Currently I only fixed some obvious comments issue like 
> // scalastyle:off println 
> on the bottom.
> Yet the style in examples is not quite consistent, like only half of the 
> examples  are with 
> // Example usage: ./bin/run-example mllib.FPGrowthExample \, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11816) fix some style issue in ML/MLlib examples

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11816:
--
Assignee: yuhao yang

> fix some style issue in ML/MLlib examples
> -
>
> Key: SPARK-11816
> URL: https://issues.apache.org/jira/browse/SPARK-11816
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Trivial
>
> Currently I only fixed some obvious comments issue like 
> // scalastyle:off println 
> on the bottom.
> Yet the style in examples is not quite consistent, like only half of the 
> examples  are with 
> // Example usage: ./bin/run-example mllib.FPGrowthExample \, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11842) Cleanups to existing Readers and Writers

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11842:
--
Target Version/s: 1.6.0

> Cleanups to existing Readers and Writers
> 
>
> Key: SPARK-11842
> URL: https://issues.apache.org/jira/browse/SPARK-11842
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Small cleanups to existing Readers and Writers
> * Add {{repartition(1)}} to save() methods' saving of data for 
> LogisticRegressionModel, LinearRegressionModel.
> * Strengthen privacy to class and companion object for Writers and Readers
> * Change LogisticRegressionSuite read/write test to fit intercept
> * Add Since versions for read/write methods in Pipeline, LogisticRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11842) Cleanups to existing Readers and Writers

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012667#comment-15012667
 ] 

Apache Spark commented on SPARK-11842:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/9829

> Cleanups to existing Readers and Writers
> 
>
> Key: SPARK-11842
> URL: https://issues.apache.org/jira/browse/SPARK-11842
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Small cleanups to existing Readers and Writers
> * Add {{repartition(1)}} to save() methods' saving of data for 
> LogisticRegressionModel, LinearRegressionModel.
> * Strengthen privacy to class and companion object for Writers and Readers
> * Change LogisticRegressionSuite read/write test to fit intercept
> * Add Since versions for read/write methods in Pipeline, LogisticRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11835) Add a menu to the documentation of MLlib

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11835:
--
Component/s: Documentation

> Add a menu to the documentation of MLlib
> 
>
> Key: SPARK-11835
> URL: https://issues.apache.org/jira/browse/SPARK-11835
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>Assignee: Tim Hunter
> Attachments: Screen Shot 2015-11-18 at 4.50.45 PM.png
>
>
> Right now, the table of contents gets generated on a page-by-page basis, 
> which makes it hard to navigate between different topics in a project. We 
> should make use of the empty space on the left of the documentation to put a 
> navigation menu.
> A picture is worth a thousand words:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11835) Add a menu to the documentation of MLlib

2015-11-18 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11835:
--
Assignee: Tim Hunter

> Add a menu to the documentation of MLlib
> 
>
> Key: SPARK-11835
> URL: https://issues.apache.org/jira/browse/SPARK-11835
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Affects Versions: 1.5.1
>Reporter: Tim Hunter
>Assignee: Tim Hunter
> Attachments: Screen Shot 2015-11-18 at 4.50.45 PM.png
>
>
> Right now, the table of contents gets generated on a page-by-page basis, 
> which makes it hard to navigate between different topics in a project. We 
> should make use of the empty space on the left of the documentation to put a 
> navigation menu.
> A picture is worth a thousand words:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11787) Speed up parquet reader for flat schemas

2015-11-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11787.
-
   Resolution: Fixed
 Assignee: Nong Li
Fix Version/s: 1.6.0

> Speed up parquet reader for flat schemas
> 
>
> Key: SPARK-11787
> URL: https://issues.apache.org/jira/browse/SPARK-11787
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 1.6.0
>
>
> Measuring the performance of running some of the TPCDS and anecdotally, 
> parquet scan and record reconstruction performance is a bottleneck.
> For simple schemas, we can do better using the lower level parquet-mr APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11842) Cleanups to existing Readers and Writers

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11842:
--
Description: 
Small cleanups to existing Readers and Writers
* Add {{repartition(1)}} to save() methods' saving of data for 
LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept
* Add Since versions for read/write methods in Pipeline, LogisticRegression

  was:
Small cleanups to existing Readers and Writers
* Add {{repartition(1)}} to save() methods' saving of data for 
LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept


> Cleanups to existing Readers and Writers
> 
>
> Key: SPARK-11842
> URL: https://issues.apache.org/jira/browse/SPARK-11842
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> Small cleanups to existing Readers and Writers
> * Add {{repartition(1)}} to save() methods' saving of data for 
> LogisticRegressionModel, LinearRegressionModel.
> * Strengthen privacy to class and companion object for Writers and Readers
> * Change LogisticRegressionSuite read/write test to fit intercept
> * Add Since versions for read/write methods in Pipeline, LogisticRegression



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11839) Renames traits to avoid collision with java.util.* and add use default traits to simplify the impl

2015-11-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11839.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9827
[https://github.com/apache/spark/pull/9827]

> Renames traits to avoid collision with java.util.* and add use default traits 
> to simplify the impl
> --
>
> Key: SPARK-11839
> URL: https://issues.apache.org/jira/browse/SPARK-11839
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> This helps simplify the development.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11833) Add Java tests for Kryo/Java encoders

2015-11-18 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11833.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Add Java tests for Kryo/Java encoders
> -
>
> Key: SPARK-11833
> URL: https://issues.apache.org/jira/browse/SPARK-11833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE

2015-11-18 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012655#comment-15012655
 ] 

Xin Wu commented on SPARK-9761:
---

One thing I notice is that if I create the table explicitly before letting the 
dataframe to write into the table,  describe table will show the alter added 
column. Even though I created the table stored as parquet and I verified that 
the saved data file is parquet format.
{code}
hiveContext.sql("drop table Orders")
val df = hiveContext.read.json("/home/xwu0226/spark-tables/Orders.json")
df.show()
hiveContext.sql("create table orders(customerID int, orderID int) stored as 
parquet")
df.write.mode(SaveMode.Append).saveAsTable("Orders")
hiveContext.sql("ALTER TABLE Orders add columns (z string)")
hiveContext.sql("describe extended Orders").show
{code}

output:
{code}
+--+-+---+
|  col_name|data_type|comment|
+--+-+---+
|customerid|  int|   |
|   orderid|  int|   |
| z|   string|   |
+--+-+---+
{code}

So with the explicit creation of the table, the describe seems to use the 
schema merging, while the other case does not merge schema.. 

"spark.sql.sources.provider" property is defined for explicitly created table, 
such that the logic of lookupRelation in HiveMetastoreCatalog.scala goes to 
look up from the cachedDataSrouceTables, where the relation is not found then, 
get reloaded from parquet file, resulting in column schemas created according 
to parquet content.. It would be nice the schema is merged when constructing 
this new relation before giving it back to caller.  Looking deeper into this.. 




> Inconsistent metadata handling with ALTER TABLE
> ---
>
> Key: SPARK-9761
> URL: https://issues.apache.org/jira/browse/SPARK-9761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. 
> The table in question was created with {{HiveContext.read.json()}}.
> Steps:
> # {{alter table dimension_components add columns (z string);}} succeeds.
> # {{describe dimension_components;}} does not show the new column, even after 
> restarting spark-sql.
> # A second {{alter table dimension_components add columns (z string);}} fails 
> with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Duplicate column name: z
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE

2015-11-18 Thread Xin Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012655#comment-15012655
 ] 

Xin Wu edited comment on SPARK-9761 at 11/19/15 2:31 AM:
-

One thing I notice is that if I create the table explicitly before letting the 
dataframe to write into the table,  describe table will show the alter added 
column. Even though I created the table stored as parquet and I verified that 
the saved data file is parquet format.
{code}
hiveContext.sql("drop table Orders")
val df = hiveContext.read.json("/home/xwu0226/spark-tables/Orders.json")
df.show()
hiveContext.sql("create table orders(customerID int, orderID int) stored as 
parquet")
df.write.mode(SaveMode.Append).saveAsTable("Orders")
hiveContext.sql("ALTER TABLE Orders add columns (z string)")
hiveContext.sql("describe extended Orders").show
{code}

output:
{code}
+--+-+---+
|  col_name|data_type|comment|
+--+-+---+
|customerid|  int|   |
|   orderid|  int|   |
| z|   string|   |
+--+-+---+
{code}

So with the explicit creation of the table, the describe seems to use the 
schema merging, while the other case does not merge schema.. 

"spark.sql.sources.provider" property is defined for implicitly created table, 
such that the logic of lookupRelation in HiveMetastoreCatalog.scala goes to 
look up from the cachedDataSrouceTables, where the relation is not found then, 
get reloaded from parquet file, resulting in column schemas created according 
to parquet content.. It would be nice the schema is merged when constructing 
this new relation before giving it back to caller.  Looking deeper into this.. 





was (Author: xwu0226):
One thing I notice is that if I create the table explicitly before letting the 
dataframe to write into the table,  describe table will show the alter added 
column. Even though I created the table stored as parquet and I verified that 
the saved data file is parquet format.
{code}
hiveContext.sql("drop table Orders")
val df = hiveContext.read.json("/home/xwu0226/spark-tables/Orders.json")
df.show()
hiveContext.sql("create table orders(customerID int, orderID int) stored as 
parquet")
df.write.mode(SaveMode.Append).saveAsTable("Orders")
hiveContext.sql("ALTER TABLE Orders add columns (z string)")
hiveContext.sql("describe extended Orders").show
{code}

output:
{code}
+--+-+---+
|  col_name|data_type|comment|
+--+-+---+
|customerid|  int|   |
|   orderid|  int|   |
| z|   string|   |
+--+-+---+
{code}

So with the explicit creation of the table, the describe seems to use the 
schema merging, while the other case does not merge schema.. 

"spark.sql.sources.provider" property is defined for explicitly created table, 
such that the logic of lookupRelation in HiveMetastoreCatalog.scala goes to 
look up from the cachedDataSrouceTables, where the relation is not found then, 
get reloaded from parquet file, resulting in column schemas created according 
to parquet content.. It would be nice the schema is merged when constructing 
this new relation before giving it back to caller.  Looking deeper into this.. 




> Inconsistent metadata handling with ALTER TABLE
> ---
>
> Key: SPARK-9761
> URL: https://issues.apache.org/jira/browse/SPARK-9761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
> Environment: Ubuntu on AWS
>Reporter: Simeon Simeonov
>  Labels: hive, sql
>
> Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. 
> The table in question was created with {{HiveContext.read.json()}}.
> Steps:
> # {{alter table dimension_components add columns (z string);}} succeeds.
> # {{describe dimension_components;}} does not show the new column, even after 
> restarting spark-sql.
> # A second {{alter table dimension_components add columns (z string);}} fails 
> with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: 
> Duplicate column name: z
> Full spark-sql output 
> [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11843) Isolate staging directory across applications on same YARN cluster

2015-11-18 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012653#comment-15012653
 ] 

Marcelo Vanzin commented on SPARK-11843:


Client.scala appends the application ID to ".sparkStaging" to form the actual 
name of the staging directory. Where have you seen two applications using the 
same staging dir?

> Isolate staging directory across applications on same YARN cluster
> --
>
> Key: SPARK-11843
> URL: https://issues.apache.org/jira/browse/SPARK-11843
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Andrew Or
>Priority: Minor
>
> If multiple clients share the same YARN cluster and file system they may end 
> up using the same `.sparkStaging` directory. This may be a problem if their 
> jars are called something similar, for instance. It would be easier to 
> enforce isolation for both security and user experience if the staging 
> directories are isolated. We can either:
> (1) allow users to configure the directory name
> (2) add an identifier to the directory name, which I prefer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11843) Isolate staging directory across applications on same YARN cluster

2015-11-18 Thread Andrew Or (JIRA)

Andrew Or created SPARK-11843:
-

 Summary: Isolate staging directory across applications on same 
YARN cluster
 Key: SPARK-11843
 URL: https://issues.apache.org/jira/browse/SPARK-11843
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Andrew Or
Priority: Minor


If multiple clients share the same YARN cluster and file system they may end up 
using the same `.sparkStaging` directory. This may be a problem if their jars 
are called something similar, for instance. It would be easier to enforce 
isolation for both security and user experience if the staging directories are 
isolated. We can either:

(1) allow users to configure the directory name
(2) add an identifier to the directory name, which I prefer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly

2015-11-18 Thread Jakob Odersky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012643#comment-15012643
 ] 

Jakob Odersky commented on SPARK-7286:
--

Going through the code, I saw that catalyst also defines !== in its dsl, so it 
seems this operator has quite wide-spread usage.
Would deprecating it in favor of something else be a viable option?

> Precedence of operator not behaving properly
> 
>
> Key: SPARK-7286
> URL: https://issues.apache.org/jira/browse/SPARK-7286
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: Linux
>Reporter: DevilJetha
>Priority: Critical
>
> The precedence of the operators ( especially with !== and && ) in Dataframe 
> Columns seems to be messed up.
> Example Snippet
> .where( $"col1" === "val1" && ($"col2"  !== "val2")  ) works fine.
> whereas .where( $"col1" === "val1" && $"col2"  !== "val2"  )
> evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager

2015-11-18 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012638#comment-15012638
 ] 

Andrew Or commented on SPARK-11278:
---

also, when you said 6 nodes what kind of nodes are they? How much memory / 
cores per node?

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.1
>Reporter: Nishkam Ravi
>Priority: Critical
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11278) PageRank fails with unified memory manager

2015-11-18 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or reassigned SPARK-11278:
-

Assignee: Andrew Or

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.1
>Reporter: Nishkam Ravi
>Assignee: Andrew Or
>Priority: Critical
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager

2015-11-18 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012632#comment-15012632
 ] 

Andrew Or commented on SPARK-11278:
---

[~nravi] can you try again with the latest 1.6 branch to see if this is still 
an issue? I wonder how this is different with 
https://github.com/apache/spark/commit/56419cf11f769c80f391b45dc41b3c7101cc5ff4.

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.1
>Reporter: Nishkam Ravi
>Priority: Critical
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11842) Cleanups to existing Readers and Writers

2015-11-18 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-11842:
-

 Summary: Cleanups to existing Readers and Writers
 Key: SPARK-11842
 URL: https://issues.apache.org/jira/browse/SPARK-11842
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


Small cleanups to existing Readers and Writers
* Add {{repartition(1)}} to save() methods' saving of data for 
LogisticRegressionModel, LinearRegressionModel.
* Strengthen privacy to class and companion object for Writers and Readers
* Change LogisticRegressionSuite read/write test to fit intercept



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow

2015-11-18 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012618#comment-15012618
 ] 

Josh Rosen commented on SPARK-11649:


[~vanzin], we actually _did_ see this fail in the master builds, too, and it's 
also really slow there, so this change is also relevant for master and 1.6.

> "SparkListenerSuite.onTaskGettingResult() called when result fetched 
> remotely" test is very slow
> 
>
> Key: SPARK-11649
> URL: https://issues.apache.org/jira/browse/SPARK-11649
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.3, 1.6.0
>
>
> The SparkListenerSuite "onTaskGettingResult() called when result fetched 
> remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which 
> seems excessively slow; we should see if there's an easy way to speed this up:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11841) None of start-all.sh, start-master.sh or start-slaves.sh takes -m, -c or -d configuration options as per the document

2015-11-18 Thread Xiangyu Li (JIRA)

Xiangyu Li created SPARK-11841:
--

 Summary: None of start-all.sh, start-master.sh or start-slaves.sh 
takes -m, -c or -d configuration options as per the document
 Key: SPARK-11841
 URL: https://issues.apache.org/jira/browse/SPARK-11841
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2
Reporter: Xiangyu Li


I was trying to set up Spark Standalone Mode following the tutorial at 
http://spark.apache.org/docs/latest/spark-standalone.html.

The tutorial says that we can pass "-c CORES" to the worker to set the total 
number of CPU cores allowed. But none of the start-all.sh, start-master.sh or 
start-slaves.sh would take those options as arguments.

The start-all.sh and start-slaves.sh will just skip the options while 
start-master.sh can only take -h, -i, -p, --properties-file according to 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala

So the only way I can limit the number of cores of an application at the moment 
is set the SPARK_WORKER_CORES in ${SPARK_HOME}/conf/spark_env.sh and then run 
start-all.sh

So I think it is either an error in the document or the program.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11840:


Assignee: Apache Spark  (was: Yin Huai)

> Restore the 1.5's behavior of planning a single distinct aggregation.
> -
>
> Key: SPARK-11840
> URL: https://issues.apache.org/jira/browse/SPARK-11840
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> The impact of this change is for a query that has a single distinct column 
> and does not have any grouping expression like
> {{SELECT COUNT(DISTINCT a) FROM table}}
> The plan will be changed from
> {code}
> AGG-2 (count distinct)
>   Shuffle to a single reducer
> Partial-AGG-2 (count distinct)
>   AGG-1 (grouping on a)
> Shuffle by a
>   Partial-AGG-1 (grouping on 1)
> {code}
> to the following one (1.5 uses this)
> {code}
> AGG-2
>   AGG-1 (grouping on a)
> Shuffle to a single reducer
>   Partial-AGG-1(grouping on a)
> {code}
> The first plan is more robust. However, to better benchmark the impact of 
> this change, we should use 1.5's plan and use the conf of 
> {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.

2015-11-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11840:


Assignee: Yin Huai  (was: Apache Spark)

> Restore the 1.5's behavior of planning a single distinct aggregation.
> -
>
> Key: SPARK-11840
> URL: https://issues.apache.org/jira/browse/SPARK-11840
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The impact of this change is for a query that has a single distinct column 
> and does not have any grouping expression like
> {{SELECT COUNT(DISTINCT a) FROM table}}
> The plan will be changed from
> {code}
> AGG-2 (count distinct)
>   Shuffle to a single reducer
> Partial-AGG-2 (count distinct)
>   AGG-1 (grouping on a)
> Shuffle by a
>   Partial-AGG-1 (grouping on 1)
> {code}
> to the following one (1.5 uses this)
> {code}
> AGG-2
>   AGG-1 (grouping on a)
> Shuffle to a single reducer
>   Partial-AGG-1(grouping on a)
> {code}
> The first plan is more robust. However, to better benchmark the impact of 
> this change, we should use 1.5's plan and use the conf of 
> {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.

2015-11-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012607#comment-15012607
 ] 

Apache Spark commented on SPARK-11840:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9828

> Restore the 1.5's behavior of planning a single distinct aggregation.
> -
>
> Key: SPARK-11840
> URL: https://issues.apache.org/jira/browse/SPARK-11840
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> The impact of this change is for a query that has a single distinct column 
> and does not have any grouping expression like
> {{SELECT COUNT(DISTINCT a) FROM table}}
> The plan will be changed from
> {code}
> AGG-2 (count distinct)
>   Shuffle to a single reducer
> Partial-AGG-2 (count distinct)
>   AGG-1 (grouping on a)
> Shuffle by a
>   Partial-AGG-1 (grouping on 1)
> {code}
> to the following one (1.5 uses this)
> {code}
> AGG-2
>   AGG-1 (grouping on a)
> Shuffle to a single reducer
>   Partial-AGG-1(grouping on a)
> {code}
> The first plan is more robust. However, to better benchmark the impact of 
> this change, we should use 1.5's plan and use the conf of 
> {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.

2015-11-18 Thread Yin Huai (JIRA)

Yin Huai created SPARK-11840:


 Summary: Restore the 1.5's behavior of planning a single distinct 
aggregation.
 Key: SPARK-11840
 URL: https://issues.apache.org/jira/browse/SPARK-11840
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai


The impact of this change is for a query that has a single distinct column and 
does not have any grouping expression like
{{SELECT COUNT(DISTINCT a) FROM table}}
The plan will be changed from

{code}
AGG-2 (count distinct)
  Shuffle to a single reducer
Partial-AGG-2 (count distinct)
  AGG-1 (grouping on a)
Shuffle by a
  Partial-AGG-1 (grouping on 1)
{code}
to the following one (1.5 uses this)

{code}
AGG-2
  AGG-1 (grouping on a)
Shuffle to a single reducer
  Partial-AGG-1(grouping on a)
{code}
The first plan is more robust. However, to better benchmark the impact of this 
change, we should use 1.5's plan and use the conf of 
{{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow

2015-11-18 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012601#comment-15012601
 ] 

Andrew Or commented on SPARK-11649:
---

I back ported it into 1.5.

> "SparkListenerSuite.onTaskGettingResult() called when result fetched 
> remotely" test is very slow
> 
>
> Key: SPARK-11649
> URL: https://issues.apache.org/jira/browse/SPARK-11649
> Project: Spark
>  Issue Type: Sub-task
>  Components: Tests
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.5.3, 1.6.0
>
>
> The SparkListenerSuite "onTaskGettingResult() called when result fetched 
> remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which 
> seems excessively slow; we should see if there's an easy way to speed this up:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 >

1 - 100 of 490 matches

Mail list logo