[jira] [Assigned] (SPARK-11846) Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11846: Assignee: Apache Spark (was: Xusen Yin) > Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression > -- > > Key: SPARK-11846 > URL: https://issues.apache.org/jira/browse/SPARK-11846 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Add read/write support to AFTSurvivalRegression and IsotonicRegression using > LinearRegression read/write as reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11747) Can not specify input path in python logistic_regression example under ml
[ https://issues.apache.org/jira/browse/SPARK-11747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11747. --- Resolution: Won't Fix > Can not specify input path in python logistic_regression example under ml > - > > Key: SPARK-11747 > URL: https://issues.apache.org/jira/browse/SPARK-11747 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Jeff Zhang >Priority: Minor > > Not sure why it is hard coded, it would be nice to allow user to specify > input path > {code} > # Load and parse the data file into a dataframe. > df = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: ok) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: ok) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hujiayin updated SPARK-4036: Comment: was deleted (was: ok) > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013095#comment-15013095 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013097#comment-15013097 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013098#comment-15013098 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013096#comment-15013096 ] hujiayin commented on SPARK-4036: - ok > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS
[ https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013088#comment-15013088 ] Virgil Palanciuc commented on SPARK-10109: -- Haven't tried it lately - just ran it with parallel execution disabled. Like I said - root cause seems to be the fact that I was writing 2 dataframes simultaneously in the same destination (albeit with a partitioning scheme that made the destination non-overlapping). I partition by the columns "dpid" and "pid" and if I process 2 PIDs in parallel (on different threads), and they both have the same DPID (e.g I simultaneously write to /dpid=1/pid=1/ and to /dpid=1/pid=2/ ), I get this problem. This seems to be caused by the fact that, on 2 different threads, I do "df.write.partitionBy().parquet('') " - i.e. I write 2 DataFrames simultaneously "to the same destination" (due to partitioning, the actual files should never overlap; but it still seems to be a problem). Not sure if this was really a Spark bug or it's an application problem - if you think Spark should've worked in this scenario, let me know and I'll retry it. But it kinda' feels it was really an application bug (bad assumption on my part about how writing works) > NPE when saving Parquet To HDFS > --- > > Key: SPARK-10109 > URL: https://issues.apache.org/jira/browse/SPARK-10109 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Sparc-ec2, standalone cluster on amazon >Reporter: Virgil Palanciuc > > Very simple code, trying to save a dataframe > I get this in the driver > {quote} > 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID > 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) > and (not for that task): > 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID > 5607, 172.yy.yy.yy): java.lang.NullPointerException > at > parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) > at > parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) > at > parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) > at > org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536) > at > scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) > at > scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536) > at > org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) > at > org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > I get this in the executor log: > {quote} > 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): > No lease on > /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet > File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not > have any open files. > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396) > at > org.apache.hadoop.hdfs.server.namenode.FSNames
[jira] [Updated] (SPARK-11339) Fix and document the list of functions in R base package that are masked by functions with same name in SparkR
[ https://issues.apache.org/jira/browse/SPARK-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-11339: -- Assignee: Felix Cheung > Fix and document the list of functions in R base package that are masked by > functions with same name in SparkR > -- > > Key: SPARK-11339 > URL: https://issues.apache.org/jira/browse/SPARK-11339 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Sun Rui >Assignee: Felix Cheung > Fix For: 1.6.0 > > > There may be name conflicts between API functions of SparkR and functions > exposed in R base package (or other popular 3rd-party packages). If some > functions in name conflict are very popular and frequently used in R base > package, we may rename the functions in SparkR to avoid conflict and > in-convenience to R users. Otherwise, we keep the name of functions in > SparkR, so the functions of same name in R base package are masked. > We'd better have a list of such functions of name conflict to reduce > confusion of users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11339) Fix and document the list of functions in R base package that are masked by functions with same name in SparkR
[ https://issues.apache.org/jira/browse/SPARK-11339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-11339. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9785 [https://github.com/apache/spark/pull/9785] > Fix and document the list of functions in R base package that are masked by > functions with same name in SparkR > -- > > Key: SPARK-11339 > URL: https://issues.apache.org/jira/browse/SPARK-11339 > Project: Spark > Issue Type: Documentation > Components: SparkR >Reporter: Sun Rui > Fix For: 1.6.0 > > > There may be name conflicts between API functions of SparkR and functions > exposed in R base package (or other popular 3rd-party packages). If some > functions in name conflict are very popular and frequently used in R base > package, we may rename the functions in SparkR to avoid conflict and > in-convenience to R users. Otherwise, we keep the name of functions in > SparkR, so the functions of same name in R base package are masked. > We'd better have a list of such functions of name conflict to reduce > confusion of users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013068#comment-15013068 ] Kai Sasaki edited comment on SPARK-4036 at 11/19/15 7:32 AM: - [~hujiayin] I'm sorry for being late for response. I haven't yet create any patch. So never mind to work on this JIRA instead of me. Anyway, can I give a check and comment to your patch? was (Author: lewuathe): [~hujiayin] I'm sorry for being late for response. I haven't yet create any patch. So never mind to work in this JIRA instead of me. Anyway, can I give a check and comment to your patch? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013068#comment-15013068 ] Kai Sasaki commented on SPARK-4036: --- [~hujiayin] I'm sorry for being late for response. I haven't yet create any patch. So never mind to work in this JIRA instead of me. Anyway, can I give a check and comment to your patch? > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet
[ https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013064#comment-15013064 ] Virgil Palanciuc commented on SPARK-11657: -- Yes, I think I was using Kryo. Thanks! > Bad Dataframe data read from parquet > > > Key: SPARK-11657 > URL: https://issues.apache.org/jira/browse/SPARK-11657 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1, 1.5.2 > Environment: EMR (yarn) >Reporter: Virgil Palanciuc >Assignee: Davies Liu >Priority: Critical > Fix For: 1.5.3, 1.6.0 > > Attachments: sample.tgz > > > I get strange behaviour when reading parquet data: > {code} > scala> val data = sqlContext.read.parquet("hdfs:///sample") > data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: > string, clusterData: array, dpid: int] > scala> data.take(1)/// this returns garbage > res0: Array[org.apache.spark.sql.Row] = > Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813]) > > scala> data.collect()/// this works > res1: Array[org.apache.spark.sql.Row] = > Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813]) > {code} > I've attached the "hdfs:///sample" directory to this bug report -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15010353#comment-15010353 ] hujiayin edited comment on SPARK-4036 at 11/19/15 7:09 AM: --- Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 I referenced the other document besides Sasaki's design for the implementation. http://www.cs.utah.edu/~piyush/teaching/structured_prediction.pdf was (Author: hujiayin): Hi Sasaki, I'm not sure if you worked on it as the jira is still open. If you have a PR, you could close my PR https://github.com/apache/spark/pull/9794 > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings
[ https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7499: --- Assignee: Apache Spark > Investigate how to specify columns in SparkR without $ or strings > - > > Key: SPARK-7499 > URL: https://issues.apache.org/jira/browse/SPARK-7499 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman >Assignee: Apache Spark > > Right now in SparkR we need to specify the columns used using `$` or strings. > For example to run select we would do > {code} > df1 <- select(df, df$age > 10) > {code} > It would be good to infer the set of columns in a dataframe automatically and > resolve symbols for column names. For example > {code} > df1 <- select(df, age > 10) > {code} > One way to do this is to build an environment with all the column names to > column handles and then use `substitute(arg, env = columnNameEnv)` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings
[ https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7499: --- Assignee: (was: Apache Spark) > Investigate how to specify columns in SparkR without $ or strings > - > > Key: SPARK-7499 > URL: https://issues.apache.org/jira/browse/SPARK-7499 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman > > Right now in SparkR we need to specify the columns used using `$` or strings. > For example to run select we would do > {code} > df1 <- select(df, df$age > 10) > {code} > It would be good to infer the set of columns in a dataframe automatically and > resolve symbols for column names. For example > {code} > df1 <- select(df, age > 10) > {code} > One way to do this is to build an environment with all the column names to > column handles and then use `substitute(arg, env = columnNameEnv)` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7499) Investigate how to specify columns in SparkR without $ or strings
[ https://issues.apache.org/jira/browse/SPARK-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013045#comment-15013045 ] Apache Spark commented on SPARK-7499: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/9835 > Investigate how to specify columns in SparkR without $ or strings > - > > Key: SPARK-7499 > URL: https://issues.apache.org/jira/browse/SPARK-7499 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman > > Right now in SparkR we need to specify the columns used using `$` or strings. > For example to run select we would do > {code} > df1 <- select(df, df$age > 10) > {code} > It would be good to infer the set of columns in a dataframe automatically and > resolve symbols for column names. For example > {code} > df1 <- select(df, age > 10) > {code} > One way to do this is to build an environment with all the column names to > column handles and then use `substitute(arg, env = columnNameEnv)` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11021) SparkSQL cli throws exception when using with Hive 0.12 metastore in spark-1.5.0 version
[ https://issues.apache.org/jira/browse/SPARK-11021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013046#comment-15013046 ] bit1129 commented on SPARK-11021: - I encounter exactly the same issue with Spark 1.5.0 + 0.14.0, The problem is gone after I added the configuration as Jeff suggested,Thanks Jeff! > SparkSQL cli throws exception when using with Hive 0.12 metastore in > spark-1.5.0 version > > > Key: SPARK-11021 > URL: https://issues.apache.org/jira/browse/SPARK-11021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: iward > > After upgrade spark from 1.4.1 to 1.5.0,I get the following exception when I > set set the following properties in spark-defaults.conf: > {noformat} > spark.sql.hive.metastore.version=0.12.0 > spark.sql.hive.metastore.jars=hive 0.12 jars and hadoop jars > {noformat} > when I run a task,it got following exception: > {noformat} > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.sql.hive.client.Shim_v0_12.loadTable(HiveShim.scala:249) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply$mcV$sp(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$loadTable$1.apply(ClientWrapper.scala:489) > at > org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256) > at > org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211) > at > org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248) > at > org.apache.spark.sql.hive.client.ClientWrapper.loadTable(ClientWrapper.scala:488) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:243) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:127) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:263) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:927) > at > org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:927) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:144) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:129) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:719) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:61) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:311) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376) > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:311) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:165) > at > org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move > results from > hdfs://ns
[jira] [Commented] (SPARK-11748) Result is null after alter column name of table stored as Parquet
[ https://issues.apache.org/jira/browse/SPARK-11748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013030#comment-15013030 ] pin_zhang commented on SPARK-11748: --- Apache hive 0.14 has added Support for Parquet Column Rename https://issues.apache.org/jira/browse/HIVE-6938 That doesn't work in spark hive > Result is null after alter column name of table stored as Parquet > -- > > Key: SPARK-11748 > URL: https://issues.apache.org/jira/browse/SPARK-11748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: pin_zhang > > 1. Test with the following code > hctx.sql(" create table " + table + " (id int, str string) STORED AS > PARQUET ") > val df = hctx.jsonFile("g:/vip.json") > df.write.format("parquet").mode(SaveMode.Append).saveAsTable(table) > hctx.sql(" select * from " + table).show() > // alter table > val alter = "alter table " + table + " CHANGE id i_d int " > hctx.sql(alter) > > hctx.sql(" select * from " + table).show() > 2. Result > after change table column name, data in null for the changed column > Result before alter table > +---+---+ > | id|str| > +---+---+ > | 1| s1| > | 2| s2| > +---+---+ > Result after alter table > ++---+ > | i_d|str| > ++---+ > |null| s1| > |null| s2| > ++---+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11659) Codegen sporadically fails with same input character
[ https://issues.apache.org/jira/browse/SPARK-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012995#comment-15012995 ] Catalin Alexandru Zamfir commented on SPARK-11659: -- It is not explicit. We've only put spark_sql on the POM which drags in all the default dependencies. For the moment we have deactivated codegen (although it was faster) and are waiting next week for an 1.5.2 update. I see version 2.6.1 (managed from 2.7.8). Should I force the newer version? > Codegen sporadically fails with same input character > > > Key: SPARK-11659 > URL: https://issues.apache.org/jira/browse/SPARK-11659 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 1.5.1 > Environment: Default, Linux (Jessie) >Reporter: Catalin Alexandru Zamfir > > We pretty much have a default installation of Spark 1.5.1. Some of our jobs > sporadically fail with the below exception for the same "input character" (we > don't have @ in our inputs as we check the types that we filter from the > data, but jobs still fail) and when we re-run the same job with the same > input, the same job passes without any failures. I believe it's a bug in > code-gen but I can't debug this on a production cluster. One thing to note is > that this has a higher chance of occurring when multiple jobs are run in > parallel to one another (eg. 4 jobs at a time started on the same second > using a scheduler and sharing the same context). However, I have no reproduce > rule. For example, from 32 jobs scheduled in batches of 4 jobs per batch, 1 > of the jobs in one of the batches may fail with the below error and with a > different job, randomly. I don't know an idea on how to approach this > situation to produce better information so maybe you can advise us. > {noformat} > Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most > recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): > java.util.concurrent.ExecutionException: java.lang.Exception: failed to > compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: > Invalid character input "@" (character code 64) > public SpecificOrdering > generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) { > return new SpecificOrdering(expr); > } > class SpecificOrdering extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering { > > private org.apache.spark.sql.catalyst.expressions.Expression[] expressions; > > > > public > SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) > { > expressions = expr; > > } > > @Override > public int compare(InternalRow a, InternalRow b) { > InternalRow i = null; // Holds current row being evaluated. > > i = a; > boolean isNullA2; > long primitiveA3; > { > /* input[0, LongType] */ > > boolean isNull0 = i.isNullAt(0); > long primitive1 = isNull0 ? -1L : (i.getLong(0)); > > isNullA2 = isNull0; > primitiveA3 = primitive1; > } > i = b; > boolean isNullB4; > long primitiveB5; > { > /* input[0, LongType] */ > > boolean isNull0 = i.isNullAt(0); > long primitive1 = isNull0 ? -1L : (i.getLong(0)); > > isNullB4 = isNull0; > primitiveB5 = primitive1; > } > if (isNullA2 && isNullB4) { > // Nothing > } else if (isNullA2) { > return -1; > } else if (isNullB4) { > return 1; > } else { > int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? > -1 : 0); > if (comp != 0) { > return comp; > } > } > > > i = a; > boolean isNullA8; > long primitiveA9; > { > /* input[1, LongType] */ > > boolean isNull6 = i.isNullAt(1); > long primitive7 = isNull6 ? -1L : (i.getLong(1)); > > isNullA8 = isNull6; > primitiveA9 = primitive7; > } > i = b; > boolean isNullB10; > long primitiveB11; > { > /* input[1, LongType] */ > > boolean isNull6 = i.isNullAt(1); > long primitive7 = isNull6 ? -1L : (i.getLong(1)); > > isNullB10 = isNull6; > primitiveB11 = primitive7; > } > if (isNullA8 && isNullB10) { > // Nothing > } else if (isNullA8) { > return -1; > } else if (isNullB10) { > return 1; > } else { > int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 > ? -1 : 0); > if (comp != 0) { > return comp; > } > } > > > i = a; > boolean isNullA14; > long primitiveA15; > { > /* input[2, LongType] */ > > boolean isNull12 = i.isNullA
[jira] [Assigned] (SPARK-11817) insert of timestamp with factional seconds inserts a NULL
[ https://issues.apache.org/jira/browse/SPARK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11817: Assignee: Apache Spark > insert of timestamp with factional seconds inserts a NULL > - > > Key: SPARK-11817 > URL: https://issues.apache.org/jira/browse/SPARK-11817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Chip Sands >Assignee: Apache Spark > > Using the Thrift jdbc interface. > The insert of the value of "1970-01-01 00:00:00.123456789" to a timestamp > column, inserts a NULL into the database. I am aware the of the change > From 1.5 releases notes Timestamp Type’s precision is reduced to 1 > microseconds (1us). However, to be compatible with previous versions, I > would suggest either rounding or truncating the fractional seconds not > inserting a NULL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11817) insert of timestamp with factional seconds inserts a NULL
[ https://issues.apache.org/jira/browse/SPARK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11817: Assignee: (was: Apache Spark) > insert of timestamp with factional seconds inserts a NULL > - > > Key: SPARK-11817 > URL: https://issues.apache.org/jira/browse/SPARK-11817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Chip Sands > > Using the Thrift jdbc interface. > The insert of the value of "1970-01-01 00:00:00.123456789" to a timestamp > column, inserts a NULL into the database. I am aware the of the change > From 1.5 releases notes Timestamp Type’s precision is reduced to 1 > microseconds (1us). However, to be compatible with previous versions, I > would suggest either rounding or truncating the fractional seconds not > inserting a NULL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11817) insert of timestamp with factional seconds inserts a NULL
[ https://issues.apache.org/jira/browse/SPARK-11817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012981#comment-15012981 ] Apache Spark commented on SPARK-11817: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/9834 > insert of timestamp with factional seconds inserts a NULL > - > > Key: SPARK-11817 > URL: https://issues.apache.org/jira/browse/SPARK-11817 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Chip Sands > > Using the Thrift jdbc interface. > The insert of the value of "1970-01-01 00:00:00.123456789" to a timestamp > column, inserts a NULL into the database. I am aware the of the change > From 1.5 releases notes Timestamp Type’s precision is reduced to 1 > microseconds (1us). However, to be compatible with previous versions, I > would suggest either rounding or truncating the fractional seconds not > inserting a NULL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012815#comment-15012815 ] Earthson Lu edited comment on SPARK-6725 at 11/19/15 6:34 AM: -- -I'm glad to give some help:) Does it mean to do some unit tests?- I'm sorry, I have to focus on my own work now, may not have time to give help to 1.6's release~ was (Author: earthsonlu): I'm glad to give some help:) Does it mean to do some unit tests? > Model export/import for Pipeline API > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals
[ https://issues.apache.org/jira/browse/SPARK-11849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012972#comment-15012972 ] Apache Spark commented on SPARK-11849: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/9833 > Analyzer should replace current_date and current_timestamp with literals > > > Key: SPARK-11849 > URL: https://issues.apache.org/jira/browse/SPARK-11849 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently rely on the optimizer's constant folding to replace > current_timestamp and current_date. However, this can still result in > different values for different instances of current_timestamp/current_date if > the optimizer is not running fast enough. > A better solution is to replace these functions in the analyzer in one shot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals
[ https://issues.apache.org/jira/browse/SPARK-11849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11849: Assignee: Reynold Xin (was: Apache Spark) > Analyzer should replace current_date and current_timestamp with literals > > > Key: SPARK-11849 > URL: https://issues.apache.org/jira/browse/SPARK-11849 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > We currently rely on the optimizer's constant folding to replace > current_timestamp and current_date. However, this can still result in > different values for different instances of current_timestamp/current_date if > the optimizer is not running fast enough. > A better solution is to replace these functions in the analyzer in one shot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals
[ https://issues.apache.org/jira/browse/SPARK-11849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11849: Assignee: Apache Spark (was: Reynold Xin) > Analyzer should replace current_date and current_timestamp with literals > > > Key: SPARK-11849 > URL: https://issues.apache.org/jira/browse/SPARK-11849 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > We currently rely on the optimizer's constant folding to replace > current_timestamp and current_date. However, this can still result in > different values for different instances of current_timestamp/current_date if > the optimizer is not running fast enough. > A better solution is to replace these functions in the analyzer in one shot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11849) Analyzer should replace current_date and current_timestamp with literals
Reynold Xin created SPARK-11849: --- Summary: Analyzer should replace current_date and current_timestamp with literals Key: SPARK-11849 URL: https://issues.apache.org/jira/browse/SPARK-11849 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin Assignee: Reynold Xin We currently rely on the optimizer's constant folding to replace current_timestamp and current_date. However, this can still result in different values for different instances of current_timestamp/current_date if the optimizer is not running fast enough. A better solution is to replace these functions in the analyzer in one shot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights
[ https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012964#comment-15012964 ] Kai Sasaki commented on SPARK-11520: [~josephkb] The metrics of {{RegressionMetrics}} seems to be based on {{MultivariateStatisticalSummary}}. And current {{RegressionMetrics}} does not support weighted samples as argument. So we can pass the weighted samples to MultivariateStatisticalSummary ({{MultivariateOnlineSummarizer}}) and calculate metrics for regression metrics. Is this assumption correct? Can I work on this JIRA, if possible? > RegressionMetrics should support instance weights > - > > Key: SPARK-11520 > URL: https://issues.apache.org/jira/browse/SPARK-11520 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This will be important to improve LinearRegressionSummary, which currently > has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11622) Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter
[ https://issues.apache.org/jira/browse/SPARK-11622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11622: -- Target Version/s: 1.7.0 > Make LibSVMRelation extends HadoopFsRelation and Add LibSVMOutputWriter > --- > > Key: SPARK-11622 > URL: https://issues.apache.org/jira/browse/SPARK-11622 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > > so that LibSVMRealtion can leverage the features from HadoopFsRelation -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11275) [SQL] Incorrect results when using rollup/cube
[ https://issues.apache.org/jira/browse/SPARK-11275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11275: - Priority: Critical (was: Major) > [SQL] Incorrect results when using rollup/cube > --- > > Key: SPARK-11275 > URL: https://issues.apache.org/jira/browse/SPARK-11275 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.1 >Reporter: Xiao Li >Priority: Critical > > Spark SQL is unable to generate a correct result when the following query > using rollup. > "select a, b, sum(a + b) as sumAB, GROUPING__ID from mytable group by a, > b with rollup" > Spark SQL generates a wrong result: > [2,4,6,3] > [2,null,null,1] > [1,null,null,1] > [null,null,null,0] > [1,2,3,3] > The table mytable is super simple, containing two rows and two columns: > testData = Seq((1, 2), (2, 4)).toDF("a", "b") > After turning off codegen, the query plan is like > == Parsed Logical Plan == > 'Rollup ['a,'b], > [unresolvedalias('a),unresolvedalias('b),unresolvedalias('sum(('a + 'b)) AS > sumAB#20),unresolvedalias('GROUPING__ID)] > 'UnresolvedRelation `mytable`, None > == Analyzed Logical Plan == > a: int, b: int, sumAB: bigint, GROUPING__ID: int > Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as > bigint)) AS sumAB#20L,GROUPING__ID#23] > Expand [0,1,3], [a#2,b#3], grouping__id#23 > Subquery mytable >Project [_1#0 AS a#2,_2#1 AS b#3] > LocalRelation [_1#0,_2#1], [[1,2],[2,4]] > == Optimized Logical Plan == > Aggregate [a#2,b#3,grouping__id#23], [a#2,b#3,sum(cast((a#2 + b#3) as > bigint)) AS sumAB#20L,GROUPING__ID#23] > Expand [0,1,3], [a#2,b#3], grouping__id#23 > LocalRelation [a#2,b#3], [[1,2],[2,4]] > == Physical Plan == > Aggregate false, [a#2,b#3,grouping__id#23], [a#2,b#3,sum(PartialSum#24L) AS > sumAB#20L,grouping__id#23] > Exchange hashpartitioning(a#2,b#3,grouping__id#23,5) > Aggregate true, [a#2,b#3,grouping__id#23], > [a#2,b#3,grouping__id#23,sum(cast((a#2 + b#3) as bigint)) AS PartialSum#24L] >Expand [List(null, null, 0),List(a#2, null, 1),List(a#2, b#3, 3)], > [a#2,b#3,grouping__id#23] > LocalTableScan [a#2,b#3], [[1,2],[2,4]] > Below are my observations: > 1. Generation of GROUP__ID looks OK. > 2. The problem still exists no matter whether turning on/off CODEGEN > 3. Rollup still works in a simple query when group-by columns have only one > column. For example, "select b, sum(a), GROUPING__ID from mytable group by b > with rollup" > 4. The buckets in "HiveDataFrameAnalytcisSuite" are misleading. > Unfortunately, they hide the bugs. Although the buckets passed, they just > compare the results of SQL and Dataframe. This way is unable to capture the > regression when both return the same wrong results. > 5. The same problem also exists in cube. I have not started the investigation > in cube, but I believe the root causes should be the same. > 6. It looks like all the logical plans are correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11842) Cleanups to existing Readers and Writers
[ https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11842. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9829 [https://github.com/apache/spark/pull/9829] > Cleanups to existing Readers and Writers > > > Key: SPARK-11842 > URL: https://issues.apache.org/jira/browse/SPARK-11842 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > Fix For: 1.6.0 > > > Small cleanups to existing Readers and Writers > * Add {{repartition(1)}} to save() methods' saving of data for > LogisticRegressionModel, LinearRegressionModel. > * Strengthen privacy to class and companion object for Writers and Readers > * Change LogisticRegressionSuite read/write test to fit intercept > * Add Since versions for read/write methods in Pipeline, LogisticRegression > * Switch from hand-written class names in Readers to using getClass -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7457) Perf test for ALS.recommendAll
[ https://issues.apache.org/jira/browse/SPARK-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012843#comment-15012843 ] Jeff Zhang commented on SPARK-7457: --- [~mengxr] I don't find the api of ALS.recommandAll, is it removed ? And is this ticket for some performance report ? > Perf test for ALS.recommendAll > -- > > Key: SPARK-7457 > URL: https://issues.apache.org/jira/browse/SPARK-7457 > Project: Spark > Issue Type: Sub-task > Components: MLlib, Tests >Affects Versions: 1.4.0 >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012815#comment-15012815 ] Earthson Lu edited comment on SPARK-6725 at 11/19/15 5:14 AM: -- I'm glad to give some help:) Does it mean to do some unit tests? was (Author: earthsonlu): I'm glad to give help:) > Model export/import for Pipeline API > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs
[ https://issues.apache.org/jira/browse/SPARK-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012817#comment-15012817 ] Apache Spark commented on SPARK-11848: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/9832 > [SQL] Support EXPLAIN in DataSet APIs > - > > Key: SPARK-11848 > URL: https://issues.apache.org/jira/browse/SPARK-11848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > Prints the plans (logical and physical) to the console for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs
[ https://issues.apache.org/jira/browse/SPARK-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11848: Assignee: Apache Spark > [SQL] Support EXPLAIN in DataSet APIs > - > > Key: SPARK-11848 > URL: https://issues.apache.org/jira/browse/SPARK-11848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Prints the plans (logical and physical) to the console for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs
[ https://issues.apache.org/jira/browse/SPARK-11848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11848: Assignee: (was: Apache Spark) > [SQL] Support EXPLAIN in DataSet APIs > - > > Key: SPARK-11848 > URL: https://issues.apache.org/jira/browse/SPARK-11848 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > > Prints the plans (logical and physical) to the console for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012815#comment-15012815 ] Earthson Lu commented on SPARK-6725: I'm glad to give help:) > Model export/import for Pipeline API > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11848) [SQL] Support EXPLAIN in DataSet APIs
Xiao Li created SPARK-11848: --- Summary: [SQL] Support EXPLAIN in DataSet APIs Key: SPARK-11848 URL: https://issues.apache.org/jira/browse/SPARK-11848 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Xiao Li Prints the plans (logical and physical) to the console for debugging purposes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11829) Model export/import for spark.ml: estimators under ml.feature (II)
[ https://issues.apache.org/jira/browse/SPARK-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012785#comment-15012785 ] Yanbo Liang commented on SPARK-11829: - Sure, I can take this. > Model export/import for spark.ml: estimators under ml.feature (II) > -- > > Key: SPARK-11829 > URL: https://issues.apache.org/jira/browse/SPARK-11829 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Add read/write support to the following estimators and models under spark.ml: > * ChiSqSelector > * PCA > * QuantileDiscretizer > * VectorIndexer > * Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11847) Model export/import for spark.ml: LDA
[ https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012776#comment-15012776 ] yuhao yang commented on SPARK-11847: Sure, I can take it. > Model export/import for spark.ml: LDA > - > > Key: SPARK-11847 > URL: https://issues.apache.org/jira/browse/SPARK-11847 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: yuhao yang > > Add read/write support to LDA, similar to ALS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11847) Model export/import for spark.ml: LDA
[ https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11847: -- Description: Add read/write support to LDA, similar to ALS. > Model export/import for spark.ml: LDA > - > > Key: SPARK-11847 > URL: https://issues.apache.org/jira/browse/SPARK-11847 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: yuhao yang > > Add read/write support to LDA, similar to ALS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11847) Model export/import for spark.ml: LDA
[ https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012775#comment-15012775 ] Xiangrui Meng commented on SPARK-11847: --- [~yuhaoyan] We need some help on pipeline persistence. Do you have time to work on this JIRA? > Model export/import for spark.ml: LDA > - > > Key: SPARK-11847 > URL: https://issues.apache.org/jira/browse/SPARK-11847 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xiangrui Meng >Assignee: yuhao yang > > Add read/write support to LDA, similar to ALS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11847) Model export/import for spark.ml: LDA
Xiangrui Meng created SPARK-11847: - Summary: Model export/import for spark.ml: LDA Key: SPARK-11847 URL: https://issues.apache.org/jira/browse/SPARK-11847 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: yuhao yang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6725) Model export/import for Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012770#comment-15012770 ] Xiangrui Meng commented on SPARK-6725: -- [~EarthsonLu] We are adding more import/export to existing algorithms. Could you help test them in both Scala and Java and let us know if you find any issues? Thanks! > Model export/import for Pipeline API > > > Key: SPARK-6725 > URL: https://issues.apache.org/jira/browse/SPARK-6725 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This is an umbrella JIRA for adding model export/import to the spark.ml API. > This JIRA is for adding the internal Saveable/Loadable API and Parquet-based > format, not for other formats like PMML. > This will require the following steps: > * Add export/import for all PipelineStages supported by spark.ml > ** This will include some Transformers which are not Models. > ** These can use almost the same format as the spark.mllib model save/load > functions, but the model metadata must store a different class name (marking > the class as a spark.ml class). > * After all PipelineStages support save/load, add an interface which forces > future additions to support save/load. > *UPDATE*: In spark.ml, we could save feature metadata using DataFrames. > Other libraries and formats can support this, and it would be great if we > could too. We could do either of the following: > * save() optionally takes a dataset (or schema), and load will return a > (model, schema) pair. > * Models themselves save the input schema. > Both options would mean inheriting from new Saveable, Loadable types. > *UPDATE: DESIGN DOC*: Here's a design doc which I wrote. If you have > comments about the planned implementation, please comment in this JIRA. > Thanks! > [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11846: -- Description: Add read/write support to AFTSurvivalRegression and IsotonicRegression using LinearRegression read/write as reference. > CLONE - Model export/import for spark.ml: AFTSurvivalRegression and > IsotonicRegression > -- > > Key: SPARK-11846 > URL: https://issues.apache.org/jira/browse/SPARK-11846 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > Add read/write support to AFTSurvivalRegression and IsotonicRegression using > LinearRegression read/write as reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11846) Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11846: -- Summary: Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression (was: CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression) > Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression > -- > > Key: SPARK-11846 > URL: https://issues.apache.org/jira/browse/SPARK-11846 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > > Add read/write support to AFTSurvivalRegression and IsotonicRegression using > LinearRegression read/write as reference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11846: -- Assignee: Xusen Yin (was: Wenjian Huang) > CLONE - Model export/import for spark.ml: AFTSurvivalRegression and > IsotonicRegression > -- > > Key: SPARK-11846 > URL: https://issues.apache.org/jira/browse/SPARK-11846 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
Xiangrui Meng created SPARK-11846: - Summary: CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression Key: SPARK-11846 URL: https://issues.apache.org/jira/browse/SPARK-11846 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Wenjian Huang Fix For: 1.6.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11846) CLONE - Model export/import for spark.ml: AFTSurvivalRegression and IsotonicRegression
[ https://issues.apache.org/jira/browse/SPARK-11846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11846: -- Fix Version/s: (was: 1.6.0) > CLONE - Model export/import for spark.ml: AFTSurvivalRegression and > IsotonicRegression > -- > > Key: SPARK-11846 > URL: https://issues.apache.org/jira/browse/SPARK-11846 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Xusen Yin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
[ https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11845: Assignee: Tathagata Das (was: Apache Spark) > Add unit tests to verify correct checkpointing of TrackStateRDD > --- > > Key: SPARK-11845 > URL: https://issues.apache.org/jira/browse/SPARK-11845 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
[ https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012738#comment-15012738 ] Apache Spark commented on SPARK-11845: -- User 'tdas' has created a pull request for this issue: https://github.com/apache/spark/pull/9831 > Add unit tests to verify correct checkpointing of TrackStateRDD > --- > > Key: SPARK-11845 > URL: https://issues.apache.org/jira/browse/SPARK-11845 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
[ https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11845: Assignee: Apache Spark (was: Tathagata Das) > Add unit tests to verify correct checkpointing of TrackStateRDD > --- > > Key: SPARK-11845 > URL: https://issues.apache.org/jira/browse/SPARK-11845 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
[ https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-11845: -- Target Version/s: 1.6.0 > Add unit tests to verify correct checkpointing of TrackStateRDD > --- > > Key: SPARK-11845 > URL: https://issues.apache.org/jira/browse/SPARK-11845 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
Tathagata Das created SPARK-11845: - Summary: Add unit tests to verify correct checkpointing of TrackStateRDD Key: SPARK-11845 URL: https://issues.apache.org/jira/browse/SPARK-11845 Project: Spark Issue Type: Bug Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11845) Add unit tests to verify correct checkpointing of TrackStateRDD
[ https://issues.apache.org/jira/browse/SPARK-11845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das updated SPARK-11845: -- Issue Type: Test (was: Bug) > Add unit tests to verify correct checkpointing of TrackStateRDD > --- > > Key: SPARK-11845 > URL: https://issues.apache.org/jira/browse/SPARK-11845 > Project: Spark > Issue Type: Test > Components: Streaming >Reporter: Tathagata Das >Assignee: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11838) Spark SQL query fragment RDD reuse
[ https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012724#comment-15012724 ] Mark Hamstra commented on SPARK-11838: -- One significant difference between this and CacheManager is that what is proposed here is caching and reuse of the RDD itself, not the blocks of data computed by that RDD. As Mikhail noted, that can avoid significant amounts of duplicate computation even when nothing is explicitly persisted/cached. > Spark SQL query fragment RDD reuse > -- > > Key: SPARK-11838 > URL: https://issues.apache.org/jira/browse/SPARK-11838 > Project: Spark > Issue Type: Improvement >Reporter: Mikhail Bautin > > With many analytical Spark SQL workloads against slowly changing tables, > successive queries frequently share fragments that produce the same result. > Instead of re-computing those fragments for every query, it makes sense to > detect similar fragments and substitute RDDs previously created for matching > SparkPlan fragments into every new SparkPlan at the execution time whenever > possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, > many stages can still be skipped due to map output files being present on > executor nodes. > The implementation involves the following steps: > (1) Logical plan "canonicalization". > Logical plans mapping to the same "canonical" logical plan should always > produce the same results (except for possible output column reordering), > although the inverse statement won't always be true. > - Re-mapping expression ids to "canonical expression ids" (successively > increasing numbers always starting with 1). > - Eliminating alias names that are unimportant after analysis completion. > Only the names that are necessary to determine the Hive table columns to be > scanned are retained. > - Reordering columns in projections, grouping/aggregation expressions, etc. > This can be done e.g. by using the string representation as a sort key. Union > inputs always have to be reordered the same way. > - Tree traversal has to happen starting from leaves and progressing towards > the root, because we need to already have identified canonical expression ids > for children of a node before we can come up with sort keys that would allow > to reorder expressions in a node deterministically. This is a bit more > complicated for Union nodes. > - Special handling for MetastoreRelations. We replace MetastoreRelation > with a special class CanonicalMetastoreRelation that uses attributes and > partitionKeys as part of its equals() and hashCode() implementation, but the > visible attributes and aprtitionKeys are restricted to expression ids that > the rest of the query actually needs from that MetastoreRelation. > An example of logical plans and corresponding canonical logical plans: > https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw > (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When > generating a SparkPlan, we keep an optional reference to a LogicalPlan > instance in every node. This allows us to populate the cache with mappings > from canonical logical plans of query fragments to the corresponding RDDs > generated as part of query execution. Note that there is no new work > necessary to generate the RDDs, we are merely utilizing the RDDs that would > have been produced as part of SparkPlan execution anyway. > (3) SparkPlan fragment substitution. After generating a SparkPlan and before > calling prepare() or execute() on it, we check if any of its nodes have an > associated LogicalPlan that maps to a canonical logical plan matching a cache > entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD > wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of > the current query fragment. If the expected column order differs from what > the current SparkPlan fragment produces, we add a projection to reorder the > columns. We also add safe/unsafe row conversions as necessary to match the > row type that is expected by the parent of the current SparkPlan fragment. > (4) The execute() method of SparkPlan also needs to perform the cache lookup > and substitution described above before producing a new RDD for the current > SparkPlan node. The "loading cache" pattern (e.g. as implemented in Guava) > allows to reuse query fragments between simultaneously submitted queries: > whichever query runs execute() for a particular fragment's canonical logical > plan starts producing an RDD first, and if another query has a fragment with > the same canonical logical plan, it waits for the RDD to be produced by the > first query and inserts it in its SparkPlan instead. > This kind of query fragment caching will mostly be useful
[jira] [Created] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13
Yin Huai created SPARK-11844: Summary: can not read class org.apache.parquet.format.PageHeader: don't know what type: 13 Key: SPARK-11844 URL: https://issues.apache.org/jira/browse/SPARK-11844 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Priority: Minor I got the following error once when I was running a query {code} java.io.IOException: can not read class org.apache.parquet.format.PageHeader: don't know what type: 13 at org.apache.parquet.format.Util.read(Util.java:216) at org.apache.parquet.format.Util.readPageHeader(Util.java:65) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534) at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546) at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496) at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127) at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208) at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201) at org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88) at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300) at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know what type: 13 at parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806) at parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500) at org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158) at parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108) {code} The next retry was good. Right now, seems not critical. But, let's still track it in case we see it in future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11614) serde parameters should be set only when all params are ready
[ https://issues.apache.org/jira/browse/SPARK-11614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11614: - Assignee: Navis > serde parameters should be set only when all params are ready > - > > Key: SPARK-11614 > URL: https://issues.apache.org/jira/browse/SPARK-11614 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Navis >Assignee: Navis >Priority: Minor > Fix For: 1.6.0 > > > see HIVE-7975 and HIVE-12373 > With changed semantic of setters in thrift objects in hive, setter should be > called only after all parameters are set. It's not problem of current state > but will be a problem in some day. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11614) serde parameters should be set only when all params are ready
[ https://issues.apache.org/jira/browse/SPARK-11614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11614. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9580 [https://github.com/apache/spark/pull/9580] > serde parameters should be set only when all params are ready > - > > Key: SPARK-11614 > URL: https://issues.apache.org/jira/browse/SPARK-11614 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Navis >Priority: Minor > Fix For: 1.6.0 > > > see HIVE-7975 and HIVE-12373 > With changed semantic of setters in thrift objects in hive, setter should be > called only after all parameters are set. It's not problem of current state > but will be a problem in some day. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11544) sqlContext doesn't use PathFilter
[ https://issues.apache.org/jira/browse/SPARK-11544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012718#comment-15012718 ] Apache Spark commented on SPARK-11544: -- User 'dilipbiswal' has created a pull request for this issue: https://github.com/apache/spark/pull/9830 > sqlContext doesn't use PathFilter > - > > Key: SPARK-11544 > URL: https://issues.apache.org/jira/browse/SPARK-11544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 > Environment: AWS EMR 4.1.0, Spark 1.5.0 >Reporter: Frank Dai >Assignee: Dilip Biswal > Fix For: 1.6.0 > > > When sqlContext reads JSON files, it doesn't use {{PathFilter}} in the > underlying SparkContext > {code:java} > val sc = new SparkContext(conf) > sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", > classOf[TmpFileFilter], classOf[PathFilter]) > val sqlContext = new org.apache.spark.sql.SQLContext(sc) > {code} > The definition of {{TmpFileFilter}} is: > {code:title=TmpFileFilter.scala|borderStyle=solid} > import org.apache.hadoop.fs.{Path, PathFilter} > class TmpFileFilter extends PathFilter { > override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp") > } > {code} > When use {{sqlContext}} to read JSON files, e.g., > {{sqlContext.read.schema(mySchema).json(s3Path)}}, Spark will throw out an > exception: > {quote} > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > s3://chef-logstash-access-backup/2015/10/21/00/logstash-172.18.68.59-s3.1445388158944.gz.tmp > {quote} > It seems {{sqlContext}} can see {{.tmp}} files while {{sc}} can not, which > causes the above exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11669) Python interface to SparkR GLM module
[ https://issues.apache.org/jira/browse/SPARK-11669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shubhanshu Mishra reopened SPARK-11669: --- What I mean't when I said a Python API to GLM was that the GLM module is something which is implemented in Spark and should be made a part of the MLLIB module rather than just being a SparkR feature. This will allow users who come to python statsmodels background to use the GLM module in their python code as well. I know the current GLM module is just build using SparkR but I feel it should be a core module with just a common API for multiple languages. > Python interface to SparkR GLM module > - > > Key: SPARK-11669 > URL: https://issues.apache.org/jira/browse/SPARK-11669 > Project: Spark > Issue Type: Improvement > Components: PySpark, SparkR >Affects Versions: 1.5.0, 1.5.1 > Environment: LINUX > MAC > WINDOWS >Reporter: Shubhanshu Mishra >Priority: Minor > Labels: GLM, pyspark, sparkR, statistics > > There should be a python interface to the sparkR GLM module. Currently the > only python library which creates R style GLM module results in statsmodels. > Inspiration for the API can be taken from the following page. > http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/formulas.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark
[ https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012710#comment-15012710 ] Apache Spark commented on SPARK-5682: - User 'winningsix' has created a pull request for this issue: https://github.com/apache/spark/pull/8880 > Add encrypted shuffle in spark > -- > > Key: SPARK-5682 > URL: https://issues.apache.org/jira/browse/SPARK-5682 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Reporter: liyunzhang_intel > Attachments: Design Document of Encrypted Spark > Shuffle_20150209.docx, Design Document of Encrypted Spark > Shuffle_20150318.docx, Design Document of Encrypted Spark > Shuffle_20150402.docx, Design Document of Encrypted Spark > Shuffle_20150506.docx > > > Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle > data safer. This feature is necessary in spark. AES is a specification for > the encryption of electronic data. There are 5 common modes in AES. CTR is > one of the modes. We use two codec JceAesCtrCryptoCodec and > OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used > in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms jdk > provides while OpensslAesCtrCryptoCodec uses encrypted algorithms openssl > provides. > Because ugi credential info is used in the process of encrypted shuffle, we > first enable encrypted shuffle on spark-on-yarn framework. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11838) Spark SQL query fragment RDD reuse
[ https://issues.apache.org/jira/browse/SPARK-11838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012701#comment-15012701 ] Michael Armbrust commented on SPARK-11838: -- As I said to [~markhamstra] offline, my biggest questions are about how we expose this to users. (what are the interfaces to opt into this, what are the interfaces to invalidate, etc) That said, you should look at how we do in-memory caching as its very similar to what you are proposing. Some relevant parts of the code to look at: - [sameResult|https://github.com/apache/spark/blob/67c75828ff4df2e305bdf5d6be5a11201d1da3f3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala#L122] a less ambitious version of the query subsumption calculation you describe. Ideally we would just improve this. - [CacheManager|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala] Where we do substitution into a logical plan (I think i like this better than mixing it into execution). - [QueryExecution|https://github.com/apache/spark/blob/67c75828ff4df2e305bdf5d6be5a11201d1da3f3/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala#L38] How it plugs into execution. > Spark SQL query fragment RDD reuse > -- > > Key: SPARK-11838 > URL: https://issues.apache.org/jira/browse/SPARK-11838 > Project: Spark > Issue Type: Improvement >Reporter: Mikhail Bautin > > With many analytical Spark SQL workloads against slowly changing tables, > successive queries frequently share fragments that produce the same result. > Instead of re-computing those fragments for every query, it makes sense to > detect similar fragments and substitute RDDs previously created for matching > SparkPlan fragments into every new SparkPlan at the execution time whenever > possible. Even if no RDDs are persist()-ed to memory/disk/off-heap memory, > many stages can still be skipped due to map output files being present on > executor nodes. > The implementation involves the following steps: > (1) Logical plan "canonicalization". > Logical plans mapping to the same "canonical" logical plan should always > produce the same results (except for possible output column reordering), > although the inverse statement won't always be true. > - Re-mapping expression ids to "canonical expression ids" (successively > increasing numbers always starting with 1). > - Eliminating alias names that are unimportant after analysis completion. > Only the names that are necessary to determine the Hive table columns to be > scanned are retained. > - Reordering columns in projections, grouping/aggregation expressions, etc. > This can be done e.g. by using the string representation as a sort key. Union > inputs always have to be reordered the same way. > - Tree traversal has to happen starting from leaves and progressing towards > the root, because we need to already have identified canonical expression ids > for children of a node before we can come up with sort keys that would allow > to reorder expressions in a node deterministically. This is a bit more > complicated for Union nodes. > - Special handling for MetastoreRelations. We replace MetastoreRelation > with a special class CanonicalMetastoreRelation that uses attributes and > partitionKeys as part of its equals() and hashCode() implementation, but the > visible attributes and aprtitionKeys are restricted to expression ids that > the rest of the query actually needs from that MetastoreRelation. > An example of logical plans and corresponding canonical logical plans: > https://gist.githubusercontent.com/mbautin/ef1317b341211d9606cf/raw > (2) Tracking LogicalPlan fragments corresponding to SparkPlan fragments. When > generating a SparkPlan, we keep an optional reference to a LogicalPlan > instance in every node. This allows us to populate the cache with mappings > from canonical logical plans of query fragments to the corresponding RDDs > generated as part of query execution. Note that there is no new work > necessary to generate the RDDs, we are merely utilizing the RDDs that would > have been produced as part of SparkPlan execution anyway. > (3) SparkPlan fragment substitution. After generating a SparkPlan and before > calling prepare() or execute() on it, we check if any of its nodes have an > associated LogicalPlan that maps to a canonical logical plan matching a cache > entry. If so, we substitute a PhysicalRDD (or a new class UnsafePhysicalRDD > wrapping an RDD of UnsafeRow) scanning the previously created RDD instead of > the current query fragment. If the expected column order differs from what > the current SparkPlan fragment produces, we add a projection to reorder the > columns. We also a
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012691#comment-15012691 ] Nishkam Ravi commented on SPARK-11278: -- [~andrewor14] This was last tested on Nov 11th, which would include the commit you mentioned. Each node has 16 vcores and 48GB memory. > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Assignee: Andrew Or >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11842) Cleanups to existing Readers and Writers
[ https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11842: -- Description: Small cleanups to existing Readers and Writers * Add {{repartition(1)}} to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression * Switch from hand-written class names in Readers to using getClass was: Small cleanups to existing Readers and Writers * Add {{repartition(1)}} to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression > Cleanups to existing Readers and Writers > > > Key: SPARK-11842 > URL: https://issues.apache.org/jira/browse/SPARK-11842 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Small cleanups to existing Readers and Writers > * Add {{repartition(1)}} to save() methods' saving of data for > LogisticRegressionModel, LinearRegressionModel. > * Strengthen privacy to class and companion object for Writers and Readers > * Change LogisticRegressionSuite read/write test to fit intercept > * Add Since versions for read/write methods in Pipeline, LogisticRegression > * Switch from hand-written class names in Readers to using getClass -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6791: - Description: Updated to be for CrossValidator only (was: Algorithms: Pipeline, CrossValidator (and associated models) This task will block on all other subtasks for [SPARK-6725]. This task will also include adding export/import as a required part of the PipelineStage interface since meta-algorithms will depend on sub-algorithms supporting save/load.) > Model export/import for spark.ml: CrossValidator > > > Key: SPARK-6791 > URL: https://issues.apache.org/jira/browse/SPARK-6791 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Updated to be for CrossValidator only -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11829) Model export/import for spark.ml: estimators under ml.feature (II)
[ https://issues.apache.org/jira/browse/SPARK-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012676#comment-15012676 ] Xiangrui Meng commented on SPARK-11829: --- [~yanboliang] Could you help on this JIRA? I don't think we can make `RFormula` work in 1.6, but the rest of them should be very similar to SPARK-6787 and SPARK-11839. > Model export/import for spark.ml: estimators under ml.feature (II) > -- > > Key: SPARK-11829 > URL: https://issues.apache.org/jira/browse/SPARK-11829 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Add read/write support to the following estimators under spark.ml: > * ChiSqSelector > * PCA > * QuantileDiscretizer > * VectorIndexer > * Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6791) Model export/import for spark.ml: CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-6791: Assignee: Joseph K. Bradley > Model export/import for spark.ml: CrossValidator > > > Key: SPARK-6791 > URL: https://issues.apache.org/jira/browse/SPARK-6791 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > Updated to be for CrossValidator only -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11829) Model export/import for spark.ml: estimators under ml.feature (II)
[ https://issues.apache.org/jira/browse/SPARK-11829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11829: -- Description: Add read/write support to the following estimators and models under spark.ml: * ChiSqSelector * PCA * QuantileDiscretizer * VectorIndexer * Word2Vec was: Add read/write support to the following estimators under spark.ml: * ChiSqSelector * PCA * QuantileDiscretizer * VectorIndexer * Word2Vec > Model export/import for spark.ml: estimators under ml.feature (II) > -- > > Key: SPARK-11829 > URL: https://issues.apache.org/jira/browse/SPARK-11829 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Add read/write support to the following estimators and models under spark.ml: > * ChiSqSelector > * PCA > * QuantileDiscretizer > * VectorIndexer > * Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6791) Model export/import for spark.ml: CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6791: - Summary: Model export/import for spark.ml: CrossValidator (was: Model export/import for spark.ml: meta-algorithms) > Model export/import for spark.ml: CrossValidator > > > Key: SPARK-6791 > URL: https://issues.apache.org/jira/browse/SPARK-6791 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley > > Algorithms: Pipeline, CrossValidator (and associated models) > This task will block on all other subtasks for [SPARK-6725]. This task will > also include adding export/import as a required part of the PipelineStage > interface since meta-algorithms will depend on sub-algorithms supporting > save/load. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11816) fix some style issue in ML/MLlib examples
[ https://issues.apache.org/jira/browse/SPARK-11816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11816. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9808 [https://github.com/apache/spark/pull/9808] > fix some style issue in ML/MLlib examples > - > > Key: SPARK-11816 > URL: https://issues.apache.org/jira/browse/SPARK-11816 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Trivial > Fix For: 1.6.0 > > > Currently I only fixed some obvious comments issue like > // scalastyle:off println > on the bottom. > Yet the style in examples is not quite consistent, like only half of the > examples are with > // Example usage: ./bin/run-example mllib.FPGrowthExample \, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11816) fix some style issue in ML/MLlib examples
[ https://issues.apache.org/jira/browse/SPARK-11816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11816: -- Target Version/s: 1.6.0 Component/s: Documentation > fix some style issue in ML/MLlib examples > - > > Key: SPARK-11816 > URL: https://issues.apache.org/jira/browse/SPARK-11816 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Trivial > > Currently I only fixed some obvious comments issue like > // scalastyle:off println > on the bottom. > Yet the style in examples is not quite consistent, like only half of the > examples are with > // Example usage: ./bin/run-example mllib.FPGrowthExample \, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11816) fix some style issue in ML/MLlib examples
[ https://issues.apache.org/jira/browse/SPARK-11816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11816: -- Assignee: yuhao yang > fix some style issue in ML/MLlib examples > - > > Key: SPARK-11816 > URL: https://issues.apache.org/jira/browse/SPARK-11816 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Trivial > > Currently I only fixed some obvious comments issue like > // scalastyle:off println > on the bottom. > Yet the style in examples is not quite consistent, like only half of the > examples are with > // Example usage: ./bin/run-example mllib.FPGrowthExample \, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11842) Cleanups to existing Readers and Writers
[ https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11842: -- Target Version/s: 1.6.0 > Cleanups to existing Readers and Writers > > > Key: SPARK-11842 > URL: https://issues.apache.org/jira/browse/SPARK-11842 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Small cleanups to existing Readers and Writers > * Add {{repartition(1)}} to save() methods' saving of data for > LogisticRegressionModel, LinearRegressionModel. > * Strengthen privacy to class and companion object for Writers and Readers > * Change LogisticRegressionSuite read/write test to fit intercept > * Add Since versions for read/write methods in Pipeline, LogisticRegression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11842) Cleanups to existing Readers and Writers
[ https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012667#comment-15012667 ] Apache Spark commented on SPARK-11842: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/9829 > Cleanups to existing Readers and Writers > > > Key: SPARK-11842 > URL: https://issues.apache.org/jira/browse/SPARK-11842 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Small cleanups to existing Readers and Writers > * Add {{repartition(1)}} to save() methods' saving of data for > LogisticRegressionModel, LinearRegressionModel. > * Strengthen privacy to class and companion object for Writers and Readers > * Change LogisticRegressionSuite read/write test to fit intercept > * Add Since versions for read/write methods in Pipeline, LogisticRegression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11835) Add a menu to the documentation of MLlib
[ https://issues.apache.org/jira/browse/SPARK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11835: -- Component/s: Documentation > Add a menu to the documentation of MLlib > > > Key: SPARK-11835 > URL: https://issues.apache.org/jira/browse/SPARK-11835 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 1.5.1 >Reporter: Tim Hunter >Assignee: Tim Hunter > Attachments: Screen Shot 2015-11-18 at 4.50.45 PM.png > > > Right now, the table of contents gets generated on a page-by-page basis, > which makes it hard to navigate between different topics in a project. We > should make use of the empty space on the left of the documentation to put a > navigation menu. > A picture is worth a thousand words: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11835) Add a menu to the documentation of MLlib
[ https://issues.apache.org/jira/browse/SPARK-11835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-11835: -- Assignee: Tim Hunter > Add a menu to the documentation of MLlib > > > Key: SPARK-11835 > URL: https://issues.apache.org/jira/browse/SPARK-11835 > Project: Spark > Issue Type: Improvement > Components: Documentation, MLlib >Affects Versions: 1.5.1 >Reporter: Tim Hunter >Assignee: Tim Hunter > Attachments: Screen Shot 2015-11-18 at 4.50.45 PM.png > > > Right now, the table of contents gets generated on a page-by-page basis, > which makes it hard to navigate between different topics in a project. We > should make use of the empty space on the left of the documentation to put a > navigation menu. > A picture is worth a thousand words: -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11787) Speed up parquet reader for flat schemas
[ https://issues.apache.org/jira/browse/SPARK-11787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11787. - Resolution: Fixed Assignee: Nong Li Fix Version/s: 1.6.0 > Speed up parquet reader for flat schemas > > > Key: SPARK-11787 > URL: https://issues.apache.org/jira/browse/SPARK-11787 > Project: Spark > Issue Type: Task > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > Fix For: 1.6.0 > > > Measuring the performance of running some of the TPCDS and anecdotally, > parquet scan and record reconstruction performance is a bottleneck. > For simple schemas, we can do better using the lower level parquet-mr APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11842) Cleanups to existing Readers and Writers
[ https://issues.apache.org/jira/browse/SPARK-11842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11842: -- Description: Small cleanups to existing Readers and Writers * Add {{repartition(1)}} to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression was: Small cleanups to existing Readers and Writers * Add {{repartition(1)}} to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept > Cleanups to existing Readers and Writers > > > Key: SPARK-11842 > URL: https://issues.apache.org/jira/browse/SPARK-11842 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > Small cleanups to existing Readers and Writers > * Add {{repartition(1)}} to save() methods' saving of data for > LogisticRegressionModel, LinearRegressionModel. > * Strengthen privacy to class and companion object for Writers and Readers > * Change LogisticRegressionSuite read/write test to fit intercept > * Add Since versions for read/write methods in Pipeline, LogisticRegression -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11839) Renames traits to avoid collision with java.util.* and add use default traits to simplify the impl
[ https://issues.apache.org/jira/browse/SPARK-11839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11839. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9827 [https://github.com/apache/spark/pull/9827] > Renames traits to avoid collision with java.util.* and add use default traits > to simplify the impl > -- > > Key: SPARK-11839 > URL: https://issues.apache.org/jira/browse/SPARK-11839 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 1.6.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.6.0 > > > This helps simplify the development. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11833) Add Java tests for Kryo/Java encoders
[ https://issues.apache.org/jira/browse/SPARK-11833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-11833. - Resolution: Fixed Fix Version/s: 1.6.0 > Add Java tests for Kryo/Java encoders > - > > Key: SPARK-11833 > URL: https://issues.apache.org/jira/browse/SPARK-11833 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE
[ https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012655#comment-15012655 ] Xin Wu commented on SPARK-9761: --- One thing I notice is that if I create the table explicitly before letting the dataframe to write into the table, describe table will show the alter added column. Even though I created the table stored as parquet and I verified that the saved data file is parquet format. {code} hiveContext.sql("drop table Orders") val df = hiveContext.read.json("/home/xwu0226/spark-tables/Orders.json") df.show() hiveContext.sql("create table orders(customerID int, orderID int) stored as parquet") df.write.mode(SaveMode.Append).saveAsTable("Orders") hiveContext.sql("ALTER TABLE Orders add columns (z string)") hiveContext.sql("describe extended Orders").show {code} output: {code} +--+-+---+ | col_name|data_type|comment| +--+-+---+ |customerid| int| | | orderid| int| | | z| string| | +--+-+---+ {code} So with the explicit creation of the table, the describe seems to use the schema merging, while the other case does not merge schema.. "spark.sql.sources.provider" property is defined for explicitly created table, such that the logic of lookupRelation in HiveMetastoreCatalog.scala goes to look up from the cachedDataSrouceTables, where the relation is not found then, get reloaded from parquet file, resulting in column schemas created according to parquet content.. It would be nice the schema is merged when constructing this new relation before giving it back to caller. Looking deeper into this.. > Inconsistent metadata handling with ALTER TABLE > --- > > Key: SPARK-9761 > URL: https://issues.apache.org/jira/browse/SPARK-9761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > Labels: hive, sql > > Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. > The table in question was created with {{HiveContext.read.json()}}. > Steps: > # {{alter table dimension_components add columns (z string);}} succeeds. > # {{describe dimension_components;}} does not show the new column, even after > restarting spark-sql. > # A second {{alter table dimension_components add columns (z string);}} fails > with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: > Duplicate column name: z > Full spark-sql output > [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9761) Inconsistent metadata handling with ALTER TABLE
[ https://issues.apache.org/jira/browse/SPARK-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012655#comment-15012655 ] Xin Wu edited comment on SPARK-9761 at 11/19/15 2:31 AM: - One thing I notice is that if I create the table explicitly before letting the dataframe to write into the table, describe table will show the alter added column. Even though I created the table stored as parquet and I verified that the saved data file is parquet format. {code} hiveContext.sql("drop table Orders") val df = hiveContext.read.json("/home/xwu0226/spark-tables/Orders.json") df.show() hiveContext.sql("create table orders(customerID int, orderID int) stored as parquet") df.write.mode(SaveMode.Append).saveAsTable("Orders") hiveContext.sql("ALTER TABLE Orders add columns (z string)") hiveContext.sql("describe extended Orders").show {code} output: {code} +--+-+---+ | col_name|data_type|comment| +--+-+---+ |customerid| int| | | orderid| int| | | z| string| | +--+-+---+ {code} So with the explicit creation of the table, the describe seems to use the schema merging, while the other case does not merge schema.. "spark.sql.sources.provider" property is defined for implicitly created table, such that the logic of lookupRelation in HiveMetastoreCatalog.scala goes to look up from the cachedDataSrouceTables, where the relation is not found then, get reloaded from parquet file, resulting in column schemas created according to parquet content.. It would be nice the schema is merged when constructing this new relation before giving it back to caller. Looking deeper into this.. was (Author: xwu0226): One thing I notice is that if I create the table explicitly before letting the dataframe to write into the table, describe table will show the alter added column. Even though I created the table stored as parquet and I verified that the saved data file is parquet format. {code} hiveContext.sql("drop table Orders") val df = hiveContext.read.json("/home/xwu0226/spark-tables/Orders.json") df.show() hiveContext.sql("create table orders(customerID int, orderID int) stored as parquet") df.write.mode(SaveMode.Append).saveAsTable("Orders") hiveContext.sql("ALTER TABLE Orders add columns (z string)") hiveContext.sql("describe extended Orders").show {code} output: {code} +--+-+---+ | col_name|data_type|comment| +--+-+---+ |customerid| int| | | orderid| int| | | z| string| | +--+-+---+ {code} So with the explicit creation of the table, the describe seems to use the schema merging, while the other case does not merge schema.. "spark.sql.sources.provider" property is defined for explicitly created table, such that the logic of lookupRelation in HiveMetastoreCatalog.scala goes to look up from the cachedDataSrouceTables, where the relation is not found then, get reloaded from parquet file, resulting in column schemas created according to parquet content.. It would be nice the schema is merged when constructing this new relation before giving it back to caller. Looking deeper into this.. > Inconsistent metadata handling with ALTER TABLE > --- > > Key: SPARK-9761 > URL: https://issues.apache.org/jira/browse/SPARK-9761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 > Environment: Ubuntu on AWS >Reporter: Simeon Simeonov > Labels: hive, sql > > Schema changes made with {{ALTER TABLE}} are not shown in {{DESCRIBE TABLE}}. > The table in question was created with {{HiveContext.read.json()}}. > Steps: > # {{alter table dimension_components add columns (z string);}} succeeds. > # {{describe dimension_components;}} does not show the new column, even after > restarting spark-sql. > # A second {{alter table dimension_components add columns (z string);}} fails > with RROR exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: > Duplicate column name: z > Full spark-sql output > [here|https://gist.github.com/ssimeonov/d9af4b8bb76b9d7befde]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11843) Isolate staging directory across applications on same YARN cluster
[ https://issues.apache.org/jira/browse/SPARK-11843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012653#comment-15012653 ] Marcelo Vanzin commented on SPARK-11843: Client.scala appends the application ID to ".sparkStaging" to form the actual name of the staging directory. Where have you seen two applications using the same staging dir? > Isolate staging directory across applications on same YARN cluster > -- > > Key: SPARK-11843 > URL: https://issues.apache.org/jira/browse/SPARK-11843 > Project: Spark > Issue Type: Bug > Components: YARN >Reporter: Andrew Or >Priority: Minor > > If multiple clients share the same YARN cluster and file system they may end > up using the same `.sparkStaging` directory. This may be a problem if their > jars are called something similar, for instance. It would be easier to > enforce isolation for both security and user experience if the staging > directories are isolated. We can either: > (1) allow users to configure the directory name > (2) add an identifier to the directory name, which I prefer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11843) Isolate staging directory across applications on same YARN cluster
Andrew Or created SPARK-11843: - Summary: Isolate staging directory across applications on same YARN cluster Key: SPARK-11843 URL: https://issues.apache.org/jira/browse/SPARK-11843 Project: Spark Issue Type: Bug Components: YARN Reporter: Andrew Or Priority: Minor If multiple clients share the same YARN cluster and file system they may end up using the same `.sparkStaging` directory. This may be a problem if their jars are called something similar, for instance. It would be easier to enforce isolation for both security and user experience if the staging directories are isolated. We can either: (1) allow users to configure the directory name (2) add an identifier to the directory name, which I prefer -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7286) Precedence of operator not behaving properly
[ https://issues.apache.org/jira/browse/SPARK-7286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012643#comment-15012643 ] Jakob Odersky commented on SPARK-7286: -- Going through the code, I saw that catalyst also defines !== in its dsl, so it seems this operator has quite wide-spread usage. Would deprecating it in favor of something else be a viable option? > Precedence of operator not behaving properly > > > Key: SPARK-7286 > URL: https://issues.apache.org/jira/browse/SPARK-7286 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.1 > Environment: Linux >Reporter: DevilJetha >Priority: Critical > > The precedence of the operators ( especially with !== and && ) in Dataframe > Columns seems to be messed up. > Example Snippet > .where( $"col1" === "val1" && ($"col2" !== "val2") ) works fine. > whereas .where( $"col1" === "val1" && $"col2" !== "val2" ) > evaluates as ( $"col1" === "val1" && $"col2" ) !== "val2" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012638#comment-15012638 ] Andrew Or commented on SPARK-11278: --- also, when you said 6 nodes what kind of nodes are they? How much memory / cores per node? > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or reassigned SPARK-11278: - Assignee: Andrew Or > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Assignee: Andrew Or >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012632#comment-15012632 ] Andrew Or commented on SPARK-11278: --- [~nravi] can you try again with the latest 1.6 branch to see if this is still an issue? I wonder how this is different with https://github.com/apache/spark/commit/56419cf11f769c80f391b45dc41b3c7101cc5ff4. > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.1 >Reporter: Nishkam Ravi >Priority: Critical > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11842) Cleanups to existing Readers and Writers
Joseph K. Bradley created SPARK-11842: - Summary: Cleanups to existing Readers and Writers Key: SPARK-11842 URL: https://issues.apache.org/jira/browse/SPARK-11842 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor Small cleanups to existing Readers and Writers * Add {{repartition(1)}} to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012618#comment-15012618 ] Josh Rosen commented on SPARK-11649: [~vanzin], we actually _did_ see this fail in the master builds, too, and it's also really slow there, so this change is also relevant for master and 1.6. > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11841) None of start-all.sh, start-master.sh or start-slaves.sh takes -m, -c or -d configuration options as per the document
Xiangyu Li created SPARK-11841: -- Summary: None of start-all.sh, start-master.sh or start-slaves.sh takes -m, -c or -d configuration options as per the document Key: SPARK-11841 URL: https://issues.apache.org/jira/browse/SPARK-11841 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2 Reporter: Xiangyu Li I was trying to set up Spark Standalone Mode following the tutorial at http://spark.apache.org/docs/latest/spark-standalone.html. The tutorial says that we can pass "-c CORES" to the worker to set the total number of CPU cores allowed. But none of the start-all.sh, start-master.sh or start-slaves.sh would take those options as arguments. The start-all.sh and start-slaves.sh will just skip the options while start-master.sh can only take -h, -i, -p, --properties-file according to https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala So the only way I can limit the number of cores of an application at the moment is set the SPARK_WORKER_CORES in ${SPARK_HOME}/conf/spark_env.sh and then run start-all.sh So I think it is either an error in the document or the program. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.
[ https://issues.apache.org/jira/browse/SPARK-11840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11840: Assignee: Apache Spark (was: Yin Huai) > Restore the 1.5's behavior of planning a single distinct aggregation. > - > > Key: SPARK-11840 > URL: https://issues.apache.org/jira/browse/SPARK-11840 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Apache Spark > > The impact of this change is for a query that has a single distinct column > and does not have any grouping expression like > {{SELECT COUNT(DISTINCT a) FROM table}} > The plan will be changed from > {code} > AGG-2 (count distinct) > Shuffle to a single reducer > Partial-AGG-2 (count distinct) > AGG-1 (grouping on a) > Shuffle by a > Partial-AGG-1 (grouping on 1) > {code} > to the following one (1.5 uses this) > {code} > AGG-2 > AGG-1 (grouping on a) > Shuffle to a single reducer > Partial-AGG-1(grouping on a) > {code} > The first plan is more robust. However, to better benchmark the impact of > this change, we should use 1.5's plan and use the conf of > {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.
[ https://issues.apache.org/jira/browse/SPARK-11840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11840: Assignee: Yin Huai (was: Apache Spark) > Restore the 1.5's behavior of planning a single distinct aggregation. > - > > Key: SPARK-11840 > URL: https://issues.apache.org/jira/browse/SPARK-11840 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > The impact of this change is for a query that has a single distinct column > and does not have any grouping expression like > {{SELECT COUNT(DISTINCT a) FROM table}} > The plan will be changed from > {code} > AGG-2 (count distinct) > Shuffle to a single reducer > Partial-AGG-2 (count distinct) > AGG-1 (grouping on a) > Shuffle by a > Partial-AGG-1 (grouping on 1) > {code} > to the following one (1.5 uses this) > {code} > AGG-2 > AGG-1 (grouping on a) > Shuffle to a single reducer > Partial-AGG-1(grouping on a) > {code} > The first plan is more robust. However, to better benchmark the impact of > this change, we should use 1.5's plan and use the conf of > {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.
[ https://issues.apache.org/jira/browse/SPARK-11840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012607#comment-15012607 ] Apache Spark commented on SPARK-11840: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/9828 > Restore the 1.5's behavior of planning a single distinct aggregation. > - > > Key: SPARK-11840 > URL: https://issues.apache.org/jira/browse/SPARK-11840 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > > The impact of this change is for a query that has a single distinct column > and does not have any grouping expression like > {{SELECT COUNT(DISTINCT a) FROM table}} > The plan will be changed from > {code} > AGG-2 (count distinct) > Shuffle to a single reducer > Partial-AGG-2 (count distinct) > AGG-1 (grouping on a) > Shuffle by a > Partial-AGG-1 (grouping on 1) > {code} > to the following one (1.5 uses this) > {code} > AGG-2 > AGG-1 (grouping on a) > Shuffle to a single reducer > Partial-AGG-1(grouping on a) > {code} > The first plan is more robust. However, to better benchmark the impact of > this change, we should use 1.5's plan and use the conf of > {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11840) Restore the 1.5's behavior of planning a single distinct aggregation.
Yin Huai created SPARK-11840: Summary: Restore the 1.5's behavior of planning a single distinct aggregation. Key: SPARK-11840 URL: https://issues.apache.org/jira/browse/SPARK-11840 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai The impact of this change is for a query that has a single distinct column and does not have any grouping expression like {{SELECT COUNT(DISTINCT a) FROM table}} The plan will be changed from {code} AGG-2 (count distinct) Shuffle to a single reducer Partial-AGG-2 (count distinct) AGG-1 (grouping on a) Shuffle by a Partial-AGG-1 (grouping on 1) {code} to the following one (1.5 uses this) {code} AGG-2 AGG-1 (grouping on a) Shuffle to a single reducer Partial-AGG-1(grouping on a) {code} The first plan is more robust. However, to better benchmark the impact of this change, we should use 1.5's plan and use the conf of {{spark.sql.specializeSingleDistinctAggPlanning}} to control the plan. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11649) "SparkListenerSuite.onTaskGettingResult() called when result fetched remotely" test is very slow
[ https://issues.apache.org/jira/browse/SPARK-11649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012601#comment-15012601 ] Andrew Or commented on SPARK-11649: --- I back ported it into 1.5. > "SparkListenerSuite.onTaskGettingResult() called when result fetched > remotely" test is very slow > > > Key: SPARK-11649 > URL: https://issues.apache.org/jira/browse/SPARK-11649 > Project: Spark > Issue Type: Sub-task > Components: Tests >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.5.3, 1.6.0 > > > The SparkListenerSuite "onTaskGettingResult() called when result fetched > remotely" test seems to take between 1 to 4 minutes to run in Jenkins, which > seems excessively slow; we should see if there's an easy way to speed this up: > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/Spark-1.5-Maven-pre-YARN/938/HADOOP_VERSION=1.2.1,label=spark-test/testReport/org.apache.spark.scheduler/SparkListenerSuite/onTaskGettingResult___called_when_result_fetched_remotely/history/ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org