[jira] [Commented] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607904#comment-14607904
 ] 

Apache Spark commented on SPARK-8621:
-

User 'animeshbaranawal' has created a pull request for this issue:
https://github.com/apache/spark/pull/7117

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8552) Using incorrect database in multiple sessions

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607910#comment-14607910
 ] 

Apache Spark commented on SPARK-8552:
-

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/7118

 Using incorrect database in multiple sessions
 -

 Key: SPARK-8552
 URL: https://issues.apache.org/jira/browse/SPARK-8552
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Tian
Priority: Critical

 To reproduce this problem:
 * 1. start thrift server
 {quote}
 sbin/start-thriftserver.sh
 {quote}
 * 2. first connection execute use test
 {quote}
 bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e use 
 test
 {quote}
 * 3. second connection execute show tables
 {quote}
 bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e show 
 tables
 {quote}
 * 4. you can find the result is the tables in {{test}} database



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8552) Using incorrect database in multiple sessions

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8552:
---

Assignee: (was: Apache Spark)

 Using incorrect database in multiple sessions
 -

 Key: SPARK-8552
 URL: https://issues.apache.org/jira/browse/SPARK-8552
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Tian
Priority: Critical

 To reproduce this problem:
 * 1. start thrift server
 {quote}
 sbin/start-thriftserver.sh
 {quote}
 * 2. first connection execute use test
 {quote}
 bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e use 
 test
 {quote}
 * 3. second connection execute show tables
 {quote}
 bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e show 
 tables
 {quote}
 * 4. you can find the result is the tables in {{test}} database



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8552) Using incorrect database in multiple sessions

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8552:
---

Assignee: Apache Spark

 Using incorrect database in multiple sessions
 -

 Key: SPARK-8552
 URL: https://issues.apache.org/jira/browse/SPARK-8552
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yi Tian
Assignee: Apache Spark
Priority: Critical

 To reproduce this problem:
 * 1. start thrift server
 {quote}
 sbin/start-thriftserver.sh
 {quote}
 * 2. first connection execute use test
 {quote}
 bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e use 
 test
 {quote}
 * 3. second connection execute show tables
 {quote}
 bin/beeline -u jdbc:hive2://localhost:1/default -n any -p any -e show 
 tables
 {quote}
 * 4. you can find the result is the tables in {{test}} database



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8728) Add configuration for limiting the maximum number of active stages in a fair scheduling queue

2015-06-30 Thread Keuntae Park (JIRA)
Keuntae Park created SPARK-8728:
---

 Summary: Add configuration for limiting the maximum number of 
active stages in a fair scheduling queue
 Key: SPARK-8728
 URL: https://issues.apache.org/jira/browse/SPARK-8728
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Keuntae Park
Priority: Minor


Currently, every TaskSetManagers in a fair queue are scheduled concurrently.
It may harm the interactiveness of every jobs when the number of queued jobs 
becomes large.
I think it is useful to add configuration like 'maxRunningApps' of YARN. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8041) Consistently pass SparkR library directory to SparkR application

2015-06-30 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607877#comment-14607877
 ] 

Sun Rui commented on SPARK-8041:


sorry, this JIRA is obsolete as we are addressing SPARK-6797. You can take a 
look at it.

 Consistently pass SparkR library directory to SparkR application
 

 Key: SPARK-8041
 URL: https://issues.apache.org/jira/browse/SPARK-8041
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 The SparkR package library directory path (RLibDir) is needed for SparkR 
 applications for loading SparkR package and locating R helper files inside 
 the package.
 Currently, there are some places that the RLibDir needs to be specified.
 First of all, when you programs a SparkR application, sparkR.init() allows 
 you to pass a RLibDir parameter (by default, it is the same as the SparkR 
 package's libname on the driver host). However, it seems not reasonable to 
 hard-code RLibDir in a program. Instead, it would be more flexible to pass 
 RLibDir via command line or env variable.
 Additionally, for YARN cluster mode, RRunner depends on SPARK_HOME env 
 variable to get the RLibDir (assume $SPARK_HOME/R/lib). 
 So it would be better to define a consistent way to pass RLibDir to a SparkR 
 application in all deployment modes. It could be a command line option for 
 bin/sparkR or an env variable. It can be passed to a sparkR application, and 
 we can remove the RLibDir parameter of sparkR.init(). When in YARN cluster 
 mode, it can be passed to AM using 
 spark.yarn.appMasterEnv.[EnvironmentVariableName] configuration option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8621:
---

Assignee: (was: Apache Spark)

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8621) crosstab exception when one of the value is empty

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8621:
---

Assignee: Apache Spark

 crosstab exception when one of the value is empty
 -

 Key: SPARK-8621
 URL: https://issues.apache.org/jira/browse/SPARK-8621
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Critical

 I think this happened because some value is empty.
 {code}
 scala df1.stat.crosstab(role, lang)
 org.apache.spark.sql.AnalysisException: syntax error in attribute name: ;
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.parseAttributeName(LogicalPlan.scala:145)
   at 
 org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:135)
   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:157)
   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:603)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.org$apache$spark$sql$DataFrameNaFunctions$$fillCol(DataFrameNaFunctions.scala:394)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:160)
   at 
 org.apache.spark.sql.DataFrameNaFunctions$$anonfun$2.apply(DataFrameNaFunctions.scala:157)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:157)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:147)
   at 
 org.apache.spark.sql.DataFrameNaFunctions.fill(DataFrameNaFunctions.scala:132)
   at 
 org.apache.spark.sql.execution.stat.StatFunctions$.crossTabulate(StatFunctions.scala:132)
   at 
 org.apache.spark.sql.DataFrameStatFunctions.crosstab(DataFrameStatFunctions.scala:91)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8717) Update mllib-data-types docs to include missing matrix Python examples

2015-06-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8717:
-
Component/s: PySpark
 Documentation

[~Rosstin] Please set components

 Update mllib-data-types docs to include missing matrix Python examples
 

 Key: SPARK-8717
 URL: https://issues.apache.org/jira/browse/SPARK-8717
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, PySpark
Reporter: Rosstin Murphy
Priority: Minor

 Currently, the documentation for MLLib Data Types (docs/mllib-data-types.md 
 in the repo, https://spark.apache.org/docs/latest/mllib-data-types.html in 
 the latest online docs) stops listing Python examples at Labeled point.
 Local vector and Labeled point have Python examples, however none of the 
 matrix entries have Python examples.
 The matrix entries could be updated to include python examples.
 I'm not 100% sure that all the matrices currently have implemented Python 
 equivalents, but I'm pretty sure that at least the first one (Local matrix) 
 could have an entry.
 from pyspark.mllib.linalg import DenseMatrix
 dm = DenseMatrix(3, 2, [1.0, 3.0, 5.0, 2.0, 4.0, 6.0])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8560) The Executors page will have negative if having resubmitted tasks

2015-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607889#comment-14607889
 ] 

Sean Owen commented on SPARK-8560:
--

Please fix the title

 The Executors page will have negative if having resubmitted tasks
 -

 Key: SPARK-8560
 URL: https://issues.apache.org/jira/browse/SPARK-8560
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: meiyoula
 Attachments: screenshot-1.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8271) string function: soundex

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8271:
---

Assignee: Apache Spark  (was: Cheng Hao)

 string function: soundex
 

 Key: SPARK-8271
 URL: https://issues.apache.org/jira/browse/SPARK-8271
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 soundex(string A): string
 Returns soundex code of the string. For example, soundex('Miller') results in 
 M460.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8271) string function: soundex

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607899#comment-14607899
 ] 

Apache Spark commented on SPARK-8271:
-

User 'HuJiayin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7115

 string function: soundex
 

 Key: SPARK-8271
 URL: https://issues.apache.org/jira/browse/SPARK-8271
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 soundex(string A): string
 Returns soundex code of the string. For example, soundex('Miller') results in 
 M460.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8728) Add configuration for limiting the maximum number of active stages in a fair scheduling queue

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8728:
---

Assignee: Apache Spark

 Add configuration for limiting the maximum number of active stages in a fair 
 scheduling queue
 -

 Key: SPARK-8728
 URL: https://issues.apache.org/jira/browse/SPARK-8728
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Keuntae Park
Assignee: Apache Spark
Priority: Minor

 Currently, every TaskSetManagers in a fair queue are scheduled concurrently.
 It may harm the interactiveness of every jobs when the number of queued jobs 
 becomes large.
 I think it is useful to add configuration like 'maxRunningApps' of YARN. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8728) Add configuration for limiting the maximum number of active stages in a fair scheduling queue

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607985#comment-14607985
 ] 

Apache Spark commented on SPARK-8728:
-

User 'sirpkt' has created a pull request for this issue:
https://github.com/apache/spark/pull/7119

 Add configuration for limiting the maximum number of active stages in a fair 
 scheduling queue
 -

 Key: SPARK-8728
 URL: https://issues.apache.org/jira/browse/SPARK-8728
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Keuntae Park
Priority: Minor

 Currently, every TaskSetManagers in a fair queue are scheduled concurrently.
 It may harm the interactiveness of every jobs when the number of queued jobs 
 becomes large.
 I think it is useful to add configuration like 'maxRunningApps' of YARN. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2015-06-30 Thread Sebastian Alfers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607994#comment-14607994
 ] 

Sebastian Alfers commented on SPARK-7334:
-

[~josephkb] any progress on this one?

 Implement RandomProjection for Dimensionality Reduction
 ---

 Key: SPARK-7334
 URL: https://issues.apache.org/jira/browse/SPARK-7334
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sebastian Alfers
Priority: Minor

 Implement RandomProjection (RP) for dimensionality reduction
 RP is a popular approach to reduce the amount of data while preserving a 
 reasonable amount of information (pairwise distance) of you data [1][2]
 - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
 - [2] 
 http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
 I compared different implementations of that algorithm:
 - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7402) JSON serialization of params

2015-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14607890#comment-14607890
 ] 

Sean Owen commented on SPARK-7402:
--

Is this still critical and targeted for 1.4.1 now that the RC is in progress?

 JSON serialization of params
 

 Key: SPARK-7402
 URL: https://issues.apache.org/jira/browse/SPARK-7402
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 Add JSON support to Param in order to persist parameters with transformer, 
 estimator, and models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2505) Weighted Regularizer

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2505:
---

Assignee: Apache Spark

 Weighted Regularizer
 

 Key: SPARK-2505
 URL: https://issues.apache.org/jira/browse/SPARK-2505
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
Assignee: Apache Spark

 The current implementation of regularization in linear model is using 
 `Updater`, and this design has couple issues as the following.
 1) It will penalize all the weights including intercept. In machine learning 
 training process, typically, people don't penalize the intercept. 
 2) The `Updater` has the logic of adaptive step size for gradient decent, and 
 we would like to clean it up by separating the logic of regularization out 
 from updater to regularizer so in LBFGS optimizer, we don't need the trick 
 for getting the loss and gradient of objective function.
 In this work, a weighted regularizer will be implemented, and users can 
 exclude the intercept or any weight from regularization by setting that term 
 with zero weighted penalty. Since the regularizer will return a tuple of loss 
 and gradient, the adaptive step size logic, and soft thresholding for L1 in 
 Updater will be moved to SGD optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-2505) Weighted Regularizer

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-2505:
---

Assignee: (was: Apache Spark)

 Weighted Regularizer
 

 Key: SPARK-2505
 URL: https://issues.apache.org/jira/browse/SPARK-2505
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai

 The current implementation of regularization in linear model is using 
 `Updater`, and this design has couple issues as the following.
 1) It will penalize all the weights including intercept. In machine learning 
 training process, typically, people don't penalize the intercept. 
 2) The `Updater` has the logic of adaptive step size for gradient decent, and 
 we would like to clean it up by separating the logic of regularization out 
 from updater to regularizer so in LBFGS optimizer, we don't need the trick 
 for getting the loss and gradient of objective function.
 In this work, a weighted regularizer will be implemented, and users can 
 exclude the intercept or any weight from regularization by setting that term 
 with zero weighted penalty. Since the regularizer will return a tuple of loss 
 and gradient, the adaptive step size logic, and soft thresholding for L1 in 
 Updater will be moved to SGD optimizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8041) Consistently pass SparkR library directory to SparkR application

2015-06-30 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-8041.
--
Resolution: Duplicate

This issue is covered by SPARK-6797

 Consistently pass SparkR library directory to SparkR application
 

 Key: SPARK-8041
 URL: https://issues.apache.org/jira/browse/SPARK-8041
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Sun Rui

 The SparkR package library directory path (RLibDir) is needed for SparkR 
 applications for loading SparkR package and locating R helper files inside 
 the package.
 Currently, there are some places that the RLibDir needs to be specified.
 First of all, when you programs a SparkR application, sparkR.init() allows 
 you to pass a RLibDir parameter (by default, it is the same as the SparkR 
 package's libname on the driver host). However, it seems not reasonable to 
 hard-code RLibDir in a program. Instead, it would be more flexible to pass 
 RLibDir via command line or env variable.
 Additionally, for YARN cluster mode, RRunner depends on SPARK_HOME env 
 variable to get the RLibDir (assume $SPARK_HOME/R/lib). 
 So it would be better to define a consistent way to pass RLibDir to a SparkR 
 application in all deployment modes. It could be a command line option for 
 bin/sparkR or an env variable. It can be passed to a sparkR application, and 
 we can remove the RLibDir parameter of sparkR.init(). When in YARN cluster 
 mode, it can be passed to AM using 
 spark.yarn.appMasterEnv.[EnvironmentVariableName] configuration option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0

2015-06-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8699:
-
Target Version/s:   (was: 1.4.0)

[~kamlesh.kumar] Please first read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
filing a JIRA.

Don't set Target Version; in any event, 1.4.0 is already released and the 
version you say it affects.

 Select command not working for SparkR built on Spark Version: 1.4.0 and R 
 3.2.0
 ---

 Key: SPARK-8699
 URL: https://issues.apache.org/jira/browse/SPARK-8699
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows 7, 64 bit 
Reporter: Kamlesh Kumar
Priority: Critical
  Labels: test

 I can successfully run Showdf and head on rrdd data frame in R but it throws 
 unexpected error for select commands. 
  R console output after running select command on rrdd data object is 
 following:
 command
 head(select(df, df$eruptions))
 output: 
 Error in head(select(df, df$eruptions)) : 
   error in evaluating the argument 'x' in selecting a method for function 
 'head': Error in UseMethod(select_) : 
   no applicable method for 'select_' applied to an object of class DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8271) string function: soundex

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8271:
---

Assignee: Cheng Hao  (was: Apache Spark)

 string function: soundex
 

 Key: SPARK-8271
 URL: https://issues.apache.org/jira/browse/SPARK-8271
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Cheng Hao

 soundex(string A): string
 Returns soundex code of the string. For example, soundex('Miller') results in 
 M460.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed

2015-06-30 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-8743:


 Summary: Deregister Codahale metrics for streaming when 
StreamingContext is closed 
 Key: SPARK-8743
 URL: https://issues.apache.org/jira/browse/SPARK-8743
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Tathagata Das


Currently, when the StreamingContext is closed, the registered metrics are not 
deregistered. If another streaming context is started, it throws a warning 
saying that the metrics are already registered. 

The solution is to deregister the metrics when streamingcontext is stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8529) Set metadata for MinMaxScaler

2015-06-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609497#comment-14609497
 ] 

Joseph K. Bradley commented on SPARK-8529:
--

Here's an example of setting the metadata (but for a NominalAttribute): 
[https://github.com/apache/spark/blob/3bee0f1466ddd69f26e95297b5e0d2398b6c6268/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L135]

MinMaxScaler should actually use a NumericAttribute, setting its relevant 
fields.

 Set metadata for MinMaxScaler
 -

 Key: SPARK-8529
 URL: https://issues.apache.org/jira/browse/SPARK-8529
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang
Priority: Minor

 Add this as an reminder for complementing the output metadata for transformer 
 MinMaxScaler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8366) When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks

2015-06-30 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-8366:

Description: 
I use the *dynamic executor allocation* function. 
When an executor is killed, all running tasks on it will be failed. Until reach 
the maxTaskFailures, this failed task will re-run with a new task id. 
But the `ExecutorAllocationManager` won't concern this new tasks to total and 
pending tasks, because the total stage task number only set when stage 
submitted.

  was:
I use the *dynamic executor allocation* function. 
When an executor is killed, all running tasks on it will be failed. Until reach 
the maxTaskFailures, this failed task will re-run with a new task id. 
But the `ExecutorAllocationManager` won't concern this new tasks to pending 
tasks, because the total stage task number only set when stage submitted.


 When task fails and append a new one, the ExecutorAllocationManager can't 
 sense the new tasks
 -

 Key: SPARK-8366
 URL: https://issues.apache.org/jira/browse/SPARK-8366
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: meiyoula

 I use the *dynamic executor allocation* function. 
 When an executor is killed, all running tasks on it will be failed. Until 
 reach the maxTaskFailures, this failed task will re-run with a new task id. 
 But the `ExecutorAllocationManager` won't concern this new tasks to total and 
 pending tasks, because the total stage task number only set when stage 
 submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8366) When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks

2015-06-30 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-8366:

Description: 
I use the *dynamic executor allocation* function. 
When an executor is killed, all running tasks on it will be failed. Until reach 
the maxTaskFailures, this failed task will re-run with a new task id. 
But the `ExecutorAllocationManager` won't concern this new tasks to pending 
tasks, because the total stage task number only set when stage submitted.

  was:I use the *dynamic executor allocation* function. Then one executor is 
killed, all running tasks on it are failed. When the new tasks are appended, 
the new executor won't added.


 When task fails and append a new one, the ExecutorAllocationManager can't 
 sense the new tasks
 -

 Key: SPARK-8366
 URL: https://issues.apache.org/jira/browse/SPARK-8366
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: meiyoula

 I use the *dynamic executor allocation* function. 
 When an executor is killed, all running tasks on it will be failed. Until 
 reach the maxTaskFailures, this failed task will re-run with a new task id. 
 But the `ExecutorAllocationManager` won't concern this new tasks to pending 
 tasks, because the total stage task number only set when stage submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8366) When task fails and append a new one, the ExecutorAllocationManager can't sense the new tasks

2015-06-30 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-8366:

Description: 
I use the *dynamic executor allocation* function. 
When an executor is killed, all running tasks on it will be failed. Until reach 
the maxTaskFailures, this failed task will re-run with a new task id. 
But the *ExecutorAllocationManager* won't concern this new tasks to total and 
pending tasks, because the total stage task number only set when stage 
submitted.

  was:
I use the *dynamic executor allocation* function. 
When an executor is killed, all running tasks on it will be failed. Until reach 
the maxTaskFailures, this failed task will re-run with a new task id. 
But the `ExecutorAllocationManager` won't concern this new tasks to total and 
pending tasks, because the total stage task number only set when stage 
submitted.


 When task fails and append a new one, the ExecutorAllocationManager can't 
 sense the new tasks
 -

 Key: SPARK-8366
 URL: https://issues.apache.org/jira/browse/SPARK-8366
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: meiyoula

 I use the *dynamic executor allocation* function. 
 When an executor is killed, all running tasks on it will be failed. Until 
 reach the maxTaskFailures, this failed task will re-run with a new task id. 
 But the *ExecutorAllocationManager* won't concern this new tasks to total and 
 pending tasks, because the total stage task number only set when stage 
 submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8743) Deregister Codahale metrics for streaming when StreamingContext is closed

2015-06-30 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609437#comment-14609437
 ] 

Neelesh Srinivas Salian commented on SPARK-8743:


I would like to work on this JIRA.
Could you please assign this to me?

Thank you.


 Deregister Codahale metrics for streaming when StreamingContext is closed 
 --

 Key: SPARK-8743
 URL: https://issues.apache.org/jira/browse/SPARK-8743
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.4.1
Reporter: Tathagata Das
  Labels: starter

 Currently, when the StreamingContext is closed, the registered metrics are 
 not deregistered. If another streaming context is started, it throws a 
 warning saying that the metrics are already registered. 
 The solution is to deregister the metrics when streamingcontext is stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8727) Add missing python api

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8727.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7114
[https://github.com/apache/spark/pull/7114]

 Add missing python api
 --

 Key: SPARK-8727
 URL: https://issues.apache.org/jira/browse/SPARK-8727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Tarek Auel
 Fix For: 1.5.0


 Add the python api that is missing for
 https://issues.apache.org/jira/browse/SPARK-8248
 https://issues.apache.org/jira/browse/SPARK-8234
 https://issues.apache.org/jira/browse/SPARK-8217
 https://issues.apache.org/jira/browse/SPARK-8215
 https://issues.apache.org/jira/browse/SPARK-8212



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6892) Recovery from checkpoint will also reuse the application id when write eventLog in yarn-cluster mode

2015-06-30 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-6892.

Resolution: Not A Problem

 Recovery from checkpoint will also reuse the application id when write 
 eventLog in yarn-cluster mode
 

 Key: SPARK-6892
 URL: https://issues.apache.org/jira/browse/SPARK-6892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.0
Reporter: yangping wu
Priority: Critical

 When I recovery from checkpoint  in yarn-cluster mode using Spark Streaming,  
 I found it will reuse the application id (In my case is 
 application_1428664056212_0016) before falied to write spark eventLog, But 
 now my application id is application_1428664056212_0017,then spark write 
 eventLog will falied, the stacktrace as follow:
 {code}
 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' 
 failed, java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
 java.io.IOException: Target log file already exists 
 (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016)
   at 
 org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
   at 
 org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
   at scala.Option.foreach(Option.scala:236)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107)
   at 
 org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
 {code}
 This exception will cause the job falied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8318) Spark Streaming Starter JIRAs

2015-06-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609399#comment-14609399
 ] 

Tathagata Das commented on SPARK-8318:
--

I think the starter label is not very easy to find, and most people search by 
JIRAs. In this way, we get the benefit of both, starter label as well finding a 
JIRA.
Case in point, the subtasks got solved pretty fast. 


 Spark Streaming Starter JIRAs
 -

 Key: SPARK-8318
 URL: https://issues.apache.org/jira/browse/SPARK-8318
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Priority: Minor
  Labels: starter

 This is a master JIRA to collect together all starter tasks related to Spark 
 Streaming. These are simple tasks that contributors can do to get familiar 
 with the process of contributing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8313) Support Spark Packages containing R code with --packages

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609403#comment-14609403
 ] 

Apache Spark commented on SPARK-8313:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/7139

 Support Spark Packages containing R code with --packages
 

 Key: SPARK-8313
 URL: https://issues.apache.org/jira/browse/SPARK-8313
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit, SparkR
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8313) Support Spark Packages containing R code with --packages

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8313:
---

Assignee: (was: Apache Spark)

 Support Spark Packages containing R code with --packages
 

 Key: SPARK-8313
 URL: https://issues.apache.org/jira/browse/SPARK-8313
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit, SparkR
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8313) Support Spark Packages containing R code with --packages

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8313:
---

Assignee: Apache Spark

 Support Spark Packages containing R code with --packages
 

 Key: SPARK-8313
 URL: https://issues.apache.org/jira/browse/SPARK-8313
 Project: Spark
  Issue Type: New Feature
  Components: Spark Submit, SparkR
Reporter: Burak Yavuz
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6990) Add Java linting script

2015-06-30 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609509#comment-14609509
 ] 

Yu Ishikawa commented on SPARK-6990:


Please assign this issue to me?

 Add Java linting script
 ---

 Key: SPARK-6990
 URL: https://issues.apache.org/jira/browse/SPARK-6990
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Josh Rosen
Priority: Minor
  Labels: starter

 It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
 Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3444) Provide a way to easily change the log level in the Spark shell while running

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609455#comment-14609455
 ] 

Apache Spark commented on SPARK-3444:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7140

 Provide a way to easily change the log level in the Spark shell while running
 -

 Key: SPARK-3444
 URL: https://issues.apache.org/jira/browse/SPARK-3444
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Reporter: holdenk
Assignee: Holden Karau
Priority: Minor
 Fix For: 1.4.0


 Right now its difficult to change the log level while running. Our log 
 messages can be quite verbose at the more detailed levels, and some users 
 want to run at WARN until they encounter an issue and then increase the 
 logging level to debug without restarting the shell.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8742) Improve SparkR error messages for DataFrame API

2015-06-30 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-8742:
-

 Summary: Improve SparkR error messages for DataFrame API
 Key: SPARK-8742
 URL: https://issues.apache.org/jira/browse/SPARK-8742
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Hossein Falaki
Priority: Blocker


Currently all DataFrame API errors result in following generic error:

{code}
Error: returnStatus == 0 is not TRUE
{code}

This is because invokeJava in backend.R does not inspect error messages. For 
most use cases it is critical to return better error messages. Initially, we 
can return the stack trace from the JVM. In future we can inspect the errors 
and translate them to human-readable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8318) Spark Streaming Starter JIRAs

2015-06-30 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609399#comment-14609399
 ] 

Tathagata Das edited comment on SPARK-8318 at 7/1/15 1:19 AM:
--

I think the starter label is not very easy to find, and most people search by 
JIRAs. In this way, we get the benefit of both, starter label as well as a easy 
to find JIRA
Case in point, the subtasks got solved pretty fast. 



was (Author: tdas):
I think the starter label is not very easy to find, and most people search by 
JIRAs. In this way, we get the benefit of both, starter label as well finding a 
JIRA.
Case in point, the subtasks got solved pretty fast. 


 Spark Streaming Starter JIRAs
 -

 Key: SPARK-8318
 URL: https://issues.apache.org/jira/browse/SPARK-8318
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Priority: Minor
  Labels: starter

 This is a master JIRA to collect together all starter tasks related to Spark 
 Streaming. These are simple tasks that contributors can do to get familiar 
 with the process of contributing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8744) StringIndexerModel should have public constructor

2015-06-30 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-8744:


 Summary: StringIndexerModel should have public constructor
 Key: SPARK-8744
 URL: https://issues.apache.org/jira/browse/SPARK-8744
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Trivial


It would be helpful to allow users to pass a pre-computed index to create an 
indexer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8628) Race condition in AbstractSparkSQLParser.parse

2015-06-30 Thread Vinod KC (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609477#comment-14609477
 ] 

Vinod KC commented on SPARK-8628:
-

Can you please assign this to me

 Race condition in AbstractSparkSQLParser.parse
 --

 Key: SPARK-8628
 URL: https://issues.apache.org/jira/browse/SPARK-8628
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1, 1.4.0
Reporter: Santiago M. Mola
Priority: Critical
  Labels: regression
 Fix For: 1.5.0, 1.4.2


 SPARK-5009 introduced the following code in AbstractSparkSQLParser:
 {code}
 def parse(input: String): LogicalPlan = {
 // Initialize the Keywords.
 lexical.initialize(reservedWords)
 phrase(start)(new lexical.Scanner(input)) match {
   case Success(plan, _) = plan
   case failureOrError = sys.error(failureOrError.toString)
 }
   }
 {code}
 The corresponding initialize method in SqlLexical is not thread-safe:
 {code}
   /* This is a work around to support the lazy setting */
   def initialize(keywords: Seq[String]): Unit = {
 reserved.clear()
 reserved ++= keywords
   }
 {code}
 I'm hitting this when parsing multiple SQL queries concurrently. When one 
 query parsing starts, it empties the reserved keyword list, then a 
 race-condition occurs and other queries fail to parse because they recognize 
 keywords as identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6602) Replace direct use of Akka with Spark RPC interface

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609481#comment-14609481
 ] 

Apache Spark commented on SPARK-6602:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/7141

 Replace direct use of Akka with Spark RPC interface
 ---

 Key: SPARK-6602
 URL: https://issues.apache.org/jira/browse/SPARK-6602
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Shixiong Zhu
Priority: Critical
 Fix For: 1.5.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8535) PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name

2015-06-30 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-8535.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 7124
[https://github.com/apache/spark/pull/7124]

 PySpark : Can't create DataFrame from Pandas dataframe with no explicit 
 column name
 ---

 Key: SPARK-8535
 URL: https://issues.apache.org/jira/browse/SPARK-8535
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Christophe Bourguignat
 Fix For: 1.5.0


 Trying to create a Spark DataFrame from a pandas dataframe with no explicit 
 column name : 
 pandasDF = pd.DataFrame([[1, 2], [5, 6]])
 sparkDF = sqlContext.createDataFrame(pandasDF)
 ***
  1 sparkDF = sqlContext.createDataFrame(pandasDF)
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/pyspark/sql/context.pyc 
 in createDataFrame(self, data, schema, samplingRatio)
 344 
 345 jrdd = 
 self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
 -- 346 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), 
 schema.json())
 347 return DataFrame(df, self)
 348 
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o87.applySchemaToPythonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6990) Add Java linting script

2015-06-30 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609531#comment-14609531
 ] 

Yu Ishikawa edited comment on SPARK-6990 at 7/1/15 3:59 AM:


I think it would be nice to execute {{mvn checkstyle:checkstyle}} with the 
checkstyle maven plugin.
What do you think about that? And do you have any good idea to realize the 
linter?


was (Author: yuu.ishik...@gmail.com):
I think it would be nice to execute `mvn checkstyle: checkstyle` with the 
checkstyle maven plugin.
What do you think about that? And do you have any good idea to realize the 
linter?

 Add Java linting script
 ---

 Key: SPARK-6990
 URL: https://issues.apache.org/jira/browse/SPARK-6990
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Josh Rosen
Priority: Minor
  Labels: starter

 It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
 Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8742) Improve SparkR error messages for DataFrame API

2015-06-30 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609558#comment-14609558
 ] 

Shivaram Venkataraman commented on SPARK-8742:
--

Thanks [~falaki] for creating this. This is a pretty important issue and I 
think there might be a bunch of things to improve here.

I think the most important thing is to filter out the Netty stack trace that 
comes from the RBackend handler. Typically the netty server throws an error 
when some other Java function call has failed and the error is rarely in the 
Netty call itself.  One way to do this might be to return an string message 
that encodes part of the actual exception when the return status is zero. 


 Improve SparkR error messages for DataFrame API
 ---

 Key: SPARK-8742
 URL: https://issues.apache.org/jira/browse/SPARK-8742
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Hossein Falaki
Priority: Blocker

 Currently all DataFrame API errors result in following generic error:
 {code}
 Error: returnStatus == 0 is not TRUE
 {code}
 This is because invokeJava in backend.R does not inspect error messages. For 
 most use cases it is critical to return better error messages. Initially, we 
 can return the stack trace from the JVM. In future we can inspect the errors 
 and translate them to human-readable error messages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609563#comment-14609563
 ] 

Apache Spark commented on SPARK-8746:
-

User 'ckadner' has created a pull request for this issue:
https://github.com/apache/spark/pull/7144

 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
 --

 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Priority: Trivial
  Labels: documentation, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
 describes how to generate golden answer files for new hive comparison test 
 cases. However the download link for the Hive 0.13.1 jars points to 
 https://hive.apache.org/downloads.html but none of the linked mirror sites 
 still has the 0.13.1 version.
 We need to update the link to 
 https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8748) Move castability test out from Cast case class into Cast object

2015-06-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8748:
--

 Summary: Move castability test out from Cast case class into Cast 
object
 Key: SPARK-8748
 URL: https://issues.apache.org/jira/browse/SPARK-8748
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


So we can use it as static methods in the analyzer.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-06-30 Thread venu k tangirala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609569#comment-14609569
 ] 

venu k tangirala edited comment on SPARK-6101 at 7/1/15 5:21 AM:
-

Hi Chris, does this include writing back to dynamoDB ? 
Is someone working on this?
Does this work in pyspark too?


was (Author: venuktan):
Hi Chris, does this include writing back to dynamoDB ? 
Is someone working on this?

 Create a SparkSQL DataSource API implementation for DynamoDB
 

 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.5.0


 similar to https://github.com/databricks/spark-avro  and 
 https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-06-30 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609579#comment-14609579
 ] 

Davies Liu commented on SPARK-8653:
---

[~rxin] With the new `ExpectsInputTypes`, we still need a way to tell how to do 
the conversion, it's ugly to do the type switch in eval() or codegen().

Maybe we could improve `AutoCastInputType` to have a method `acceptedTypes`, 
which returns a list of list of data types, specify those types could be casted 
into expected types. Be default, it will accept all type types which could be 
casted to expected types. 

 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8647) Potential issues with the constant hashCode

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609580#comment-14609580
 ] 

Apache Spark commented on SPARK-8647:
-

User 'aloknsingh' has created a pull request for this issue:
https://github.com/apache/spark/pull/7146

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8647) Potential issues with the constant hashCode

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8647:
---

Assignee: (was: Apache Spark)

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8647) Potential issues with the constant hashCode

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8647:
---

Assignee: Apache Spark

 Potential issues with the constant hashCode 
 

 Key: SPARK-8647
 URL: https://issues.apache.org/jira/browse/SPARK-8647
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Alok Singh
Assignee: Apache Spark
Priority: Minor
  Labels: performance

 Hi,
 This may be potential bug or performance issue or just the code docs.
 The issue is wrt to MatrixUDT class.
  If we decide to put instance of MatrixUDT into the hash based collection.
 The hashCode function is returning constant and even though equals method is 
 consistant with hashCode. I don't see the reason why hashCode() = 1994 (i.e 
 constant) has been used.
 I was expecting it to be similar to the other matrix class or the vector 
 class .
 If there is the reason why we have this code, we should document it properly 
 in the code so that others reading it is fine.
 regards,
 Alok
 Details
 =
 a)
 In reference to the file 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
 line 188-197 ie
  override def equals(o: Any): Boolean = {
 o match {
 case v: MatrixUDT = true
 case _ = false
 }
 }
 override def hashCode(): Int = 1994
 b) the commit is 
 https://github.com/apache/spark/commit/11e025956be3818c00effef0d650734f8feeb436
 on March 20.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

2015-06-30 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608169#comment-14608169
 ] 

Antony Mayi commented on SPARK-8708:


The real case is about 13M of users, few hundreds of products and about 500 
partitions. The rdd returned by .predictAll() utilizes single partition as in 
my example (btw. why do you say I have one partition in my toy example? It is 
using 5 partitions, all of them utilized before it comes to ALS - to me it 
replicate the real issue I am facing).

 MatrixFactorizationModel.predictAll() populates single partition only
 -

 Key: SPARK-8708
 URL: https://issues.apache.org/jira/browse/SPARK-8708
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Antony Mayi

 When using mllib.recommendation.ALS the RDD returned by .predictAll() has all 
 values pushed into single partition despite using quite high parallelism.
 This degrades performance of further processing (I can obviously run 
 .partitionBy()) to balance it but that's still too costly (ie if running 
 .predictAll() in loop for thousands of products) and should be possible to do 
 it rather somehow on the model (automatically)).
 Bellow is an example on tiny sample (same on large dataset):
 {code:title=pyspark}
  r1 = (1, 1, 1.0)
  r2 = (1, 2, 2.0)
  r3 = (2, 1, 2.0)
  r4 = (2, 2, 2.0)
  r5 = (3, 1, 1.0)
  ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
  ratings.getNumPartitions()
 5
  users = ratings.map(itemgetter(0)).distinct()
  model = ALS.trainImplicit(ratings, 1, seed=10)
  predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
  predictions_for_2.glom().map(len).collect()
 [0, 0, 3, 0, 0]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8044) Avoid to use directMemory while put or get disk level block from file

2015-06-30 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8044.
--
Resolution: Won't Fix

 Avoid to use directMemory while put or get disk level block from file
 -

 Key: SPARK-8044
 URL: https://issues.apache.org/jira/browse/SPARK-8044
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: SuYan
Priority: Critical

 1.  I found if we use getChannel to put or get data, it will create 
 DirectBuffer anyway, which is not controllable.
 according openJDK source code: because it will create a ThreadLocal 
 directBuffer pool, and is not provider a 100% percent way to sure the direct 
 buffer to be released, it will cache in the pool.
 {code}
  sun.nio.ch.FileChannelImpl.java
 public int write(ByteBuffer src) throws IOException {
 210 ensureOpen();
 211 if (!writable)
 212 throw new NonWritableChannelException();
 213 synchronized (positionLock) {
 214 int n = 0;
 215 int ti = -1;
 216 try {
 217 begin();
 218 if (!isOpen())
 219 return 0;
 220 ti = threads.add();
 221 if (appending)
 222 position(size());
 223 do {
 224 n = IOUtil.write(fd, src, -1, nd, positionLock);
 225 } while ((n == IOStatus.INTERRUPTED)  isOpen());
 226 return IOStatus.normalize(n);
 227 } finally {
 228 threads.remove(ti);
 229 end(n  0);
 230 assert IOStatus.check(n);
 231 }
 232 }
 233 }
 {code}
 {code}
 IOUtil.java
 static int write(FileDescriptor fd, ByteBuffer src, long position,
 74  NativeDispatcher nd, Object lock)
 75 throws IOException
 76 {
 77 if (src instanceof DirectBuffer)
 78 return writeFromNativeBuffer(fd, src, position, nd, lock);
 79 
 80 // Substitute a native buffer
 81 int pos = src.position();
 82 int lim = src.limit();
 83 assert (pos = lim);
 84 int rem = (pos = lim ? lim - pos : 0);
 85 ByteBuffer bb = null;
 86 try {
 87 bb = Util.getTemporaryDirectBuffer(rem);
 88 bb.put(src);
 89 bb.flip();
 90 // Do not update src until we see how many bytes were written
 91 src.position(pos);
 92 
 93 int n = writeFromNativeBuffer(fd, bb, position, nd, lock);
 94 if (n  0) {
 95 // now update src
 96 src.position(pos + n);
 97 }
 98 return n;
 99 } finally {
 100Util.releaseTemporaryDirectBuffer(bb);
 101}
 102}
 {code}
 {code}
 Util.java
  static ByteBuffer getTemporaryDirectBuffer(int size) {
 61 ByteBuffer buf = null;
 62 // Grab a buffer if available
 63 for (int i=0; iTEMP_BUF_POOL_SIZE; i++) {
 64 SoftReference ref = (SoftReference)(bufferPool[i].get());
 65 if ((ref != null)  ((buf = (ByteBuffer)ref.get()) != null) 
 66 (buf.capacity() = size)) {
 67 buf.rewind();
 68 buf.limit(size);
 69 bufferPool[i].set(null);
 70 return buf;
 71 }
 72 }
 73 
 74 // Make a new one
 75 return ByteBuffer.allocateDirect(size);
 76 }
 {code}
 {code}
  private static final int TEMP_BUF_POOL_SIZE = 3;
 50 
 51 // Per-thread soft cache of the last temporary direct buffer
 52 private static ThreadLocal[] bufferPool;
 53 
 54 static {
 55 bufferPool = new ThreadLocal[TEMP_BUF_POOL_SIZE];
 56 for (int i=0; iTEMP_BUF_POOL_SIZE; i++)
 57 bufferPool[i] = new ThreadLocal();
 58 }
 59 
 60 static ByteBuffer getTemporaryDirectBuffer(int size) {
 61 ByteBuffer buf = null;
 62 // Grab a buffer if available
 63 for (int i=0; iTEMP_BUF_POOL_SIZE; i++) {
 64 SoftReference ref = (SoftReference)(bufferPool[i].get());
 65 if ((ref != null)  ((buf = (ByteBuffer)ref.get()) != null) 
 66 (buf.capacity() = size)) {
 67 buf.rewind();
 68 buf.limit(size);
 69 bufferPool[i].set(null);
 70 return buf;
 71 }
 72 }
 73 
 74 // Make a new one
 75 return ByteBuffer.allocateDirect(size);
 76 }
 77 
 78 static void releaseTemporaryDirectBuffer(ByteBuffer buf) {
 79 if (buf == null)
 80 return;
 81 // Put it in an empty slot if such 

[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

2015-06-30 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608214#comment-14608214
 ] 

Antony Mayi commented on SPARK-8708:


ok, more detailed example showing there really are 5 partitions used in this 
case but eventually the .predictAll() pushes everything to just one. This is 
exactly what I am seeing in production - out of 500 partitions single one gets 
all the millions of predictions in it, all other partitions are empty.

{code}
 from operator import itemgetter
 from pyspark.mllib.recommendation import ALS
 from pyspark import SparkConf
 sconf = SparkConf()
 sconf.get('spark.default.parallelism')
u'5'
 r1 = (1, 1, 1.0)
 r2 = (1, 2, 2.0)
 r3 = (2, 1, 2.0)
 r4 = (2, 2, 2.0)
 r5 = (3, 1, 1.0)
 ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
 ratings.glom().map(len).collect()
[1, 1, 1, 1, 1]
 users = ratings.map(itemgetter(0)).distinct()
 users.glom().map(len).collect()
[0, 1, 1, 1, 0]
 model = ALS.trainImplicit(ratings, 1, seed=10)
 predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
 predictions_for_2.glom().map(len).collect()
[0, 0, 3, 0, 0]
{code}

 MatrixFactorizationModel.predictAll() populates single partition only
 -

 Key: SPARK-8708
 URL: https://issues.apache.org/jira/browse/SPARK-8708
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Antony Mayi

 When using mllib.recommendation.ALS the RDD returned by .predictAll() has all 
 values pushed into single partition despite using quite high parallelism.
 This degrades performance of further processing (I can obviously run 
 .partitionBy()) to balance it but that's still too costly (ie if running 
 .predictAll() in loop for thousands of products) and should be possible to do 
 it rather somehow on the model (automatically)).
 Bellow is an example on tiny sample (same on large dataset):
 {code:title=pyspark}
  r1 = (1, 1, 1.0)
  r2 = (1, 2, 2.0)
  r3 = (2, 1, 2.0)
  r4 = (2, 2, 2.0)
  r5 = (3, 1, 1.0)
  ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
  ratings.getNumPartitions()
 5
  users = ratings.map(itemgetter(0)).distinct()
  model = ALS.trainImplicit(ratings, 1, seed=10)
  predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
  predictions_for_2.glom().map(len).collect()
 [0, 0, 3, 0, 0]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

2015-06-30 Thread Antony Mayi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608169#comment-14608169
 ] 

Antony Mayi edited comment on SPARK-8708 at 6/30/15 11:55 AM:
--

The real case is about 13M of users, few hundreds of products and about 500 
partitions. The rdd returned by .predictAll() utilizes single partition as in 
my example (btw. why do you say I have one partition in my toy example? It is 
using 5 partitions, all of them utilized before it comes to ALS - to me it 
replicates the real issue I am facing).


was (Author: antonymayi):
The real case is about 13M of users, few hundreds of products and about 500 
partitions. The rdd returned by .predictAll() utilizes single partition as in 
my example (btw. why do you say I have one partition in my toy example? It is 
using 5 partitions, all of them utilized before it comes to ALS - to me it 
replicate the real issue I am facing).

 MatrixFactorizationModel.predictAll() populates single partition only
 -

 Key: SPARK-8708
 URL: https://issues.apache.org/jira/browse/SPARK-8708
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Antony Mayi

 When using mllib.recommendation.ALS the RDD returned by .predictAll() has all 
 values pushed into single partition despite using quite high parallelism.
 This degrades performance of further processing (I can obviously run 
 .partitionBy()) to balance it but that's still too costly (ie if running 
 .predictAll() in loop for thousands of products) and should be possible to do 
 it rather somehow on the model (automatically)).
 Bellow is an example on tiny sample (same on large dataset):
 {code:title=pyspark}
  r1 = (1, 1, 1.0)
  r2 = (1, 2, 2.0)
  r3 = (2, 1, 2.0)
  r4 = (2, 2, 2.0)
  r5 = (3, 1, 1.0)
  ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
  ratings.getNumPartitions()
 5
  users = ratings.map(itemgetter(0)).distinct()
  model = ALS.trainImplicit(ratings, 1, seed=10)
  predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
  predictions_for_2.glom().map(len).collect()
 [0, 0, 3, 0, 0]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8708) MatrixFactorizationModel.predictAll() populates single partition only

2015-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608194#comment-14608194
 ] 

Sean Owen commented on SPARK-8708:
--

Does that actually make 5 partitions? I see that's what's requested, but are 
the items evenly distributed?
The computation doesn't use 1 partition, so the question is why the result 
would have 1 partition. It might if you have a small number of products that 
all get into one partition for whatever reason, I think, since the final join 
is on product. I have one more idea on the PR ...

 MatrixFactorizationModel.predictAll() populates single partition only
 -

 Key: SPARK-8708
 URL: https://issues.apache.org/jira/browse/SPARK-8708
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Antony Mayi

 When using mllib.recommendation.ALS the RDD returned by .predictAll() has all 
 values pushed into single partition despite using quite high parallelism.
 This degrades performance of further processing (I can obviously run 
 .partitionBy()) to balance it but that's still too costly (ie if running 
 .predictAll() in loop for thousands of products) and should be possible to do 
 it rather somehow on the model (automatically)).
 Bellow is an example on tiny sample (same on large dataset):
 {code:title=pyspark}
  r1 = (1, 1, 1.0)
  r2 = (1, 2, 2.0)
  r3 = (2, 1, 2.0)
  r4 = (2, 2, 2.0)
  r5 = (3, 1, 1.0)
  ratings = sc.parallelize([r1, r2, r3, r4, r5], 5)
  ratings.getNumPartitions()
 5
  users = ratings.map(itemgetter(0)).distinct()
  model = ALS.trainImplicit(ratings, 1, seed=10)
  predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2)))
  predictions_for_2.glom().map(len).collect()
 [0, 0, 3, 0, 0]
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8437) Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608174#comment-14608174
 ] 

Apache Spark commented on SPARK-8437:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7126

 Using directory path without wildcard for filename slow for large number of 
 files with wholeTextFiles and binaryFiles
 -

 Key: SPARK-8437
 URL: https://issues.apache.org/jira/browse/SPARK-8437
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 1.3.1, 1.4.0
 Environment: Ubuntu 15.04 + local filesystem
 Amazon EMR + S3 + HDFS
Reporter: Ewan Leith
Assignee: Sean Owen
Priority: Minor

 When calling wholeTextFiles or binaryFiles with a directory path with 10,000s 
 of files in it, Spark hangs for a few minutes before processing the files.
 If you add a * to the end of the path, there is no delay.
 This happens for me on Spark 1.3.1 and 1.4 on the local filesystem, HDFS, and 
 on S3.
 To reproduce, create a directory with 50,000 files in it, then run:
 val a = sc.binaryFiles(file:/path/to/files/)
 a.count()
 val b = sc.binaryFiles(file:/path/to/files/*)
 b.count()
 and monitor the different startup times.
 For example, in the spark-shell these commands are pasted in together, so the 
 delay at f.count() is from 10:11:08 t- 10:13:29 to output Total input paths 
 to process : 4, then until 10:15:42 to being processing files:
 scala val f = sc.binaryFiles(file:/home/ewan/large/)
 15/06/18 10:11:07 INFO MemoryStore: ensureFreeSpace(160616) called with 
 curMem=0, maxMem=278019440
 15/06/18 10:11:07 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 156.9 KB, free 265.0 MB)
 15/06/18 10:11:08 INFO MemoryStore: ensureFreeSpace(17282) called with 
 curMem=160616, maxMem=278019440
 15/06/18 10:11:08 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.9 KB, free 265.0 MB)
 15/06/18 10:11:08 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:40430 (size: 16.9 KB, free: 265.1 MB)
 15/06/18 10:11:08 INFO SparkContext: Created broadcast 0 from binaryFiles at 
 console:21
 f: org.apache.spark.rdd.RDD[(String, 
 org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/ 
 BinaryFileRDD[0] at binaryFiles at console:21
 scala f.count()
 15/06/18 10:13:29 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:15:42 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:15:42 INFO CombineFileInputFormat: DEBUG: Terminated node 
 allocation with : CompletedNodes: 1, size left: 0
 15/06/18 10:15:42 INFO SparkContext: Starting job: count at console:24
 15/06/18 10:15:42 INFO DAGScheduler: Got job 0 (count at console:24) with 
 4 output partitions (allowLocal=false)
 15/06/18 10:15:42 INFO DAGScheduler: Final stage: ResultStage 0(count at 
 console:24)
 15/06/18 10:15:42 INFO DAGScheduler: Parents of final stage: List()
 Adding a * to the end of the path removes the delay:
 scala val f = sc.binaryFiles(file:/home/ewan/large/*)
 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(160616) called with 
 curMem=0, maxMem=278019440
 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0 stored as values in 
 memory (estimated size 156.9 KB, free 265.0 MB)
 15/06/18 10:08:29 INFO MemoryStore: ensureFreeSpace(17309) called with 
 curMem=160616, maxMem=278019440
 15/06/18 10:08:29 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes 
 in memory (estimated size 16.9 KB, free 265.0 MB)
 15/06/18 10:08:29 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory 
 on localhost:42825 (size: 16.9 KB, free: 265.1 MB)
 15/06/18 10:08:29 INFO SparkContext: Created broadcast 0 from binaryFiles at 
 console:21
 f: org.apache.spark.rdd.RDD[(String, 
 org.apache.spark.input.PortableDataStream)] = file:/home/ewan/large/* 
 BinaryFileRDD[0] at binaryFiles at console:21
 scala f.count()
 15/06/18 10:08:32 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:08:33 INFO FileInputFormat: Total input paths to process : 4
 15/06/18 10:08:35 INFO CombineFileInputFormat: DEBUG: Terminated node 
 allocation with : CompletedNodes: 1, size left: 0
 15/06/18 10:08:35 INFO SparkContext: Starting job: count at console:24
 15/06/18 10:08:35 INFO DAGScheduler: Got job 0 (count at console:24) with 
 4 output partitions 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608242#comment-14608242
 ] 

Apache Spark commented on SPARK-8707:
-

User 'navis' has created a pull request for this issue:
https://github.com/apache/spark/pull/7127

 RDD#toDebugString fails if any cached RDD has invalid partitions
 

 Key: SPARK-8707
 URL: https://issues.apache.org/jira/browse/SPARK-8707
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0, 1.4.1
Reporter: Aaron Davidson
  Labels: starter

 Repro:
 {code}
 sc.textFile(/ThisFileDoesNotExist).cache()
 sc.parallelize(0 until 100).toDebugString
 {code}
 Output:
 {code}
 java.io.IOException: Not a file: /ThisFileDoesNotExist
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
   at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
   at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
   at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
 {code}
 This is because toDebugString gets all the partitions from all RDDs, which 
 fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
 resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
 also be).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7820) Java8-tests suite compile error under SBT

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7820:
---

Assignee: (was: Apache Spark)

 Java8-tests suite compile error under SBT
 -

 Key: SPARK-7820
 URL: https://issues.apache.org/jira/browse/SPARK-7820
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao
Priority: Critical

 Lots of compilation error is shown when java 8 test suite is enabled in SBT:
 {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.6.0 -Pjava8-tests}}
 {code}
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43:
  error: cannot find symbol
 [error] public class Java8APISuite extends LocalJavaStreamingContext 
 implements Serializable {
 [error]^
 [error]   symbol: class LocalJavaStreamingContext
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
  error: cannot find symbol
 [error] JavaDStreamString stream = 
 JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
 [error]  ^
 [error]   symbol:   variable ssc
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
  error: cannot find symbol
 [error] JavaDStreamString stream = 
 JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
 [error]  ^
 [error]   symbol:   variable JavaTestUtils
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57:
  error: cannot find symbol
 [error] JavaTestUtils.attachTestOutputStream(letterCount);
 [error] ^
 [error]   symbol:   variable JavaTestUtils
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
  error: cannot find symbol
 [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2);
 [error]   ^
 [error]   symbol:   variable ssc
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
  error: cannot find symbol
 [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2);
 [error]  ^
 [error]   symbol:   variable JavaTestUtils
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73:
  error: cannot find symbol
 [error] JavaDStreamString stream = 
 JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
 [error]  ^
 [error]   symbol:   variable ssc
 [error]   location: class Java8APISuite
 {code}
 The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which 
 exists in streaming test jar. It is OK for maven compile, since it will 
 generate test jar, but will be failed in sbt test compile, sbt do not 
 generate test jar by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7820) Java8-tests suite compile error under SBT

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608063#comment-14608063
 ] 

Apache Spark commented on SPARK-7820:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/7120

 Java8-tests suite compile error under SBT
 -

 Key: SPARK-7820
 URL: https://issues.apache.org/jira/browse/SPARK-7820
 Project: Spark
  Issue Type: Bug
  Components: Build, Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao
Priority: Critical

 Lots of compilation error is shown when java 8 test suite is enabled in SBT:
 {{JAVA_HOME=/usr/java/jdk1.8.0_45 ./sbt/sbt -Pyarn -Phadoop-2.4 
 -Dhadoop.version=2.6.0 -Pjava8-tests}}
 {code}
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:43:
  error: cannot find symbol
 [error] public class Java8APISuite extends LocalJavaStreamingContext 
 implements Serializable {
 [error]^
 [error]   symbol: class LocalJavaStreamingContext
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
  error: cannot find symbol
 [error] JavaDStreamString stream = 
 JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
 [error]  ^
 [error]   symbol:   variable ssc
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:55:
  error: cannot find symbol
 [error] JavaDStreamString stream = 
 JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
 [error]  ^
 [error]   symbol:   variable JavaTestUtils
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:57:
  error: cannot find symbol
 [error] JavaTestUtils.attachTestOutputStream(letterCount);
 [error] ^
 [error]   symbol:   variable JavaTestUtils
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
  error: cannot find symbol
 [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2);
 [error]   ^
 [error]   symbol:   variable ssc
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:58:
  error: cannot find symbol
 [error] ListListInteger result = JavaTestUtils.runStreams(ssc, 2, 2);
 [error]  ^
 [error]   symbol:   variable JavaTestUtils
 [error]   location: class Java8APISuite
 [error] 
 /mnt/data/project/apache-spark/extras/java8-tests/src/test/java/org/apache/spark/streaming/Java8APISuite.java:73:
  error: cannot find symbol
 [error] JavaDStreamString stream = 
 JavaTestUtils.attachTestInputStream(ssc, inputData, 1);
 [error]  ^
 [error]   symbol:   variable ssc
 [error]   location: class Java8APISuite
 {code}
 The class {{JavaAPISuite}} relies on {{LocalJavaStreamingContext}} which 
 exists in streaming test jar. It is OK for maven compile, since it will 
 generate test jar, but will be failed in sbt test compile, sbt do not 
 generate test jar by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8731) Beeline doesn't work with -e option when started in background

2015-06-30 Thread Wang Yiguang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608079#comment-14608079
 ] 

Wang Yiguang commented on SPARK-8731:
-

Here is more discussion about this issue
https://issues.apache.org/jira/browse/HIVE-6758

I looked into it a bit and will give more information later.

 Beeline doesn't work with -e option when started in background
 --

 Key: SPARK-8731
 URL: https://issues.apache.org/jira/browse/SPARK-8731
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Wang Yiguang
Priority: Minor

 Beeline stops when running back ground like this:
 beeline -e some query 
 it doesn't work even with the -f switch.
 For example:
 this works:
 beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; 
 however this not:
 beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases;  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8731) Beeline doesn't work with -e option when started in background

2015-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608084#comment-14608084
 ] 

Sean Owen commented on SPARK-8731:
--

Is this Spark-specific or just about beeline?

 Beeline doesn't work with -e option when started in background
 --

 Key: SPARK-8731
 URL: https://issues.apache.org/jira/browse/SPARK-8731
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Wang Yiguang
Priority: Minor

 Beeline stops when running back ground like this:
 beeline -e some query 
 it doesn't work even with the -f switch.
 For example:
 this works:
 beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; 
 however this not:
 beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases;  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8729) Spark app unable to instantiate the classes using the reflection

2015-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608087#comment-14608087
 ] 

Sean Owen commented on SPARK-8729:
--

[~kmurt...@gmail.com] This isn't really useful as you have no info about how 
you are deploying this. I'm going to close it unless you can provide something 
much more reproducible.

 Spark app unable to instantiate the classes using the reflection
 

 Key: SPARK-8729
 URL: https://issues.apache.org/jira/browse/SPARK-8729
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.3.0
Reporter: Murthy Chelankuri
Priority: Critical

 SPARK 1.3.0 unable to instantiate the classes using the reflection (using 
 Class.forName). It says class not found even that class is available in the 
 list jars.
 The following is the expection i am getting by the executors
 java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:264)
 at kafka.utils.Utils$.createObject(Utils.scala:438)
 at kafka.producer.Producer.init(Producer.scala:61)
 The application is working fine with out any issues with 1.2.0 version. 
 I am planing to upgrade to 1.3.0 and found it its not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8729) Spark app unable to instantiate the classes using the reflection

2015-06-30 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608111#comment-14608111
 ] 

Sean Owen commented on SPARK-8729:
--

It just sounds like one of your jars is not deployed in the right place. I 
don't think this code helps analyze that. You need to verify how you are 
shipping your app and that you package all necessary classes in your app and 
submit it through spark-submit.

 Spark app unable to instantiate the classes using the reflection
 

 Key: SPARK-8729
 URL: https://issues.apache.org/jira/browse/SPARK-8729
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.3.0
Reporter: Murthy Chelankuri
Priority: Critical

 SPARK 1.3.0 unable to instantiate the classes using the reflection (using 
 Class.forName). It says class not found even that class is available in the 
 list jars.
 The following is the expection i am getting by the executors
 java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:264)
 at kafka.utils.Utils$.createObject(Utils.scala:438)
 at kafka.producer.Producer.init(Producer.scala:61)
 The application is working fine with out any issues with 1.2.0 version. 
 I am planing to upgrade to 1.3.0 and found it its not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8732) Compilation warning for existentials code

2015-06-30 Thread Tijo Thomas (JIRA)
Tijo Thomas created SPARK-8732:
--

 Summary: Compilation warning for existentials code
 Key: SPARK-8732
 URL: https://issues.apache.org/jira/browse/SPARK-8732
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Tijo Thomas
Priority: Trivial


Compilation warning for Scala code for using existential  
1. RBackendHandler.scala
2. CatalystTypeConverters.scala

Need to add import.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6735) Provide options to make maximum executor failure count ( which kills the application ) relative to a window duration or disable it.

2015-06-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608280#comment-14608280
 ] 

Thomas Graves commented on SPARK-6735:
--

Pull request was up but didn't have time to do rework for some comments if 
someone else wants to take this over.
https://github.com/apache/spark/pull/5449

 Provide options to make maximum executor failure count ( which kills the 
 application ) relative to a window duration or disable it.
 ---

 Key: SPARK-6735
 URL: https://issues.apache.org/jira/browse/SPARK-6735
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit, YARN
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Twinkle Sachdeva

 Currently there is a setting (spark.yarn.max.executor.failures ) which tells 
 maximum number of executor failures, after which Application fails.
 For long running applications, user can require not to kill the application 
 at all or will require such setting relative to a window duration. This 
 improvement is ti provide such options to make maximum executor failure count 
 ( which kills the application ) relative to a window duration or disable it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8707:
---

Assignee: (was: Apache Spark)

 RDD#toDebugString fails if any cached RDD has invalid partitions
 

 Key: SPARK-8707
 URL: https://issues.apache.org/jira/browse/SPARK-8707
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0, 1.4.1
Reporter: Aaron Davidson
  Labels: starter

 Repro:
 {code}
 sc.textFile(/ThisFileDoesNotExist).cache()
 sc.parallelize(0 until 100).toDebugString
 {code}
 Output:
 {code}
 java.io.IOException: Not a file: /ThisFileDoesNotExist
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
   at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
   at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
   at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
 {code}
 This is because toDebugString gets all the partitions from all RDDs, which 
 fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
 resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
 also be).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8707) RDD#toDebugString fails if any cached RDD has invalid partitions

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8707:
---

Assignee: Apache Spark

 RDD#toDebugString fails if any cached RDD has invalid partitions
 

 Key: SPARK-8707
 URL: https://issues.apache.org/jira/browse/SPARK-8707
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0, 1.4.1
Reporter: Aaron Davidson
Assignee: Apache Spark
  Labels: starter

 Repro:
 {code}
 sc.textFile(/ThisFileDoesNotExist).cache()
 sc.parallelize(0 until 100).toDebugString
 {code}
 Output:
 {code}
 java.io.IOException: Not a file: /ThisFileDoesNotExist
   at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:215)
   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
   at org.apache.spark.storage.RDDInfo$.fromRdd(RDDInfo.scala:59)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 org.apache.spark.SparkContext$$anonfun$34.apply(SparkContext.scala:1455)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.MapLike$DefaultValuesIterable.foreach(MapLike.scala:206)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.SparkContext.getRDDStorageInfo(SparkContext.scala:1455)
   at org.apache.spark.rdd.RDD.debugSelf$1(RDD.scala:1573)
   at org.apache.spark.rdd.RDD.firstDebugString$1(RDD.scala:1607)
   at org.apache.spark.rdd.RDD.toDebugString(RDD.scala:1637
 {code}
 This is because toDebugString gets all the partitions from all RDDs, which 
 fails (via SparkContext#getRDDStorageInfo). This pathway should definitely be 
 resilient to other RDDs being invalid (and getRDDStorageInfo should probably 
 also be).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8729) Spark app unable to instantiate the classes using the reflection

2015-06-30 Thread Murthy Chelankuri (JIRA)
Murthy Chelankuri created SPARK-8729:


 Summary: Spark app unable to instantiate the classes using the 
reflection
 Key: SPARK-8729
 URL: https://issues.apache.org/jira/browse/SPARK-8729
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.3.0
Reporter: Murthy Chelankuri
Priority: Critical


SPARK 1.3.0 unable to instantiate the classes using the reflection (using 
Class.forName). It says class not found even that class is available in the 
list jars.

The following is the expection i am getting by the executors

java.lang.ClassNotFoundException: com.abc.mq.msg.ObjectEncoder
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at kafka.utils.Utils$.createObject(Utils.scala:438)
at kafka.producer.Producer.init(Producer.scala:61)

The application is working fine with out any issues with 1.2.0 version. 
I am planing to upgrade to 1.3.0 and found it its not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8731) Beeline doesn't work with -e option when started in background

2015-06-30 Thread Wang Yiguang (JIRA)
Wang Yiguang created SPARK-8731:
---

 Summary: Beeline doesn't work with -e option when started in 
background
 Key: SPARK-8731
 URL: https://issues.apache.org/jira/browse/SPARK-8731
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Wang Yiguang
Priority: Minor


Beeline stops when running back ground like this:
beeline -e some query 
it doesn't work even with the -f switch.

For example:
this works:
beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases; 

however this not:
beeline -u jdbc:hive2://0.0.0.0:8000 -e show databases;  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8730) Deser primitive class with Java serialization

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8730:
---

Assignee: Apache Spark

 Deser primitive class with Java serialization
 -

 Key: SPARK-8730
 URL: https://issues.apache.org/jira/browse/SPARK-8730
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Eugen Cepoi
Assignee: Apache Spark
Priority: Critical

 Objects that contain as property a primitive Class, can not be deserialized 
 using java serde. Class.forName does not work for primitives.
 Exemple of object:
 class Foo extends Serializable {
   val intClass = classOf[Int]
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8730) Deser primitive class with Java serialization

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8730:
---

Assignee: (was: Apache Spark)

 Deser primitive class with Java serialization
 -

 Key: SPARK-8730
 URL: https://issues.apache.org/jira/browse/SPARK-8730
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Eugen Cepoi
Priority: Critical

 Objects that contain as property a primitive Class, can not be deserialized 
 using java serde. Class.forName does not work for primitives.
 Exemple of object:
 class Foo extends Serializable {
   val intClass = classOf[Int]
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8730) Deser primitive class with Java serialization

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608076#comment-14608076
 ] 

Apache Spark commented on SPARK-8730:
-

User 'EugenCepoi' has created a pull request for this issue:
https://github.com/apache/spark/pull/7122

 Deser primitive class with Java serialization
 -

 Key: SPARK-8730
 URL: https://issues.apache.org/jira/browse/SPARK-8730
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Eugen Cepoi
Priority: Critical

 Objects that contain as property a primitive Class, can not be deserialized 
 using java serde. Class.forName does not work for primitives.
 Exemple of object:
 class Foo extends Serializable {
   val intClass = classOf[Int]
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6951) History server slow startup if the event log directory is large

2015-06-30 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608287#comment-14608287
 ] 

Thomas Graves commented on SPARK-6951:
--

This actually happens more then just at start up.  If you have  large number of 
files, especially in progress files.  Or even just large history files, it 
takes forever for the history server to pick up new completed ones and show on 
the UI. 

 History server slow startup if the event log directory is large
 ---

 Key: SPARK-6951
 URL: https://issues.apache.org/jira/browse/SPARK-6951
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Matt Cheah

 I started my history server, then navigated to the web UI where I expected to 
 be able to view some completed applications, but the webpage was not 
 available. It turned out that the History Server was not finished parsing all 
 of the event logs in the event log directory that I had specified. I had 
 accumulated a lot of event logs from months of running Spark, so it would 
 have taken a very long time for the History Server to crunch through them 
 all. I purged the event log directory and started from scratch, and the UI 
 loaded immediately.
 We should have a pagination strategy or parse the directory lazily to avoid 
 needing to wait after starting the history server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8464) Consider separating aggregator and non-aggregator paths in ExternalSorter

2015-06-30 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14608292#comment-14608292
 ] 

Ilya Ganelin commented on SPARK-8464:
-

Josh - I'd be happy to look into this, I'll submit a PR shortly.

 Consider separating aggregator and non-aggregator paths in ExternalSorter
 -

 Key: SPARK-8464
 URL: https://issues.apache.org/jira/browse/SPARK-8464
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Josh Rosen

 ExternalSorter is still really complicated and hard to understand.  We should 
 investigate whether separating the aggregator and non-aggregator paths into 
 separate files would make the code easier to understand without introducing 
 significant duplication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0

2015-06-30 Thread Kamlesh Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609545#comment-14609545
 ] 

Kamlesh Kumar commented on SPARK-8699:
--

Thanks Shivaram, it works. 

 Select command not working for SparkR built on Spark Version: 1.4.0 and R 
 3.2.0
 ---

 Key: SPARK-8699
 URL: https://issues.apache.org/jira/browse/SPARK-8699
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows 7, 64 bit 
Reporter: Kamlesh Kumar
Priority: Critical
  Labels: test

 I can successfully run Showdf and head on rrdd data frame in R but it throws 
 unexpected error for select commands. 
  R console output after running select command on rrdd data object is 
 following:
 command
 head(select(df, df$eruptions))
 output: 
 Error in head(select(df, df$eruptions)) : 
   error in evaluating the argument 'x' in selecting a method for function 
 'head': Error in UseMethod(select_) : 
   no applicable method for 'select_' applied to an object of class DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0

2015-06-30 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-8699.
--
Resolution: Not A Problem

 Select command not working for SparkR built on Spark Version: 1.4.0 and R 
 3.2.0
 ---

 Key: SPARK-8699
 URL: https://issues.apache.org/jira/browse/SPARK-8699
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows 7, 64 bit 
Reporter: Kamlesh Kumar
Priority: Critical
  Labels: test

 I can successfully run Showdf and head on rrdd data frame in R but it throws 
 unexpected error for select commands. 
  R console output after running select command on rrdd data object is 
 following:
 command
 head(select(df, df$eruptions))
 output: 
 Error in head(select(df, df$eruptions)) : 
   error in evaluating the argument 'x' in selecting a method for function 
 'head': Error in UseMethod(select_) : 
   no applicable method for 'select_' applied to an object of class DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0

2015-06-30 Thread Kamlesh Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609545#comment-14609545
 ] 

Kamlesh Kumar edited comment on SPARK-8699 at 7/1/15 4:39 AM:
--

Thanks Shivaram, it works as some other package was riding over select command. 


was (Author: kamlesh.kumar):
Thanks Shivaram, it works. 

 Select command not working for SparkR built on Spark Version: 1.4.0 and R 
 3.2.0
 ---

 Key: SPARK-8699
 URL: https://issues.apache.org/jira/browse/SPARK-8699
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows 7, 64 bit 
Reporter: Kamlesh Kumar
Priority: Critical
  Labels: test

 I can successfully run Showdf and head on rrdd data frame in R but it throws 
 unexpected error for select commands. 
  R console output after running select command on rrdd data object is 
 following:
 command
 head(select(df, df$eruptions))
 output: 
 Error in head(select(df, df$eruptions)) : 
   error in evaluating the argument 'x' in selecting a method for function 
 'head': Error in UseMethod(select_) : 
   no applicable method for 'select_' applied to an object of class DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8699) Select command not working for SparkR built on Spark Version: 1.4.0 and R 3.2.0

2015-06-30 Thread Kamlesh Kumar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamlesh Kumar updated SPARK-8699:
-
Priority: Trivial  (was: Critical)

 Select command not working for SparkR built on Spark Version: 1.4.0 and R 
 3.2.0
 ---

 Key: SPARK-8699
 URL: https://issues.apache.org/jira/browse/SPARK-8699
 Project: Spark
  Issue Type: Bug
  Components: R
Affects Versions: 1.4.0
 Environment: Windows 7, 64 bit 
Reporter: Kamlesh Kumar
Priority: Trivial
  Labels: test

 I can successfully run Showdf and head on rrdd data frame in R but it throws 
 unexpected error for select commands. 
  R console output after running select command on rrdd data object is 
 following:
 command
 head(select(df, df$eruptions))
 output: 
 Error in head(select(df, df$eruptions)) : 
   error in evaluating the argument 'x' in selecting a method for function 
 'head': Error in UseMethod(select_) : 
   no applicable method for 'select_' applied to an object of class DataFrame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8747) fix EqualNullSafe for binary type

2015-06-30 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-8747:
--

 Summary: fix EqualNullSafe for binary type
 Key: SPARK-8747
 URL: https://issues.apache.org/jira/browse/SPARK-8747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8747) fix EqualNullSafe for binary type

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609562#comment-14609562
 ] 

Apache Spark commented on SPARK-8747:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7143

 fix EqualNullSafe for binary type
 -

 Key: SPARK-8747
 URL: https://issues.apache.org/jira/browse/SPARK-8747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8748) Move castability test out from Cast case class into Cast object

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8748:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Move castability test out from Cast case class into Cast object
 ---

 Key: SPARK-8748
 URL: https://issues.apache.org/jira/browse/SPARK-8748
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 So we can use it as static methods in the analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-06-30 Thread Murtaza Kanchwala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609598#comment-14609598
 ] 

Murtaza Kanchwala commented on SPARK-6101:
--

https://github.com/cfregly/spark-dynamodb

Read is implemented, But save is not.

I'll prefer you to use the Amazon's DynamoDB Mapper.

 Create a SparkSQL DataSource API implementation for DynamoDB
 

 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.5.0


 similar to https://github.com/databricks/spark-avro  and 
 https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8535) PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name

2015-06-30 Thread Yuri Saito (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609552#comment-14609552
 ] 

Yuri Saito commented on SPARK-8535:
---

Could you change assignee from no-assignee to me?

 PySpark : Can't create DataFrame from Pandas dataframe with no explicit 
 column name
 ---

 Key: SPARK-8535
 URL: https://issues.apache.org/jira/browse/SPARK-8535
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Christophe Bourguignat
 Fix For: 1.5.0


 Trying to create a Spark DataFrame from a pandas dataframe with no explicit 
 column name : 
 pandasDF = pd.DataFrame([[1, 2], [5, 6]])
 sparkDF = sqlContext.createDataFrame(pandasDF)
 ***
  1 sparkDF = sqlContext.createDataFrame(pandasDF)
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/pyspark/sql/context.pyc 
 in createDataFrame(self, data, schema, samplingRatio)
 344 
 345 jrdd = 
 self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
 -- 346 df = self._ssql_ctx.applySchemaToPythonRDD(jrdd.rdd(), 
 schema.json())
 347 return DataFrame(df, self)
 348 
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
  in __call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer, self.gateway_client,
 -- 538 self.target_id, self.name)
 539 
 540 for temp_arg in temp_args:
 /usr/local/Cellar/apache-spark/1.4.0/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling o87.applySchemaToPythonRDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8746:
---

Assignee: (was: Apache Spark)

 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
 --

 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Priority: Trivial
  Labels: documentation, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
 describes how to generate golden answer files for new hive comparison test 
 cases. However the download link for the Hive 0.13.1 jars points to 
 https://hive.apache.org/downloads.html but none of the linked mirror sites 
 still has the 0.13.1 version.
 We need to update the link to 
 https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8746:
---

Assignee: Apache Spark

 Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
 --

 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Assignee: Apache Spark
Priority: Trivial
  Labels: documentation, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
 describes how to generate golden answer files for new hive comparison test 
 cases. However the download link for the Hive 0.13.1 jars points to 
 https://hive.apache.org/downloads.html but none of the linked mirror sites 
 still has the 0.13.1 version.
 We need to update the link to 
 https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6101) Create a SparkSQL DataSource API implementation for DynamoDB

2015-06-30 Thread venu k tangirala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609569#comment-14609569
 ] 

venu k tangirala commented on SPARK-6101:
-

Hi Chris, does this include writing back to dynamoDB ? 
Is someone working on this?

 Create a SparkSQL DataSource API implementation for DynamoDB
 

 Key: SPARK-6101
 URL: https://issues.apache.org/jira/browse/SPARK-6101
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: Chris Fregly
Assignee: Chris Fregly
 Fix For: 1.5.0


 similar to https://github.com/databricks/spark-avro  and 
 https://github.com/databricks/spark-csv



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8748) Move castability test out from Cast case class into Cast object

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609576#comment-14609576
 ] 

Apache Spark commented on SPARK-8748:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7145

 Move castability test out from Cast case class into Cast object
 ---

 Key: SPARK-8748
 URL: https://issues.apache.org/jira/browse/SPARK-8748
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 So we can use it as static methods in the analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8748) Move castability test out from Cast case class into Cast object

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8748:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Move castability test out from Cast case class into Cast object
 ---

 Key: SPARK-8748
 URL: https://issues.apache.org/jira/browse/SPARK-8748
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 So we can use it as static methods in the analyzer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8749) Remove HiveTypeCoercion trait

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8749:
---

Assignee: Reynold Xin  (was: Apache Spark)

 Remove HiveTypeCoercion trait
 -

 Key: SPARK-8749
 URL: https://issues.apache.org/jira/browse/SPARK-8749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It is easier to test rules if they are in the companion object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8749) Remove HiveTypeCoercion trait

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8749:
---

Assignee: Apache Spark  (was: Reynold Xin)

 Remove HiveTypeCoercion trait
 -

 Key: SPARK-8749
 URL: https://issues.apache.org/jira/browse/SPARK-8749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 It is easier to test rules if they are in the companion object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8749) Remove HiveTypeCoercion trait

2015-06-30 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609594#comment-14609594
 ] 

Apache Spark commented on SPARK-8749:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7147

 Remove HiveTypeCoercion trait
 -

 Key: SPARK-8749
 URL: https://issues.apache.org/jira/browse/SPARK-8749
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 It is easier to test rules if they are in the companion object.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8750) Remove the closure in functions.callUdf

2015-06-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8750:
--

 Summary: Remove the closure in functions.callUdf
 Key: SPARK-8750
 URL: https://issues.apache.org/jira/browse/SPARK-8750
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


{code}
[warn] 
/Users/yhuai/Projects/Spark/yin-spark-1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala:1829:
 Class org.apache.spark.sql.functions$$anonfun$callUDF$1 differs only in case 
from org.apache.spark.sql.functions$$anonfun$callUdf$1. Such classes will 
overwrite one another on case-insensitive filesystems.
{code}





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6990) Add Java linting script

2015-06-30 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609531#comment-14609531
 ] 

Yu Ishikawa commented on SPARK-6990:


I think it would be nice to execute `mvn checkstyle: checkstyle` with the 
checkstyle maven plugin.
What do you think about that? And do you have any good idea to realize the 
linter?

 Add Java linting script
 ---

 Key: SPARK-6990
 URL: https://issues.apache.org/jira/browse/SPARK-6990
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Josh Rosen
Priority: Minor
  Labels: starter

 It would be nice to add a {{dev/lint-java}} script to enforce style rules for 
 Spark's Java code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8745) Remove GenerateMutableProjection

2015-06-30 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8745:
--

 Summary: Remove GenerateMutableProjection
 Key: SPARK-8745
 URL: https://issues.apache.org/jira/browse/SPARK-8745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Based on discussion offline with [~marmbrus], we should remove 
GenerateMutableProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)

2015-06-30 Thread Christian Kadner (JIRA)
Christian Kadner created SPARK-8746:
---

 Summary: Need to update download link for Hive 0.13.1 jars 
(HiveComparisonTest)
 Key: SPARK-8746
 URL: https://issues.apache.org/jira/browse/SPARK-8746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Christian Kadner
Priority: Trivial


The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
describes how to generate golden answer files for new hive comparison test 
cases. However the download link for the Hive 0.13.1 jars points to 
https://hive.apache.org/downloads.html but none of the linked mirror sites 
still has the 0.13.1 version.

We need to update the link to https://archive.apache.org/dist/hive/hive-0.13.1/




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8747) fix EqualNullSafe for binary type

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8747:
---

Assignee: Apache Spark

 fix EqualNullSafe for binary type
 -

 Key: SPARK-8747
 URL: https://issues.apache.org/jira/browse/SPARK-8747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Apache Spark
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8747) fix EqualNullSafe for binary type

2015-06-30 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8747:
---

Assignee: (was: Apache Spark)

 fix EqualNullSafe for binary type
 -

 Key: SPARK-8747
 URL: https://issues.apache.org/jira/browse/SPARK-8747
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8745) Remove GenerateMutableProjection

2015-06-30 Thread Akhil Thatipamula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609566#comment-14609566
 ] 

Akhil Thatipamula commented on SPARK-8745:
--

[~rxin] I will work on this.

 Remove GenerateMutableProjection
 

 Key: SPARK-8745
 URL: https://issues.apache.org/jira/browse/SPARK-8745
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin

 Based on discussion offline with [~marmbrus], we should remove 
 GenerateMutableProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8653) Add constraint for Children expression for data type

2015-06-30 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609583#comment-14609583
 ] 

Reynold Xin commented on SPARK-8653:


Implicit type casts should be up to the query engine itself, not each 
individual expressions. So we really just need one rule in the TypeCoercion 
file to handle the implicit type casts, and each expression can simply specify 
the expected input types.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-AllowedImplicitConversions

https://msdn.microsoft.com/en-us/library/ms191530.aspx



 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Reynold Xin

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8653) Add constraint for Children expression for data type

2015-06-30 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reassigned SPARK-8653:
--

Assignee: Reynold Xin

 Add constraint for Children expression for data type
 

 Key: SPARK-8653
 URL: https://issues.apache.org/jira/browse/SPARK-8653
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Cheng Hao
Assignee: Reynold Xin

 Currently, we have trait in Expression like `ExpectsInputTypes` and also the 
 `checkInputDataTypes`, but can not convert the children expressions 
 automatically, except we write the new rules in the `HiveTypeCoercion`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >