[jira] [Commented] (SPARK-7129) Add generic boosting algorithm to spark.ml

2015-09-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877275#comment-14877275
 ] 

Meihua Wu commented on SPARK-7129:
--

[~josephkb] As weighting has been added to logistic regression and linear 
regression recently, I think we are in a good position to work on the boosting 
algorithms. Are there any plans to have it for 1.6? If so, I would like to work 
on this. Thanks!

> Add generic boosting algorithm to spark.ml
> --
>
> Key: SPARK-7129
> URL: https://issues.apache.org/jira/browse/SPARK-7129
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877274#comment-14877274
 ] 

Sean Owen commented on SPARK-10706:
---

Let's follow https://github.com/apache/spark/pull/8782 in any event. I think 
the idea is to add developer APIs, yes, but fair question: if it's just for 
consumption by the Scala core then does it need a wrapper?

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877371#comment-14877371
 ] 

Apache Spark commented on SPARK-8632:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8835

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10685) Misaligned data with RDD.zip and DataFrame.withColumn after repartition

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877372#comment-14877372
 ] 

Apache Spark commented on SPARK-10685:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8835

> Misaligned data with RDD.zip and DataFrame.withColumn after repartition
> ---
>
> Key: SPARK-10685
> URL: https://issues.apache.org/jira/browse/SPARK-10685
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.3.0, 1.4.1, 1.5.0
> Environment: - OSX 10.10.4, java 1.7.0_51, hadoop 2.6.0-cdh5.4.5
> - Ubuntu 12.04, java 1.7.0_80, hadoop 2.6.0-cdh5.4.5
>Reporter: Dan Brown
>Priority: Blocker
>
> Here's a weird behavior where {{RDD.zip}} or {{DataFrame.withColumn}} after a 
> {{repartition}} produces "misaligned" data, meaning different column values 
> in the same row aren't matched, as if a zip shuffled the collections before 
> zipping them. It's difficult to reproduce because it's nondeterministic, 
> doesn't occur in local mode, and requires ≥2 workers (≥3 in one case). I was 
> able to repro it using pyspark 1.3.0 (cdh5.4.5), 1.4.1 (bin-without-hadoop), 
> and 1.5.0 (bin-without-hadoop).
> Here's the most similar issue I was able to find. It appears to not have been 
> repro'd and then closed optimistically, and it smells like it could have been 
> the same underlying cause that was never fixed:
> - https://issues.apache.org/jira/browse/SPARK-9131
> Also, this {{DataFrame.zip}} issue is related in spirit, since we were trying 
> to build it ourselves when we ran into this problem. Let me put in my vote 
> for reopening the issue and supporting {{DataFrame.zip}} in the standard lib.
> - https://issues.apache.org/jira/browse/SPARK-7460
> h3. Brief repro
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> [r for r in df.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> [Row(a=39, b=639), Row(a=139, b=739), Row(a=239, b=839)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> []
> [Row(a=641, b=41), Row(a=741, b=141), Row(a=841, b=241)]
> [Row(a=641, b=1343), Row(a=741, b=1443), Row(a=841, b=1543)]
> [Row(a=639, b=39), Row(a=739, b=139), Row(a=839, b=239)]
> {code}
> Fail: RDD.zip after DataFrame.repartition
> {code}
> df  = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df  = df.repartition(100)
> rdd = df.rdd.zip(df.map(lambda r: Row(b=r.a))).map(lambda (x,y): Row(a=x.a, 
> b=y.b))
> [r for r in rdd.collect() if r.a != r.b][:3] # Should be []
> {code}
> Sample outputs (nondeterministic):
> {code}
> []
> [Row(a=50, b=6947), Row(a=150, b=7047), Row(a=250, b=7147)]
> []
> []
> [Row(a=44, b=644), Row(a=144, b=744), Row(a=244, b=844)]
> []
> {code}
> Test setup:
> - local\[8]: {{MASTER=local\[8]}}
> - dist\[N]: 1 driver + 1 master + N workers
> {code}
> "Fail" tests pass?  cluster mode  spark version
> 
> yes local[8]  1.3.0-cdh5.4.5
> no  dist[4]   1.3.0-cdh5.4.5
> yes local[8]  1.4.1
> yes dist[1]   1.4.1
> no  dist[2]   1.4.1
> no  dist[4]   1.4.1
> yes local[8]  1.5.0
> yes dist[1]   1.5.0
> no  dist[2]   1.5.0
> no  dist[4]   1.5.0
> {code}
> h3. Detailed repro
> Start `pyspark` and run these imports:
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType, StructType, StructField
> {code}
> Fail: withColumn(udf) after DataFrame.repartition
> {code}
> df = sqlCtx.createDataFrame(Row(a=a) for a in xrange(1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Ok: withColumn(udf) after DataFrame.repartition(100) after 1 starting 
> partition
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=1))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a != r.b]) # Should be 0
> {code}
> Fail: withColumn(udf) after DataFrame.repartition(100) after 100 starting 
> partitions
> {code}
> df = sqlCtx.createDataFrame(sc.parallelize((Row(a=a) for a in xrange(1)), 
> numSlices=100))
> df = df.repartition(100)
> df = df.withColumn('b', udf(lambda r: r, IntegerType())(df.a))
> len([r for r in df.collect() if r.a 

[jira] [Resolved] (SPARK-10710) Remove ability to set spark.shuffle.spill=false and spark.sql.planner.externalSort=false

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10710.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Remove ability to set spark.shuffle.spill=false and 
> spark.sql.planner.externalSort=false
> 
>
> Key: SPARK-10710
> URL: https://issues.apache.org/jira/browse/SPARK-10710
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The {{spark.shuffle.spill=false}} configuration doesn't make much sense 
> nowadays: I think that this configuration was only added as an escape-hatch 
> to guard against bugs when spilling was first added. Similarly, setting 
> {{spark.sql.planner.externalSort=false}} doesn't make sense in newer 
> releases: many new implementations, such as Tungsten, completely ignore this 
> flag, so it's not applied in a consistent way.
> In order to reduce complexity, I think that we should remove the ability to 
> disable spilling. Note that the {{tungsten-shuffle}} manager already does not 
> respect this setting, so removing this configuration is a blocker to being 
> able to unify the two sort-shuffle implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10710) Remove ability to set spark.shuffle.spill=false and spark.sql.planner.externalSort=false

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10710:

Labels: releasenotes  (was: )

> Remove ability to set spark.shuffle.spill=false and 
> spark.sql.planner.externalSort=false
> 
>
> Key: SPARK-10710
> URL: https://issues.apache.org/jira/browse/SPARK-10710
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The {{spark.shuffle.spill=false}} configuration doesn't make much sense 
> nowadays: I think that this configuration was only added as an escape-hatch 
> to guard against bugs when spilling was first added. Similarly, setting 
> {{spark.sql.planner.externalSort=false}} doesn't make sense in newer 
> releases: many new implementations, such as Tungsten, completely ignore this 
> flag, so it's not applied in a consistent way.
> In order to reduce complexity, I think that we should remove the ability to 
> disable spilling. Note that the {{tungsten-shuffle}} manager already does not 
> respect this setting, so removing this configuration is a blocker to being 
> able to unify the two sort-shuffle implementations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10155) Memory leak in SQL parsers

2015-09-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10155:
-
Assignee: Shixiong Zhu

> Memory leak in SQL parsers
> --
>
> Key: SPARK-10155
> URL: https://issues.apache.org/jira/browse/SPARK-10155
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
> Attachments: Screen Shot 2015-08-21 at 5.45.24 PM.png
>
>
> I saw a lot of `ThreadLocal` objects in the following app:
> {code}
> import org.apache.spark._
> import org.apache.spark.sql._
> object SparkApp {
>   def foo(sqlContext: SQLContext): Unit = {
> import sqlContext.implicits._
> sqlContext.sparkContext.parallelize(Seq("aaa", "bbb", 
> "ccc")).toDF().filter("length(_1) > 0").count()
>   }
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("sql-memory-leak")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> while (true) {
>   foo(sqlContext)
> }
>   }
> }
> {code}
> Running the above codes in a long time and finally it will OOM.
> These "ThreadLocal"s are from 
> "scala.util.parsing.combinator.Parsers.lastNoSuccessVar", which stores 
> `Failure("end of input", ...)`.
> There is an issue in Scala here: https://issues.scala-lang.org/browse/SI-9010
> and some discussions here: https://issues.scala-lang.org/browse/SI-4929



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10155) Memory leak in SQL parsers

2015-09-19 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10155.
--
   Resolution: Fixed
Fix Version/s: 1.5.1
   1.6.0

Issue resolved by pull request 8357
[https://github.com/apache/spark/pull/8357]

> Memory leak in SQL parsers
> --
>
> Key: SPARK-10155
> URL: https://issues.apache.org/jira/browse/SPARK-10155
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Shixiong Zhu
>Priority: Critical
> Fix For: 1.6.0, 1.5.1
>
> Attachments: Screen Shot 2015-08-21 at 5.45.24 PM.png
>
>
> I saw a lot of `ThreadLocal` objects in the following app:
> {code}
> import org.apache.spark._
> import org.apache.spark.sql._
> object SparkApp {
>   def foo(sqlContext: SQLContext): Unit = {
> import sqlContext.implicits._
> sqlContext.sparkContext.parallelize(Seq("aaa", "bbb", 
> "ccc")).toDF().filter("length(_1) > 0").count()
>   }
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("sql-memory-leak")
> val sc = new SparkContext(conf)
> val sqlContext = new SQLContext(sc)
> while (true) {
>   foo(sqlContext)
> }
>   }
> }
> {code}
> Running the above codes in a long time and finally it will OOM.
> These "ThreadLocal"s are from 
> "scala.util.parsing.combinator.Parsers.lastNoSuccessVar", which stores 
> `Failure("end of input", ...)`.
> There is an issue in Scala here: https://issues.scala-lang.org/browse/SI-9010
> and some discussions here: https://issues.scala-lang.org/browse/SI-4929



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Check License should not verify conf files for license

2015-09-19 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Summary: Check License should not verify conf files for license  (was: Post 
Apache license header missing from multiple script and required files)

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> On the latest 1.6.0-SNAPSHOT the ./dev/run-tests fails due to apache license 
> missing in scripts and other files.
> This seems to be side effect of changes done on check-license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10718) Check License should not verify conf files for license

2015-09-19 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Description: 
Check License should not verify conf files for license
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}

  was:
On the latest 1.6.0-SNAPSHOT the ./dev/run-tests fails due to apache license 
missing in scripts and other files.
This seems to be side effect of changes done on check-license
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}


> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10718) Check License should not verify conf files for license

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877413#comment-14877413
 ] 

Apache Spark commented on SPARK-10718:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/8842

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10718) Check License should not verify conf files for license

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10718:


Assignee: Apache Spark

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>Assignee: Apache Spark
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10716) http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz file broken?

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10716:
--
  Priority: Minor  (was: Major)
Issue Type: Bug  (was: Question)

Hm, I see the same thing on OS X at the command line. The built-in OS X Finder 
unpacks it successfully though. It works OK on Linux too, for me.

It appears to be this file:

{code}
-rw-r--r-- jenkins/jenkins  8 2015-08-31 23:22 
spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/._SUCCESS.crc
{code}

That's in the archive but SUCCESS.crc isn't. Sounds like it's looking for a 
file that's not there. I think this file should just be removed from the repo:

https://github.com/apache/spark/blob/master/python/test_support/sql/orc_partitioned/._SUCCESS.crc

It's just a temp file. For the current artifacts there's at least a workaround.

> http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz file 
> broken?
> ---
>
> Key: SPARK-10716
> URL: https://issues.apache.org/jira/browse/SPARK-10716
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Deploy
>Affects Versions: 1.5.0
> Environment: Yosemite 10.10.5
>Reporter: Jack Jack
>Priority: Minor
>
> Directly downloaded prebuilt binaries of 
> http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz 
> got error when tar xvzf it.  Tried download twice and extract twice.
> error log:
> ..
> x spark-1.5.0-bin-hadoop2.6/lib/
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
> x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
> x spark-1.5.0-bin-hadoop2.6/README.md
> tar: copyfile unpack 
> (spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
>  failed: No such file or directory
> ~ :>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10626) Create a Java friendly method for randomRDD & RandomDataGenerator on RandomRDDs.

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877398#comment-14877398
 ] 

Apache Spark commented on SPARK-10626:
--

User 'rotationsymmetry' has created a pull request for this issue:
https://github.com/apache/spark/pull/8841

> Create a Java friendly method for randomRDD & RandomDataGenerator on 
> RandomRDDs.
> 
>
> Key: SPARK-10626
> URL: https://issues.apache.org/jira/browse/SPARK-10626
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: holdenk
>Priority: Minor
>
> SPARK-3136 added a large number of functions for creating Java RandomRDDs, 
> but for people that want to use custom RandomDataGenerators we should make a 
> Java friendly method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10717) remove the with Loging in the NioBlockTransferService

2015-09-19 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877397#comment-14877397
 ] 

DjvuLee commented on SPARK-10717:
-

change to 

final class NioBlockTransferService(conf: SparkConf, securityManager: 
SecurityManager) 
extends BlockTransferService {}

> remove the with Loging in the NioBlockTransferService
> -
>
> Key: SPARK-10717
> URL: https://issues.apache.org/jira/browse/SPARK-10717
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: DjvuLee
>Priority: Minor
>
> Since BlockTransferService implement the Logging interface, we can remove the 
> with Logging in the NioBlockTransferService, keep the same as 
> NettyBlockTransferService
> abstract class BlockTransferService extends ShuffleClient with Closeable with 
> Logging {}
> final class NioBlockTransferService(conf: SparkConf, securityManager: 
> SecurityManager) 
> extends BlockTransferService with Logging {}
> class NettyBlockTransferService(conf: SparkConf, securityManager: 
> SecurityManager, numCores: Int) extends BlockTransferService {}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10717) remove the with Loging in the NioBlockTransferService

2015-09-19 Thread DjvuLee (JIRA)
DjvuLee created SPARK-10717:
---

 Summary: remove the with Loging in the NioBlockTransferService
 Key: SPARK-10717
 URL: https://issues.apache.org/jira/browse/SPARK-10717
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: DjvuLee
Priority: Minor


Since BlockTransferService implement the Logging interface, we can remove the 
with Logging in the NioBlockTransferService, keep the same as 
NettyBlockTransferService

abstract class BlockTransferService extends ShuffleClient with Closeable with 
Logging {}

final class NioBlockTransferService(conf: SparkConf, securityManager: 
SecurityManager) 
extends BlockTransferService with Logging {}

class NettyBlockTransferService(conf: SparkConf, securityManager: 
SecurityManager, numCores: Int) extends BlockTransferService {}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8567) Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars

2015-09-19 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877338#comment-14877338
 ] 

Rekha Joshi commented on SPARK-8567:


Also [~mengxr] there is a worker shutdown issue (SPARK-4300 ?) for 
HiveSparkSubmitSuite at the SPARK-8368 timeout log.Could that be the cause of 
the SPARK-8368 failure? [~yhuai] thanks!
{code}
[info] HiveSparkSubmitSuite:Exception in thread "redirect stderr for command 
./bin/spark-submit" java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:283)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:325)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:177)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:154)
at java.io.BufferedReader.readLine(BufferedReader.java:317)
at java.io.BufferedReader.readLine(BufferedReader.java:382)
at 
scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:67)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.util.Utils$$anon$4.run(Utils.scala:1081)
{code}

> Flaky test: o.a.s.sql.hive.HiveSparkSubmitSuite --jars
> --
>
> Key: SPARK-8567
> URL: https://issues.apache.org/jira/browse/SPARK-8567
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 1.4.1
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>  Labels: flaky-test
> Fix For: 1.4.1, 1.5.0
>
>
> Seems tests in HiveSparkSubmitSuite fail with timeout pretty frequently.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4503) The history server is not compatible with HDFS HA

2015-09-19 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-4503:
---
Attachment: historyserver1.png

Hi [~xjxyxgq] MarsXu, I was not able to replicate the issue on Spark-1.5.0.Also 
checked with the latest, 1.6.0-SNAPSHOT and with yarn configured correctly the 
namenode works fine for me.Attached screenshot.Agree with [~vanzin] it could be 
some setup issue at your end.Please confirm if we can close this? Thanks!

> The history server is not compatible with HDFS HA
> -
>
> Key: SPARK-4503
> URL: https://issues.apache.org/jira/browse/SPARK-4503
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.1.0
>Reporter: MarsXu
>Priority: Minor
> Attachments: historyserver1.png
>
>
>   I use a high availability of HDFS to store the history server data.
>   Can be written eventlog to HDFS , but history server cannot be started.
>   
>   Error log when execute "sbin/start-history-server.sh":
> {quote}
> 
> 14/11/20 10:25:04 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root, ); users 
> with modify permissions: Set(root, )
> 14/11/20 10:25:04 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at 
> org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:187)
> at 
> org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
> Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> appcluster
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> 
> {quote}
> When I set  SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://s161.zw.db.d:53310/spark_history">
>  in spark-evn.sh, can start, but no high availability.
> Environment
> {quote}
> spark-1.1.0-bin-hadoop2.4
> hadoop-2.5.1
> zookeeper-3.4.6
> {quote}
>   The config file is as follows:
> {quote}
> !### spark-defaults.conf ###
> spark.eventLog.dirhdfs://appcluster/history_server/
> spark.yarn.historyServer.addresss161.zw.db.d:18080
> !### spark-env.sh ###
> export 
> SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://appcluster/history_server"
> !### core-site.xml ###
> 
> fs.defaultFS
> hdfs://appcluster
> 
> !### hdfs-site.xml ###
> 
> dfs.nameservices
> appcluster
> 
> 
> dfs.ha.namenodes.appcluster
> nn1,nn2
> 
> 
> dfs.namenode.rpc-address.appcluster.nn1
> s161.zw.db.d:8020
> 
> 
> dfs.namenode.rpc-address.appcluster.nn2
> s162.zw.db.d:8020
> 
> 
> dfs.namenode.servicerpc-address.appcluster.nn1
> s161.zw.db.d:53310
> 
> 
> dfs.namenode.servicerpc-address.appcluster.nn2
> s162.zw.db.d:53310
> 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877220#comment-14877220
 ] 

Meihua Wu edited comment on SPARK-10706 at 9/19/15 6:48 PM:


[~mengxr] I notice the Scala API `randomVectorRDD` has a DeveloperAPI 
annotation. I am checking if there is a reason to not expose the java wrapper. 
If not, I will submit a PR to resolve this JIRA. Thanks. 


was (Author: meihuawu):
I will work on this.

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8632:
---
Target Version/s: 1.6.0, 1.5.1  (was: 1.6.0)

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8632:
---
Priority: Blocker  (was: Major)

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>Assignee: Davies Liu
>Priority: Blocker
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10706:


Assignee: (was: Apache Spark)

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10706:


Assignee: Apache Spark

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>Assignee: Apache Spark
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10718) Check License should not verify conf files for license

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10718:


Assignee: (was: Apache Spark)

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10718) Check License should not verify conf files for license

2015-09-19 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-10718:

Comment: was deleted

(was: pull request in progress. thanks)

> Check License should not verify conf files for license
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> Check License should not verify conf files for license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10631:


Assignee: Vinod KC  (was: Apache Spark)

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876914#comment-14876914
 ] 

Apache Spark commented on SPARK-10631:
--

User 'vinodkc' has created a pull request for this issue:
https://github.com/apache/spark/pull/8834

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Vinod KC
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10631) Add missing API doc in pyspark.mllib.linalg.Vector

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10631:


Assignee: Apache Spark  (was: Vinod KC)

> Add missing API doc in pyspark.mllib.linalg.Vector
> --
>
> Key: SPARK-10631
> URL: https://issues.apache.org/jira/browse/SPARK-10631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> There are some missing API docs in pyspark.mllib.linalg.Vector (including 
> DenseVector and SparseVector). We should add them based on their Scala 
> counterparts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10629) Gradient boosted trees: mapPartitions input size increasing

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10629.
---
Resolution: Duplicate

> Gradient boosted trees: mapPartitions input size increasing 
> 
>
> Key: SPARK-10629
> URL: https://issues.apache.org/jira/browse/SPARK-10629
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1
>Reporter: Wenmin Wu
>
> First of all, I think my problem is quite different from 
> https://issues.apache.org/jira/browse/SPARK-10433, which point that the input 
> size increasing at each iteration.
> My problem is the mapPartitions input size increase in one iteration. My 
> training samples has 2958359 features in total. Within one iteration, 3 
> collectAsMap operation had been called. And here is a summary of each call.
> | Stage Id |   Description| 
> Duration  |   Input| Shuffle Read | Shuffle Write |
> |:--:|:---:|:---:|:---:|::|::|
> |  4  | mapPartitions at DecisionTree.scala:613 |  1.6 h  |710.2 
> MB | |   2.8 GB   |
> |  5  | collectAsMap at DecisionTree.scala:642  |  1.8 min  | 
>|2.8 GB|  |
> |  6  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 27.0 
> GB  ||  5.6 GB |
> |  7  | collectAsMap at DecisionTree.scala:642 | 2.0 min |   |
> 5.6GB   |  |
> |  8  | mapPartitions at DecisionTree.scala:613 |  1.2 h  | 26.5 
> GB  ||   11.1 GB |
> |  9  | collectAsMap at DecisionTree.scala:642 | 2.0 min |  |
> 8.3 GB  |  |
> the mapPartitions operation took too long time! It's so strange! I wonder 
> whether there is bug exits?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10661) The PipelineModel class inherits from Serializable twice.

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10661.
---
Resolution: Not A Problem

i.e. "not a problem in *Spark* as far as I can tell". Still does show up twice 
for some reason.

> The PipelineModel class inherits from Serializable twice.
> -
>
> Key: SPARK-10661
> URL: https://issues.apache.org/jira/browse/SPARK-10661
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Matt Hagen
>Priority: Trivial
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The Scaladoc shows that org.apache.spark.ml.PipelineModel inherits from 
> Serializable twice. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-10715:
---
Labels: ML  (was: )

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Bug
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: ML
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-10715:
--

 Summary: Duplicate initialzation flag in WeightedLeastSquare
 Key: SPARK-10715
 URL: https://issues.apache.org/jira/browse/SPARK-10715
 Project: Spark
  Issue Type: Bug
Reporter: Kai Sasaki
Priority: Trivial


There are duplicate set of initialization flag in {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10715:
--
 Labels:   (was: ML)
Component/s: ML
 Issue Type: Improvement  (was: Bug)

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Priority: Trivial
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10686:


Assignee: Apache Spark  (was: Yanbo Liang)

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876932#comment-14876932
 ] 

Apache Spark commented on SPARK-10686:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8836

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10686:


Assignee: Yanbo Liang  (was: Apache Spark)

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10714) Refactor PythonRDD to decouple iterator computation from PythonRDD

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876947#comment-14876947
 ] 

Apache Spark commented on SPARK-10714:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8835

> Refactor PythonRDD to decouple iterator computation from PythonRDD
> --
>
> Key: SPARK-10714
> URL: https://issues.apache.org/jira/browse/SPARK-10714
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The idea is that most of the logic of calling Python actually has nothing to 
> do with RDD (it is really just communicating with a socket -- there is 
> nothing distributed about it), and it is only currently depending on RDD 
> because it was written this way.
> If we extract that functionality out, we can apply it to area of the code 
> that doesn't depend on RDDs, and also make it easier to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10686) Add quantileCol to AFTSurvivalRegression

2015-09-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876935#comment-14876935
 ] 

Yanbo Liang commented on SPARK-10686:
-

I think name `quantileCol` as `quantilesCol` is more better, please feel free 
to comment on my PR.

> Add quantileCol to AFTSurvivalRegression
> 
>
> Key: SPARK-10686
> URL: https://issues.apache.org/jira/browse/SPARK-10686
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> By default `quantileCol` should be empty. If both `quantileProbabilities` and 
> `quantileCol` are set, we should append quantiles as a new column (of type 
> Vector).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10529) When creating multiple HiveContext objects in one jvm, jdbc connections to metastore cann't be released and it may cause PermGen OutOfMemoryError.

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10529.
---
Resolution: Not A Problem

Resolving per the PR discussion: don't make more than one context in a JVM 
(i.e. run multiple isolated apps) as that's not the usual mode anyway. If you 
must, then increase PermGen.

> When creating multiple HiveContext objects in one jvm, jdbc connections to 
> metastore cann't be released and it may cause PermGen OutOfMemoryError.
> --
>
> Key: SPARK-10529
> URL: https://issues.apache.org/jira/browse/SPARK-10529
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: ZhengYaofeng
>
> Test code as follows:
> object SqlTest {
>   def main(args: Array[String]) {
> def createSc = {
>   val sparkConf = new SparkConf().setAppName(s"SqlTest")
> .setMaster("spark://zdh221:7077")
> .set("spark.executor.memory", "4g")
> .set("spark.executor.cores", "2")
> .set("spark.cores.max", "6")
>   new SparkContext(sparkConf)
> }
> for (index <- 1 to 200) {
>   println(s"Current Index:${index}=")
>   val hc = new HiveContext(createSc)
>   hc.sql("show databases").collect().foreach(println)
>   hc.sparkContext.stop()
>   Thread.sleep(3000)
> }
> Thread.sleep(100)
>   }
> } 
> Testing on spark 1.4.1 with run cmd bellow.
>   export 
> CLASSPATH="$CLASSPATH:/home/hadoop/spark/conf:/home/hadoop/spark/lib/*:/home/hadoop/zyf/lib/*"
>   java -Xmx8096m -Xms1024m -XX:MaxPermSize=1024m -cp $CLASSPATH SqlTest
> Files list:
>   
> /home/hadoop/spark/conf:core-site.xml;hdfs-site.xml;hive-site.xml;slaves;spark-defaults.conf;spark-env.sh
>   
> /home/hadoop/zyf/lib:hadoop-lzo-0.4.20.jar;mysql-connector-java-5.1.28-bin.jar;sqltest-1.0-SNAPSHOT.jar
>   
> MySQL is used as the metastore. You can obviously see that jdbc connections 
> to MySQL grow constantly through command 'show status like 
> 'Threads_connected';' when my test app is running. Even if you invoke 
> 'Hive.closeCurrent()', it cann't release current jdbc connections. Besides I 
> can not find another possible way. If you take spark 1.3.1 to test, jdbc 
> connections won't grow.
> Meanwhile, it ends with 'java.lang.OutOfMemoryError: PermGen space' when 
> cycling 45 times, which means 45 HiveContext objects are created. It's 
> interesting that if you set MaxPermSize to '2048m', it can cycle 93 times, if 
> you set MaxPermSize to '3072m', it can cycle 141 times. So,it indicates that 
> each time creating one HiveContext object, it loads the same amount of new 
> classes and they won't be released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14876953#comment-14876953
 ] 

Apache Spark commented on SPARK-10715:
--

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/8837

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Bug
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: ML
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10715:


Assignee: Apache Spark

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Bug
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: ML
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10715) Duplicate initialzation flag in WeightedLeastSquare

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10715:


Assignee: (was: Apache Spark)

> Duplicate initialzation flag in WeightedLeastSquare
> ---
>
> Key: SPARK-10715
> URL: https://issues.apache.org/jira/browse/SPARK-10715
> Project: Spark
>  Issue Type: Bug
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: ML
>
> There are duplicate set of initialization flag in 
> {{WeightedLeastSquares#add}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10714) Refactor PythonRDD to decouple iterator computation from PythonRDD

2015-09-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10714:

Summary: Refactor PythonRDD to decouple iterator computation from PythonRDD 
 (was: Refactor PythonRDD to extract Python invocation out of "RDD")

> Refactor PythonRDD to decouple iterator computation from PythonRDD
> --
>
> Key: SPARK-10714
> URL: https://issues.apache.org/jira/browse/SPARK-10714
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The idea is that most of the logic of calling Python actually has nothing to 
> do with RDD (it is really just communicating with a socket -- there is 
> nothing distributed about it), and it is only currently depending on RDD 
> because it was written this way.
> If we extract that functionality out, we can apply it to area of the code 
> that doesn't depend on RDDs, and also make it easier to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10714) Refactor PythonRDD to extract Python invocation out of "RDD"

2015-09-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-10714:
---

 Summary: Refactor PythonRDD to extract Python invocation out of 
"RDD"
 Key: SPARK-10714
 URL: https://issues.apache.org/jira/browse/SPARK-10714
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin


The idea is that most of the logic of calling Python actually has nothing to do 
with RDD (it is really just communicating with a socket -- there is nothing 
distributed about it), and it is only currently depending on RDD because it was 
written this way.

If we extract that functionality out, we can apply it to area of the code that 
doesn't depend on RDDs, and also make it easier to test.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10714) Refactor PythonRDD to decouple iterator computation from PythonRDD

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10714:


Assignee: Reynold Xin  (was: Apache Spark)

> Refactor PythonRDD to decouple iterator computation from PythonRDD
> --
>
> Key: SPARK-10714
> URL: https://issues.apache.org/jira/browse/SPARK-10714
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The idea is that most of the logic of calling Python actually has nothing to 
> do with RDD (it is really just communicating with a socket -- there is 
> nothing distributed about it), and it is only currently depending on RDD 
> because it was written this way.
> If we extract that functionality out, we can apply it to area of the code 
> that doesn't depend on RDDs, and also make it easier to test.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5905:
-
Priority: Trivial  (was: Minor)
 Summary: Note requirements for certain RowMatrix methods in docs  (was: 
Improve RowMatrix user guide and doc.)

I'll make a PR. I'm sure row x col is always the convention, but the methods 
could use a doc stating what they require.

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Priority: Trivial
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877047#comment-14877047
 ] 

Apache Spark commented on SPARK-5905:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8839

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Priority: Trivial
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5905:
---

Assignee: (was: Apache Spark)

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Priority: Trivial
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5905) Note requirements for certain RowMatrix methods in docs

2015-09-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5905:
---

Assignee: Apache Spark

> Note requirements for certain RowMatrix methods in docs
> ---
>
> Key: SPARK-5905
> URL: https://issues.apache.org/jira/browse/SPARK-5905
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Trivial
>
> From mbofb's comment in PR https://github.com/apache/spark/pull/4680:
> {code}
> The description of RowMatrix.computeSVD and 
> mllib-dimensionality-reduction.html should be more precise/explicit regarding 
> the m x n matrix. In the current description I would conclude that n refers 
> to the rows. According to 
> http://math.stackexchange.com/questions/191711/how-many-rows-and-columns-are-in-an-m-x-n-matrix
>  this way of describing a matrix is only used in particular domains. I as a 
> reader interested on applying SVD would rather prefer the more common m x n 
> way of rows x columns (e.g. 
> http://en.wikipedia.org/wiki/Matrix_%28mathematics%29 ) which is also used in 
> http://en.wikipedia.org/wiki/Latent_semantic_analysis (and also within the 
> ARPACK manual:
> “
> N Integer. (INPUT) - Dimension of the eigenproblem. 
> NEV Integer. (INPUT) - Number of eigenvalues of OP to be computed. 0 < NEV < 
> N. 
> NCV Integer. (INPUT) - Number of columns of the matrix V (less than or equal 
> to N).
> “
> ).
> description of RowMatrix.computeSVD and mllib-dimensionality-reduction.html:
> "We assume n is smaller than m." Is this just a recommendation or a hard 
> requirement. This condition seems not to be checked and causing an 
> IllegalArgumentException – the processing finishes even though the vectors 
> have a higher dimension than the number of vectors.
> description of RowMatrix. computePrincipalComponents or RowMatrix in general:
> I got a Exception.
> java.lang.IllegalArgumentException: Argument with more than 65535 cols: 
> 7949273
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.checkNumColumns(RowMatrix.scala:131)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computeCovariance(RowMatrix.scala:318)
> at 
> org.apache.spark.mllib.linalg.distributed.RowMatrix.computePrincipalComponents(RowMatrix.scala:373)
> This 65535 cols restriction would be nice to be written in the doc (if this 
> still applies in 1.3).
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10568) Error thrown in stopping one component in SparkContext.stop() doesn't allow other components to be stopped

2015-09-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877037#comment-14877037
 ] 

Sean Owen commented on SPARK-10568:
---

[~mcheah] are you working on a PR for this?

> Error thrown in stopping one component in SparkContext.stop() doesn't allow 
> other components to be stopped
> --
>
> Key: SPARK-10568
> URL: https://issues.apache.org/jira/browse/SPARK-10568
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Matt Cheah
>Priority: Minor
>
> When I shut down a Java process that is running a SparkContext, it invokes a 
> shutdown hook that eventually calls SparkContext.stop(), and inside 
> SparkContext.stop() each individual component (DiskBlockManager, Scheduler 
> Backend) is stopped. If an exception is thrown in stopping one of these 
> components, none of the other components will be stopped cleanly either. This 
> caused problems when I stopped a Java process running a Spark context in 
> yarn-client mode, because not properly stopping YarnSchedulerBackend leads to 
> problems.
> The steps I ran are as follows:
> 1. Create one job which fills the cluster
> 2. Kick off another job which creates a Spark Context
> 3. Kill the Java process with the Spark Context in #2
> 4. The job remains in the YARN UI as ACCEPTED
> Looking in the logs we see the following:
> {code}
> 2015-09-07 10:32:43,446 ERROR [Thread-3] o.a.s.u.Utils - Uncaught exception 
> in thread Thread-3
> java.lang.NullPointerException: null
> at 
> org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:162)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at 
> org.apache.spark.storage.DiskBlockManager$$anonfun$addShutdownHook$1.apply$mcV$sp(DiskBlockManager.scala:144)
>  ~[spark-core_2.10-1.4.1.jar:1.4.1]
> at org.apache.spark.util.SparkShutdownHook.run(Utils.scala:2308) 
> ~[spark-core_2.10-1.4.1.jar:1.4.1]
> {code}
> I think what's going on is that when we kill the application in the queued 
> state, it tries to run the SparkContext.stop() method on the driver and stop 
> each component. It dies trying to stop the DiskBlockManager because it hasn't 
> been initialized yet - the application is still waiting to be scheduled by 
> the Yarn RM - but YarnClient.stop() is not invoked as a result, leaving the 
> application sticking around in the accepted state.
> Because of what appears to be bugs in the YARN scheduler, entering this state 
> makes it so that the YARN scheduler is unable to schedule any more jobs 
> unless we manually remove this application via the YARN CLI. We can tackle 
> the YARN stuck state separately, but ensuring that all components get at 
> least some chance to stop when a SparkContext stops seems like a good idea. 
> Of course we can still throw some exception and/or log exceptions for 
> everything that goes wrong at the end of stopping the context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10712) JVM crashes with spark.sql.tungsten.enabled = true

2015-09-19 Thread Mauro Pirrone (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mauro Pirrone updated SPARK-10712:
--
Description: 
When turning on tungsten, I get the following error when executing a query/job 
with a few joins. When tungsten is turned off, the error does not appear. Also 
note that tungsten works for me in other cases.

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ffadaf59200, pid=7598, tid=140710015645440
#
# JRE version: Java(TM) SE Runtime Environment (8.0_45-b14) (build 1.8.0_45-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.45-b02 mixed mode linux-amd64 
compressed oops)
# Problematic frame:
# V  [libjvm.so+0x7eb200]
#
# Core dump written. Default location: //core or core.7598 (max size 100 
kB). To ensure a full core dump, try "ulimit -c unlimited" before starting Java 
again
#
# An error report file with more information is saved as:
# //hs_err_pid7598.log
Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
(native)
 total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
 relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
 main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
(native)
 total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
 relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
 main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#





---  T H R E A D  ---

Current thread (0x7ff7902e7800):  JavaThread "broadcast-hash-join-1" daemon 
[_thread_in_vm, id=16548, stack(0x7ff66bd98000,0x7ff66be99000)]

siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 
0x00069f572b10

Registers:
RAX=0x00069f672b08, RBX=0x7ff7902e7800, RCX=0x000394132140, 
RDX=0xfffe0004
RSP=0x7ff66be97048, RBP=0x7ff66be970a0, RSI=0x000394032148, 
RDI=0x00069f572b10
R8 =0x7ff66be970d0, R9 =0x0028, R10=0x7ff79cc0e1e7, 
R11=0x7ff79cc0e198
R12=0x7ff66be970c0, R13=0x7ff66be970d0, R14=0x0028, 
R15=0x30323048
RIP=0x7ff7b0dae200, EFLAGS=0x00010282, CSGSFS=0xe033, 
ERR=0x0004
  TRAPNO=0x000e

Top of Stack: (sp=0x7ff66be97048)
0x7ff66be97048:   7ff7b1042b1a 7ff7902e7800
0x7ff66be97058:   7ff7 7ff7902e7800
0x7ff66be97068:   7ff7902e7800 7ff7ad2846a0
0x7ff66be97078:   7ff7897048d8 
0x7ff66be97088:   7ff66be97110 7ff66be971f0
0x7ff66be97098:   7ff7902e7800 7ff66be970f0
0x7ff66be970a8:   7ff79cc0e261 0010
0x7ff66be970b8:   000390c04048 00066f24fac8
0x7ff66be970c8:   7ff7902e7800 000394032120
0x7ff66be970d8:   7ff7902e7800 7ff66f971af0
0x7ff66be970e8:   7ff7902e7800 7ff66be97198
0x7ff66be970f8:   7ff79c9d4c4d 7ff66a454b10
0x7ff66be97108:   7ff79c9d4c4d 0010
0x7ff66be97118:   7ff7902e5a90 0028
0x7ff66be97128:   7ff79c9d4760 000394032120
0x7ff66be97138:   30323048 7ff66be97160
0x7ff66be97148:   00066f24fac8 000390c04048
0x7ff66be97158:   7ff66be97158 7ff66f978eeb
0x7ff66be97168:   7ff66be971f0 7ff66f9791c8
0x7ff66be97178:   7ff668e90c60 7ff66f978f60
0x7ff66be97188:   7ff66be97110 7ff66be971b8
0x7ff66be97198:   7ff66be97238 7ff79c9d4c4d
0x7ff66be971a8:   0010 
0x7ff66be971b8:   38363130 38363130
0x7ff66be971c8:   0028 7ff66f973388
0x7ff66be971d8:   000394032120 30323048
0x7ff66be971e8:   000665823080 00066f24fac8
0x7ff66be971f8:   7ff66be971f8 7ff66f973357
0x7ff66be97208:   7ff66be97260 7ff66f976fe0
0x7ff66be97218:    7ff66f973388
0x7ff66be97228:   7ff66be971b8 7ff66be97248
0x7ff66be97238:   7ff66be972a8 7ff79c9d4c4d 

Instructions: (pc=0x7ff7b0dae200)
0x7ff7b0dae1e0:   00 00 00 48 8d 4c d6 f8 48 f7 da eb 39 48 8b 74
0x7ff7b0dae1f0:   d0 08 48 89 74 d1 08 48 83 c2 01 75 f0 c3 66 90
0x7ff7b0dae200:   48 8b 74 d0 e8 48 89 74 d1 e8 48 8b 74 d0 f0 48
0x7ff7b0dae210:   89 74 d1 f0 48 8b 74 d0 f8 48 89 74 d1 f8 48 8b 

Register to memory mapping:

RAX=0x00069f672b08 is an unallocated location in the heap
RBX=0x7ff7902e7800 is a thread
RCX=0x000394132140 is pointing into object: 0x000394032120
[B 
 - klass: {type array byte}
 - length: 1886151312
RDX=0xfffe0004 is an unknown value
RSP=0x7ff66be97048 is pointing 

[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is invalid

2015-09-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877105#comment-14877105
 ] 

Apache Spark commented on SPARK-10304:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8840

> Partition discovery does not throw an exception if the dir structure is 
> invalid
> ---
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6067) Spark sql hive dynamic partitions job will fail if task fails

2015-09-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6067.
--
Resolution: Duplicate

>From the PR discussion, sounds like this is the same as SPARK-8379

> Spark sql hive dynamic partitions job will fail if task fails
> -
>
> Key: SPARK-6067
> URL: https://issues.apache.org/jira/browse/SPARK-6067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Jason Hubbard
>Priority: Minor
> Attachments: job.log
>
>
> When inserting into a hive table from spark sql while using dynamic 
> partitioning, if a task fails it will cause the task to continue to fail and 
> eventually fail the job.
> /mytable/.hive-staging_hive_2015-02-27_11-53-19_573_222-3/-ext-1/partition=2015-02-04/part-1
>  for client  already exists
> The task may need to clean up after a failed task to write to the location of 
> the previously failed task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10663) Change test.toDF to test in Spark ML Programming Guide

2015-09-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877127#comment-14877127
 ] 

Sean Owen commented on SPARK-10663:
---

[~jzhang] I see model.transform(test.toDF) on the current docs: 
http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline  I think 
this is indeed redundant. Only one instance uses .toDF

> Change test.toDF to test in Spark ML Programming Guide
> --
>
> Key: SPARK-10663
> URL: https://issues.apache.org/jira/browse/SPARK-10663
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Matt Hagen
>Priority: Trivial
>
> Spark 1.5.0 > Spark ML Programming Guide > Example: Pipeline
> I believe model.transform(test.toDF) should be model.transform(test).
> Note that "test" is already a DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10662) Code snippets are not properly formatted in tables

2015-09-19 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-10662:

Summary: Code snippets are not properly formatted in tables  (was: Code 
snippets are not properly formatted in docs)

> Code snippets are not properly formatted in tables
> --
>
> Key: SPARK-10662
> URL: https://issues.apache.org/jira/browse/SPARK-10662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.0
>Reporter: Jacek Laskowski
> Attachments: spark-docs-backticks-tables.png
>
>
> Backticks (markdown) in tables are not processed and hence not formatted 
> properly. See 
> http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
>  and search for {{`yarn-client`}}.
> As per [Sean's 
> suggestion|https://github.com/apache/spark/pull/8795#issuecomment-141019047] 
> I'm creating the JIRA task.
> {quote}
> This is a good fix, but this is another instance where I suspect the same 
> issue exists in several markup files, like configuration.html. It's worth a 
> JIRA since I think catching and fixing all of these is one non-trivial 
> logical change.
> If you can, avoid whitespace changes like stripping or adding space at the 
> end of lines. It just adds to the diff and makes for a tiny extra chance of 
> merge conflicts.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10706) Add java wrapper for random vector rdd

2015-09-19 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877220#comment-14877220
 ] 

Meihua Wu commented on SPARK-10706:
---

I will work on this.

> Add java wrapper for random vector rdd
> --
>
> Key: SPARK-10706
> URL: https://issues.apache.org/jira/browse/SPARK-10706
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, MLlib
>Reporter: holdenk
>
> Similar to SPARK-3136 also wrap the random vector API to make it callable 
> easily from Java.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10712) JVM crashes with spark.sql.tungsten.enabled = true

2015-09-19 Thread Mauro Pirrone (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877223#comment-14877223
 ] 

Mauro Pirrone commented on SPARK-10712:
---

A workaround to this problem is set increase 
spark.sql.autoBroadcastJoinThreshold or set the value to -1. 

> JVM crashes with spark.sql.tungsten.enabled = true
> --
>
> Key: SPARK-10712
> URL: https://issues.apache.org/jira/browse/SPARK-10712
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: 1 node - Linux, 64GB ram, 8 core
>Reporter: Mauro Pirrone
>Priority: Critical
>
> When turning on tungsten, I get the following error when executing a 
> query/job with a few joins. When tungsten is turned off, the error does not 
> appear. Also note that tungsten works for me in other cases.
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ffadaf59200, pid=7598, tid=140710015645440
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_45-b14) (build 
> 1.8.0_45-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.45-b02 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x7eb200]
> #
> # Core dump written. Default location: //core or core.7598 (max size 100 
> kB). To ensure a full core dump, try "ulimit -c unlimited" before starting 
> Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid7598.log
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x7ff7902e7800):  JavaThread "broadcast-hash-join-1" 
> daemon [_thread_in_vm, id=16548, stack(0x7ff66bd98000,0x7ff66be99000)]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 
> 0x00069f572b10
> Registers:
> RAX=0x00069f672b08, RBX=0x7ff7902e7800, RCX=0x000394132140, 
> RDX=0xfffe0004
> RSP=0x7ff66be97048, RBP=0x7ff66be970a0, RSI=0x000394032148, 
> RDI=0x00069f572b10
> R8 =0x7ff66be970d0, R9 =0x0028, R10=0x7ff79cc0e1e7, 
> R11=0x7ff79cc0e198
> R12=0x7ff66be970c0, R13=0x7ff66be970d0, R14=0x0028, 
> R15=0x30323048
> RIP=0x7ff7b0dae200, EFLAGS=0x00010282, CSGSFS=0xe033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x7ff66be97048)
> 0x7ff66be97048:   7ff7b1042b1a 7ff7902e7800
> 0x7ff66be97058:   7ff7 7ff7902e7800
> 0x7ff66be97068:   7ff7902e7800 7ff7ad2846a0
> 0x7ff66be97078:   7ff7897048d8 
> 0x7ff66be97088:   7ff66be97110 7ff66be971f0
> 0x7ff66be97098:   7ff7902e7800 7ff66be970f0
> 0x7ff66be970a8:   7ff79cc0e261 0010
> 0x7ff66be970b8:   000390c04048 00066f24fac8
> 0x7ff66be970c8:   7ff7902e7800 000394032120
> 0x7ff66be970d8:   7ff7902e7800 7ff66f971af0
> 0x7ff66be970e8:   7ff7902e7800 7ff66be97198
> 0x7ff66be970f8:   7ff79c9d4c4d 7ff66a454b10
> 0x7ff66be97108:   7ff79c9d4c4d 0010
> 0x7ff66be97118:   7ff7902e5a90 0028
> 0x7ff66be97128:   7ff79c9d4760 000394032120
> 0x7ff66be97138:   30323048 7ff66be97160
> 0x7ff66be97148:   00066f24fac8 000390c04048
> 0x7ff66be97158:   7ff66be97158 7ff66f978eeb
> 0x7ff66be97168:   7ff66be971f0 7ff66f9791c8
> 0x7ff66be97178:   7ff668e90c60 7ff66f978f60
> 0x7ff66be97188:   7ff66be97110 7ff66be971b8
> 0x7ff66be97198:   7ff66be97238 7ff79c9d4c4d
> 0x7ff66be971a8:   0010 
> 0x7ff66be971b8:   38363130 38363130
> 0x7ff66be971c8:   0028 7ff66f973388
> 0x7ff66be971d8:   000394032120 30323048
> 0x7ff66be971e8:   000665823080 00066f24fac8
> 0x7ff66be971f8:   7ff66be971f8 7ff66f973357
> 0x7ff66be97208:   7ff66be97260 7ff66f976fe0
> 0x7ff66be97218:    7ff66f973388
> 

[jira] [Comment Edited] (SPARK-10712) JVM crashes with spark.sql.tungsten.enabled = true

2015-09-19 Thread Mauro Pirrone (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877223#comment-14877223
 ] 

Mauro Pirrone edited comment on SPARK-10712 at 9/19/15 5:40 PM:


A workaround to this problem is to increase 
spark.sql.autoBroadcastJoinThreshold or set the value to -1. 


was (Author: mauro.pirrone):
A workaround to this problem is set increase 
spark.sql.autoBroadcastJoinThreshold or set the value to -1. 

> JVM crashes with spark.sql.tungsten.enabled = true
> --
>
> Key: SPARK-10712
> URL: https://issues.apache.org/jira/browse/SPARK-10712
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
> Environment: 1 node - Linux, 64GB ram, 8 core
>Reporter: Mauro Pirrone
>Priority: Critical
>
> When turning on tungsten, I get the following error when executing a 
> query/job with a few joins. When tungsten is turned off, the error does not 
> appear. Also note that tungsten works for me in other cases.
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ffadaf59200, pid=7598, tid=140710015645440
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_45-b14) (build 
> 1.8.0_45-b14)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.45-b02 mixed mode 
> linux-amd64 compressed oops)
> # Problematic frame:
> # V  [libjvm.so+0x7eb200]
> #
> # Core dump written. Default location: //core or core.7598 (max size 100 
> kB). To ensure a full core dump, try "ulimit -c unlimited" before starting 
> Java again
> #
> # An error report file with more information is saved as:
> # //hs_err_pid7598.log
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> Compiled method (nm)   44403 10436 n 0   sun.misc.Unsafe::copyMemory 
> (native)
>  total in heap  [0x7ffac6b49290,0x7ffac6b495f8] = 872
>  relocation [0x7ffac6b493b8,0x7ffac6b49400] = 72
>  main code  [0x7ffac6b49400,0x7ffac6b495f8] = 504
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
> ---  T H R E A D  ---
> Current thread (0x7ff7902e7800):  JavaThread "broadcast-hash-join-1" 
> daemon [_thread_in_vm, id=16548, stack(0x7ff66bd98000,0x7ff66be99000)]
> siginfo: si_signo: 11 (SIGSEGV), si_code: 2 (SEGV_ACCERR), si_addr: 
> 0x00069f572b10
> Registers:
> RAX=0x00069f672b08, RBX=0x7ff7902e7800, RCX=0x000394132140, 
> RDX=0xfffe0004
> RSP=0x7ff66be97048, RBP=0x7ff66be970a0, RSI=0x000394032148, 
> RDI=0x00069f572b10
> R8 =0x7ff66be970d0, R9 =0x0028, R10=0x7ff79cc0e1e7, 
> R11=0x7ff79cc0e198
> R12=0x7ff66be970c0, R13=0x7ff66be970d0, R14=0x0028, 
> R15=0x30323048
> RIP=0x7ff7b0dae200, EFLAGS=0x00010282, CSGSFS=0xe033, 
> ERR=0x0004
>   TRAPNO=0x000e
> Top of Stack: (sp=0x7ff66be97048)
> 0x7ff66be97048:   7ff7b1042b1a 7ff7902e7800
> 0x7ff66be97058:   7ff7 7ff7902e7800
> 0x7ff66be97068:   7ff7902e7800 7ff7ad2846a0
> 0x7ff66be97078:   7ff7897048d8 
> 0x7ff66be97088:   7ff66be97110 7ff66be971f0
> 0x7ff66be97098:   7ff7902e7800 7ff66be970f0
> 0x7ff66be970a8:   7ff79cc0e261 0010
> 0x7ff66be970b8:   000390c04048 00066f24fac8
> 0x7ff66be970c8:   7ff7902e7800 000394032120
> 0x7ff66be970d8:   7ff7902e7800 7ff66f971af0
> 0x7ff66be970e8:   7ff7902e7800 7ff66be97198
> 0x7ff66be970f8:   7ff79c9d4c4d 7ff66a454b10
> 0x7ff66be97108:   7ff79c9d4c4d 0010
> 0x7ff66be97118:   7ff7902e5a90 0028
> 0x7ff66be97128:   7ff79c9d4760 000394032120
> 0x7ff66be97138:   30323048 7ff66be97160
> 0x7ff66be97148:   00066f24fac8 000390c04048
> 0x7ff66be97158:   7ff66be97158 7ff66f978eeb
> 0x7ff66be97168:   7ff66be971f0 7ff66f9791c8
> 0x7ff66be97178:   7ff668e90c60 7ff66f978f60
> 0x7ff66be97188:   7ff66be97110 7ff66be971b8
> 0x7ff66be97198:   7ff66be97238 7ff79c9d4c4d
> 0x7ff66be971a8:   0010 
> 0x7ff66be971b8:   38363130 38363130
> 0x7ff66be971c8:   0028 7ff66f973388
> 0x7ff66be971d8:   000394032120 30323048
> 0x7ff66be971e8:   000665823080 

[jira] [Created] (SPARK-10716) http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz file broken?

2015-09-19 Thread Jack Jack (JIRA)
Jack Jack created SPARK-10716:
-

 Summary: 
http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz file broken?
 Key: SPARK-10716
 URL: https://issues.apache.org/jira/browse/SPARK-10716
 Project: Spark
  Issue Type: Question
  Components: Build, Deploy
Affects Versions: 1.5.0
 Environment: Yosemite 10.10.5
Reporter: Jack Jack


Directly downloaded prebuilt binaries of 
http://d3kbcqa49mib13.cloudfront.net/spark-1.5.0-bin-hadoop2.6.tgz 
got error when tar xvzf it.  Tried download twice and extract twice.

error log:
..
x spark-1.5.0-bin-hadoop2.6/lib/
x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar
x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar
x spark-1.5.0-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
x spark-1.5.0-bin-hadoop2.6/README.md
tar: copyfile unpack 
(spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc) 
failed: No such file or directory
~ :>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10718) Post Apache license header missing from multiple script and required files

2015-09-19 Thread Rekha Joshi (JIRA)
Rekha Joshi created SPARK-10718:
---

 Summary: Post Apache license header missing from multiple script 
and required files
 Key: SPARK-10718
 URL: https://issues.apache.org/jira/browse/SPARK-10718
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.5.0
Reporter: Rekha Joshi


On the latest 1.6.0-SNAPSHOT the ./dev/run-tests fails due to apache license 
missing in scripts and other files.
This seems to be side effect of changes done on check-license
{code}
Apache license header missing from multiple script and required files
Could not find Apache license headers in the following files:
 !? <>spark/conf/spark-defaults.conf
[error] running <>spark/dev/check-license ; received return code 1
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10718) Post Apache license header missing from multiple script and required files

2015-09-19 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14877408#comment-14877408
 ] 

Rekha Joshi commented on SPARK-10718:
-

pull request in progress. thanks

> Post Apache license header missing from multiple script and required files
> --
>
> Key: SPARK-10718
> URL: https://issues.apache.org/jira/browse/SPARK-10718
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> On the latest 1.6.0-SNAPSHOT the ./dev/run-tests fails due to apache license 
> missing in scripts and other files.
> This seems to be side effect of changes done on check-license
> {code}
> Apache license header missing from multiple script and required files
> Could not find Apache license headers in the following files:
>  !? <>spark/conf/spark-defaults.conf
> [error] running <>spark/dev/check-license ; received return code 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org