[jira] [Commented] (SPARK-4497) HiveThriftServer2 does not exit properly on failure

2015-12-15 Thread Yana Kadiyska (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058324#comment-15058324
 ] 

Yana Kadiyska commented on SPARK-4497:
--

[~jeffzhang] I have moved on to 1.2 so I cannot comment on this any longer. I'm 
ok to close

> HiveThriftServer2 does not exit properly on failure
> ---
>
> Key: SPARK-4497
> URL: https://issues.apache.org/jira/browse/SPARK-4497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yana Kadiyska
>Priority: Critical
>
> start thriftserver with 
> {{sbin/start-thriftserver.sh --master ...}}
> If there is an error (in my case namenode is in standby mode) the driver 
> shuts down properly:
> {code}
> 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2
> 
> 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040
> 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler
> 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
> stopped!
> 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared
> 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped
> 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped
> 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext
> {code}
> but trying to run {{sbin/start-thriftserver.sh --master ... }} again results 
> in an error that Thrifserver is already running.
> {{ps -aef|grep }} shows
> {code}
> root 32334 1  0 16:32 ?00:00:00 /usr/local/bin/java 
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master 
> spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc 
> -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf 
> hive.root.logger=INFO,console
> {code}
> This is problematic since we have a process that tries to restart the driver 
> if it dies



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058459#comment-15058459
 ] 

Josh Rosen commented on SPARK-11255:


Just to be clear, my proposal was to bump from 3.1.1 to 3.1.2. If they use 
semantic versioning and haven't introduced regressions, then I'd hope that 
would be okay. Also fine holding off on that until 2.0.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12010) Spark JDBC requires support for column-name-free INSERT syntax

2015-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058409#comment-15058409
 ] 

Apache Spark commented on SPARK-12010:
--

User 'CK50' has created a pull request for this issue:
https://github.com/apache/spark/pull/10312

> Spark JDBC requires support for column-name-free INSERT syntax
> --
>
> Key: SPARK-12010
> URL: https://issues.apache.org/jira/browse/SPARK-12010
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Christian Kurz
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Spark JDBC write only works with technologies which support the following 
> INSERT statement syntax (JdbcUtils.scala: insertStatement()):
> INSERT INTO $table VALUES ( ?, ?, ..., ? )
> Some technologies require a list of column names:
> INSERT INTO $table ( $colNameList ) VALUES ( ?, ?, ..., ? )
> Therefore technologies like Progress JDBC Driver for Cassandra do not work 
> with Spark JDBC write.
> Idea for fix:
> Move JdbcUtils.scala:insertStatement() into SqlDialect and add a SqlDialect 
> for Progress JDBC Driver for Cassandra



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4497) HiveThriftServer2 does not exit properly on failure

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4497.
--
Resolution: Cannot Reproduce

> HiveThriftServer2 does not exit properly on failure
> ---
>
> Key: SPARK-4497
> URL: https://issues.apache.org/jira/browse/SPARK-4497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Yana Kadiyska
>Priority: Critical
>
> start thriftserver with 
> {{sbin/start-thriftserver.sh --master ...}}
> If there is an error (in my case namenode is in standby mode) the driver 
> shuts down properly:
> {code}
> 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2
> 
> 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040
> 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler
> 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all 
> executors
> 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to 
> shut down
> 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
> stopped!
> 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared
> 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped
> 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped
> 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext
> {code}
> but trying to run {{sbin/start-thriftserver.sh --master ... }} again results 
> in an error that Thrifserver is already running.
> {{ps -aef|grep }} shows
> {code}
> root 32334 1  0 16:32 ?00:00:00 /usr/local/bin/java 
> org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master 
> spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc 
> -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf 
> hive.root.logger=INFO,console
> {code}
> This is problematic since we have a process that tries to restart the driver 
> if it dies



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12263) IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit

2015-12-15 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058429#comment-15058429
 ] 

Neelesh Srinivas Salian commented on SPARK-12263:
-

I would like to work on this.

Please let me know if anyone has started working on it already.
Else I can go ahead and submit a PR.

Thank you.

> IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
> -
>
> Key: SPARK-12263
> URL: https://issues.apache.org/jira/browse/SPARK-12263
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> When starting a worker with the following command - note 
> {{SPARK_WORKER_MEMORY=1024}} it fails saying that the memory was 0 while it 
> was 1024 (without size unit).
> {code}
> ➜  spark git:(master) ✗ SPARK_WORKER_MEMORY=1024 SPARK_WORKER_CORES=5 
> ./sbin/start-slave.sh spark://localhost:7077
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
>   INFO ShutdownHookManager: Shutdown hook called
>   INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> full log in 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}
> The full stack trace is as follows:
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> INFO Worker: Registered signal handlers for [TERM, HUP, INT]
> Exception in thread "main" java.lang.IllegalStateException: Memory can't be 
> 0, missing a M or G on the end of the memory specification?
> at 
> org.apache.spark.deploy.worker.WorkerArguments.checkWorkerMemory(WorkerArguments.scala:179)
> at 
> org.apache.spark.deploy.worker.WorkerArguments.(WorkerArguments.scala:64)
> at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:691)
> at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
> INFO ShutdownHookManager: Shutdown hook called
> INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> {code}
> The following command starts spark standalone worker successfully:
> {code}
> SPARK_WORKER_MEMORY=1g SPARK_WORKER_CORES=5 ./sbin/start-slave.sh 
> spark://localhost:7077
> {code}
> The master reports:
> {code}
> INFO Master: Registering worker 192.168.1.6:63884 with 5 cores, 1024.0 MB RAM
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12061) Persist for Map/filter with Lambda Functions don't always read from Cache

2015-12-15 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058419#comment-15058419
 ] 

Xiao Li commented on SPARK-12061:
-

Start working on it. Thanks!

> Persist for Map/filter with Lambda Functions don't always read from Cache
> -
>
> Key: SPARK-12061
> URL: https://issues.apache.org/jira/browse/SPARK-12061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> So far, the existing caching mechanisms do not work on dataset operations 
> when using map/filter with lambda functions. For example, 
> {code}
>   test("persist and then map/filter with lambda functions") {
> val f = (i: Int) => i + 1
> val ds = Seq(1, 2, 3).toDS()
> val mapped = ds.map(f)
> mapped.cache()
> val mapped2 = ds.map(f)
> assertCached(mapped2)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12336:

Assignee: Cheng Lian  (was: Apache Spark)

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12317:


Assignee: Apache Spark

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Assignee: Apache Spark
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12317:


Assignee: (was: Apache Spark)

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11255) R Test build should run on R 3.1.2

2015-12-15 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-11255:

Summary: R Test build should run on R 3.1.2  (was: R Test build should run 
on R 3.1.1)

> R Test build should run on R 3.1.2
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9886) Validate usages of Runtime.getRuntime.addShutdownHook

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9886:
---

Assignee: Apache Spark

> Validate usages of Runtime.getRuntime.addShutdownHook
> -
>
> Key: SPARK-9886
> URL: https://issues.apache.org/jira/browse/SPARK-9886
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Michel Lemay
>Assignee: Apache Spark
>Priority: Minor
>
> While refactoring calls to Utils.addShutdownHook to spark ShutdownHookManager 
> in PR #8109, I've seen instances of calls to 
> Runtime.getRuntime.addShutdownHook:
> org\apache\spark\deploy\ExternalShuffleService.scala:126
> org\apache\spark\deploy\mesos\MesosClusterDispatcher.scala:113
> org\apache\spark\storage\ExternalBlockStore.scala:181
> Comment from @vanzin:
> "From a quick look, it seems that at least the one in ExternalBlockStore 
> should be changed; the other two seem to be separate processes (i.e. they are 
> not part of a Spark application) so that's questionable. But I'd say leave it 
> for a different change (maybe file a separate bug so it doesn't fall through 
> the cracks)."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12317:


Assignee: (was: Apache Spark)

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-15 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058522#comment-15058522
 ] 

Narine Kokhlikyan commented on SPARK-12325:
---

Thank you for your generous kindness, [~srowen]. I appreciate it!


> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance 
> calculation for columns with dataType StringType not supported.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation 
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand 
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding 
> methods
> 3. create CorrelationCounter and splitting the computations for correlation 
> and covariance
> and many more  
> Since I'm not getting any response and according to github all five of you 
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or 
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly 
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull 
> requests in SQL/ML components are just staying there without any response, 
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12344) Remove env-based configurations

2015-12-15 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-12344:
--

 Summary: Remove env-based configurations
 Key: SPARK-12344
 URL: https://issues.apache.org/jira/browse/SPARK-12344
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, YARN
Reporter: Marcelo Vanzin
Assignee: Reynold Xin


We should remove as many env-based configurations as it makes sense, since they 
are deprecated and we prefer to use Spark's configuration.

Tools available through the command line should consistently support both a 
properties file with configuration keys and the {{--conf}} command line 
argument such as the one SparkSubmit supports.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-15 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp reopened SPARK-11255:
-

testing 3.1.2 on our staging instance right now.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12331) R^2 for regression through the origin

2015-12-15 Thread Imran Younus (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058439#comment-15058439
 ] 

Imran Younus commented on SPARK-12331:
--

Sure, I can take care of this.

> R^2 for regression through the origin
> -
>
> Key: SPARK-12331
> URL: https://issues.apache.org/jira/browse/SPARK-12331
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Imran Younus
>Priority: Minor
>
> The value of R^2 (coefficient of determination) obtained from 
> LinearRegressionModel is not consistent with R and statsmodels when the 
> fitIntercept is false i.e., regression through the origin. In this case, both 
> R and statsmodels use the definition of R^2 given by eq(4') in the following 
> review paper:
> https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
> Here is the definition from this paper:
> R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2)
> The paper also describes why this should be the case. I've double checked 
> that the value of R^2 from statsmodels and R are consistent with this 
> definition. On the other hand, scikit-learn doesn't use the above definition. 
> I would recommend using the above definition in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12327) lint-r checks fail with commented code

2015-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058464#comment-15058464
 ] 

Shivaram Venkataraman commented on SPARK-12327:
---

Do you have a link to the PR ? These are showing up now as I think the lint-r 
package got upgraded during the R version change that happened yesterday.

> lint-r checks fail with commented code
> --
>
> Key: SPARK-12327
> URL: https://issues.apache.org/jira/browse/SPARK-12327
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> We get this after our R version downgrade
> {code}
> R/RDD.R:183:68: style: Commented code should be removed.
> rdd@env$jrdd_val <- callJMethod(rddRef, "asJavaRDD") # 
> rddRef$asJavaRDD()
>
> ^~
> R/RDD.R:228:63: style: Commented code should be removed.
> #' http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence.
>   ^~~~
> R/RDD.R:388:24: style: Commented code should be removed.
> #' collectAsMap(rdd) # list(`1` = 2, `3` = 4)
>^~
> R/RDD.R:603:61: style: Commented code should be removed.
> #' unlist(collect(filterRDD(rdd, function (x) { x < 3 }))) # c(1, 2)
> ^~~~
> R/RDD.R:762:20: style: Commented code should be removed.
> #' take(rdd, 2L) # list(1, 2)
>^~
> R/RDD.R:830:42: style: Commented code should be removed.
> #' sort(unlist(collect(distinct(rdd # c(1, 2, 3)
>  ^~~
> R/RDD.R:980:47: style: Commented code should be removed.
> #' collect(keyBy(rdd, function(x) { x*x })) # list(list(1, 1), list(4, 2), 
> list(9, 3))
>   
> ^~~~
> R/RDD.R:1194:27: style: Commented code should be removed.
> #' takeOrdered(rdd, 6L) # list(1, 2, 3, 4, 5, 6)
>   ^~
> R/RDD.R:1215:19: style: Commented code should be removed.
> #' top(rdd, 6L) # list(10, 9, 7, 6, 5, 4)
>   ^~~
> R/RDD.R:1270:50: style: Commented code should be removed.
> #' aggregateRDD(rdd, zeroValue, seqOp, combOp) # list(10, 4)
>  ^~~
> R/RDD.R:1374:6: style: Commented code should be removed.
> #' # list(list("a", 0), list("b", 3), list("c", 1), list("d", 4), list("e", 
> 2))
>  
> ^~
> R/RDD.R:1415:6: style: Commented code should be removed.
> #' # list(list("a", 0), list("b", 1), list("c", 2), list("d", 3), list("e", 
> 4))
>  
> ^~
> R/RDD.R:1461:6: style: Commented code should be removed.
> #' # list(list(1, 2), list(3, 4))
>  ^~~~
> R/RDD.R:1527:6: style: Commented code should be removed.
> #' # list(list(0, 1000), list(1, 1001), list(2, 1002), list(3, 1003), list(4, 
> 1004))
>  
> ^~~
> R/RDD.R:1564:6: style: Commented code should be removed.
> #' # list(list(1, 1), list(1, 2), list(2, 1), list(2, 2))
>  ^~~~
> R/RDD.R:1595:6: style: Commented code should be removed.
> #' # list(1, 1, 3)
>  ^
> R/RDD.R:1627:6: style: Commented code should be removed.
> #' # list(1, 2, 3)
>  ^
> R/RDD.R:1663:6: style: Commented code should be removed.
> #' # list(list(1, c(1,2), c(1,2,3)), list(2, c(3,4), c(4,5,6)))
>  ^~
> R/deserialize.R:22:3: style: Commented code should be removed.
> # void -> NULL
>   ^~~~
> R/deserialize.R:23:3: style: Commented code should be removed.
> # Int -> integer
>   ^~
> R/deserialize.R:24:3: style: Commented code should be removed.
> # String -> character
>   ^~~
> R/deserialize.R:25:3: style: Commented code should be removed.
> # Boolean -> logical
>   ^~
> R/deserialize.R:26:3: style: Commented code should be removed.
> # Float -> double
>   ^~~
> R/deserialize.R:27:3: style: Commented code should be removed.
> # Double -> double
>   ^~~~
> R/deserialize.R:28:3: style: Commented code should be removed.
> # Long -> double
>   ^~
> R/deserialize.R:29:3: style: Commented code should be removed.
> # Array[Byte] -> raw
>   ^~
> R/deserialize.R:30:3: style: Commented code should be removed.
> # Date -> Date
>   ^~~~
> 

[jira] [Created] (SPARK-12345) Mesos cluster mode is broken

2015-12-15 Thread Andrew Or (JIRA)
Andrew Or created SPARK-12345:
-

 Summary: Mesos cluster mode is broken
 Key: SPARK-12345
 URL: https://issues.apache.org/jira/browse/SPARK-12345
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.6.0
Reporter: Andrew Or
Assignee: Iulian Dragos
Priority: Critical


The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
(Iulian: please edit this to provide more detail)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11808) Remove Bagel

2015-12-15 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11808.

Resolution: Duplicate

> Remove Bagel
> 
>
> Key: SPARK-11808
> URL: https://issues.apache.org/jira/browse/SPARK-11808
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.2

2015-12-15 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058613#comment-15058613
 ] 

shane knapp commented on SPARK-11255:
-

on staging, i ran the following shell commands:
{quote}
for x in $(rpm -qa | grep R); do yum remove $x; done
export PATH=/home/anaconda/bin/:$PATH
conda install -c https://conda.anaconda.org/asmeurer r=3.1.2
R CMD javareconf
Rscript -e 'install.packages("digest", repos="http://cran.stat.ucla.edu/;)'
Rscript -e 'install.packages("testthat", repos="http://cran.stat.ucla.edu/;)'
Rscript -e 'install.packages("knitr", repos="http://cran.stat.ucla.edu/;)'
Rscript -e 'install.packages("devtools", repos="http://cran.stat.ucla.edu/;)'
Rscript -e 'devtools::install_github("jimhester/lintr")'
Rscript -e 'install.packages("plyr", repos="http://cran.stat.ucla.edu/;)'
{quote}

then i ran our test build, which passed:
https://amplab.cs.berkeley.edu/jenkins/view/AMPLab%20Infra/job/SparkR-testing/6/

i'll update R the next time we have downtime scheduled (probably not until 
early next year).

> R Test build should run on R 3.1.2
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058629#comment-15058629
 ] 

Apache Spark commented on SPARK-12317:
--

User 'kevinyu98' has created a pull request for this issue:
https://github.com/apache/spark/pull/10314

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10250) Scala PairRDDFunctions.groupByKey() should be fault-tolerant of single large groups

2015-12-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10250:
--
Summary: Scala PairRDDFunctions.groupByKey() should be fault-tolerant of 
single large groups  (was: Scala PairRDDFuncitons.groupByKey() should be 
fault-tolerant of single large groups)

> Scala PairRDDFunctions.groupByKey() should be fault-tolerant of single large 
> groups
> ---
>
> Key: SPARK-10250
> URL: https://issues.apache.org/jira/browse/SPARK-10250
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Matt Cheah
>Priority: Minor
>
> PairRDDFunctions.groupByKey() is less robust that Python's equivalent, as 
> PySpark's groupByKey can spill single large groups to disk. We should bring 
> the Scala implementation up to parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12343) Remove YARN Client / ClientArguments

2015-12-15 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-12343:
--

 Summary: Remove YARN Client / ClientArguments
 Key: SPARK-12343
 URL: https://issues.apache.org/jira/browse/SPARK-12343
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Marcelo Vanzin
Assignee: Reynold Xin


We should remove the YARN {{Client}} object - or at least make it private - and 
its {{ClientArguments}} class. Those are only public for backwards 
compatibility reasons.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-12-15 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058514#comment-15058514
 ] 

Yin Huai commented on SPARK-11885:
--

Thank you for the update!

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3
>
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch.
> Try the following in spark 1.5 (with a cluster) and you can see the problem.
> {code}
> import java.math.BigDecimal
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType, StructField, DataType, 
> DoubleType, LongType}
> class GeometricMean extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(StructField("value", DoubleType) :: Nil)
>   def bufferSchema: StructType = StructType(
> StructField("count", LongType) ::
>   StructField("product", DoubleType) :: Nil
>   )
>   def dataType: DataType = DoubleType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
> buffer(1) = 1.0
>   }
>   def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> buffer(0) = buffer.getAs[Long](0) + 1
> buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
> buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
>   }
>   def evaluate(buffer: Row): Any = {
> math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0))
>   }
> }
> sqlContext.udf.register("gm", new GeometricMean)
> val df = Seq(
>   (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"),
>   (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"),
>   (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"),
>   (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"),
>   (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"),
>   (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")).
>   toDF("receipt_id", "store_country", "store_region", "store_id", "amount", 
> "seller_name")
> df.registerTempTable("receipts")
>   
> val q = sql("""
> select   store_country,
>  store_region,
>  avg(amount),
>  sum(amount),
>  gm(amount)
> from receipts
> whereamount > 50
>  and store_country = 'italy'
> group by store_country, store_region
> """)
> q.show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10347) Investigate the usage of normalizePath()

2015-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058520#comment-15058520
 ] 

Shivaram Venkataraman commented on SPARK-10347:
---

I think its a good idea to fix and also a good idea to check if the provided 
input is a URI or not. However I think this code might be complex to maintain 
in R -- Do you think this might be easier to write in Scala ?

> Investigate the usage of normalizePath()
> 
>
> Key: SPARK-10347
> URL: https://issues.apache.org/jira/browse/SPARK-10347
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Sun Rui
>Priority: Minor
>
> Currently normalizePath() is used in several places allowing users to specify 
> paths via the use of tilde expansion, or to normalize a relative path to an 
> absolute path. However, normalizePath() is used for paths which are actually 
> expected to be a URI. normalizePath() may display warning messages when it 
> does not recognize a URI as a local file path. So suppressWarnings() is used 
> to suppress the possible warnings.
> Worse than warnings, call normalizePath() on a URI may cause error. Because 
> it may turn a user specified relative path to an absolute path using the 
> local current directory, but this may not be true because the path is 
> actually relative to the working directory of the default file system instead 
> of the local file system (depends on the Hadoop configuration of Spark).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058456#comment-15058456
 ] 

Shivaram Venkataraman commented on SPARK-11255:
---

Yeah we could bump up the R version as a part of Spark 2.0 (we should open a 
new JIRA to discuss this). R releases happen every 3 months or so (see 
https://cran.r-project.org/src/base/R-3/), so 3.1.1 is around 1.5 years old now.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10312) Enhance SerDe to handle atomic vector

2015-12-15 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058478#comment-15058478
 ] 

Shivaram Venkataraman commented on SPARK-10312:
---

SInce the SerDe is an internal API I think its fine to not support some of the 
use cases. We can close this issue as `Will not fix` if you think thats ok. 

> Enhance SerDe to handle atomic vector
> -
>
> Key: SPARK-10312
> URL: https://issues.apache.org/jira/browse/SPARK-10312
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.4.1
>Reporter: Sun Rui
>
> Currently, writeObject() does not handle atomic vector well. For an atomic 
> vector, it treats it like a scalar object. For example, if you pass c(1:10) 
> into writeObject, it will write a single integer as 1. You have to explicitly 
> cast an atomic vector, for example, as.list(1:10), to a list, if you want to 
> write the whole vector.
> Could we enhance the SerDe that when the object is an atomic vector whose 
> length >1, convert it to a list and then write?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12317) Support configurate value with unit(e.g. kb/mb/gb) in SQL

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12317:


Assignee: Apache Spark

> Support configurate value with unit(e.g. kb/mb/gb) in SQL
> -
>
> Key: SPARK-12317
> URL: https://issues.apache.org/jira/browse/SPARK-12317
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yadong Qi
>Assignee: Apache Spark
>Priority: Minor
>
> e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` 
> instead of `10485760`, because `10MB` is more easier than `10485760`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2015-12-15 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12334:
--

 Summary: Support read from multiple input paths for orc file in 
DataFrameReader.orc
 Key: SPARK-12334
 URL: https://issues.apache.org/jira/browse/SPARK-12334
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Jeff Zhang
Priority: Minor


DataFrameReader.json/text/parquet support multiple input paths, orc should be 
consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057580#comment-15057580
 ] 

Josh Rosen commented on SPARK-11255:


Hey, just curious: is 3.1.1 a hard requirement or would 3.1.2 also work? I'm 
experimenting with some different build automation stuff and it would be nice 
to be able to use 3.1.2, which is the earliest version that's supported by 
Anaconda ({{conda install r}}).

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8519) Blockify distance computation in k-means

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8519:
---

Assignee: Apache Spark

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8519) Blockify distance computation in k-means

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8519:
---

Assignee: (was: Apache Spark)

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12270) JDBC Where clause comparison doesn't work for DB2 char(n)

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12270:


Assignee: Apache Spark

> JDBC Where clause comparison doesn't work for DB2 char(n) 
> --
>
> Key: SPARK-12270
> URL: https://issues.apache.org/jira/browse/SPARK-12270
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> I am doing some Spark jdbc test against DB2. My test is like this: 
> {code}
>  conn.prepareStatement(
> "create table people (name char(32)").executeUpdate()
>  conn.prepareStatement("insert into people values 
> ('fred')").executeUpdate()
>  sql(
>s"""
>   |CREATE TEMPORARY TABLE foobar
>   |USING org.apache.spark.sql.jdbc
>   |OPTIONS (url '$url', dbtable 'PEOPLE', user 'testuser', password 
> 'testpassword')
>   """.stripMargin.replaceAll("\n", " "))
>  val df = sqlContext.sql("SELECT * FROM foobar WHERE NAME = 'fred'")
> {code}
> I am expecting to see one row with content 'fred' in df. However, there is no 
> row returned. If I changed the data type to varchar (32) in the create table 
> ddl , then I can get the row back correctly. The cause of the problem is that 
> for data type char (num), DB2 defines it as fixed-length character strings, 
> so if I have char (32), when doing "SELECT * FROM foobar WHERE NAME = 
> 'fred'", DB2 returns 'fred' padded with 28 empty space. Spark treats "fred' 
> padded with empty space not the same as 'fred' so df doesn't have any row. If 
> I have varchar (32), DB2 just returns 'fred' for the select statement and df 
> has the right row. In order to make DB2 char (num) works for spark, I suggest 
> to change spark code to trim the empty space after get the data from 
> database. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9042) Spark SQL incompatibility if security is enforced on the Hive warehouse

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9042:
-
Summary: Spark SQL incompatibility if security is enforced on the Hive 
warehouse  (was: Spark SQL incompatibility with Apache Sentry)

> Spark SQL incompatibility if security is enforced on the Hive warehouse
> ---
>
> Key: SPARK-9042
> URL: https://issues.apache.org/jira/browse/SPARK-9042
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Nitin Kak
>
> Hive queries executed from Spark using HiveContext use CLI to create the 
> query plan and then access the Hive table directories(under 
> /user/hive/warehouse/) directly. This gives AccessContolException if Apache 
> Sentry is installed:
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kakn, access=READ_EXECUTE, 
> inode="/user/hive/warehouse/mastering.db/sample_table":hive:hive:drwxrwx--t 
> With Apache Sentry, only "hive" user(created only for Sentry) has the 
> permissions to access the hive warehouse directory. After Sentry 
> installations all the queries are directed to HiveServer2 which translates 
> the changes the invoking user to "hive" and then access the hive warehouse 
> directory. However, HiveContext does not execute the query through 
> HiveServer2 which is leading to the issue. Here is an example of executing 
> hive query through HiveContext.
> val hqlContext = new HiveContext(sc) // Create context to run Hive queries 
> val pairRDD = hqlContext.sql(hql) // where hql is the string with hive query 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8519) Blockify distance computation in k-means

2015-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057720#comment-15057720
 ] 

Apache Spark commented on SPARK-8519:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10306

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-12-15 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057603#comment-15057603
 ] 

Sun Rui commented on SPARK-11255:
-

in http://spark.apache.org/docs/latest/, it is claimed that Spark runs on Java 
7+, Python 2.6+ and R 3.1+. So to keep consistent with the documentation, we 
need to guarrantee that SparkR works on R 3.1+. Otherwise, SparkR may use some 
new features of latest R versions (for example, 3.2+) but not available in 
versions like 3.1.

But I won't object bumping up the supported R version if needed:)

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Assignee: shane knapp
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-12-15 Thread Milad Bourhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057641#comment-15057641
 ] 

Milad Bourhani commented on SPARK-11885:


Hi,

I can no longer replicate the bug on branch 1.5 :)

I've tried several times running my example both with local and clustered mode 
and the result is consistently the same. Specifically, I have:
- built Spark,
- then built my project using Spark dependencies with version 1.5.3-SNAPSHOT,
- run {{./make-distribution.sh}},
- used the result in the {{dist}} directory to start up the cluster.

Thank you!

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.3
>
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch.
> Try the following in spark 1.5 (with a cluster) and you can see the problem.
> {code}
> import java.math.BigDecimal
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType, StructField, DataType, 
> DoubleType, LongType}
> class GeometricMean extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(StructField("value", DoubleType) :: Nil)
>   def bufferSchema: StructType = StructType(
> StructField("count", LongType) ::
>   StructField("product", DoubleType) :: Nil
>   )
>   def dataType: DataType = DoubleType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
> buffer(1) = 1.0
>   }
>   def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> buffer(0) = buffer.getAs[Long](0) + 1
> buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
> buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
>   }
>   def evaluate(buffer: Row): Any = {
> math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0))
>   }
> }
> sqlContext.udf.register("gm", new GeometricMean)
> val df = Seq(
>   (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"),
>   (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"),
>   (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"),
>   (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"),
>   (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"),
>   (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")).
>   toDF("receipt_id", "store_country", "store_region", "store_id", "amount", 
> "seller_name")
> df.registerTempTable("receipts")
>   
> val q = sql("""
> select   store_country,
>  store_region,
>  avg(amount),
>  sum(amount),
>  gm(amount)
> from receipts
> whereamount > 50
>  and store_country = 'italy'
> group by store_country, store_region
> """)
> q.show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12334:


Assignee: Apache Spark

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12334:


Assignee: (was: Apache Spark)

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Jeff Zhang
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6521) Bypass network shuffle read if both endpoints are local

2015-12-15 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033016#comment-15033016
 ] 

Takeshi Yamamuro edited comment on SPARK-6521 at 12/15/15 9:23 AM:
---

Performance of the current Spark heavily depends on CPU, so this shuffle 
optimization has little effect on that ( benchmark results could be found in 
pullreq #9478). For a while, this ticket needs not to be considered.


was (Author: maropu):
Performance of the current Spark heavily depends on CPU, so this shuffle 
optimization has little effect on that ( benchmark results could be found in 
pullreq #9478). For a while, this ticket needs not to be considered and

> Bypass network shuffle read if both endpoints are local
> ---
>
> Key: SPARK-6521
> URL: https://issues.apache.org/jira/browse/SPARK-6521
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: xukun
>
> In the past, executor read other executor's shuffle file in the same node by 
> net. This pr make that executors in the same node read local shuffle file In 
> sort-based Shuffle. It will reduce net transport.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2015-12-15 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12334:
---
Component/s: PySpark

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Jeff Zhang
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12332) Typo in ResetSystemProperties.scala's comments

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12332.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10303
[https://github.com/apache/spark/pull/10303]

> Typo in ResetSystemProperties.scala's comments
> --
>
> Key: SPARK-12332
> URL: https://issues.apache.org/jira/browse/SPARK-12332
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> There is a minor typo (missing close bracket) inside of of 
> ResetSystemProperties.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12332) Typo in ResetSystemProperties.scala's comments

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12332:
--
  Assignee: holdenk
Issue Type: Improvement  (was: Bug)

> Typo in ResetSystemProperties.scala's comments
> --
>
> Key: SPARK-12332
> URL: https://issues.apache.org/jira/browse/SPARK-12332
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Reporter: holdenk
>Assignee: holdenk
>Priority: Trivial
> Fix For: 2.0.0
>
>
> There is a minor typo (missing close bracket) inside of of 
> ResetSystemProperties.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12339) NullPointerException on stage kill from web UI

2015-12-15 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-12339:
---

 Summary: NullPointerException on stage kill from web UI
 Key: SPARK-12339
 URL: https://issues.apache.org/jira/browse/SPARK-12339
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Jacek Laskowski


The following message is in the logs after killing a stage:

{code}
scala> INFO Executor: Executor killed task 1.0 in stage 7.0 (TID 33)
INFO Executor: Executor killed task 0.0 in stage 7.0 (TID 32)
WARN TaskSetManager: Lost task 1.0 in stage 7.0 (TID 33, localhost): TaskKilled 
(killed intentionally)
WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 32, localhost): TaskKilled 
(killed intentionally)
INFO TaskSchedulerImpl: Removed TaskSet 7.0, whose tasks have all completed, 
from pool
ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
ERROR LiveListenerBus: Listener SQLListener threw an exception
java.lang.NullPointerException
at 
org.apache.spark.sql.execution.ui.SQLListener.onTaskEnd(SQLListener.scala:167)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1169)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
{code}

To reproduce, start a job and kill the stage from web UI, e.g.:

{code}
val rdd = sc.parallelize(0 to 9, 2)
rdd.mapPartitionsWithIndex { case (n, it) => Thread.sleep(10 * 1000); it }.count
{code}

Go to web UI and in Stages tab click "kill" for the stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12336:


Assignee: Apache Spark  (was: Cheng Lian)

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12219) Spark 1.5.2 code does not build on Scala 2.11.7 with SBT assembly

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12219.
---
Resolution: Not A Problem

master builds fine with SBT / Scala 2.11. Since building vs 2.11 is the issue 
here, I'm resolving this.

> Spark 1.5.2 code does not build on Scala 2.11.7 with SBT assembly
> -
>
> Key: SPARK-12219
> URL: https://issues.apache.org/jira/browse/SPARK-12219
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.5.2
>Reporter: Rodrigo Boavida
>
> I've tried with no success to build Spark on Scala 2.11.7. I'm getting build 
> errors using sbt due to the issues found in the below thread in July of this 
> year.
> https://mail-archives.apache.org/mod_mbox/spark-dev/201507.mbox/%3CCA+3qhFSJGmZToGmBU1=ivy7kr6eb7k8t6dpz+ibkstihryw...@mail.gmail.com%3E
> Seems some minor fixes are needed to make the Scala 2.11 compiler happy.
> I needed to build with SBT as per suggested on below thread to get over some 
> apparent maven shader plugin because which changed some classes when I change 
> to akka 2.4.0.
> https://groups.google.com/forum/#!topic/akka-user/iai6whR6-xU
> I've set this bug to Major priority assuming that the Spark community wants 
> to keep fully supporting SBT builds, including the Scala 2.11 compatibility.
> Tnks,
> Rod



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12331) R^2 for regression through the origin

2015-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057863#comment-15057863
 ] 

Sean Owen commented on SPARK-12331:
---

I'd support this change, if you're willing to do the legwork and create a PR 
and add a few tests for this behavior vs R's output.

> R^2 for regression through the origin
> -
>
> Key: SPARK-12331
> URL: https://issues.apache.org/jira/browse/SPARK-12331
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Imran Younus
>Priority: Minor
>
> The value of R^2 (coefficient of determination) obtained from 
> LinearRegressionModel is not consistent with R and statsmodels when the 
> fitIntercept is false i.e., regression through the origin. In this case, both 
> R and statsmodels use the definition of R^2 given by eq(4') in the following 
> review paper:
> https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
> Here is the definition from this paper:
> R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2)
> The paper also describes why this should be the case. I've double checked 
> that the value of R^2 from statsmodels and R are consistent with this 
> definition. On the other hand, scikit-learn doesn't use the above definition. 
> I would recommend using the above definition in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-15 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-12336:
--

 Summary: Outer join using multiple columns results in wrong 
nullability
 Key: SPARK-12336
 URL: https://issues.apache.org/jira/browse/SPARK-12336
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.2, 1.4.1, 1.6.0, 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


When joining two DataFrames using multiple columns, a temporary inner join is 
used to compute join output. Then a real join operator is created and 
projected. However, the final projection list is based on the inner join rather 
than real join operator. When the real join operator is an outer join, 
nullability of the final projection can be wrong, since outer join may alter 
nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12337) Implement dropDuplicates() method of DataFrame in SparkR

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12337:


Assignee: (was: Apache Spark)

> Implement dropDuplicates() method of DataFrame in SparkR
> 
>
> Key: SPARK-12337
> URL: https://issues.apache.org/jira/browse/SPARK-12337
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> distinct() and unique() drop duplicated rows on all columns. While 
> dropDuplicates() can drop duplicated rows on selected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12337) Implement dropDuplicates() method of DataFrame in SparkR

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12337:


Assignee: Apache Spark

> Implement dropDuplicates() method of DataFrame in SparkR
> 
>
> Key: SPARK-12337
> URL: https://issues.apache.org/jira/browse/SPARK-12337
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>Assignee: Apache Spark
>
> distinct() and unique() drop duplicated rows on all columns. While 
> dropDuplicates() can drop duplicated rows on selected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12263) IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12263:
--
Labels: starter  (was: )

> IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
> -
>
> Key: SPARK-12263
> URL: https://issues.apache.org/jira/browse/SPARK-12263
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Trivial
>  Labels: starter
>
> When starting a worker with the following command - note 
> {{SPARK_WORKER_MEMORY=1024}} it fails saying that the memory was 0 while it 
> was 1024 (without size unit).
> {code}
> ➜  spark git:(master) ✗ SPARK_WORKER_MEMORY=1024 SPARK_WORKER_CORES=5 
> ./sbin/start-slave.sh spark://localhost:7077
> starting org.apache.spark.deploy.worker.Worker, logging to 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> failed to launch org.apache.spark.deploy.worker.Worker:
>   INFO ShutdownHookManager: Shutdown hook called
>   INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> full log in 
> /Users/jacek/dev/oss/spark/logs/spark-jacek-org.apache.spark.deploy.worker.Worker-1-japila.local.out
> {code}
> The full stack trace is as follows:
> {code}
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel).
> INFO Worker: Registered signal handlers for [TERM, HUP, INT]
> Exception in thread "main" java.lang.IllegalStateException: Memory can't be 
> 0, missing a M or G on the end of the memory specification?
> at 
> org.apache.spark.deploy.worker.WorkerArguments.checkWorkerMemory(WorkerArguments.scala:179)
> at 
> org.apache.spark.deploy.worker.WorkerArguments.(WorkerArguments.scala:64)
> at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:691)
> at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
> INFO ShutdownHookManager: Shutdown hook called
> INFO ShutdownHookManager: Deleting directory 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-f4e5f222-e938-46b2-a189-241453cf1f50
> {code}
> The following command starts spark standalone worker successfully:
> {code}
> SPARK_WORKER_MEMORY=1g SPARK_WORKER_CORES=5 ./sbin/start-slave.sh 
> spark://localhost:7077
> {code}
> The master reports:
> {code}
> INFO Master: Registering worker 192.168.1.6:63884 with 5 cores, 1024.0 MB RAM
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12335) CentralMomentAgg should be nullable

2015-12-15 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-12335:
--

 Summary: CentralMomentAgg should be nullable
 Key: SPARK-12335
 URL: https://issues.apache.org/jira/browse/SPARK-12335
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0, 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


According to the {{getStatistics}} method overriden in all its subclasses, 
{{CentralMomentAgg}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8519) Blockify distance computation in k-means

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8519:
---

Assignee: Apache Spark

> Blockify distance computation in k-means
> 
>
> Key: SPARK-8519
> URL: https://issues.apache.org/jira/browse/SPARK-8519
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: advanced
>
> The performance of pairwise distance computation in k-means can benefit from 
> BLAS Level 3 matrix-matrix multiplications. It requires we update the 
> implementation to use blocks. Even for sparse data, we might be able to see 
> some performance gain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12335) CentralMomentAgg should be nullable

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12335:


Assignee: Apache Spark  (was: Cheng Lian)

> CentralMomentAgg should be nullable
> ---
>
> Key: SPARK-12335
> URL: https://issues.apache.org/jira/browse/SPARK-12335
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> According to the {{getStatistics}} method overriden in all its subclasses, 
> {{CentralMomentAgg}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10157) Add ability to specify s3 bootstrap script to spark-ec2

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10157.
---
Resolution: Won't Fix

> Add ability to specify s3 bootstrap script to spark-ec2
> ---
>
> Key: SPARK-10157
> URL: https://issues.apache.org/jira/browse/SPARK-10157
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>Reporter: Michelangelo D'Agostino
>Priority: Minor
>  Labels: ec2
>
> One of the nice features of using EMR to start spark clusters is that you can 
> specify bootstrap actions to run on each node. This is useful for installing 
> custom libraries, etc. However, you pay a per-node/per-hour premium for using 
> EMR (https://aws.amazon.com/elasticmapreduce/pricing/).
> This pull request adds the --bootstrap-script flag to the spark-ec2 script. 
> This flag specifies an s3 path to a shell script. The script is downloaded 
> and run on each node during the setup process. The flag can be specified 
> multiple times for multiple bootstrap actions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12336:


Assignee: Apache Spark  (was: Cheng Lian)

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12336:


Assignee: Cheng Lian  (was: Apache Spark)

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12338) Support dropping duplicated rows on selected columns in DataFrame in R style

2015-12-15 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12338:
---

 Summary: Support dropping duplicated rows on selected columns in 
DataFrame in R style
 Key: SPARK-12338
 URL: https://issues.apache.org/jira/browse/SPARK-12338
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Sun Rui


in R, unique() can drop duplicated rows on all columns. And something like
{code}
df[!duplicated(df[,c('x1', 'x2')]),]
{code}
is used to drop  duplicated rows on selected columns. It's better that my can 
support duplicated(), and subsetting a DataFrame using the result of 
duplicated().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12325.
---
Resolution: Invalid

[~Narine] I'm going to push back on this, since it's inappropriate to open a 
"Critical Bug" on JIRA just to get attention for your question. This is, at 
best, suggesting a very minor change to an error message.

Please first read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Then 
consider whether you clearly explained your problem and proposed change -- I 
think it's a lot simpler than this.

When you're ready to open a pull request with a change to the message, *then* 
make a "Minor Improvement" / "Documentation" JIRA which your PR references.

> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance 
> calculation for columns with dataType StringType not supported.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation 
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand 
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding 
> methods
> 3. create CorrelationCounter and splitting the computations for correlation 
> and covariance
> and many more  
> Since I'm not getting any response and according to github all five of you 
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or 
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly 
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull 
> requests in SQL/ML components are just staying there without any response, 
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12332) Typo in ResetSystemProperties.scala's comments

2015-12-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057826#comment-15057826
 ] 

Sean Owen commented on SPARK-12332:
---

(I don't think this is worth your time for a JIRA, just a PR)

> Typo in ResetSystemProperties.scala's comments
> --
>
> Key: SPARK-12332
> URL: https://issues.apache.org/jira/browse/SPARK-12332
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> There is a minor typo (missing close bracket) inside of of 
> ResetSystemProperties.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12332) Typo in ResetSystemProperties.scala's comments

2015-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15057835#comment-15057835
 ] 

Apache Spark commented on SPARK-12332:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10303

> Typo in ResetSystemProperties.scala's comments
> --
>
> Key: SPARK-12332
> URL: https://issues.apache.org/jira/browse/SPARK-12332
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Reporter: holdenk
>Priority: Trivial
>
> There is a minor typo (missing close bracket) inside of of 
> ResetSystemProperties.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12334) Support read from multiple input paths for orc file in DataFrameReader.orc

2015-12-15 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12334:
---
Affects Version/s: 1.6.0
 Target Version/s: 1.6.1

> Support read from multiple input paths for orc file in DataFrameReader.orc
> --
>
> Key: SPARK-12334
> URL: https://issues.apache.org/jira/browse/SPARK-12334
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
>Reporter: Jeff Zhang
>Priority: Minor
>
> DataFrameReader.json/text/parquet support multiple input paths, orc should be 
> consistent with that 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12337) Implement dropDuplicates() method of DataFrame in SparkR

2015-12-15 Thread Sun Rui (JIRA)
Sun Rui created SPARK-12337:
---

 Summary: Implement dropDuplicates() method of DataFrame in SparkR
 Key: SPARK-12337
 URL: https://issues.apache.org/jira/browse/SPARK-12337
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Affects Versions: 1.5.2
Reporter: Sun Rui


distinct() and unique() drop duplicated rows on all columns. While 
dropDuplicates() can drop duplicated rows on selected columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12335) CentralMomentAgg should be nullable

2015-12-15 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12335:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-12323

> CentralMomentAgg should be nullable
> ---
>
> Key: SPARK-12335
> URL: https://issues.apache.org/jira/browse/SPARK-12335
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> According to the {{getStatistics}} method overriden in all its subclasses, 
> {{CentralMomentAgg}} should be nullable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12336) Outer join using multiple columns results in wrong nullability

2015-12-15 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12336:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-12323

> Outer join using multiple columns results in wrong nullability
> --
>
> Key: SPARK-12336
> URL: https://issues.apache.org/jira/browse/SPARK-12336
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0, 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> When joining two DataFrames using multiple columns, a temporary inner join is 
> used to compute join output. Then a real join operator is created and 
> projected. However, the final projection list is based on the inner join 
> rather than real join operator. When the real join operator is an outer join, 
> nullability of the final projection can be wrong, since outer join may alter 
> nullability of its child plan(s).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11293) Spillable collections leak shuffle memory

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11293.
---
Resolution: Fixed

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-15 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058819#comment-15058819
 ] 

Charles Allen commented on SPARK-12330:
---

Looks like the mesos coarse scheduler underwent a lot of changes in 1.6 vs 1.5. 
In the 1.6 branch I'm getting errors on the terminating of the coarse tasks and 
reporting of failed tasks.

https://github.com/metamx/spark/commit/338f511e3f6ef03f457555d838ed3a694a77dece

That same stuff against 1.5.1 has a nice and clean shutdown of the executors on 
mesos, reporting FINISHED instead of FAILED or KILLED.

> Mesos coarse executor does not cleanup blockmgr properly on termination if 
> data is stored on disk
> -
>
> Key: SPARK-12330
> URL: https://issues.apache.org/jira/browse/SPARK-12330
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 1.5.1
>Reporter: Charles Allen
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark 
> MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
> ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
> --conf spark.mesos.executor.memoryOverhead=2048 --conf 
> spark.mesos.executor.home=/mnt/tmp/spark --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
> -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
> -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
> -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
> spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
> spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
> spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
> spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = 
> sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after 
> termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
> 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
> from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
> executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
> __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
> __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
> (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming 
> (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
> (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
> 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming 
> (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
> 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
> 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
> reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
> reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
> 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
> (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
> 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR 

[jira] [Assigned] (SPARK-12345) Mesos cluster mode is broken

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12345:


Assignee: Apache Spark  (was: Iulian Dragos)

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8745) Remove GenerateProjection

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8745:
---

Assignee: Apache Spark  (was: Davies Liu)

> Remove GenerateProjection
> -
>
> Key: SPARK-8745
> URL: https://issues.apache.org/jira/browse/SPARK-8745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Based on discussion offline with [~marmbrus], we should remove 
> GenerateProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-15 Thread Charles Allen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058729#comment-15058729
 ] 

Charles Allen commented on SPARK-12330:
---

This is because the CoarseMesosSchedulerBackend does not wait for the 
environment to report tasks finished before shutting down the mesos driver. I 
have a fix that seems to be working against 1.5.1. will see if I can 
cherry-pick it to master

> Mesos coarse executor does not cleanup blockmgr properly on termination if 
> data is stored on disk
> -
>
> Key: SPARK-12330
> URL: https://issues.apache.org/jira/browse/SPARK-12330
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 1.5.1
>Reporter: Charles Allen
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark 
> MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
> ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
> --conf spark.mesos.executor.memoryOverhead=2048 --conf 
> spark.mesos.executor.home=/mnt/tmp/spark --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
> -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
> -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
> -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
> spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
> spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
> spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
> spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = 
> sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after 
> termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
> 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
> from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
> executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
> __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
> __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
> (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming 
> (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
> (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
> 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming 
> (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
> 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
> 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
> reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
> reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
> 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
> (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
> 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
> SIGTERM
> 15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called
> no more java logs
> {code}
> If the 

[jira] [Updated] (SPARK-11293) Spillable collections leak shuffle memory

2015-12-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11293:
--
Target Version/s: 1.6.0  (was: 1.5.3, 1.6.0)

Since the back-port for 1.5 was cancelled, I think this is Fixed.

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.6.0
>
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2015-12-15 Thread Tristan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058987#comment-15058987
 ] 

Tristan commented on SPARK-10915:
-

Would the analogy to UDAF support in Python be lambdas, as mentioned in the 
description? Or is this more appropriate an abstract base class? From the Scala 
example, it seems the most direct implementation would follow the same 
interface.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2015-12-15 Thread Dean Wampler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058856#comment-15058856
 ] 

Dean Wampler commented on SPARK-12177:
--

Since the new Kafka 0.9 consumer API supports SSL, would implementation of 
#12177 enable use of SSL or would additional work be required in Spark's 
integration?

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12345) Mesos cluster mode is broken

2015-12-15 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058873#comment-15058873
 ] 

Iulian Dragos commented on SPARK-12345:
---

[~skonto] pointed out this commit: 
https://github.com/apache/spark/commit/8aff36e91de0fee2f3f56c6d240bb203b5bb48ba 
it could be part of the problem.

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8745) Remove GenerateProjection

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8745:
---

Assignee: Davies Liu  (was: Apache Spark)

> Remove GenerateProjection
> -
>
> Key: SPARK-8745
> URL: https://issues.apache.org/jira/browse/SPARK-8745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> Based on discussion offline with [~marmbrus], we should remove 
> GenerateProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12345) Mesos cluster mode is broken

2015-12-15 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos updated SPARK-12345:
--
Description: 
The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.

The driver is confused about where SPARK_HOME is. It resolves 
`mesos.executor.uri` or `spark.mesos.executor.home` relative to the filesystem 
where the driver runs, which is wrong.

{code}
I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
130bdc39-44e7-4256-8c22-602040d337f1-S1
bin/spark-submit: line 27: 
/Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
 No such file or directory
{code}

  was:
The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
(Iulian: please edit this to provide more detail)


> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Iulian Dragos
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12346) GLM summary crashes with NoSuchElementException if attributes are missing names

2015-12-15 Thread Eric Liang (JIRA)
Eric Liang created SPARK-12346:
--

 Summary: GLM summary crashes with NoSuchElementException if 
attributes are missing names
 Key: SPARK-12346
 URL: https://issues.apache.org/jira/browse/SPARK-12346
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang


In getModelFeatures() of SparkRWrappers.scala, we call _.name.get on all the 
feature column attributes. This fails when the attribute name is not defined.

One way of reproducing this is to perform glm() in R with a vector-type input 
feature that lacks ML attrs, then trying to call summary() on it, for example:

{code}
df <- sql(sqlContext, "SELECT * FROM testData")
df2 <- withColumnRenamed(df, "f1", "f2") // This drops the ML attrs from f1

lrModel <- glm(hours_per_week ~ f2, data = df2, family = "gaussian")
summary(lrModel) // NoSuchElementException
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8745) Remove GenerateProjection

2015-12-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-8745:
-

Assignee: Davies Liu

> Remove GenerateProjection
> -
>
> Key: SPARK-8745
> URL: https://issues.apache.org/jira/browse/SPARK-8745
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
>
> Based on discussion offline with [~marmbrus], we should remove 
> GenerateProjection.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9690) Add random seed Param to PySpark CrossValidator

2015-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9690:
-
Target Version/s: 2.0.0  (was: )

> Add random seed Param to PySpark CrossValidator
> ---
>
> Key: SPARK-9690
> URL: https://issues.apache.org/jira/browse/SPARK-9690
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 1.4.1
>Reporter: Martin Menestret
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The fold in the ML CrossValidator depends on a rand whose seed is set to 0 
> and it leads the sql.functions rand to call sc._jvm.functions.rand() with no 
> seed.
> In order to be able to unit test a Cross Validation it would be a good idea 
> to be able to set this seed so the output of the cross validation (with a 
> featureSubsetStrategy set to "all") would always be the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10915) Add support for UDAFs in Python

2015-12-15 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059055#comment-15059055
 ] 

Justin Uang edited comment on SPARK-10915 at 12/15/15 11:07 PM:


An abstract base class would be fine, or something like F.udaf(initialize_func, 
update_func, merge_func) would both be fine. Depends on which is more 
convenient/idiomatic in python.


was (Author: justin.uang):
An abstract base class would be fine, or something like F.udaf(initialize_func, 
update_func, merge_func) would both be fine. Depends on how which is more 
convenient.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12309) Use sqlContext from MLlibTestSparkContext for spark.ml test suites

2015-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12309:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang
Target Version/s: 2.0.0

> Use sqlContext from MLlibTestSparkContext for spark.ml test suites
> --
>
> Key: SPARK-12309
> URL: https://issues.apache.org/jira/browse/SPARK-12309
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Use sqlContext from MLlibTestSparkContext rather than creating new one for 
> spark.ml test cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12324:
--
Assignee: Timothy Hunter  (was: Apache Spark)

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>Priority: Minor
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-8471) Implement Discrete Cosine Transform feature transformer

2015-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8471:
-
Comment: was deleted

(was: User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/7138)

> Implement Discrete Cosine Transform feature transformer
> ---
>
> Key: SPARK-8471
> URL: https://issues.apache.org/jira/browse/SPARK-8471
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Minor
> Fix For: 1.5.0
>
>
> Discrete cosine transform (DCT) is an invertible matrix transformation 
> commonly used to analyze signals (e.g. audio, images, video) in the frequency 
> domain. In contrast to the FFT, the DCT maps real vectors to real vectors. 
> The DCT is oftentimes used to provide an alternative feature representation 
> (e.g. spectrogram representations of audio and video) useful for 
> classification and frequency-domain analysis.
> Ideally, an implementation of the DCT should allow both forward and inverse 
> transforms. It should also work for any numeric datatype and both 1D and 2D 
> data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12348) PySpark _inferSchema crashes with incorrect exception on an empty RDD

2015-12-15 Thread Hurshal Patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hurshal Patel updated SPARK-12348:
--
Description: 
{code}
>>> rdd = sc.emptyRDD()
>>> df = sqlContext.createDataFrame(rdd)
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
_createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
_inferSchema
first = rdd.first()
  File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
{code}
throws "RDD is empty" in rdd.first() instead of the intended message "The first 
row in RDD is empty, can not infer schema" in sqlContext._inferSchema

  was:
{code:python}
>>> rdd = sc.emptyRDD()
>>> df = sqlContext.createDataFrame(rdd)
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
_createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
_inferSchema
first = rdd.first()
  File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
{code}
throws "RDD is empty" in rdd.first() instead of the intended message "The first 
row in RDD is empty, can not infer schema" in sqlContext._inferSchema


> PySpark _inferSchema crashes with incorrect exception on an empty RDD
> -
>
> Key: SPARK-12348
> URL: https://issues.apache.org/jira/browse/SPARK-12348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Hurshal Patel
>Priority: Minor
>
> {code}
> >>> rdd = sc.emptyRDD()
> >>> df = sqlContext.createDataFrame(rdd)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
> createDataFrame
> rdd, schema = self._createFromRDD(data, schema, samplingRatio)
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
> _createFromRDD
> struct = self._inferSchema(rdd, samplingRatio)
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
> _inferSchema
> first = rdd.first()
>   File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
> raise ValueError("RDD is empty")
> ValueError: RDD is empty
> {code}
> throws "RDD is empty" in rdd.first() instead of the intended message "The 
> first row in RDD is empty, can not infer schema" in sqlContext._inferSchema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12304) Make Spark Streaming web UI display more friendly Receiver graphs

2015-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059136#comment-15059136
 ] 

Apache Spark commented on SPARK-12304:
--

User 'proflin' has created a pull request for this issue:
https://github.com/apache/spark/pull/10318

> Make Spark Streaming web UI display more friendly Receiver graphs
> -
>
> Key: SPARK-12304
> URL: https://issues.apache.org/jira/browse/SPARK-12304
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.5.2
>Reporter: Liwei Lin
>Priority: Minor
> Attachments: after-5.png, before-5.png
>
>
> Currently, the Spark Streaming web UI uses the same maxY when displays 'Input 
> Rate Times& Histograms' and 'Per-Receiver Times& Histograms'. 
> This may lead to somewhat un-friendly graphs: once we have tens of Receivers 
> or more, every 'Per-Receiver Times' line almost hits the ground.
> This issue proposes to calculate a new maxY against the original one, which 
> is shared among all the `Per-Receiver Times& Histograms' graphs.
> Before:
> !before-5.png!
> After:
> !after-5.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12271) Improve error message for Dataset.as[] when the schema is incompatible.

2015-12-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12271.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10260
[https://github.com/apache/spark/pull/10260]

> Improve error message for Dataset.as[] when the schema is incompatible.
> ---
>
> Key: SPARK-12271
> URL: https://issues.apache.org/jira/browse/SPARK-12271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Nong Li
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> It currently fails with an unexecutable exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12347) Write script to run all MLlib examples for testing

2015-12-15 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12347:
-

 Summary: Write script to run all MLlib examples for testing
 Key: SPARK-12347
 URL: https://issues.apache.org/jira/browse/SPARK-12347
 Project: Spark
  Issue Type: Test
  Components: ML, MLlib, PySpark, R, Tests
Reporter: Joseph K. Bradley


It would facilitate testing to have a script which runs all MLlib examples for 
all languages.

Design sketch to ensure all examples are run:
* Generate a list of examples to run programmatically (not from a fixed list).
* Use a list of special examples to handle examples which require command line 
arguments.
* Make sure data, etc. used are small to keep the tests quick.

This could be broken into subtasks for each language, though it would be nice 
to provide a single script.

Not sure where the script should live; perhaps in {{bin/}}?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12348) PySpark _inferSchema crashes with incorrect exception on an empty RDD

2015-12-15 Thread Hurshal Patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hurshal Patel updated SPARK-12348:
--
Description: 
{code}
>>> rdd = sc.emptyRDD()
>>> df = sqlContext.createDataFrame(rdd)
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
_createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
_inferSchema
first = rdd.first()
  File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
{code}
-throws "RDD is empty" in rdd.first() instead of the intended message "The 
first row in RDD is empty, can not infer schema" in sqlContext._inferSchema-
throws "RDD is empty" in rdd.first() but could throw "The RDD is empty, can not 
infer schema" in sqlContext._inferSchema

  was:
{code}
>>> rdd = sc.emptyRDD()
>>> df = sqlContext.createDataFrame(rdd)
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
_createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
_inferSchema
first = rdd.first()
  File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
{code}
throws "RDD is empty" in rdd.first() instead of the intended message "The first 
row in RDD is empty, can not infer schema" in sqlContext._inferSchema


> PySpark _inferSchema crashes with incorrect exception on an empty RDD
> -
>
> Key: SPARK-12348
> URL: https://issues.apache.org/jira/browse/SPARK-12348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Hurshal Patel
>Priority: Minor
>
> {code}
> >>> rdd = sc.emptyRDD()
> >>> df = sqlContext.createDataFrame(rdd)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
> createDataFrame
> rdd, schema = self._createFromRDD(data, schema, samplingRatio)
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
> _createFromRDD
> struct = self._inferSchema(rdd, samplingRatio)
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
> _inferSchema
> first = rdd.first()
>   File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
> raise ValueError("RDD is empty")
> ValueError: RDD is empty
> {code}
> -throws "RDD is empty" in rdd.first() instead of the intended message "The 
> first row in RDD is empty, can not infer schema" in sqlContext._inferSchema-
> throws "RDD is empty" in rdd.first() but could throw "The RDD is empty, can 
> not infer schema" in sqlContext._inferSchema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12330:


Assignee: Apache Spark

> Mesos coarse executor does not cleanup blockmgr properly on termination if 
> data is stored on disk
> -
>
> Key: SPARK-12330
> URL: https://issues.apache.org/jira/browse/SPARK-12330
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 1.5.1
>Reporter: Charles Allen
>Assignee: Apache Spark
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark 
> MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
> ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
> --conf spark.mesos.executor.memoryOverhead=2048 --conf 
> spark.mesos.executor.home=/mnt/tmp/spark --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
> -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
> -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
> -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
> spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
> spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
> spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
> spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = 
> sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after 
> termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
> 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
> from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
> executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
> __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
> __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
> (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming 
> (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
> (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
> 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming 
> (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
> 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
> 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
> reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
> reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
> 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
> (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
> 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
> SIGTERM
> 15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called
> no more java logs
> {code}
> If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in 
> memory, then the following log can be seen at termination instead:
> {code}
> I1215 01:19:16.247705 120052 process.cpp:566] Parsed message name 
> 

[jira] [Updated] (SPARK-12281) Fixed potential exceptions when exiting a local cluster.

2015-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12281:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Fixed potential exceptions when exiting a local cluster.
> 
>
> Key: SPARK-12281
> URL: https://issues.apache.org/jira/browse/SPARK-12281
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> Fixed the following potential exceptions when exiting a local cluster.
> {code}
> java.lang.AssertionError: assertion failed: executor 4 state transfer from 
> RUNNING to RUNNING is illegal
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> {code}
> java.lang.IllegalStateException: Shutdown hooks cannot be modified during 
> shutdown.
>   at 
> org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
>   at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
>   at 
> org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
>   at 
> org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
>   at 
> org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12267:
-
Fix Version/s: (was: 1.6.1)
   (was: 2.0.0)
   1.6.0

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12330) Mesos coarse executor does not cleanup blockmgr properly on termination if data is stored on disk

2015-12-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059169#comment-15059169
 ] 

Apache Spark commented on SPARK-12330:
--

User 'drcrallen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10319

> Mesos coarse executor does not cleanup blockmgr properly on termination if 
> data is stored on disk
> -
>
> Key: SPARK-12330
> URL: https://issues.apache.org/jira/browse/SPARK-12330
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Mesos
>Affects Versions: 1.5.1
>Reporter: Charles Allen
>Assignee: Apache Spark
>
> A simple line count example can be launched as similar to 
> {code}
> SPARK_HOME=/mnt/tmp/spark 
> MASTER=mesos://zk://zk.metamx-prod.com:2181/mesos-druid/metrics 
> ./bin/spark-shell --conf spark.mesos.coarse=true --conf spark.cores.max=7 
> --conf spark.mesos.executor.memoryOverhead=2048 --conf 
> spark.mesos.executor.home=/mnt/tmp/spark --conf 
> spark.executor.extraJavaOptions='-Duser.timezone=UTC -Dfile.encoding=UTF-8 
> -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 
> -XX:+PrintGCApplicationStoppedTime -XX:+PrintTenuringDistribution 
> -XX:+PrintFlagsFinal -XX:+PrintAdaptiveSizePolicy -XX:+PrintReferenceGC 
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:MaxDirectMemorySize=1024m 
> -verbose:gc -XX:+PrintFlagsFinal -Djava.io.tmpdir=/mnt/tmp/scratch' --conf 
> spark.hadoop.fs.s3n.awsAccessKeyId='REDACTED' --conf 
> spark.hadoop.fs.s3n.awsSecretAccessKey='REDACTED' --conf 
> spark.executor.memory=7g --conf spark.executorEnv.GLOG_v=9 --conf 
> spark.storage.memoryFraction=0.0 --conf spark.shuffle.memoryFraction=0.0
> {code}
> In the shell the following lines can be executed:
> {code}
> val text_file = 
> sc.textFile("s3n://REDACTED/charlesallen/tpch/lineitem.tbl").persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
> {code}
> {code}
> text_file.map(l => 1).sum
> {code}
> which will result in
> {code}
> res0: Double = 6001215.0
> {code}
> for the TPCH 1GB dataset
> Unfortunately the blockmgr directory remains on the executor node after 
> termination of the spark context.
> The log on the executor looks like this near the termination:
> {code}
> I1215 02:12:31.190878 130732 process.cpp:566] Parsed message name 
> 'mesos.internal.ShutdownExecutorMessage' for executor(1)@172.19.67.30:58604 
> from slave(1)@172.19.67.30:5051
> I1215 02:12:31.190928 130732 process.cpp:2382] Spawned process 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.190932 130721 process.cpp:2392] Resuming 
> executor(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.190924800+00:00
> I1215 02:12:31.190958 130702 process.cpp:2392] Resuming 
> __http__(4)@172.19.67.30:58604 at 2015-12-15 02:12:31.190951936+00:00
> I1215 02:12:31.190976 130721 exec.cpp:381] Executor asked to shutdown
> I1215 02:12:31.190943 130727 process.cpp:2392] Resuming 
> __gc__@172.19.67.30:58604 at 2015-12-15 02:12:31.190937088+00:00
> I1215 02:12:31.190991 130702 process.cpp:2497] Cleaning up 
> __http__(4)@172.19.67.30:58604
> I1215 02:12:31.191032 130721 process.cpp:2382] Spawned process 
> (2)@172.19.67.30:58604
> I1215 02:12:31.191040 130702 process.cpp:2392] Resuming 
> (2)@172.19.67.30:58604 at 2015-12-15 02:12:31.191037952+00:00
> I1215 02:12:31.191054 130702 exec.cpp:80] Scheduling shutdown of the executor
> I1215 02:12:31.191069 130721 exec.cpp:396] Executor::shutdown took 21572ns
> I1215 02:12:31.191073 130702 clock.cpp:260] Created a timer for 
> (2)@172.19.67.30:58604 in 5secs in the future (2015-12-15 
> 02:12:36.191062016+00:00)
> I1215 02:12:31.191066 130720 process.cpp:2392] Resuming 
> (1)@172.19.67.30:58604 at 2015-12-15 02:12:31.191059200+00:00
> 15/12/15 02:12:31 INFO CoarseGrainedExecutorBackend: Driver commanded a 
> shutdown
> I1215 02:12:31.240103 130732 clock.cpp:151] Handling timers up to 2015-12-15 
> 02:12:31.240091136+00:00
> I1215 02:12:31.240123 130732 clock.cpp:158] Have timeout(s) at 2015-12-15 
> 02:12:31.240036096+00:00
> I1215 02:12:31.240183 130730 process.cpp:2392] Resuming 
> reaper(1)@172.19.67.30:58604 at 2015-12-15 02:12:31.240178176+00:00
> I1215 02:12:31.240226 130730 clock.cpp:260] Created a timer for 
> reaper(1)@172.19.67.30:58604 in 100ms in the future (2015-12-15 
> 02:12:31.340212992+00:00)
> I1215 02:12:31.247019 130720 clock.cpp:260] Created a timer for 
> (1)@172.19.67.30:58604 in 3secs in the future (2015-12-15 
> 02:12:34.247005952+00:00)
> 15/12/15 02:12:31 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: 
> SIGTERM
> 15/12/15 02:12:31 INFO ShutdownHookManager: Shutdown hook called
> no more java logs
> {code}
> If the shuffle fraction is NOT set to 0.0, and the data is allowed to stay in 
> memory, then the following log can be 

[jira] [Commented] (SPARK-10915) Add support for UDAFs in Python

2015-12-15 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059055#comment-15059055
 ] 

Justin Uang commented on SPARK-10915:
-

An abstract base class would be fine, or something like F.udaf(initialize_func, 
update_func, merge_func) would both be fine. Depends on how which is more 
convenient.

> Add support for UDAFs in Python
> ---
>
> Key: SPARK-10915
> URL: https://issues.apache.org/jira/browse/SPARK-10915
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Justin Uang
>
> This should support python defined lambdas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12348) PySpark _inferSchema crashes with incorrect exception on an empty RDD

2015-12-15 Thread Hurshal Patel (JIRA)
Hurshal Patel created SPARK-12348:
-

 Summary: PySpark _inferSchema crashes with incorrect exception on 
an empty RDD
 Key: SPARK-12348
 URL: https://issues.apache.org/jira/browse/SPARK-12348
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.0
Reporter: Hurshal Patel
Priority: Minor


{code:python}
>>> rdd = sc.emptyRDD()
>>> df = sqlContext.createDataFrame(rdd)
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
createDataFrame
rdd, schema = self._createFromRDD(data, schema, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
_createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
  File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
_inferSchema
first = rdd.first()
  File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty
{code}
throws "RDD is empty" in rdd.first() instead of the intended message "The first 
row in RDD is empty, can not infer schema" in sqlContext._inferSchema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12345) Mesos cluster mode is broken

2015-12-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-12345:
--
Target Version/s: 1.6.1  (was: 1.6.0)

> Mesos cluster mode is broken
> 
>
> Key: SPARK-12345
> URL: https://issues.apache.org/jira/browse/SPARK-12345
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.6.0
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Critical
>
> The same setup worked in 1.5.2 but is now failing for 1.6.0-RC2.
> The driver is confused about where SPARK_HOME is. It resolves 
> `mesos.executor.uri` or `spark.mesos.executor.home` relative to the 
> filesystem where the driver runs, which is wrong.
> {code}
> I1215 15:00:39.411212 28032 exec.cpp:134] Version: 0.25.0
> I1215 15:00:39.413512 28037 exec.cpp:208] Executor registered on slave 
> 130bdc39-44e7-4256-8c22-602040d337f1-S1
> bin/spark-submit: line 27: 
> /Users/dragos/workspace/Spark/dev/rc-tests/spark-1.6.0-bin-hadoop2.6/bin/spark-class:
>  No such file or directory
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12348) PySpark _inferSchema crashes with incorrect exception on an empty RDD

2015-12-15 Thread Hurshal Patel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059119#comment-15059119
 ] 

Hurshal Patel commented on SPARK-12348:
---

whoops, i think this was intentional but there is still value in returning a 
more complete error like "The RDD is empty, can not infer schema"
instead of raising the generic error

> PySpark _inferSchema crashes with incorrect exception on an empty RDD
> -
>
> Key: SPARK-12348
> URL: https://issues.apache.org/jira/browse/SPARK-12348
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Hurshal Patel
>Priority: Minor
>
> {code}
> >>> rdd = sc.emptyRDD()
> >>> df = sqlContext.createDataFrame(rdd)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 404, in 
> createDataFrame
> rdd, schema = self._createFromRDD(data, schema, samplingRatio)
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 285, in 
> _createFromRDD
> struct = self._inferSchema(rdd, samplingRatio)
>   File "/home/memsql/spark/python/pyspark/sql/context.py", line 229, in 
> _inferSchema
> first = rdd.first()
>   File "/home/memsql/spark/python/pyspark/rdd.py", line 1320, in first
> raise ValueError("RDD is empty")
> ValueError: RDD is empty
> {code}
> throws "RDD is empty" in rdd.first() instead of the intended message "The 
> first row in RDD is empty, can not infer schema" in sqlContext._inferSchema



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12349:
--
Priority: Major  (was: Critical)

> Make spark.ml PCAModel load backwards compatible
> 
>
> Key: SPARK-12349
> URL: https://issues.apache.org/jira/browse/SPARK-12349
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Joseph K. Bradley
>
> [SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
> PCAModel and modified the save/load method.  This method needs to be updated 
> in order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12349) Make spark.ml PCAModel load backwards compatible

2015-12-15 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12349:
-

 Summary: Make spark.ml PCAModel load backwards compatible
 Key: SPARK-12349
 URL: https://issues.apache.org/jira/browse/SPARK-12349
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.0.0
Reporter: Joseph K. Bradley
Priority: Critical


[SPARK-11530] introduced a new data member {{explainedVariance}} to spark.ml 
PCAModel and modified the save/load method.  This method needs to be updated in 
order to load models saved with Spark 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12350) VectorAssembler#transform() initially throws an exception

2015-12-15 Thread Jakob Odersky (JIRA)
Jakob Odersky created SPARK-12350:
-

 Summary: VectorAssembler#transform() initially throws an exception
 Key: SPARK-12350
 URL: https://issues.apache.org/jira/browse/SPARK-12350
 Project: Spark
  Issue Type: Bug
  Components: ML
 Environment: sparkShell command from sbt
Reporter: Jakob Odersky


Calling VectorAssembler.transform() initially throws an exception, subsequent 
calls work.

h3. Steps to reproduce
In spark-shell,
1. Create a dummy dataframe and define an assembler
{code}
import org.apache.spark.ml.feature.VectorAssembler
val df = sc.parallelize(List((1,2), (3,4))).toDF
val assembler = new VectorAssembler().setInputCols(Array("_1", 
"_2")).setOutputCol("features")
{code}

2. Run
{code}
assembler.transform(df).show
{code}
Initially the following exception is thrown:
{code}
15/12/15 16:20:19 ERROR TransportRequestHandler: Error opening stream 
/classes/org/apache/spark/sql/catalyst/expressions/Object.class for request 
from /9.72.139.102:60610
java.lang.IllegalArgumentException: requirement failed: File not found: 
/classes/org/apache/spark/sql/catalyst/expressions/Object.class
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.rpc.netty.NettyStreamManager.openStream(NettyStreamManager.scala:60)
at 
org.apache.spark.network.server.TransportRequestHandler.processStreamRequest(TransportRequestHandler.java:136)
at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:106)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
{code}
Subsequent calls work:
{code}
+---+---+-+
| _1| _2| features|
+---+---+-+
|  1|  2|[1.0,2.0]|
|  3|  4|[3.0,4.0]|
+---+---+-+
{code}

It seems as though there is some internal state that is not initialized.

[~iyounus] originally found this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12236) JDBC filter tests all pass if filters are not really pushed down

2015-12-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-12236.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10221
[https://github.com/apache/spark/pull/10221]

> JDBC filter tests all pass if filters are not really pushed down
> 
>
> Key: SPARK-12236
> URL: https://issues.apache.org/jira/browse/SPARK-12236
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Hyukjin Kwon
> Fix For: 2.0.0
>
>
> It is similar with https://issues.apache.org/jira/browse/SPARK-11676 and 
> https://issues.apache.org/jira/browse/SPARK-11677.
> Currently JDBC predicate tests all pass even if filters are not pushed down.
> This is because of Spark-side filtering. 
> Moreover, {{Not(Equal)}} is also being tested which is actually not pushed 
> down to JDBC datasource.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >