[jira] [Commented] (SPARK-14914) Test Cases fail on Windows

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604304#comment-15604304
 ] 

Apache Spark commented on SPARK-14914:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/15618

> Test Cases fail on Windows
> --
>
> Key: SPARK-14914
> URL: https://issues.apache.org/jira/browse/SPARK-14914
> Project: Spark
>  Issue Type: Test
>Reporter: Tao LI
>Assignee: Tao LI
> Fix For: 2.1.0
>
>
> There are lots of test failure on windows. Mainly for the following reasons:
> 1. resources (e.g., temp files) haven't been properly closed after using, 
> thus IOException raised while deleting the temp directory. 
> 2. Command line too long on windows. 
> 3. File path problems caused by the drive label and back slash on absolute 
> windows path. 
> 4. Simply java bug on windows
>a. setReadable doesn't work on windows for directories. 
>b. setWritable doesn't work on windows
>c. Memory-mapped file can't be read at the same time. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17984) Add support for numa aware feature

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604300#comment-15604300
 ] 

Apache Spark commented on SPARK-17984:
--

User 'sheepduke' has created a pull request for this issue:
https://github.com/apache/spark/pull/15579

> Add support for numa aware feature
> --
>
> Key: SPARK-17984
> URL: https://issues.apache.org/jira/browse/SPARK-17984
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos, YARN
>Affects Versions: 2.0.1
> Environment: Cluster Topo: 1 Master + 4 Slaves
> CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores)
> Memory: 128GB(2 NUMA Nodes)
> SW Version: Hadoop-5.7.0 + Spark-2.0.0
>Reporter: quanfuwang
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> This Jira is target to add support numa aware feature which can help improve 
> performance by making core access local memory rather than remote one. 
>  A patch is being developed, see https://github.com/apache/spark/pull/15524.
> And the whole task includes 3 subtasks and will be developed iteratively:
> Numa aware support for Yarn based deployment mode
> Numa aware support for Mesos based deployment mode
> Numa aware support for Standalone based deployment mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18088:
--
Description: 
There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Clarify FPR, alpha, p-value relationship in docs and param naming
** I'd like to remove "alpha" since it is such a generic name.
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups

  was:
There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Clarify FPR, alpha, p-value relationship in docs and param naming
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups


> ChiSqSelector FPR PR cleanups
> -
>
> Key: SPARK-18088
> URL: https://issues.apache.org/jira/browse/SPARK-18088
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
>
> There are several cleanups I'd like to make as a follow-up to the PRs from 
> [SPARK-17017]:
> * Clarify FPR, alpha, p-value relationship in docs and param naming
> ** I'd like to remove "alpha" since it is such a generic name.
> * Rename selectorType values to match corresponding Params
> * Add Since tags where missing
> * a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18088) ChiSqSelector FPR PR cleanups

2016-10-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18088:
-

 Summary: ChiSqSelector FPR PR cleanups
 Key: SPARK-18088
 URL: https://issues.apache.org/jira/browse/SPARK-18088
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor


There are several cleanups I'd like to make as a follow-up to the PRs from 
[SPARK-17017]:
* Clarify FPR, alpha, p-value relationship in docs and param naming
* Rename selectorType values to match corresponding Params
* Add Since tags where missing
* a few minor cleanups



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18019) Log instrumentation in GBTs

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18019:
--
Assignee: Seth Hendrickson

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18019) Log instrumentation in GBTs

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18019:
--
Shepherd: Joseph K. Bradley  (was: Timothy Hunter)

> Log instrumentation in GBTs
> ---
>
> Key: SPARK-18019
> URL: https://issues.apache.org/jira/browse/SPARK-18019
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> Sub-task for adding instrumentation to GBTs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18078:
---
Priority: Minor  (was: Major)

> Add option for customize zipPartition task preferred locations
> --
>
> Key: SPARK-18078
> URL: https://issues.apache.org/jira/browse/SPARK-18078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> `RDD.zipPartitions` task preferred locations strategy will use the 
> intersection of corresponding zipped partitions locations, if the 
> intersection is null, it use union of these locations.
> but in special case, I want to customize the task preferred locations for 
> better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
> operator: a distributed matrix(DMatrix) multiplying a distributed 
> vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
> partitioned in the same way beforehand).
> https://github.com/databricks/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala
> Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope 
> the zipPartition task always locates on `DMatrix` partition's location. it 
> will get better data locality than the default preferred location strategy.
> I think it make sense to add an option for this.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18078:
---
Description: 
`RDD.zipPartitions` task preferred locations strategy will use the intersection 
of corresponding zipped partitions locations, if the intersection is null, it 
use union of these locations.

but in special case, I want to customize the task preferred locations for 
better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
operator: a distributed matrix(DMatrix) multiplying a distributed 
vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
partitioned in the same way beforehand).
https://github.com/databricks/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala

Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the 
zipPartition task always locates on `DMatrix` partition's location. it will get 
better data locality than the default preferred location strategy.

I think it make sense to add an option for this.
 

  was:
`RDD.zipPartitions` task preferred locations strategy will use the intersection 
of corresponding zipped partitions locations, if the intersection is null, it 
use union of these locations.

but in special case, I want to customize the task preferred locations for 
better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
operator: a distributed matrix(DMatrix) multiplying a distributed 
vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
partitioned in the same way beforehand).
https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala

Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the 
zipPartition task always locates on `DMatrix` partition's location. it will get 
better data locality than the default preferred location strategy.

I think it make sense to add an option for this.
 


> Add option for customize zipPartition task preferred locations
> --
>
> Key: SPARK-18078
> URL: https://issues.apache.org/jira/browse/SPARK-18078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> `RDD.zipPartitions` task preferred locations strategy will use the 
> intersection of corresponding zipped partitions locations, if the 
> intersection is null, it use union of these locations.
> but in special case, I want to customize the task preferred locations for 
> better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
> operator: a distributed matrix(DMatrix) multiplying a distributed 
> vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
> partitioned in the same way beforehand).
> https://github.com/databricks/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala
> Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope 
> the zipPartition task always locates on `DMatrix` partition's location. it 
> will get better data locality than the default preferred location strategy.
> I think it make sense to add an option for this.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17183) put hive serde table schema to table properties like data source table

2016-10-24 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17183:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> put hive serde table schema to table properties like data source table
> --
>
> Key: SPARK-17183
> URL: https://issues.apache.org/jira/browse/SPARK-17183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18087) Optimize insert to not require REPAIR TABLE

2016-10-24 Thread Eric Liang (JIRA)
Eric Liang created SPARK-18087:
--

 Summary: Optimize insert to not require REPAIR TABLE
 Key: SPARK-18087
 URL: https://issues.apache.org/jira/browse/SPARK-18087
 Project: Spark
  Issue Type: Sub-task
Reporter: Eric Liang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18026) should not always lowercase partition columns of partition spec in parser

2016-10-24 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-18026:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-17861

> should not always lowercase partition columns of partition spec in parser
> -
>
> Key: SPARK-18026
> URL: https://issues.apache.org/jira/browse/SPARK-18026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17970) Use metastore for managing filesource table partitions as well

2016-10-24 Thread Eric Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Liang updated SPARK-17970:
---
Summary: Use metastore for managing filesource table partitions as well  
(was: store partition spec in metastore for data source table)

> Use metastore for managing filesource table partitions as well
> --
>
> Key: SPARK-17970
> URL: https://issues.apache.org/jira/browse/SPARK-17970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17894) Ensure uniqueness of TaskSetManager name

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603860#comment-15603860
 ] 

Apache Spark commented on SPARK-17894:
--

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/15617

> Ensure uniqueness of TaskSetManager name
> 
>
> Key: SPARK-17894
> URL: https://issues.apache.org/jira/browse/SPARK-17894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
> Fix For: 2.1.0
>
>
> TaskSetManager should have unique name to avoid adding duplicate ones to 
> *Pool* via *SchedulableBuilder*. This problem surfaced with 
> https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
> https://github.com/apache/spark/pull/15326
> *Proposal* :
> There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
> taskSet.Id covering both stageId and stageAttemptId looks to be used for 
> TaskSetManager as well. 
> *Current TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
> *Sample*: TaskSet_0
> *Proposed TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + 
> stageAttemptId) {code}
> *Sample* : TaskSet_0.0
> cc [~kayousterhout] [~markhamstra]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18028) simplify TableFileCatalog

2016-10-24 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-18028.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15568
[https://github.com/apache/spark/pull/15568]

> simplify TableFileCatalog
> -
>
> Key: SPARK-18028
> URL: https://issues.apache.org/jira/browse/SPARK-18028
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0

2016-10-24 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-18086:
-

 Summary: Regression: Hive variables no longer work in Spark 2.0
 Key: SPARK-18086
 URL: https://issues.apache.org/jira/browse/SPARK-18086
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Ryan Blue


The behavior of variables in the SQL shell has changed from 1.6 to 2.0. 
Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer 
work. Queries that worked correctly in 1.6 will either fail or produce 
unexpected results in 2.0 so I think this is a regression that should be 
addressed.

Hive and Spark 1.6 work like this:
1. Command-line args --hiveconf and --hivevar can be used to set session 
properties. --hiveconf properties are added to the Hadoop Configuration.
2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} 
adds a Hive var.
3. Hive vars can be substituted into queries by name, and Configuration 
properties can be substituted using {{hiveconf:name}}.

In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then 
the value in SQLConf for the rest of the key is returned. SET adds properties 
to the session config and (according to [a 
comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28])
 the Hadoop configuration "during I/O".

{code:title=Hive and Spark 1.6.1 behavior}
[user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
spark-sql> select "${hiveconf:test.conf}";
1
spark-sql> select "${test.conf}";
${test.conf}
spark-sql> select "${hivevar:test.var}";
2
spark-sql> select "${test.var}";
2
spark-sql> set test.set=3;
SET test.set=3
spark-sql> select "${test.set}"
"${test.set}"
spark-sql> select "${hivevar:test.set}"
"${hivevar:test.set}"
spark-sql> select "${hiveconf:test.set}"
3
spark-sql> set hivevar:test.setvar=4;
SET hivevar:test.setvar=4
spark-sql> select "${hivevar:test.setvar}";
4
spark-sql> select "${test.setvar}";
4
{code}

{code:title=Spark 2.0.0 behavior}
[user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2
spark-sql> select "${hiveconf:test.conf}";
1
spark-sql> select "${test.conf}";
1
spark-sql> select "${hivevar:test.var}";
${hivevar:test.var}
spark-sql> select "${test.var}";
${test.var}
spark-sql> set test.set=3;
test.set3
spark-sql> select "${test.set}";
3
spark-sql> set hivevar:test.setvar=4;
hivevar:test.setvar  4
spark-sql> select "${hivevar:test.setvar}";
4
spark-sql> select "${test.setvar}";
${test.setvar}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17624) Flaky test? StateStoreSuite maintenance

2016-10-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-17624.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

> Flaky test? StateStoreSuite maintenance
> ---
>
> Key: SPARK-17624
> URL: https://issues.apache.org/jira/browse/SPARK-17624
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.1, 2.1.0
>Reporter: Adam Roberts
>Assignee: Tathagata Das
>Priority: Minor
> Fix For: 2.0.2, 2.1.0
>
>
> I've noticed this test failing consistently (25x in a row) with a two core 
> machine but not on an eight core machine
> If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 
> being the default in Spark), the test reliably passes, we can also gain 
> reliability by setting the master to be anything other than just local.
> Is there a reason spark.rpc.numRetries is set to be 1?
> I see this failure is also mentioned here so it's been flaky for a while 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html
> If we run without the "quietly" code so we get debug info:
> {code}
> 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error 
> sending message [message = 
> VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)]
>  in 1 attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at java.lang.Thread.run(Thread.java:785)
> Caused by: org.apache.spark.SparkException: Could not find 
> StateStoreCoordinator.
> at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154)
> at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129)
> at 

[jira] [Commented] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-10-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603677#comment-15603677
 ] 

Joseph K. Bradley commented on SPARK-14300:
---

[~yinxusen] It looks like quite a few of these are not duplicates.  Can you 
please check again and update the JIRA description?  Thanks!

> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14300) Scala MLlib examples code merge and clean up

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14300:
--
Shepherd: Joseph K. Bradley

> Scala MLlib examples code merge and clean up
> 
>
> Key: SPARK-14300
> URL: https://issues.apache.org/jira/browse/SPARK-14300
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples
>Reporter: Xusen Yin
>Priority: Minor
>  Labels: starter
>
> Duplicated code that I found in scala/examples/mllib:
> * scala/mllib
> ** DecisionTreeRunner.scala 
> ** DenseGaussianMixture.scala
> ** DenseKMeans.scala
> ** GradientBoostedTreesRunner.scala
> ** LDAExample.scala
> ** LinearRegression.scala
> ** SparseNaiveBayes.scala
> ** StreamingLinearRegression.scala
> ** StreamingLogisticRegression.scala
> ** TallSkinnyPCA.scala
> ** TallSkinnySVD.scala
> * Unsure code duplications (need doube check)
> ** AbstractParams.scala
> ** BinaryClassification.scala
> ** Correlations.scala
> ** CosineSimilarity.scala
> ** DenseGaussianMixture.scala
> ** FPGrowthExample.scala
> ** MovieLensALS.scala
> ** MultivariateSummarizer.scala
> ** RandomRDDGeneration.scala
> ** SampledRDDs.scala
> When merging and cleaning those code, be sure not disturb the previous 
> example on and off blocks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17950) Match SparseVector behavior with DenseVector

2016-10-24 Thread AbderRahman Sobh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

AbderRahman Sobh updated SPARK-17950:
-
Description: 
What changes were proposed in this pull request?

Simply added the __getattr__ to SparseVector that DenseVector has, but calls to 
a SciPy sparse representation instead of storing a vector all the time in 
self.array

This allows for use of functions on the values of an entire SparseVector in the 
same direct way that users interact with DenseVectors.
i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

Note: The functions do have a slight bit of variance due to calling SciPy and 
not NumPy. However, the majority of useful functions (sums, means, max, etc.) 
are available to both packages anyways.

How was this patch tested?

Manual testing on local machine.
Passed ./python/run-tests
No UI changes.

  was:
Simply added the `__getattr__` to SparseVector that DenseVector has, but calls 
self.toArray() instead of storing a vector all the time in self.array

This allows for use of numpy functions on the values of a SparseVector in the 
same direct way that users interact with DenseVectors.
 i.e. you can simply call SparseVector.mean() to average the values in the 
entire vector.

Component/s: ML

> Match SparseVector behavior with DenseVector
> 
>
> Key: SPARK-17950
> URL: https://issues.apache.org/jira/browse/SPARK-17950
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Affects Versions: 2.0.1
>Reporter: AbderRahman Sobh
>Priority: Minor
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> What changes were proposed in this pull request?
> Simply added the __getattr__ to SparseVector that DenseVector has, but calls 
> to a SciPy sparse representation instead of storing a vector all the time in 
> self.array
> This allows for use of functions on the values of an entire SparseVector in 
> the same direct way that users interact with DenseVectors.
> i.e. you can simply call SparseVector.mean() to average the values in the 
> entire vector.
> Note: The functions do have a slight bit of variance due to calling SciPy and 
> not NumPy. However, the majority of useful functions (sums, means, max, etc.) 
> are available to both packages anyways.
> How was this patch tested?
> Manual testing on local machine.
> Passed ./python/run-tests
> No UI changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603596#comment-15603596
 ] 

Marcelo Vanzin commented on SPARK-18085:


Finally, I actually wrote some code for the first two milestones described in 
the document, to show what this could look like. The code is at:
https://github.com/vanzin/spark/tree/shs-ng/M2
https://github.com/vanzin/spark/tree/shs-ng/M1

(M2 actually has all the changes in M1, this is just to follow the 
implementation path described in the document.)

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603588#comment-15603588
 ] 

Marcelo Vanzin commented on SPARK-18085:


Pinging a few people who've worked / complained about the history server in the 
past: [~tgraves] [~andrewor14] [~steve_l] [~ajbozarth]

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server

2016-10-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603585#comment-15603585
 ] 

Marcelo Vanzin commented on SPARK-18085:


Also, I'm not sure if we're labeling things as "enhancement proposals" yet, but 
given the scope of this work, this would be a good candidate for that.

> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18085) Scalability enhancements for the History Server

2016-10-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-18085:
---
Attachment: spark_hs_next_gen.pdf

Here's an initial document to bootstrap the discussion of how to fix the 
History Server. It discusses the current issues and proposes solutions to them, 
providing an implementation path that will hopefully allow for finer-grained 
reviews with as little disruption to functionality as possible while the work 
is under review.

I could also share a google doc link if people prefer (although I believe the 
ASF likes documents to be stored on their servers).

I think there are two points that might be a little controversial, so I'll list 
them up front:

- use of a JNI library for data storage: this is not a requirement, but there 
were no good pure Java alternatives I could find. The functionality is not 
dependent on the actual library, though, so we could switch the library if 
desired.

- move of Spark UI code to a separate module: I think this is good both as a 
way to isolate these changes, and for future maintainability; it makes it 
clearer that the UI is sort of a separate component, and that there should be a 
clearer separation between the code in core and the ui.


> Scalability enhancements for the History Server
> ---
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18085) Scalability enhancements for the History Server

2016-10-24 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-18085:
--

 Summary: Scalability enhancements for the History Server
 Key: SPARK-18085
 URL: https://issues.apache.org/jira/browse/SPARK-18085
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core, Web UI
Affects Versions: 2.0.0
Reporter: Marcelo Vanzin


It's a known fact that the History Server currently has some annoying issues 
when serving lots of applications, and when serving large applications.

I'm filing this umbrella to track work related to addressing those issues. I'll 
be attaching a document shortly describing the issues and suggesting a path to 
how to solve them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11375) History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders

2016-10-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11375.

Resolution: Duplicate

Looks like a dupe.

> History Server "no histories" message to be dynamically generated by 
> ApplicationHistoryProviders
> 
>
> Key: SPARK-11375
> URL: https://issues.apache.org/jira/browse/SPARK-11375
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Steve Loughran
>Priority: Minor
>
> When there are no histories, the {{HistoryPage}} displays an error text which 
> assumes that the provider is the {{FsHistoryProvider}}, and its sole failure 
> mode is "directory not found"
> {code}
> Did you specify the correct logging directory?
> Please verify your setting of spark.history.fs.logDirectory
> {code}
> Different providers have different failure modes, and even the filesystem 
> provider has some, such as an access control exception, or the specified 
> directly path actually being a file.
> If the {{ApplicationHistoryProvider}} was itself asked to provide an error 
> message, then it could
> * be dynamically generated to show the current state of the history provider
> * potentially include any exceptions to list
> * display the actual values of settings such as the log directory property.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17894) Ensure uniqueness of TaskSetManager name

2016-10-24 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout resolved SPARK-17894.

   Resolution: Fixed
Fix Version/s: 2.1.0

Resolved by https://github.com/apache/spark/pull/15463

> Ensure uniqueness of TaskSetManager name
> 
>
> Key: SPARK-17894
> URL: https://issues.apache.org/jira/browse/SPARK-17894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
> Fix For: 2.1.0
>
>
> TaskSetManager should have unique name to avoid adding duplicate ones to 
> *Pool* via *SchedulableBuilder*. This problem surfaced with 
> https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
> https://github.com/apache/spark/pull/15326
> *Proposal* :
> There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
> taskSet.Id covering both stageId and stageAttemptId looks to be used for 
> TaskSetManager as well. 
> *Current TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
> *Sample*: TaskSet_0
> *Proposed TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + 
> stageAttemptId) {code}
> *Sample* : TaskSet_0.0
> cc [~kayousterhout] [~markhamstra]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17894) Ensure uniqueness of TaskSetManager name

2016-10-24 Thread Kay Ousterhout (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-17894:
---
Assignee: Eren Avsarogullari

> Ensure uniqueness of TaskSetManager name
> 
>
> Key: SPARK-17894
> URL: https://issues.apache.org/jira/browse/SPARK-17894
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>
> TaskSetManager should have unique name to avoid adding duplicate ones to 
> *Pool* via *SchedulableBuilder*. This problem surfaced with 
> https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: 
> https://github.com/apache/spark/pull/15326
> *Proposal* :
> There is 1x1 relationship between Stage Attempt Id and TaskSetManager so 
> taskSet.Id covering both stageId and stageAttemptId looks to be used for 
> TaskSetManager as well. 
> *Current TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code}
> *Sample*: TaskSet_0
> *Proposed TaskSetManager Name* : 
> {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + 
> stageAttemptId) {code}
> *Sample* : TaskSet_0.0
> cc [~kayousterhout] [~markhamstra]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-24 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-18084:
-
Issue Type: Bug  (was: Improvement)

> write.partitionBy() does not recognize nested columns that select() can access
> --
>
> Key: SPARK-18084
> URL: https://issues.apache.org/jira/browse/SPARK-18084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Here's a simple repro in the PySpark shell:
> {code}
> from pyspark.sql import Row
> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> df = spark.createDataFrame(rdd)
> df.printSchema()
> df.select('a.b').show()  # works
> df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
> {code}
> Here's what I see when I run this:
> {code}
> >>> from pyspark.sql import Row
> >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
> >>> df = spark.createDataFrame(rdd)
> >>> df.printSchema()
> root
>  |-- a: struct (nullable = true)
>  ||-- b: long (nullable = true)
> >>> df.show()
> +---+
> |  a|
> +---+
> |[5]|
> +---+
> >>> df.select('a.b').show()
> +---+
> |  b|
> +---+
> |  5|
> +---+
> >>> df.write.partitionBy('a.b').text('/tmp/test')
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
>  line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
> : org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
> schema 
> StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
>   at 
> org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
>   at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:745)
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",
>  line 656, in text
> self._jwrite.text(path)
>   File 
> 

[jira] [Created] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access

2016-10-24 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-18084:


 Summary: write.partitionBy() does not recognize nested columns 
that select() can access
 Key: SPARK-18084
 URL: https://issues.apache.org/jira/browse/SPARK-18084
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.1, 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


Here's a simple repro in the PySpark shell:

{code}
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
df = spark.createDataFrame(rdd)
df.printSchema()
df.select('a.b').show()  # works
df.write.partitionBy('a.b').text('/tmp/test')  # doesn't work
{code}

Here's what I see when I run this:

{code}
>>> from pyspark.sql import Row
>>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))])
>>> df = spark.createDataFrame(rdd)
>>> df.printSchema()
root
 |-- a: struct (nullable = true)
 ||-- b: long (nullable = true)

>>> df.show()
+---+
|  a|
+---+
|[5]|
+---+

>>> df.select('a.b').show()
+---+
|  b|
+---+
|  5|
+---+

>>> df.write.partitionBy('a.b').text('/tmp/test')
Traceback (most recent call last):
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
line 63, in deco
return f(*a, **kw)
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o233.text.
: org.apache.spark.sql.AnalysisException: Partition column a.b not found in 
schema StructType(StructField(a,StructType(StructField(b,LongType,true)),true));
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366)
at 
org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py",
 line 656, in text
self._jwrite.text(path)
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
  File 
"/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", 
line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'Partition column a.b not 

[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603249#comment-15603249
 ] 

Apache Spark commented on SPARK-16827:
--

User 'dreamworks007' has created a pull request for this issue:
https://github.com/apache/spark/pull/15616

> Stop reporting spill metrics as shuffle metrics
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Brian Cho
>  Labels: performance
> Fix For: 2.0.2, 2.1.0
>
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads

2016-10-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603211#comment-15603211
 ] 

Nicholas Chammas commented on SPARK-12757:
--

Just to link back, [~josephkb] is reporting that [this GraphFrames 
issue|https://github.com/graphframes/graphframes/issues/116] may be related to 
the work done here.

> Use reference counting to prevent blocks from being evicted during reads
> 
>
> Key: SPARK-12757
> URL: https://issues.apache.org/jira/browse/SPARK-12757
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> As a pre-requisite to off-heap caching of blocks, we need a mechanism to 
> prevent pages / blocks from being evicted while they are being read. With 
> on-heap objects, evicting a block while it is being read merely leads to 
> memory-accounting problems (because we assume that an evicted block is a 
> candidate for garbage-collection, which will not be true during a read), but 
> with off-heap memory this will lead to either data corruption or segmentation 
> faults.
> To address this, we should add a reference-counting mechanism to track which 
> blocks/pages are being read in order to prevent them from being evicted 
> prematurely. I propose to do this in two phases: first, add a safe, 
> conservative approach in which all BlockManager.get*() calls implicitly 
> increment the reference count of blocks and where tasks' references are 
> automatically freed upon task completion. This will be correct but may have 
> adverse performance impacts because it will prevent legitimate block 
> evictions. In phase two, we should incrementally add release() calls in order 
> to fix the eviction of unreferenced blocks. The latter change may need to 
> touch many different components, which is why I propose to do it separately 
> in order to make the changes easier to reason about and review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work

2016-10-24 Thread Yuehua Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603181#comment-15603181
 ] 

Yuehua Zhang commented on SPARK-18017:
--

Yeah, that is what i did: "spark-submit --conf 
spark.hadoop.fs.s3n.block.size=524288000 ...". It did get rid of the non-spark 
config warning though. 

> Changing Hadoop parameter through 
> sparkSession.sparkContext.hadoopConfiguration doesn't work
> 
>
> Key: SPARK-18017
> URL: https://issues.apache.org/jira/browse/SPARK-18017
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Scala version 2.11.8; Java 1.8.0_91; 
> com.databricks:spark-csv_2.11:1.2.0
>Reporter: Yuehua Zhang
>
> My Spark job tries to read csv files on S3. I need to control the number of 
> partitions created so I set Hadoop parameter fs.s3n.block.size. However, it 
> stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is 
> related to https://issues.apache.org/jira/browse/SPARK-15991. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-10-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18081:
-

 Summary: Locality Sensitive Hashing (LSH) User Guide
 Key: SPARK-18081
 URL: https://issues.apache.org/jira/browse/SPARK-18081
 Project: Spark
  Issue Type: New Feature
  Components: Documentation, ML
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18081:
--
Issue Type: Documentation  (was: New Feature)

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18083) Locality Sensitive Hashing (LSH) - BitSampling

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18083:
--
Assignee: Yun Ni

> Locality Sensitive Hashing (LSH) - BitSampling
> --
>
> Key: SPARK-18083
> URL: https://issues.apache.org/jira/browse/SPARK-18083
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>Priority: Minor
>
> See linked JIRA for original LSH for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5992:
-
Assignee: Yun Ni

> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18082) Locality Sensitive Hashing (LSH) - SignRandomProjection

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18082:
--
Assignee: Yun Ni

> Locality Sensitive Hashing (LSH) - SignRandomProjection
> ---
>
> Key: SPARK-18082
> URL: https://issues.apache.org/jira/browse/SPARK-18082
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>Priority: Minor
>
> See linked JIRA for original LSH for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18081:
--
Assignee: Yun Ni

> Locality Sensitive Hashing (LSH) User Guide
> ---
>
> Key: SPARK-18081
> URL: https://issues.apache.org/jira/browse/SPARK-18081
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yun Ni
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2016-10-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18080:
-

 Summary: Locality Sensitive Hashing (LSH) Python API
 Key: SPARK-18080
 URL: https://issues.apache.org/jira/browse/SPARK-18080
 Project: Spark
  Issue Type: New Feature
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2016-10-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603030#comment-15603030
 ] 

Joseph K. Bradley commented on SPARK-7334:
--

[~sebalf] I'm sorry we weren't able to get your PR in.  I do appreciate your 
work on this!  Looking back, I believe the functionality in this JIRA should be 
a subset of what is in the PR for [SPARK-5992], so I'll go ahead and close this 
JIRA issue.  If you have time, feedback on the current PR for [SPARK-5992] 
would be very valuable.  Thanks very much.

> Implement RandomProjection for Dimensionality Reduction
> ---
>
> Key: SPARK-7334
> URL: https://issues.apache.org/jira/browse/SPARK-7334
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sebastian Alfers
>Priority: Minor
>
> Implement RandomProjection (RP) for dimensionality reduction
> RP is a popular approach to reduce the amount of data while preserving a 
> reasonable amount of information (pairwise distance) of you data [1][2]
> - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
> - [2] 
> http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
> I compared different implementations of that algorithm:
> - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7334:
-
Issue Type: New Feature  (was: Improvement)

> Implement RandomProjection for Dimensionality Reduction
> ---
>
> Key: SPARK-7334
> URL: https://issues.apache.org/jira/browse/SPARK-7334
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Sebastian Alfers
>Priority: Minor
>
> Implement RandomProjection (RP) for dimensionality reduction
> RP is a popular approach to reduce the amount of data while preserving a 
> reasonable amount of information (pairwise distance) of you data [1][2]
> - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
> - [2] 
> http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
> I compared different implementations of that algorithm:
> - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18080:
--
Component/s: PySpark
 ML

> Locality Sensitive Hashing (LSH) Python API
> ---
>
> Key: SPARK-18080
> URL: https://issues.apache.org/jira/browse/SPARK-18080
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18082) Locality Sensitive Hashing (LSH) - SignRandomProjection

2016-10-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18082:
-

 Summary: Locality Sensitive Hashing (LSH) - SignRandomProjection
 Key: SPARK-18082
 URL: https://issues.apache.org/jira/browse/SPARK-18082
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


See linked JIRA for original LSH for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18083) Locality Sensitive Hashing (LSH) - BitSampling

2016-10-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-18083:
-

 Summary: Locality Sensitive Hashing (LSH) - BitSampling
 Key: SPARK-18083
 URL: https://issues.apache.org/jira/browse/SPARK-18083
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor


See linked JIRA for original LSH for details



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction

2016-10-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-7334.

Resolution: Duplicate

> Implement RandomProjection for Dimensionality Reduction
> ---
>
> Key: SPARK-7334
> URL: https://issues.apache.org/jira/browse/SPARK-7334
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Sebastian Alfers
>Priority: Minor
>
> Implement RandomProjection (RP) for dimensionality reduction
> RP is a popular approach to reduce the amount of data while preserving a 
> reasonable amount of information (pairwise distance) of you data [1][2]
> - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf
> - [2] 
> http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf
> I compared different implementations of that algorithm:
> - https://github.com/sebastian-alfers/random-projection-python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602974#comment-15602974
 ] 

Cheng Lian commented on SPARK-18053:


Yea, reproduced using 2.0.

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0

2016-10-24 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602969#comment-15602969
 ] 

Cheng Lian commented on SPARK-18053:


Hm, the user mailing list thread said that it fails under 2.0 
https://lists.apache.org/thread.html/%3c1476953644701-27926.p...@n3.nabble.com%3E

I haven't verify it under 2.0 yet.

> ARRAY equality is broken in Spark 2.0
> -
>
> Key: SPARK-18053
> URL: https://issues.apache.org/jira/browse/SPARK-18053
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>  Labels: correctness
>
> The following Spark shell reproduces this issue:
> {code}
> case class Test(a: Seq[Int])
> Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t")
> sql("SELECT a FROM t WHERE a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // +---+
> sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show()
> // +---+
> // |  a|
> // +---+
> // |[1]|
> // +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work

2016-10-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602917#comment-15602917
 ] 

Sean Owen commented on SPARK-18017:
---

You need to set it with --conf, not programmatically, I'd imagine.

> Changing Hadoop parameter through 
> sparkSession.sparkContext.hadoopConfiguration doesn't work
> 
>
> Key: SPARK-18017
> URL: https://issues.apache.org/jira/browse/SPARK-18017
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Scala version 2.11.8; Java 1.8.0_91; 
> com.databricks:spark-csv_2.11:1.2.0
>Reporter: Yuehua Zhang
>
> My Spark job tries to read csv files on S3. I need to control the number of 
> partitions created so I set Hadoop parameter fs.s3n.block.size. However, it 
> stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is 
> related to https://issues.apache.org/jira/browse/SPARK-15991. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-10-24 Thread Alex Bozarth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602747#comment-15602747
 ] 

Alex Bozarth commented on SPARK-18073:
--

I think we should keep the Internals pages but (open some JIRAs to) update 
them, I'd be willing to look at updating the Web UI Internals page, it helped 
me when I was first starting but it's pretty out of date now.

> Migrate wiki to spark.apache.org web site
> -
>
> Key: SPARK-18073
> URL: https://issues.apache.org/jira/browse/SPARK-18073
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>
> Per 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
>  , let's consider migrating all wiki pages to documents at 
> github.com/apache/spark-webiste (i.e. spark.apache.org).
> Some reasons:
> * No pull request system or history for changes to the wiki
> * Separate, not-so-clear system for granting write access to wiki
> * Wiki doesn't change much
> * One less place to maintain or look for docs
> The idea would be to then update all wiki pages with a message pointing to 
> the new home of the information (or message saying it's obsolete).
> Here are the current wikis and my general proposal for what to do with the 
> content:
> * Additional Language Bindings -> roll this into wherever Third Party 
> Projects ends up
> * Committers -> Migrate to a new /committers.html page, linked under 
> Community menu (alread exists)
> * Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
> /contributing.html page under Community menu
> ** Jira Permissions Scheme -> obsolete
> ** Spark Code Style Guide -> roll this into new contributing.html page
> * Development Discussions -> obsolete?
> * Powered By Spark -> Make into new /powered-by.html linked by the existing 
> Commnunity menu item
> * Preparing Spark Releases -> see below; roll into where "versioning policy" 
> goes?
> * Profiling Spark Applications -> roll into where Useful Developer Tools goes
> ** Profiling Spark Applications Using YourKit -> ditto
> * Spark Internals -> all of these look somewhat to very stale; remove?
> ** Java API Internals
> ** PySpark Internals
> ** Shuffle Internals
> ** Spark SQL Internals
> ** Web UI Internals
> * Spark QA Infrastructure -> tough one. Good info to document; does it belong 
> on the website? we can just migrate it
> * Spark Versioning Policy -> new page living under Community (?) that 
> documents release policy and process (better menu?)
> ** spark-ec2 AMI list and install file version mappings -> obsolete
> ** Spark-Shark version mapping -> obsolete
> * Third Party Projects -> new Community menu item
> * Useful Developer Tools -> new page under new Developer menu? Community?
> ** Jenkins -> obsolete, remove
> Of course, another outcome is to just remove outdated wikis, migrate some, 
> leave the rest.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18044) FileStreamSource should not infer partitions in every batch

2016-10-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-18044:
-
Fix Version/s: 2.0.2

> FileStreamSource should not infer partitions in every batch
> ---
>
> Key: SPARK-18044
> URL: https://issues.apache.org/jira/browse/SPARK-18044
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.2, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17153) [Structured streams] readStream ignores partition columns

2016-10-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17153:
-
Fix Version/s: 2.0.2

> [Structured streams] readStream ignores partition columns
> -
>
> Key: SPARK-17153
> URL: https://issues.apache.org/jira/browse/SPARK-17153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Dmitri Carpov
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.2, 2.1.0
>
>
> When parquet files are persisted using partitions, spark's `readStream` 
> returns data with all `null`s for the partitioned columns.
> For example:
> {noformat}
> case class A(id: Int, value: Int)
> val data = spark.createDataset(Seq(
>   A(1, 1), 
>   A(2, 2), 
>   A(2, 3))
> )
> val url = "/mnt/databricks/test"
> data.write.partitionBy("id").parquet(url)
> {noformat}
> when data is read as stream:
> {noformat}
> spark.readStream.schema(spark.read.load(url).schema).parquet(url)
> {noformat}
> it reads:
> {noformat}
> id, value
> null, 1
> null, 2
> null, 3
> {noformat}
> A possible reason is `readStream` reads parquet files directly but when those 
> are stored the columns they are partitioned by are excluded from the file 
> itself. In the given example the parquet files contain `value` information 
> only since `id` is partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work

2016-10-24 Thread Yuehua Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602708#comment-15602708
 ] 

Yuehua Zhang commented on SPARK-18017:
--

Yeah, I tried that also. Not working either...

> Changing Hadoop parameter through 
> sparkSession.sparkContext.hadoopConfiguration doesn't work
> 
>
> Key: SPARK-18017
> URL: https://issues.apache.org/jira/browse/SPARK-18017
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Scala version 2.11.8; Java 1.8.0_91; 
> com.databricks:spark-csv_2.11:1.2.0
>Reporter: Yuehua Zhang
>
> My Spark job tries to read csv files on S3. I need to control the number of 
> partitions created so I set Hadoop parameter fs.s3n.block.size. However, it 
> stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is 
> related to https://issues.apache.org/jira/browse/SPARK-15991. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18079:


Assignee: (was: Apache Spark)

> CollectLimitExec.executeToIterator() should perform per-partition limits
> 
>
> Key: SPARK-18079
> URL: https://issues.apache.org/jira/browse/SPARK-18079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Patrick Woody
>
> Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for 
> executeToIterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602645#comment-15602645
 ] 

Apache Spark commented on SPARK-18079:
--

User 'pwoody' has created a pull request for this issue:
https://github.com/apache/spark/pull/15614

> CollectLimitExec.executeToIterator() should perform per-partition limits
> 
>
> Key: SPARK-18079
> URL: https://issues.apache.org/jira/browse/SPARK-18079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Patrick Woody
>
> Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for 
> executeToIterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18079:


Assignee: Apache Spark

> CollectLimitExec.executeToIterator() should perform per-partition limits
> 
>
> Key: SPARK-18079
> URL: https://issues.apache.org/jira/browse/SPARK-18079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Patrick Woody
>Assignee: Apache Spark
>
> Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for 
> executeToIterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits

2016-10-24 Thread Patrick Woody (JIRA)
Patrick Woody created SPARK-18079:
-

 Summary: CollectLimitExec.executeToIterator() should perform 
per-partition limits
 Key: SPARK-18079
 URL: https://issues.apache.org/jira/browse/SPARK-18079
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.1
Reporter: Patrick Woody


Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for 
executeToIterator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602525#comment-15602525
 ] 

Apache Spark commented on SPARK-16988:
--

User 'hayashidac' has created a pull request for this issue:
https://github.com/apache/spark/pull/15611

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16988:


Assignee: Apache Spark

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Assignee: Apache Spark
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602537#comment-15602537
 ] 

Apache Spark commented on SPARK-18078:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15612

> Add option for customize zipPartition task preferred locations
> --
>
> Key: SPARK-18078
> URL: https://issues.apache.org/jira/browse/SPARK-18078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> `RDD.zipPartitions` task preferred locations strategy will use the 
> intersection of corresponding zipped partitions locations, if the 
> intersection is null, it use union of these locations.
> but in special case, I want to customize the task preferred locations for 
> better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
> operator: a distributed matrix(DMatrix) multiplying a distributed 
> vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
> partitioned in the same way beforehand).
> https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala
> Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope 
> the zipPartition task always locates on `DMatrix` partition's location. it 
> will get better data locality than the default preferred location strategy.
> I think it make sense to add an option for this.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18078:


Assignee: Apache Spark

> Add option for customize zipPartition task preferred locations
> --
>
> Key: SPARK-18078
> URL: https://issues.apache.org/jira/browse/SPARK-18078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>Assignee: Apache Spark
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> `RDD.zipPartitions` task preferred locations strategy will use the 
> intersection of corresponding zipped partitions locations, if the 
> intersection is null, it use union of these locations.
> but in special case, I want to customize the task preferred locations for 
> better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
> operator: a distributed matrix(DMatrix) multiplying a distributed 
> vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
> partitioned in the same way beforehand).
> https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala
> Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope 
> the zipPartition task always locates on `DMatrix` partition's location. it 
> will get better data locality than the default preferred location strategy.
> I think it make sense to add an option for this.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18078:


Assignee: (was: Apache Spark)

> Add option for customize zipPartition task preferred locations
> --
>
> Key: SPARK-18078
> URL: https://issues.apache.org/jira/browse/SPARK-18078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> `RDD.zipPartitions` task preferred locations strategy will use the 
> intersection of corresponding zipped partitions locations, if the 
> intersection is null, it use union of these locations.
> but in special case, I want to customize the task preferred locations for 
> better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
> operator: a distributed matrix(DMatrix) multiplying a distributed 
> vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
> partitioned in the same way beforehand).
> https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala
> Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope 
> the zipPartition task always locates on `DMatrix` partition's location. it 
> will get better data locality than the default preferred location strategy.
> I think it make sense to add an option for this.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16988:


Assignee: (was: Apache Spark)

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Weichen Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-18078:
---
Description: 
`RDD.zipPartitions` task preferred locations strategy will use the intersection 
of corresponding zipped partitions locations, if the intersection is null, it 
use union of these locations.

but in special case, I want to customize the task preferred locations for 
better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
operator: a distributed matrix(DMatrix) multiplying a distributed 
vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
partitioned in the same way beforehand).
https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala

Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the 
zipPartition task always locates on `DMatrix` partition's location. it will get 
better data locality than the default preferred location strategy.

I think it make sense to add an option for this.
 

  was:
`RDD.zipPartitions` task preferred locations strategy will use the intersection 
of corresponding zipped partitions locations, if the intersection is null, it 
use union of these locations.

but in special case, I want to customize the task preferred locations for 
better performance. A typical case is in spark-tfocus: a distributed 
matrix(DMatrix) multiply a vector(DVector), it use RDD.zipPartitions.
https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/DVectorFunctions.scala

Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the 
zipPartition task always locates on `DMatrix` partition's location. it will get 
better data locality than the default preferred location strategy.

I think it make sense to add an option for this.
 


> Add option for customize zipPartition task preferred locations
> --
>
> Key: SPARK-18078
> URL: https://issues.apache.org/jira/browse/SPARK-18078
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Weichen Xu
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> `RDD.zipPartitions` task preferred locations strategy will use the 
> intersection of corresponding zipped partitions locations, if the 
> intersection is null, it use union of these locations.
> but in special case, I want to customize the task preferred locations for 
> better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* 
> operator: a distributed matrix(DMatrix) multiplying a distributed 
> vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be 
> partitioned in the same way beforehand).
> https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala
> Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope 
> the zipPartition task always locates on `DMatrix` partition's location. it 
> will get better data locality than the default preferred location strategy.
> I think it make sense to add an option for this.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602484#comment-15602484
 ] 

chie hayashida commented on SPARK-16988:


Can I work on this issue?

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread chie hayashida (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chie hayashida updated SPARK-16988:
---
Comment: was deleted

(was: Can I work on it?)

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18078) Add option for customize zipPartition task preferred locations

2016-10-24 Thread Weichen Xu (JIRA)
Weichen Xu created SPARK-18078:
--

 Summary: Add option for customize zipPartition task preferred 
locations
 Key: SPARK-18078
 URL: https://issues.apache.org/jira/browse/SPARK-18078
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Weichen Xu


`RDD.zipPartitions` task preferred locations strategy will use the intersection 
of corresponding zipped partitions locations, if the intersection is null, it 
use union of these locations.

but in special case, I want to customize the task preferred locations for 
better performance. A typical case is in spark-tfocus: a distributed 
matrix(DMatrix) multiply a vector(DVector), it use RDD.zipPartitions.
https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/DVectorFunctions.scala

Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the 
zipPartition task always locates on `DMatrix` partition's location. it will get 
better data locality than the default preferred location strategy.

I think it make sense to add an option for this.
 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled

2016-10-24 Thread chie hayashida (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602442#comment-15602442
 ] 

chie hayashida commented on SPARK-16988:


Can I work on it?

> spark history server log needs to be fixed to show https url when ssl is 
> enabled
> 
>
> Key: SPARK-16988
> URL: https://issues.apache.org/jira/browse/SPARK-16988
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Yesha Vora
>Priority: Minor
>
> When spark ssl is enabled, spark history server ui ( http://host:port) is 
> redirected to https://host:port+400. 
> So, spark history server log should be updated to print https url instead 
> http url 
> {code:title=spark HS log}
> 16/08/09 15:21:11 INFO ServerConnector: Started 
> ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481}
> 16/08/09 15:21:11 INFO Server: Started @4023ms
> 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081.
> 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and 
> started at http://xxx:18081
> 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: 
> hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-10-24 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602432#comment-15602432
 ] 

Shivaram Venkataraman commented on SPARK-18073:
---

I think most of the list looks good to me. The only thing I am not sure about 
are the Spark Internals pages. While I agree they are obsolete I wonder if they 
are still a useful starting point to understanding the code. 

> Migrate wiki to spark.apache.org web site
> -
>
> Key: SPARK-18073
> URL: https://issues.apache.org/jira/browse/SPARK-18073
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>
> Per 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
>  , let's consider migrating all wiki pages to documents at 
> github.com/apache/spark-webiste (i.e. spark.apache.org).
> Some reasons:
> * No pull request system or history for changes to the wiki
> * Separate, not-so-clear system for granting write access to wiki
> * Wiki doesn't change much
> * One less place to maintain or look for docs
> The idea would be to then update all wiki pages with a message pointing to 
> the new home of the information (or message saying it's obsolete).
> Here are the current wikis and my general proposal for what to do with the 
> content:
> * Additional Language Bindings -> roll this into wherever Third Party 
> Projects ends up
> * Committers -> Migrate to a new /committers.html page, linked under 
> Community menu (alread exists)
> * Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
> /contributing.html page under Community menu
> ** Jira Permissions Scheme -> obsolete
> ** Spark Code Style Guide -> roll this into new contributing.html page
> * Development Discussions -> obsolete?
> * Powered By Spark -> Make into new /powered-by.html linked by the existing 
> Commnunity menu item
> * Preparing Spark Releases -> see below; roll into where "versioning policy" 
> goes?
> * Profiling Spark Applications -> roll into where Useful Developer Tools goes
> ** Profiling Spark Applications Using YourKit -> ditto
> * Spark Internals -> all of these look somewhat to very stale; remove?
> ** Java API Internals
> ** PySpark Internals
> ** Shuffle Internals
> ** Spark SQL Internals
> ** Web UI Internals
> * Spark QA Infrastructure -> tough one. Good info to document; does it belong 
> on the website? we can just migrate it
> * Spark Versioning Policy -> new page living under Community (?) that 
> documents release policy and process (better menu?)
> ** spark-ec2 AMI list and install file version mappings -> obsolete
> ** Spark-Shark version mapping -> obsolete
> * Third Party Projects -> new Community menu item
> * Useful Developer Tools -> new page under new Developer menu? Community?
> ** Jenkins -> obsolete, remove
> Of course, another outcome is to just remove outdated wikis, migrate some, 
> leave the rest.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow

2016-10-24 Thread J.P Feng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

J.P Feng updated SPARK-18077:
-
Description: 
Hello,all. I face a strange thing in my project.

there is a table:

CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` 
string, `recdate` string)
PARTITIONED BY (`pt` string, `dt` string) stored as orc;

another table:
CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` 
string, `happened_time` int)
PARTITIONED BY (`pt` string, `dt` string)


--
Test-1:

executed sql in spark-shell or spark-sql( before i run this sql, there is much 
data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
select distinct account_name,role_id,server,'1476979200' as recdate from 
tbllog_login  where pt='mix_en' and  dt='2016-10-21' 

it will cost a lot of time, below is a part of the logs:

/
[Stage 5:===>   (144 + 8) / 
200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 
893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, 
real=0.08 secs] 
[Stage 5:=> (152 + 8) / 
200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 
872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, 
real=0.08 secs] 
[Stage 5:>  (160 + 8) / 
200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 
854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, 
real=0.07 secs] 
[Stage 5:>  (176 + 8) / 
200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 
797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, 
real=0.06 secs] 
[Stage 5:>  (177 + 8) / 
200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 
810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, 
real=0.06 secs] 
[Stage 5:>  (192 + 8) / 
200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 
821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, 
real=0.06 secs] 
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-3'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-4'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
...
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
/


i can see, the origin data is moved to .trash

and then, there is no log printing, and after about 10 min, the log print again:

/
16/10/24 17:24:15 INFO Hive: Replacing 
src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-0,
 dest: 
hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0,
 Status:true
16/10/24 17:24:15 INFO Hive: Replacing 
src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-1,
 dest: 
hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1,
 Status:true
16/10/24 17:24:15 INFO Hive: Replacing 
src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-2,
 dest: 
hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2,
 Status:true
16/10/24 17:24:15 INFO Hive: Replacing 

[jira] [Created] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow

2016-10-24 Thread J.P Feng (JIRA)
J.P Feng created SPARK-18077:


 Summary: Run insert overwrite statements in spark  to overwrite a 
partitioned table is very slow
 Key: SPARK-18077
 URL: https://issues.apache.org/jira/browse/SPARK-18077
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
 Environment: spark 2.0
hive 2.0.1
driver memory: 4g
total executors: 4
executor memory: 10g
total cores: 13
Reporter: J.P Feng


Hello,all. I face a strange thing in my project.

there is a table:

CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` 
string, `recdate` string)
PARTITIONED BY (`pt` string, `dt` string) stored as orc;

another table:
CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` 
string, `happened_time` int)
PARTITIONED BY (`pt` string, `dt` string)



Test-1:

executed sql in spark-shell or spark-sql( before i run this sql, there is much 
data in partition(pt='mix_en', dt='2016-10-21') of table login4game ):

insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21')
select distinct account_name,role_id,server,'1476979200' as recdate from 
tbllog_login  where pt='mix_en' and  dt='2016-10-21' 

it will cost a lot of time, below is a part of the logs:

/
[Stage 5:===>   (144 + 8) / 
200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 
893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, 
real=0.08 secs] 
[Stage 5:=> (152 + 8) / 
200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 
872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, 
real=0.08 secs] 
[Stage 5:>  (160 + 8) / 
200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 
854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, 
real=0.07 secs] 
[Stage 5:>  (176 + 8) / 
200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 
797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, 
real=0.06 secs] 
[Stage 5:>  (177 + 8) / 
200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 
810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, 
real=0.06 secs] 
[Stage 5:>  (192 + 8) / 
200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 
821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, 
real=0.06 secs] 
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-3'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-4'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
...
Moved: 
'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199'
 to trash at: hdfs://master.com/user/hadoop/.Trash/Current
/


i can see, the origin data is moved to .trash

and then, there is no log printing, and after about 10 min, the log print again:

/
16/10/24 17:24:15 INFO Hive: Replacing 
src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-0,
 dest: 
hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0,
 Status:true
16/10/24 17:24:15 INFO Hive: Replacing 
src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-1,
 dest: 
hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1,
 Status:true
16/10/24 17:24:15 INFO Hive: Replacing 
src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-2,
 dest: 

[jira] [Assigned] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18076:


Assignee: Apache Spark

> Fix default Locale used in DateFormat, NumberFormat to Locale.US
> 
>
> Key: SPARK-18076
> URL: https://issues.apache.org/jira/browse/SPARK-18076
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>
> Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. 
> Although the behavior of these format is mostly determined by things like 
> format strings, the exact behavior can vary according to the platform's 
> default locale. Although the locale defaults to "en", it can be set to 
> something else by env variables. And if it does, it can cause the same code 
> to succeed or fail based just on locale:
> {code}
> import java.text._
> import java.util._
> def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", 
> l).parse(s)
> parse("1989Dec31", Locale.US)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.UK)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.CHINA)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(:18)
>   ... 32 elided
> parse("1989Dec31", Locale.GERMANY)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(:18)
>   ... 32 elided
> {code}
> Where not otherwise specified, I believe all instances in the code should 
> default to some fixed value, and that should probably be {{Locale.US}}. This 
> matches the JVM's default, and specifies both language ("en") and region 
> ("US") to remove ambiguity. This most closely matches what the current code 
> behavior would be (unless default locale was changed), because it will 
> currently default to "en".
> This affects SQL date/time functions. At the moment, the only SQL function 
> that lets the user specify language/country is "sentences", which is 
> consistent with Hive.
> It affects dates passed in the JSON API. 
> It affects some strings rendered in the UI, potentially. Although this isn't 
> a correctness issue, there may be an argument for not letting that vary (?)
> It affects a bunch of instances where dates are formatted into strings for 
> things like IDs or file names, which is far less likely to cause a problem, 
> but worth making consistent.
> The other occurrences are in tests.
> The downside to this change is also its upside: the behavior doesn't depend 
> on default JVM locale, but, also can't be affected by the default JVM locale. 
> For example, if you wanted to parse some dates in a way that depended on an 
> non-US locale (not just the format string) then it would no longer be 
> possible. There's no means of specifying this, for example, in SQL functions 
> for parsing dates. However, controlling this by globally changing the locale 
> isn't exactly great either.
> The purpose of this change is to make the current default behavior 
> deterministic and fixed. PR coming.
> CC [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US

2016-10-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602233#comment-15602233
 ] 

Apache Spark commented on SPARK-18076:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15610

> Fix default Locale used in DateFormat, NumberFormat to Locale.US
> 
>
> Key: SPARK-18076
> URL: https://issues.apache.org/jira/browse/SPARK-18076
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>
> Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. 
> Although the behavior of these format is mostly determined by things like 
> format strings, the exact behavior can vary according to the platform's 
> default locale. Although the locale defaults to "en", it can be set to 
> something else by env variables. And if it does, it can cause the same code 
> to succeed or fail based just on locale:
> {code}
> import java.text._
> import java.util._
> def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", 
> l).parse(s)
> parse("1989Dec31", Locale.US)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.UK)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.CHINA)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(:18)
>   ... 32 elided
> parse("1989Dec31", Locale.GERMANY)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(:18)
>   ... 32 elided
> {code}
> Where not otherwise specified, I believe all instances in the code should 
> default to some fixed value, and that should probably be {{Locale.US}}. This 
> matches the JVM's default, and specifies both language ("en") and region 
> ("US") to remove ambiguity. This most closely matches what the current code 
> behavior would be (unless default locale was changed), because it will 
> currently default to "en".
> This affects SQL date/time functions. At the moment, the only SQL function 
> that lets the user specify language/country is "sentences", which is 
> consistent with Hive.
> It affects dates passed in the JSON API. 
> It affects some strings rendered in the UI, potentially. Although this isn't 
> a correctness issue, there may be an argument for not letting that vary (?)
> It affects a bunch of instances where dates are formatted into strings for 
> things like IDs or file names, which is far less likely to cause a problem, 
> but worth making consistent.
> The other occurrences are in tests.
> The downside to this change is also its upside: the behavior doesn't depend 
> on default JVM locale, but, also can't be affected by the default JVM locale. 
> For example, if you wanted to parse some dates in a way that depended on an 
> non-US locale (not just the format string) then it would no longer be 
> possible. There's no means of specifying this, for example, in SQL functions 
> for parsing dates. However, controlling this by globally changing the locale 
> isn't exactly great either.
> The purpose of this change is to make the current default behavior 
> deterministic and fixed. PR coming.
> CC [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US

2016-10-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18076:


Assignee: (was: Apache Spark)

> Fix default Locale used in DateFormat, NumberFormat to Locale.US
> 
>
> Key: SPARK-18076
> URL: https://issues.apache.org/jira/browse/SPARK-18076
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core, SQL
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>
> Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. 
> Although the behavior of these format is mostly determined by things like 
> format strings, the exact behavior can vary according to the platform's 
> default locale. Although the locale defaults to "en", it can be set to 
> something else by env variables. And if it does, it can cause the same code 
> to succeed or fail based just on locale:
> {code}
> import java.text._
> import java.util._
> def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", 
> l).parse(s)
> parse("1989Dec31", Locale.US)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.UK)
> Sun Dec 31 00:00:00 GMT 1989
> parse("1989Dec31", Locale.CHINA)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(:18)
>   ... 32 elided
> parse("1989Dec31", Locale.GERMANY)
> java.text.ParseException: Unparseable date: "1989Dec31"
>   at java.text.DateFormat.parse(DateFormat.java:366)
>   at .parse(:18)
>   ... 32 elided
> {code}
> Where not otherwise specified, I believe all instances in the code should 
> default to some fixed value, and that should probably be {{Locale.US}}. This 
> matches the JVM's default, and specifies both language ("en") and region 
> ("US") to remove ambiguity. This most closely matches what the current code 
> behavior would be (unless default locale was changed), because it will 
> currently default to "en".
> This affects SQL date/time functions. At the moment, the only SQL function 
> that lets the user specify language/country is "sentences", which is 
> consistent with Hive.
> It affects dates passed in the JSON API. 
> It affects some strings rendered in the UI, potentially. Although this isn't 
> a correctness issue, there may be an argument for not letting that vary (?)
> It affects a bunch of instances where dates are formatted into strings for 
> things like IDs or file names, which is far less likely to cause a problem, 
> but worth making consistent.
> The other occurrences are in tests.
> The downside to this change is also its upside: the behavior doesn't depend 
> on default JVM locale, but, also can't be affected by the default JVM locale. 
> For example, if you wanted to parse some dates in a way that depended on an 
> non-US locale (not just the format string) then it would no longer be 
> possible. There's no means of specifying this, for example, in SQL functions 
> for parsing dates. However, controlling this by globally changing the locale 
> isn't exactly great either.
> The purpose of this change is to make the current default behavior 
> deterministic and fixed. PR coming.
> CC [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US

2016-10-24 Thread Sean Owen (JIRA)
Sean Owen created SPARK-18076:
-

 Summary: Fix default Locale used in DateFormat, NumberFormat to 
Locale.US
 Key: SPARK-18076
 URL: https://issues.apache.org/jira/browse/SPARK-18076
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core, SQL
Affects Versions: 2.0.1
Reporter: Sean Owen


Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. 
Although the behavior of these format is mostly determined by things like 
format strings, the exact behavior can vary according to the platform's default 
locale. Although the locale defaults to "en", it can be set to something else 
by env variables. And if it does, it can cause the same code to succeed or fail 
based just on locale:

{code}
import java.text._
import java.util._

def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", l).parse(s)

parse("1989Dec31", Locale.US)
Sun Dec 31 00:00:00 GMT 1989

parse("1989Dec31", Locale.UK)
Sun Dec 31 00:00:00 GMT 1989

parse("1989Dec31", Locale.CHINA)
java.text.ParseException: Unparseable date: "1989Dec31"
  at java.text.DateFormat.parse(DateFormat.java:366)
  at .parse(:18)
  ... 32 elided

parse("1989Dec31", Locale.GERMANY)
java.text.ParseException: Unparseable date: "1989Dec31"
  at java.text.DateFormat.parse(DateFormat.java:366)
  at .parse(:18)
  ... 32 elided
{code}

Where not otherwise specified, I believe all instances in the code should 
default to some fixed value, and that should probably be {{Locale.US}}. This 
matches the JVM's default, and specifies both language ("en") and region ("US") 
to remove ambiguity. This most closely matches what the current code behavior 
would be (unless default locale was changed), because it will currently default 
to "en".

This affects SQL date/time functions. At the moment, the only SQL function that 
lets the user specify language/country is "sentences", which is consistent with 
Hive.

It affects dates passed in the JSON API. 

It affects some strings rendered in the UI, potentially. Although this isn't a 
correctness issue, there may be an argument for not letting that vary (?)

It affects a bunch of instances where dates are formatted into strings for 
things like IDs or file names, which is far less likely to cause a problem, but 
worth making consistent.

The other occurrences are in tests.


The downside to this change is also its upside: the behavior doesn't depend on 
default JVM locale, but, also can't be affected by the default JVM locale. For 
example, if you wanted to parse some dates in a way that depended on an non-US 
locale (not just the format string) then it would no longer be possible. 
There's no means of specifying this, for example, in SQL functions for parsing 
dates. However, controlling this by globally changing the locale isn't exactly 
great either.

The purpose of this change is to make the current default behavior 
deterministic and fixed. PR coming.

CC [~hyukjin.kwon]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-24 Thread Nick Orka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602164#comment-15602164
 ] 

Nick Orka commented on SPARK-9219:
--

Here is UDF dedicated ticket https://issues.apache.org/jira/browse/SPARK-18075

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at 

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-10-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602160#comment-15602160
 ] 

Steve Loughran commented on SPARK-2984:
---

Alexy, can you describe your layout a bit more

# are you using Amazon EMR?
# why s3:// URLs? 
# what hadoop-* JARs are you using (it's not that easy to tell in 
spark-1.6...what did you download?)

If you can, use Hadoop 2.7.x+ and then switch to s3a.

What I actually suspect here, looking at the code listing, is that the 
following sequence is happening

* previous work completes, deletes/renames _temporary
* next code gets a list of all files (globStatus(pat))
* Then iterates through, getting info about each one.
* During that listing, the _temp dir goes away, which breaks the code.


Seems to me the listing logic in {{FileInputFormat.singleThreadedListStatus}} 
could be more forgiving about FNFEs: if the file is no longer there, well, skip 
it.

However, S3A in Hadoop 2.8 (and in HDP2.5, I'll note) replaces this very 
inefficient directory listing with one optimised for S3A (HADOOP-13208), which 
is much less likely to exhibit this problem.

Even so, some more resilience would be good; I'll make a note in the relevant 
bits of the Hadoop code

> FileNotFoundException on _temporary directory
> -
>
> Key: SPARK-2984
> URL: https://issues.apache.org/jira/browse/SPARK-2984
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
>
> We've seen several stacktraces and threads on the user mailing list where 
> people are having issues with a {{FileNotFoundException}} stemming from an 
> HDFS path containing {{_temporary}}.
> I ([~aash]) think this may be related to {{spark.speculation}}.  I think the 
> error condition might manifest in this circumstance:
> 1) task T starts on a executor E1
> 2) it takes a long time, so task T' is started on another executor E2
> 3) T finishes in E1 so moves its data from {{_temporary}} to the final 
> destination and deletes the {{_temporary}} directory during cleanup
> 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but 
> those files no longer exist!  exception
> Some samples:
> {noformat}
> 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 
> 140774430 ms.0
> java.io.FileNotFoundException: File 
> hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07
>  does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310)
> at 
> org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136)
> at 
> org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724)
> at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643)
> at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at 
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
> at 
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
> at 
> 

[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-10-24 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602152#comment-15602152
 ] 

holdenk commented on SPARK-18073:
-

I like the idea of migrating everything off of the wiki - the fact its locked 
down makes sense from a SPAM POV but one of the best places for people to "get 
their toes wet" is with documentation fixes and by making that experience more 
consistent across the different types of documentation seems like it could be 
quite beneficial. (Not to mention the resulting ability for more than just 
committers to contribute suggestions on how to improve these parts of our 
documentation).

> Migrate wiki to spark.apache.org web site
> -
>
> Key: SPARK-18073
> URL: https://issues.apache.org/jira/browse/SPARK-18073
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>
> Per 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
>  , let's consider migrating all wiki pages to documents at 
> github.com/apache/spark-webiste (i.e. spark.apache.org).
> Some reasons:
> * No pull request system or history for changes to the wiki
> * Separate, not-so-clear system for granting write access to wiki
> * Wiki doesn't change much
> * One less place to maintain or look for docs
> The idea would be to then update all wiki pages with a message pointing to 
> the new home of the information (or message saying it's obsolete).
> Here are the current wikis and my general proposal for what to do with the 
> content:
> * Additional Language Bindings -> roll this into wherever Third Party 
> Projects ends up
> * Committers -> Migrate to a new /committers.html page, linked under 
> Community menu (alread exists)
> * Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
> /contributing.html page under Community menu
> ** Jira Permissions Scheme -> obsolete
> ** Spark Code Style Guide -> roll this into new contributing.html page
> * Development Discussions -> obsolete?
> * Powered By Spark -> Make into new /powered-by.html linked by the existing 
> Commnunity menu item
> * Preparing Spark Releases -> see below; roll into where "versioning policy" 
> goes?
> * Profiling Spark Applications -> roll into where Useful Developer Tools goes
> ** Profiling Spark Applications Using YourKit -> ditto
> * Spark Internals -> all of these look somewhat to very stale; remove?
> ** Java API Internals
> ** PySpark Internals
> ** Shuffle Internals
> ** Spark SQL Internals
> ** Web UI Internals
> * Spark QA Infrastructure -> tough one. Good info to document; does it belong 
> on the website? we can just migrate it
> * Spark Versioning Policy -> new page living under Community (?) that 
> documents release policy and process (better menu?)
> ** spark-ec2 AMI list and install file version mappings -> obsolete
> ** Spark-Shark version mapping -> obsolete
> * Third Party Projects -> new Community menu item
> * Useful Developer Tools -> new page under new Developer menu? Community?
> ** Jenkins -> obsolete, remove
> Of course, another outcome is to just remove outdated wikis, migrate some, 
> leave the rest.
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark

2016-10-24 Thread Nick Orka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Orka updated SPARK-18075:
--
Description: 
I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)

According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 I've 
made all spark dependancies with PROVIDED scope. I use 100% same versions of 
spark in the app as well as for spark server. 

Here is my pom:
{code:title=pom.xml}

1.6
1.6
UTF-8
2.11.8
2.0.0
2.7.0






org.apache.spark
spark-core_2.11
${spark.version}
provided


org.apache.spark
spark-sql_2.11
${spark.version}
provided


org.apache.spark
spark-hive_2.11
${spark.version}
provided



{code}

As you can see all spark dependencies have provided scope

And this is a code for reproduction:
{code:title=udfTest.scala}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
  * Created by nborunov on 10/19/16.
  */
object udfTest {

  class Seq extends Serializable {
var i = 0

def getVal: Int = {
  i = i + 1
  i
}
  }

  def main(args: Array[String]) {

val spark = SparkSession
  .builder()
.master("spark://nborunov-mbp.local:7077")
//  .master("local")
  .getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))

val schema = StructType(Array(StructField("name", StringType)))

val df = spark.createDataFrame(rdd, schema)

df.show()

spark.udf.register("func", (name: String) => name.toUpperCase)

import org.apache.spark.sql.functions.expr

val newDf = df.withColumn("upperName", expr("func(name)"))

newDf.show()

val seq = new Seq

spark.udf.register("seq", () => seq.getVal)

val seqDf = df.withColumn("id", expr("seq()"))

seqDf.show()

df.createOrReplaceTempView("df")

spark.sql("select *, seq() as sql_id from df").show()

  }

}
{code}

When .master("local") - everything works fine. When 
.master("spark://...:7077"), it fails on line:
{code}
newDf.show()
{code}

The error is exactly the same:
{code}
scala> udfTest.main(Array())
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(nborunov); groups 
with view permissions: Set(); users  with modify permissions: Set(nborunov); 
groups with modify permissions: Set()
16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
port 57828.
16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
/private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.2.202:4040
16/10/19 19:37:54 INFO StandaloneAppClient$ClientEndpoint: Connecting to master 
spark://nborunov-mbp.local:7077...
16/10/19 19:37:54 INFO TransportClientFactory: Successfully created connection 
to nborunov-mbp.local/192.168.2.202:7077 after 74 ms (0 ms spent in bootstraps)
16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Connected to Spark cluster 
with app ID app-20161019153755-0017
16/10/19 19:37:55 INFO StandaloneAppClient$ClientEndpoint: Executor added: 
app-20161019153755-0017/0 on 

[jira] [Created] (SPARK-18075) UDF doesn't work on non-local spark

2016-10-24 Thread Nick Orka (JIRA)
Nick Orka created SPARK-18075:
-

 Summary: UDF doesn't work on non-local spark
 Key: SPARK-18075
 URL: https://issues.apache.org/jira/browse/SPARK-18075
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0, 1.6.1
Reporter: Nick Orka


I've decided to clone the ticket because it had the same problem for anothe 
spark version and provided workaround doesn't fix an issue.
I'm duplicating my case here.

I have the same issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
Here is my pom:
{code:title=pom.xml}

1.6
1.6
UTF-8
2.11.8
2.0.0
2.7.0






org.apache.spark
spark-core_2.11
${spark.version}
provided


org.apache.spark
spark-sql_2.11
${spark.version}
provided


org.apache.spark
spark-hive_2.11
${spark.version}
provided



{code}

As you can see all spark dependencies have provided scope

And this is a code for reproduction:
{code:title=udfTest.scala}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
  * Created by nborunov on 10/19/16.
  */
object udfTest {

  class Seq extends Serializable {
var i = 0

def getVal: Int = {
  i = i + 1
  i
}
  }

  def main(args: Array[String]) {

val spark = SparkSession
  .builder()
.master("spark://nborunov-mbp.local:7077")
//  .master("local")
  .getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))

val schema = StructType(Array(StructField("name", StringType)))

val df = spark.createDataFrame(rdd, schema)

df.show()

spark.udf.register("func", (name: String) => name.toUpperCase)

import org.apache.spark.sql.functions.expr

val newDf = df.withColumn("upperName", expr("func(name)"))

newDf.show()

val seq = new Seq

spark.udf.register("seq", () => seq.getVal)

val seqDf = df.withColumn("id", expr("seq()"))

seqDf.show()

df.createOrReplaceTempView("df")

spark.sql("select *, seq() as sql_id from df").show()

  }

}
{code}

When .master("local") - everything works fine. When 
.master("spark://...:7077"), it fails on line:
{code}
newDf.show()
{code}

The error is exactly the same:
{code}
scala> udfTest.main(Array())
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(nborunov); groups 
with view permissions: Set(); users  with modify permissions: Set(nborunov); 
groups with modify permissions: Set()
16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
port 57828.
16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
/private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.2.202:4040
16/10/19 19:37:54 INFO StandaloneAppClient$ClientEndpoint: Connecting to master 
spark://nborunov-mbp.local:7077...
16/10/19 19:37:54 INFO TransportClientFactory: Successfully created connection 
to nborunov-mbp.local/192.168.2.202:7077 after 74 ms (0 ms spent in bootstraps)
16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Connected to Spark cluster 
with app ID app-20161019153755-0017

[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark

2016-10-24 Thread Nick Orka (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Orka updated SPARK-18075:
--
Description: 
I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz)
Here is my pom:
{code:title=pom.xml}

1.6
1.6
UTF-8
2.11.8
2.0.0
2.7.0






org.apache.spark
spark-core_2.11
${spark.version}
provided


org.apache.spark
spark-sql_2.11
${spark.version}
provided


org.apache.spark
spark-hive_2.11
${spark.version}
provided



{code}

As you can see all spark dependencies have provided scope

And this is a code for reproduction:
{code:title=udfTest.scala}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
  * Created by nborunov on 10/19/16.
  */
object udfTest {

  class Seq extends Serializable {
var i = 0

def getVal: Int = {
  i = i + 1
  i
}
  }

  def main(args: Array[String]) {

val spark = SparkSession
  .builder()
.master("spark://nborunov-mbp.local:7077")
//  .master("local")
  .getOrCreate()

val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two")))

val schema = StructType(Array(StructField("name", StringType)))

val df = spark.createDataFrame(rdd, schema)

df.show()

spark.udf.register("func", (name: String) => name.toUpperCase)

import org.apache.spark.sql.functions.expr

val newDf = df.withColumn("upperName", expr("func(name)"))

newDf.show()

val seq = new Seq

spark.udf.register("seq", () => seq.getVal)

val seqDf = df.withColumn("id", expr("seq()"))

seqDf.show()

df.createOrReplaceTempView("df")

spark.sql("select *, seq() as sql_id from df").show()

  }

}
{code}

When .master("local") - everything works fine. When 
.master("spark://...:7077"), it fails on line:
{code}
newDf.show()
{code}

The error is exactly the same:
{code}
scala> udfTest.main(Array())
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in 
[jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in 
[jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0
16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov
16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov
16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 
16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 
16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(nborunov); groups 
with view permissions: Set(); users  with modify permissions: Set(nborunov); 
groups with modify permissions: Set()
16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on 
port 57828.
16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker
16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster
16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at 
/private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264
16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator
16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at 
http://192.168.2.202:4040
16/10/19 19:37:54 INFO StandaloneAppClient$ClientEndpoint: Connecting to master 
spark://nborunov-mbp.local:7077...
16/10/19 19:37:54 INFO TransportClientFactory: Successfully created connection 
to nborunov-mbp.local/192.168.2.202:7077 after 74 ms (0 ms spent in bootstraps)
16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Connected to Spark cluster 
with app ID app-20161019153755-0017
16/10/19 19:37:55 INFO StandaloneAppClient$ClientEndpoint: Executor added: 
app-20161019153755-0017/0 on worker-20161018232014-192.168.2.202-61437 
(192.168.2.202:61437) with 4 cores
16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Granted executor ID 
app-20161019153755-0017/0 on hostPort 192.168.2.202:61437 with 4 

[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory

2016-10-24 Thread Alexey Balchunas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602045#comment-15602045
 ] 

Alexey Balchunas commented on SPARK-2984:
-

I'm getting a similar exception on Spark 1.6.0:
{code}
java.io.FileNotFoundException: No such file or directory 
's3://bucket/path/part-1'
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:507)
at 
org.apache.hadoop.fs.FileSystem.getFileBlockLocations(FileSystem.java:704)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1696)
at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681)
at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268)
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 

[jira] [Created] (SPARK-18074) UDFs don't work on non-local environment

2016-10-24 Thread Alberto Andreotti (JIRA)
Alberto Andreotti created SPARK-18074:
-

 Summary: UDFs don't work on non-local environment
 Key: SPARK-18074
 URL: https://issues.apache.org/jira/browse/SPARK-18074
 Project: Spark
  Issue Type: Bug
Reporter: Alberto Andreotti


It seems that UDFs fail to deserialize when they are sent to the remote 
cluster. This is an app that can help reproduce the problem,

https://github.com/albertoandreottiATgmail/spark_udf

and this is the stack trace with the exception,
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 
77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance of 
scala.collection.immutable.List$SerializationProxy to field 
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at 
java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at 

[jira] [Created] (SPARK-18073) Migrate wiki to spark.apache.org web site

2016-10-24 Thread Sean Owen (JIRA)
Sean Owen created SPARK-18073:
-

 Summary: Migrate wiki to spark.apache.org web site
 Key: SPARK-18073
 URL: https://issues.apache.org/jira/browse/SPARK-18073
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 2.0.1
Reporter: Sean Owen


Per 
http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html
 , let's consider migrating all wiki pages to documents at 
github.com/apache/spark-webiste (i.e. spark.apache.org).

Some reasons:
* No pull request system or history for changes to the wiki
* Separate, not-so-clear system for granting write access to wiki
* Wiki doesn't change much
* One less place to maintain or look for docs

The idea would be to then update all wiki pages with a message pointing to the 
new home of the information (or message saying it's obsolete).

Here are the current wikis and my general proposal for what to do with the 
content:

* Additional Language Bindings -> roll this into wherever Third Party Projects 
ends up
* Committers -> Migrate to a new /committers.html page, linked under Community 
menu (alread exists)
* Contributing to Spark -> Make this CONTRIBUTING.md? or a new 
/contributing.html page under Community menu
** Jira Permissions Scheme -> obsolete
** Spark Code Style Guide -> roll this into new contributing.html page
* Development Discussions -> obsolete?
* Powered By Spark -> Make into new /powered-by.html linked by the existing 
Commnunity menu item
* Preparing Spark Releases -> see below; roll into where "versioning policy" 
goes?
* Profiling Spark Applications -> roll into where Useful Developer Tools goes
** Profiling Spark Applications Using YourKit -> ditto
* Spark Internals -> all of these look somewhat to very stale; remove?
** Java API Internals
** PySpark Internals
** Shuffle Internals
** Spark SQL Internals
** Web UI Internals
* Spark QA Infrastructure -> tough one. Good info to document; does it belong 
on the website? we can just migrate it
* Spark Versioning Policy -> new page living under Community (?) that documents 
release policy and process (better menu?)
** spark-ec2 AMI list and install file version mappings -> obsolete
** Spark-Shark version mapping -> obsolete
* Third Party Projects -> new Community menu item
* Useful Developer Tools -> new page under new Developer menu? Community?
** Jenkins -> obsolete, remove


Of course, another outcome is to just remove outdated wikis, migrate some, 
leave the rest.

Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-24 Thread Alberto Andreotti (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601892#comment-15601892
 ] 

Alberto Andreotti commented on SPARK-9219:
--

Please paste the ticket number here. Thanks.

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at 

[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD

2016-10-24 Thread Nick Orka (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601836#comment-15601836
 ] 

Nick Orka commented on SPARK-9219:
--

This UDF functionality doesn't work on non-local environment. I would re-open 
the issue or create a new one dedicated to UDF.

> ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
> ---
>
> Key: SPARK-9219
> URL: https://issues.apache.org/jira/browse/SPARK-9219
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.4.1
>Reporter: Mohsen Zainalpour
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 
> (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
>   at 
> java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477)
>   at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> 

[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601827#comment-15601827
 ] 

Cédric Hernalsteens commented on SPARK-15487:
-

Sure, this one (reverse proxy to access the workers and apps from master webui) 
is resolved.

I'll open another one.

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-18052) Spark Job failing with org.apache.spark.rpc.RpcTimeoutException

2016-10-24 Thread Srikanth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth closed SPARK-18052.

Resolution: Not A Bug

> Spark Job failing with org.apache.spark.rpc.RpcTimeoutException
> ---
>
> Key: SPARK-18052
> URL: https://issues.apache.org/jira/browse/SPARK-18052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.0.0
> Environment: 3 node spark cluster, all AWS r3.xlarge instances 
> running on ubuntu.
>Reporter: Srikanth
> Attachments: sparkErrorLog.txt
>
>
> Spark submit jobs are failing with org.apache.spark.rpc.RpcTimeoutException. 
> increased the spark.executor.heartbeatInterval value from 10s to 60s, but 
> still the same issue.
> This is happening while saving a dataframe to a mounted network drive. Not 
> using HDFS here. We are able to write successfully for smaller size files 
> under 10G, the data here we are reading is nearly 20G
> driver memory = 10G
> executor memory = 25G
> Please see the attached log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-24 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601794#comment-15601794
 ] 

Matthew Farrellee commented on SPARK-15487:
---

well, unless you're putting another proxy in front of your master and want it 
to show up in a subsection of your domain, you should only need "/" works and 
it would be a great default. in the case of a site proxy on 
www.mydomain.com/spark i'd expect you only need to set the url to "/spark"

fyi, the master passes the proxy url to the workers - 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L401
 - so you should not need to set it on the workers

if you're continuing to have a problem you should definitely open another issue 
and leave this one as resolved.

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-24 Thread zhangxinyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601783#comment-15601783
 ] 

zhangxinyu commented on SPARK-17935:


I write a short deasign doc of KafkaSink for kafka-0.10 as follows. KafkaSink 
is designed based on kafka-0.10-sql module.
Could you please take a look?

> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module

2016-10-24 Thread zhangxinyu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778
 ] 

zhangxinyu commented on SPARK-17935:


h2. KafkaSink Design Doc

h4. Goal
Output results to kafka cluster(version 0.10.0.0) in structured streaming 
module.

h4. Implement
Four classes are implemented to output data to kafka cluster in structured 
streaming module.
* *KafkaSinkProvider*
This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* 
and overrides function *shortName* and *createSink*. In function *createSink*, 
*KafkaSink* is created.
* *KafkaSink*
KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will 
be created in function *addBatch*.
* *KafkaSinkRDD*
*KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It 
extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to 
get or create producer to send data
* *CachedKafkaProducer*
*CachedKafkaProducer* is used to store producers in the executors so that these 
producers can be reused.

h4. Configuration
* *Kafka Producer Configuration*
"*.option()*" is used to configure kafka producer configurations which are all 
starting with "*kafka.*". For example, producer configuration 
*bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", 
kafka-servers)*.
* *Other Configuration*
Other configuration is also set by ".option()". The difference is these 
configurations don't start with "kafka.".

h4. Usage
val query = input.writeStream
  .format("kafkaSink")
  .outputMode("append")
  .option("kafka.bootstrap.servers", kafka-servers)
  .option(“topic”, topic)
  .start()



> Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
> --
>
> Key: SPARK-17935
> URL: https://issues.apache.org/jira/browse/SPARK-17935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Streaming
>Affects Versions: 2.0.0
>Reporter: zhangxinyu
>
> Now spark already supports kafkaInputStream. It would be useful that we add 
> `KafkaForeachWriter` to output results to kafka in structured streaming 
> module.
> `KafkaForeachWriter.scala` is put in external kafka-0.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema

2016-10-24 Thread Matthew Scruggs (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Scruggs updated SPARK-18065:

Priority: Minor  (was: Major)

> Spark 2 allows filter/where on columns not in current schema
> 
>
> Key: SPARK-18065
> URL: https://issues.apache.org/jira/browse/SPARK-18065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1
>Reporter: Matthew Scruggs
>Priority: Minor
>
> I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a 
> DataFrame that previously had a column, but no longer has it in its schema 
> due to a select() operation.
> In Spark 1.6.2, in spark-shell, we see that an exception is thrown when 
> attempting to filter/where using the selected-out column:
> {code:title=Spark 1.6.2}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.2
>   /_/
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
> scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), 
> (2, "two".selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input 
> columns: [id];
> {code}
> However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds 
> (no AnalysisException) and seems to filter out data as if the column remains:
> {code:title=Spark 2.0.1}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.1
>   /_/
>  
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val df1 = sc.parallelize(Seq((1, "one"), (2, 
> "two"))).toDF().selectExpr("_1 as id", "_2 as word")
> df1: org.apache.spark.sql.DataFrame = [id: int, word: string]
> scala> df1.show()
> +---++
> | id|word|
> +---++
> |  1| one|
> |  2| two|
> +---++
> scala> val df2 = df1.select("id")
> df2: org.apache.spark.sql.DataFrame = [id: int]
> scala> df2.printSchema()
> root
>  |-- id: integer (nullable = false)
> scala> df2.where("word = 'one'").show()
> +---+
> | id|
> +---+
> |  1|
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI

2016-10-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601746#comment-15601746
 ] 

Cédric Hernalsteens commented on SPARK-15487:
-

@Matthew : that would be too easy ;) Seriously, I don't think I'll be the only 
one wanting to reverse proxy to something like www.mydomain/spark .

@Gurvinder: For spark.ui.reverseProxy I got it, works well indeed. Maybe I 
should open another issue because I believe I set spark.ui.reverseProxyUrl 
correctly on the worker and it's not taken into account.

Thanks for looking at this though.

> Spark Master UI to reverse proxy Application and Workers UI
> ---
>
> Key: SPARK-15487
> URL: https://issues.apache.org/jira/browse/SPARK-15487
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.6.0, 1.6.1
>Reporter: Gurvinder
>Assignee: Gurvinder
>Priority: Minor
> Fix For: 2.1.0
>
>
> Currently when running in Standalone mode, Spark UI's link to workers and 
> application drivers are pointing to internal/protected network endpoints. So 
> to access workers/application UI user's machine has to connect to VPN or need 
> to have access to internal network directly.
> Therefore the proposal is to make Spark master UI reverse proxy this 
> information back to the user. So only Spark master UI needs to be opened up 
> to internet. 
> The minimal changes can be done by adding another route e.g. 
> http://spark-master.com/target// so when request goes to target, 
> ProxyServlet kicks in and takes the  and forwards the request to it 
> and send response back to user.
> More information about discussions for this features can be found on this 
> mailing list thread 
> http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2016-10-24 Thread Dhananjay Patkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601625#comment-15601625
 ] 

Dhananjay Patkar commented on SPARK-4105:
-

I see this error intermittently.
I am using
spark :1.6.2
hadoop/yarn  :2.7.3

{quote}
 Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 40 in stage 7.0 failed 4 times, most recent failure: Lost task 40.3 in 
stage 7.0 (TID 411, 10.0.0.12): java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:98)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:474)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:513)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:422)
at 
org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:469)
at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899)
at 
org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119)
at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
{quote}

> FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
> shuffle
> -
>
> Key: SPARK-4105
> URL: https://issues.apache.org/jira/browse/SPARK-4105
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Attachments: JavaObjectToSerialize.java, 
> SparkFailedToUncompressGenerator.scala
>
>
> We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
> shuffle read.  Here's a sample stacktrace from an executor:
> {code}
> 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
> 33053)
> java.io.IOException: FAILED_TO_UNCOMPRESS(5)
>   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
>   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
>   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
>   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
>   at 
> org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
>   at 
> org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
>   at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58)
>   at 
> org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
>   at 
> org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>   at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> 

[jira] [Resolved] (SPARK-17810) Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path

2016-10-24 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17810.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.2

Issue resolved by pull request 15382
[https://github.com/apache/spark/pull/15382]

> Default spark.sql.warehouse.dir is relative to local FS but can resolve as 
> HDFS path
> 
>
> Key: SPARK-17810
> URL: https://issues.apache.org/jira/browse/SPARK-17810
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
>Reporter: Sean Owen
>Assignee: Sean Owen
> Fix For: 2.0.2, 2.1.0
>
>
> Following SPARK-15899 and 
> https://github.com/apache/spark/pull/13868#discussion_r82252372 we have a 
> slightly different problem. 
> The change removed the {{file:}} scheme from the default 
> {{spark.sql.warehouse.dir}} as part of its fix, though the path is still 
> clearly intended to be a local FS path and defaults to "spark-warehouse" in 
> the user's home dir. However when running on HDFS this path will be resolved 
> as an HDFS path, where it almost surely doesn't exist. 
> Although it can be fixed by overriding {{spark.sql.warehouse.dir}} to a path 
> like "file:/tmp/spark-warehouse", or any valid HDFS path, this probably won't 
> work on Windows (the original problem) and of course means the default fails 
> to work for most HDFS use cases.
> There's a related problem here: the docs say the default should be 
> spark-warehouse relative to the current working dir, not the user home dir. 
> We can adjust that.
> PR coming shortly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17847) Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml

2016-10-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-17847:
---

Assignee: Yanbo Liang

> Reduce shuffled data size of GaussianMixture & copy the implementation from 
> mllib to ml
> ---
>
> Key: SPARK-17847
> URL: https://issues.apache.org/jira/browse/SPARK-17847
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
> features to it.
> I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
> wrap the ml implementation. For the following reasons:
> * mllib {{GaussianMixture}} allow k == 1, but ml does not.
> * mllib {{GaussianMixture}} supports setting initial model, but ml does not 
> support currently. (We will definitely add this feature for ml in the future)
> Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
> this task. Since the covariance matrix of multivariate gaussian distribution 
> is symmetric, we can only store the upper triangular part of the matrix and 
> it will greatly reduce the shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17847) Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml

2016-10-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17847:

Description: 
Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
features to it.
I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
wrap the ml implementation. For the following reasons:
* mllib {{GaussianMixture}} allow k == 1, but ml does not.
* mllib {{GaussianMixture}} supports setting initial model, but ml does not 
support currently. (We will definitely add this feature for ml in the future)

Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
this task. Since the covariance matrix of multivariate gaussian distribution is 
symmetric, we can only store the upper triangular part of the matrix and it 
will greatly reduce the shuffled data size.

  was:
Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
features to it.
I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
wrap the ml implementation. For the following reasons:
* mllib {{GaussianMixture}} allow k == 1, but ml does not.
* mllib {{GaussianMixture}} supports setting initial model, but ml does not 
support currently. (We will definitely add this feature for ml in the future)
Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
this task. Since the covariance matrix of multivariate gaussian distribution is 
symmetric, we can only store the upper triangular part of the matrix and it 
will greatly reduce the shuffled data size.


> Reduce shuffled data size of GaussianMixture & copy the implementation from 
> mllib to ml
> ---
>
> Key: SPARK-17847
> URL: https://issues.apache.org/jira/browse/SPARK-17847
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
> features to it.
> I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
> wrap the ml implementation. For the following reasons:
> * mllib {{GaussianMixture}} allow k == 1, but ml does not.
> * mllib {{GaussianMixture}} supports setting initial model, but ml does not 
> support currently. (We will definitely add this feature for ml in the future)
> Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
> this task. Since the covariance matrix of multivariate gaussian distribution 
> is symmetric, we can only store the upper triangular part of the matrix and 
> it will greatly reduce the shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17847) Copy GaussianMixture implementation from mllib to ml

2016-10-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17847:

Description: 
Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
features to it.
I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
wrap the ml implementation. For the following reasons:
* mllib {{GaussianMixture}} allow k == 1, but ml does not.
* mllib {{GaussianMixture}} supports setting initial model, but ml does not 
support currently. (We will definitely add this feature for ml in the future)
Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
this task. Since the covariance matrix of multivariate gaussian distribution is 
symmetric, we can only store the upper triangular part of the matrix and it 
will greatly reduce the shuffled data size.

  was:
Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
features to it.
I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
wrap the ml implementation. For the following reasons:
* mllib {{GaussianMixture}} allow k == 1, but ml does not.
* mllib {{GaussianMixture}} supports setting initial model, but ml does not 
support currently. (We will definitely add this feature for ml in the future)


> Copy GaussianMixture implementation from mllib to ml
> 
>
> Key: SPARK-17847
> URL: https://issues.apache.org/jira/browse/SPARK-17847
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
> features to it.
> I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
> wrap the ml implementation. For the following reasons:
> * mllib {{GaussianMixture}} allow k == 1, but ml does not.
> * mllib {{GaussianMixture}} supports setting initial model, but ml does not 
> support currently. (We will definitely add this feature for ml in the future)
> Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
> this task. Since the covariance matrix of multivariate gaussian distribution 
> is symmetric, we can only store the upper triangular part of the matrix and 
> it will greatly reduce the shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17847) Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml

2016-10-24 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-17847:

Summary: Reduce shuffled data size of GaussianMixture & copy the 
implementation from mllib to ml  (was: Copy GaussianMixture implementation from 
mllib to ml)

> Reduce shuffled data size of GaussianMixture & copy the implementation from 
> mllib to ml
> ---
>
> Key: SPARK-17847
> URL: https://issues.apache.org/jira/browse/SPARK-17847
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Yanbo Liang
>
> Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new 
> features to it.
> I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to 
> wrap the ml implementation. For the following reasons:
> * mllib {{GaussianMixture}} allow k == 1, but ml does not.
> * mllib {{GaussianMixture}} supports setting initial model, but ml does not 
> support currently. (We will definitely add this feature for ml in the future)
> Meanwhile, There is a big performance improvement for {{GaussianMixture}} in 
> this task. Since the covariance matrix of multivariate gaussian distribution 
> is symmetric, we can only store the upper triangular part of the matrix and 
> it will greatly reduce the shuffled data size.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work

2016-10-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601473#comment-15601473
 ] 

Sean Owen commented on SPARK-18017:
---

Ah, try spark.hadoop.fs.s3n.block.size=...

> Changing Hadoop parameter through 
> sparkSession.sparkContext.hadoopConfiguration doesn't work
> 
>
> Key: SPARK-18017
> URL: https://issues.apache.org/jira/browse/SPARK-18017
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: Scala version 2.11.8; Java 1.8.0_91; 
> com.databricks:spark-csv_2.11:1.2.0
>Reporter: Yuehua Zhang
>
> My Spark job tries to read csv files on S3. I need to control the number of 
> partitions created so I set Hadoop parameter fs.s3n.block.size. However, it 
> stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is 
> related to https://issues.apache.org/jira/browse/SPARK-15991. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2016-10-24 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601455#comment-15601455
 ] 

Herman van Hovell commented on SPARK-18067:
---

[~tejasp] This makes sense to me. However there are a few potential problems:
- You generally have a better chance of getting nicely distributed data if you 
hash by multiple values. If the `key` in your example has a relatively low 
cardinality we can hit significant performance problems and OOMs if we need to 
buffer a lot of rows.
- I am pretty sure this will break 
{{outputPartitioning/requiredChildDistribution}}. This would allow 
EnsureRequirements to give us a different distribution then we have asked for. 
This can be extremely problematic in case of shuffle joins, since we need to 
make sure that both the left and the right relation have exactly the same 
distribution.

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >