[jira] [Commented] (SPARK-14914) Test Cases fail on Windows
[ https://issues.apache.org/jira/browse/SPARK-14914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604304#comment-15604304 ] Apache Spark commented on SPARK-14914: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/15618 > Test Cases fail on Windows > -- > > Key: SPARK-14914 > URL: https://issues.apache.org/jira/browse/SPARK-14914 > Project: Spark > Issue Type: Test >Reporter: Tao LI >Assignee: Tao LI > Fix For: 2.1.0 > > > There are lots of test failure on windows. Mainly for the following reasons: > 1. resources (e.g., temp files) haven't been properly closed after using, > thus IOException raised while deleting the temp directory. > 2. Command line too long on windows. > 3. File path problems caused by the drive label and back slash on absolute > windows path. > 4. Simply java bug on windows >a. setReadable doesn't work on windows for directories. >b. setWritable doesn't work on windows >c. Memory-mapped file can't be read at the same time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17984) Add support for numa aware feature
[ https://issues.apache.org/jira/browse/SPARK-17984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604300#comment-15604300 ] Apache Spark commented on SPARK-17984: -- User 'sheepduke' has created a pull request for this issue: https://github.com/apache/spark/pull/15579 > Add support for numa aware feature > -- > > Key: SPARK-17984 > URL: https://issues.apache.org/jira/browse/SPARK-17984 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos, YARN >Affects Versions: 2.0.1 > Environment: Cluster Topo: 1 Master + 4 Slaves > CPU: Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz(72 Cores) > Memory: 128GB(2 NUMA Nodes) > SW Version: Hadoop-5.7.0 + Spark-2.0.0 >Reporter: quanfuwang > Original Estimate: 672h > Remaining Estimate: 672h > > This Jira is target to add support numa aware feature which can help improve > performance by making core access local memory rather than remote one. > A patch is being developed, see https://github.com/apache/spark/pull/15524. > And the whole task includes 3 subtasks and will be developed iteratively: > Numa aware support for Yarn based deployment mode > Numa aware support for Mesos based deployment mode > Numa aware support for Standalone based deployment mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18088) ChiSqSelector FPR PR cleanups
[ https://issues.apache.org/jira/browse/SPARK-18088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18088: -- Description: There are several cleanups I'd like to make as a follow-up to the PRs from [SPARK-17017]: * Clarify FPR, alpha, p-value relationship in docs and param naming ** I'd like to remove "alpha" since it is such a generic name. * Rename selectorType values to match corresponding Params * Add Since tags where missing * a few minor cleanups was: There are several cleanups I'd like to make as a follow-up to the PRs from [SPARK-17017]: * Clarify FPR, alpha, p-value relationship in docs and param naming * Rename selectorType values to match corresponding Params * Add Since tags where missing * a few minor cleanups > ChiSqSelector FPR PR cleanups > - > > Key: SPARK-18088 > URL: https://issues.apache.org/jira/browse/SPARK-18088 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Minor > > There are several cleanups I'd like to make as a follow-up to the PRs from > [SPARK-17017]: > * Clarify FPR, alpha, p-value relationship in docs and param naming > ** I'd like to remove "alpha" since it is such a generic name. > * Rename selectorType values to match corresponding Params > * Add Since tags where missing > * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18088) ChiSqSelector FPR PR cleanups
Joseph K. Bradley created SPARK-18088: - Summary: ChiSqSelector FPR PR cleanups Key: SPARK-18088 URL: https://issues.apache.org/jira/browse/SPARK-18088 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor There are several cleanups I'd like to make as a follow-up to the PRs from [SPARK-17017]: * Clarify FPR, alpha, p-value relationship in docs and param naming * Rename selectorType values to match corresponding Params * Add Since tags where missing * a few minor cleanups -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18019) Log instrumentation in GBTs
[ https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18019: -- Assignee: Seth Hendrickson > Log instrumentation in GBTs > --- > > Key: SPARK-18019 > URL: https://issues.apache.org/jira/browse/SPARK-18019 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18019) Log instrumentation in GBTs
[ https://issues.apache.org/jira/browse/SPARK-18019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18019: -- Shepherd: Joseph K. Bradley (was: Timothy Hunter) > Log instrumentation in GBTs > --- > > Key: SPARK-18019 > URL: https://issues.apache.org/jira/browse/SPARK-18019 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > Sub-task for adding instrumentation to GBTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18078) Add option for customize zipPartition task preferred locations
[ https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-18078: --- Priority: Minor (was: Major) > Add option for customize zipPartition task preferred locations > -- > > Key: SPARK-18078 > URL: https://issues.apache.org/jira/browse/SPARK-18078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Weichen Xu >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > `RDD.zipPartitions` task preferred locations strategy will use the > intersection of corresponding zipped partitions locations, if the > intersection is null, it use union of these locations. > but in special case, I want to customize the task preferred locations for > better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* > operator: a distributed matrix(DMatrix) multiplying a distributed > vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be > partitioned in the same way beforehand). > https://github.com/databricks/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala > Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope > the zipPartition task always locates on `DMatrix` partition's location. it > will get better data locality than the default preferred location strategy. > I think it make sense to add an option for this. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18078) Add option for customize zipPartition task preferred locations
[ https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-18078: --- Description: `RDD.zipPartitions` task preferred locations strategy will use the intersection of corresponding zipped partitions locations, if the intersection is null, it use union of these locations. but in special case, I want to customize the task preferred locations for better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* operator: a distributed matrix(DMatrix) multiplying a distributed vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be partitioned in the same way beforehand). https://github.com/databricks/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the zipPartition task always locates on `DMatrix` partition's location. it will get better data locality than the default preferred location strategy. I think it make sense to add an option for this. was: `RDD.zipPartitions` task preferred locations strategy will use the intersection of corresponding zipped partitions locations, if the intersection is null, it use union of these locations. but in special case, I want to customize the task preferred locations for better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* operator: a distributed matrix(DMatrix) multiplying a distributed vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be partitioned in the same way beforehand). https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the zipPartition task always locates on `DMatrix` partition's location. it will get better data locality than the default preferred location strategy. I think it make sense to add an option for this. > Add option for customize zipPartition task preferred locations > -- > > Key: SPARK-18078 > URL: https://issues.apache.org/jira/browse/SPARK-18078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > `RDD.zipPartitions` task preferred locations strategy will use the > intersection of corresponding zipped partitions locations, if the > intersection is null, it use union of these locations. > but in special case, I want to customize the task preferred locations for > better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* > operator: a distributed matrix(DMatrix) multiplying a distributed > vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be > partitioned in the same way beforehand). > https://github.com/databricks/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala > Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope > the zipPartition task always locates on `DMatrix` partition's location. it > will get better data locality than the default preferred location strategy. > I think it make sense to add an option for this. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17183) put hive serde table schema to table properties like data source table
[ https://issues.apache.org/jira/browse/SPARK-17183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17183: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > put hive serde table schema to table properties like data source table > -- > > Key: SPARK-17183 > URL: https://issues.apache.org/jira/browse/SPARK-17183 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18087) Optimize insert to not require REPAIR TABLE
Eric Liang created SPARK-18087: -- Summary: Optimize insert to not require REPAIR TABLE Key: SPARK-18087 URL: https://issues.apache.org/jira/browse/SPARK-18087 Project: Spark Issue Type: Sub-task Reporter: Eric Liang -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18026) should not always lowercase partition columns of partition spec in parser
[ https://issues.apache.org/jira/browse/SPARK-18026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-18026: --- Issue Type: Sub-task (was: Improvement) Parent: SPARK-17861 > should not always lowercase partition columns of partition spec in parser > - > > Key: SPARK-18026 > URL: https://issues.apache.org/jira/browse/SPARK-18026 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17970) Use metastore for managing filesource table partitions as well
[ https://issues.apache.org/jira/browse/SPARK-17970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Liang updated SPARK-17970: --- Summary: Use metastore for managing filesource table partitions as well (was: store partition spec in metastore for data source table) > Use metastore for managing filesource table partitions as well > -- > > Key: SPARK-17970 > URL: https://issues.apache.org/jira/browse/SPARK-17970 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17894) Ensure uniqueness of TaskSetManager name
[ https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603860#comment-15603860 ] Apache Spark commented on SPARK-17894: -- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/15617 > Ensure uniqueness of TaskSetManager name > > > Key: SPARK-17894 > URL: https://issues.apache.org/jira/browse/SPARK-17894 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari > Fix For: 2.1.0 > > > TaskSetManager should have unique name to avoid adding duplicate ones to > *Pool* via *SchedulableBuilder*. This problem surfaced with > https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: > https://github.com/apache/spark/pull/15326 > *Proposal* : > There is 1x1 relationship between Stage Attempt Id and TaskSetManager so > taskSet.Id covering both stageId and stageAttemptId looks to be used for > TaskSetManager as well. > *Current TaskSetManager Name* : > {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code} > *Sample*: TaskSet_0 > *Proposed TaskSetManager Name* : > {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + > stageAttemptId) {code} > *Sample* : TaskSet_0.0 > cc [~kayousterhout] [~markhamstra] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18028) simplify TableFileCatalog
[ https://issues.apache.org/jira/browse/SPARK-18028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-18028. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15568 [https://github.com/apache/spark/pull/15568] > simplify TableFileCatalog > - > > Key: SPARK-18028 > URL: https://issues.apache.org/jira/browse/SPARK-18028 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18086) Regression: Hive variables no longer work in Spark 2.0
Ryan Blue created SPARK-18086: - Summary: Regression: Hive variables no longer work in Spark 2.0 Key: SPARK-18086 URL: https://issues.apache.org/jira/browse/SPARK-18086 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Ryan Blue The behavior of variables in the SQL shell has changed from 1.6 to 2.0. Specifically, --hivevar name=value and {{SET hivevar:name=value}} no longer work. Queries that worked correctly in 1.6 will either fail or produce unexpected results in 2.0 so I think this is a regression that should be addressed. Hive and Spark 1.6 work like this: 1. Command-line args --hiveconf and --hivevar can be used to set session properties. --hiveconf properties are added to the Hadoop Configuration. 2. {{SET}} adds a Hive Configuration property, {{SET hivevar:=}} adds a Hive var. 3. Hive vars can be substituted into queries by name, and Configuration properties can be substituted using {{hiveconf:name}}. In 2.0, hiveconf, sparkconf, and conf variable prefixes are all removed, then the value in SQLConf for the rest of the key is returned. SET adds properties to the session config and (according to [a comment|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RuntimeConfig.scala#L28]) the Hadoop configuration "during I/O". {code:title=Hive and Spark 1.6.1 behavior} [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 spark-sql> select "${hiveconf:test.conf}"; 1 spark-sql> select "${test.conf}"; ${test.conf} spark-sql> select "${hivevar:test.var}"; 2 spark-sql> select "${test.var}"; 2 spark-sql> set test.set=3; SET test.set=3 spark-sql> select "${test.set}" "${test.set}" spark-sql> select "${hivevar:test.set}" "${hivevar:test.set}" spark-sql> select "${hiveconf:test.set}" 3 spark-sql> set hivevar:test.setvar=4; SET hivevar:test.setvar=4 spark-sql> select "${hivevar:test.setvar}"; 4 spark-sql> select "${test.setvar}"; 4 {code} {code:title=Spark 2.0.0 behavior} [user@host:~]: spark-sql --hiveconf test.conf=1 --hivevar test.var=2 spark-sql> select "${hiveconf:test.conf}"; 1 spark-sql> select "${test.conf}"; 1 spark-sql> select "${hivevar:test.var}"; ${hivevar:test.var} spark-sql> select "${test.var}"; ${test.var} spark-sql> set test.set=3; test.set3 spark-sql> select "${test.set}"; 3 spark-sql> set hivevar:test.setvar=4; hivevar:test.setvar 4 spark-sql> select "${hivevar:test.setvar}"; 4 spark-sql> select "${test.setvar}"; ${test.setvar} {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17624) Flaky test? StateStoreSuite maintenance
[ https://issues.apache.org/jira/browse/SPARK-17624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-17624. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 > Flaky test? StateStoreSuite maintenance > --- > > Key: SPARK-17624 > URL: https://issues.apache.org/jira/browse/SPARK-17624 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.0.1, 2.1.0 >Reporter: Adam Roberts >Assignee: Tathagata Das >Priority: Minor > Fix For: 2.0.2, 2.1.0 > > > I've noticed this test failing consistently (25x in a row) with a two core > machine but not on an eight core machine > If we increase the spark.rpc.numRetries value used in the test from 1 to 2 (3 > being the default in Spark), the test reliably passes, we can also gain > reliability by setting the master to be anything other than just local. > Is there a reason spark.rpc.numRetries is set to be 1? > I see this failure is also mentioned here so it's been flaky for a while > http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-2-0-0-RC5-td18367.html > If we run without the "quietly" code so we get debug info: > {code} > 16:26:15.213 WARN org.apache.spark.rpc.netty.NettyRpcEndpointRef: Error > sending message [message = > VerifyIfInstanceActive(StateStoreId(/home/aroberts/Spark-DK/sql/core/target/tmp/spark-cc44f5fa-b675-426f-9440-76785c365507/ૺꎖ鮎衲넅-28e9196f-8b2d-43ba-8421-44a5c5e98ceb,0,0),driver)] > in 1 attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.sql.execution.streaming.state.StateStoreCoordinatorRef.verifyIfInstanceActive(StateStoreCoordinator.scala:91) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$3.apply(StateStore.scala:227) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$verifyIfStoreInstanceActive(StateStore.scala:227) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:199) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anonfun$org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance$2.apply(StateStore.scala:197) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.sql.execution.streaming.state.StateStore$.org$apache$spark$sql$execution$streaming$state$StateStore$$doMaintenance(StateStore.scala:197) > at > org.apache.spark.sql.execution.streaming.state.StateStore$$anon$1.run(StateStore.scala:180) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:522) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:319) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:191) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.lang.Thread.run(Thread.java:785) > Caused by: org.apache.spark.SparkException: Could not find > StateStoreCoordinator. > at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:154) > at > org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:129) > at
[jira] [Commented] (SPARK-14300) Scala MLlib examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603677#comment-15603677 ] Joseph K. Bradley commented on SPARK-14300: --- [~yinxusen] It looks like quite a few of these are not duplicates. Can you please check again and update the JIRA description? Thanks! > Scala MLlib examples code merge and clean up > > > Key: SPARK-14300 > URL: https://issues.apache.org/jira/browse/SPARK-14300 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/mllib: > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * Unsure code duplications (need doube check) > ** AbstractParams.scala > ** BinaryClassification.scala > ** Correlations.scala > ** CosineSimilarity.scala > ** DenseGaussianMixture.scala > ** FPGrowthExample.scala > ** MovieLensALS.scala > ** MultivariateSummarizer.scala > ** RandomRDDGeneration.scala > ** SampledRDDs.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14300) Scala MLlib examples code merge and clean up
[ https://issues.apache.org/jira/browse/SPARK-14300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14300: -- Shepherd: Joseph K. Bradley > Scala MLlib examples code merge and clean up > > > Key: SPARK-14300 > URL: https://issues.apache.org/jira/browse/SPARK-14300 > Project: Spark > Issue Type: Sub-task > Components: Examples >Reporter: Xusen Yin >Priority: Minor > Labels: starter > > Duplicated code that I found in scala/examples/mllib: > * scala/mllib > ** DecisionTreeRunner.scala > ** DenseGaussianMixture.scala > ** DenseKMeans.scala > ** GradientBoostedTreesRunner.scala > ** LDAExample.scala > ** LinearRegression.scala > ** SparseNaiveBayes.scala > ** StreamingLinearRegression.scala > ** StreamingLogisticRegression.scala > ** TallSkinnyPCA.scala > ** TallSkinnySVD.scala > * Unsure code duplications (need doube check) > ** AbstractParams.scala > ** BinaryClassification.scala > ** Correlations.scala > ** CosineSimilarity.scala > ** DenseGaussianMixture.scala > ** FPGrowthExample.scala > ** MovieLensALS.scala > ** MultivariateSummarizer.scala > ** RandomRDDGeneration.scala > ** SampledRDDs.scala > When merging and cleaning those code, be sure not disturb the previous > example on and off blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17950) Match SparseVector behavior with DenseVector
[ https://issues.apache.org/jira/browse/SPARK-17950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] AbderRahman Sobh updated SPARK-17950: - Description: What changes were proposed in this pull request? Simply added the __getattr__ to SparseVector that DenseVector has, but calls to a SciPy sparse representation instead of storing a vector all the time in self.array This allows for use of functions on the values of an entire SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Note: The functions do have a slight bit of variance due to calling SciPy and not NumPy. However, the majority of useful functions (sums, means, max, etc.) are available to both packages anyways. How was this patch tested? Manual testing on local machine. Passed ./python/run-tests No UI changes. was: Simply added the `__getattr__` to SparseVector that DenseVector has, but calls self.toArray() instead of storing a vector all the time in self.array This allows for use of numpy functions on the values of a SparseVector in the same direct way that users interact with DenseVectors. i.e. you can simply call SparseVector.mean() to average the values in the entire vector. Component/s: ML > Match SparseVector behavior with DenseVector > > > Key: SPARK-17950 > URL: https://issues.apache.org/jira/browse/SPARK-17950 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, PySpark >Affects Versions: 2.0.1 >Reporter: AbderRahman Sobh >Priority: Minor > Original Estimate: 0h > Remaining Estimate: 0h > > What changes were proposed in this pull request? > Simply added the __getattr__ to SparseVector that DenseVector has, but calls > to a SciPy sparse representation instead of storing a vector all the time in > self.array > This allows for use of functions on the values of an entire SparseVector in > the same direct way that users interact with DenseVectors. > i.e. you can simply call SparseVector.mean() to average the values in the > entire vector. > Note: The functions do have a slight bit of variance due to calling SciPy and > not NumPy. However, the majority of useful functions (sums, means, max, etc.) > are available to both packages anyways. > How was this patch tested? > Manual testing on local machine. > Passed ./python/run-tests > No UI changes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603596#comment-15603596 ] Marcelo Vanzin commented on SPARK-18085: Finally, I actually wrote some code for the first two milestones described in the document, to show what this could look like. The code is at: https://github.com/vanzin/spark/tree/shs-ng/M2 https://github.com/vanzin/spark/tree/shs-ng/M1 (M2 actually has all the changes in M1, this is just to follow the implementation path described in the document.) > Scalability enhancements for the History Server > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603588#comment-15603588 ] Marcelo Vanzin commented on SPARK-18085: Pinging a few people who've worked / complained about the history server in the past: [~tgraves] [~andrewor14] [~steve_l] [~ajbozarth] > Scalability enhancements for the History Server > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) Scalability enhancements for the History Server
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603585#comment-15603585 ] Marcelo Vanzin commented on SPARK-18085: Also, I'm not sure if we're labeling things as "enhancement proposals" yet, but given the scope of this work, this would be a good candidate for that. > Scalability enhancements for the History Server > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18085) Scalability enhancements for the History Server
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-18085: --- Attachment: spark_hs_next_gen.pdf Here's an initial document to bootstrap the discussion of how to fix the History Server. It discusses the current issues and proposes solutions to them, providing an implementation path that will hopefully allow for finer-grained reviews with as little disruption to functionality as possible while the work is under review. I could also share a google doc link if people prefer (although I believe the ASF likes documents to be stored on their servers). I think there are two points that might be a little controversial, so I'll list them up front: - use of a JNI library for data storage: this is not a requirement, but there were no good pure Java alternatives I could find. The functionality is not dependent on the actual library, though, so we could switch the library if desired. - move of Spark UI code to a separate module: I think this is good both as a way to isolate these changes, and for future maintainability; it makes it clearer that the UI is sort of a separate component, and that there should be a clearer separation between the code in core and the ui. > Scalability enhancements for the History Server > --- > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Attachments: spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18085) Scalability enhancements for the History Server
Marcelo Vanzin created SPARK-18085: -- Summary: Scalability enhancements for the History Server Key: SPARK-18085 URL: https://issues.apache.org/jira/browse/SPARK-18085 Project: Spark Issue Type: Umbrella Components: Spark Core, Web UI Affects Versions: 2.0.0 Reporter: Marcelo Vanzin It's a known fact that the History Server currently has some annoying issues when serving lots of applications, and when serving large applications. I'm filing this umbrella to track work related to addressing those issues. I'll be attaching a document shortly describing the issues and suggesting a path to how to solve them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11375) History Server "no histories" message to be dynamically generated by ApplicationHistoryProviders
[ https://issues.apache.org/jira/browse/SPARK-11375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-11375. Resolution: Duplicate Looks like a dupe. > History Server "no histories" message to be dynamically generated by > ApplicationHistoryProviders > > > Key: SPARK-11375 > URL: https://issues.apache.org/jira/browse/SPARK-11375 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Steve Loughran >Priority: Minor > > When there are no histories, the {{HistoryPage}} displays an error text which > assumes that the provider is the {{FsHistoryProvider}}, and its sole failure > mode is "directory not found" > {code} > Did you specify the correct logging directory? > Please verify your setting of spark.history.fs.logDirectory > {code} > Different providers have different failure modes, and even the filesystem > provider has some, such as an access control exception, or the specified > directly path actually being a file. > If the {{ApplicationHistoryProvider}} was itself asked to provide an error > message, then it could > * be dynamically generated to show the current state of the history provider > * potentially include any exceptions to list > * display the actual values of settings such as the log directory property. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17894) Ensure uniqueness of TaskSetManager name
[ https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-17894. Resolution: Fixed Fix Version/s: 2.1.0 Resolved by https://github.com/apache/spark/pull/15463 > Ensure uniqueness of TaskSetManager name > > > Key: SPARK-17894 > URL: https://issues.apache.org/jira/browse/SPARK-17894 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari > Fix For: 2.1.0 > > > TaskSetManager should have unique name to avoid adding duplicate ones to > *Pool* via *SchedulableBuilder*. This problem surfaced with > https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: > https://github.com/apache/spark/pull/15326 > *Proposal* : > There is 1x1 relationship between Stage Attempt Id and TaskSetManager so > taskSet.Id covering both stageId and stageAttemptId looks to be used for > TaskSetManager as well. > *Current TaskSetManager Name* : > {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code} > *Sample*: TaskSet_0 > *Proposed TaskSetManager Name* : > {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + > stageAttemptId) {code} > *Sample* : TaskSet_0.0 > cc [~kayousterhout] [~markhamstra] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17894) Ensure uniqueness of TaskSetManager name
[ https://issues.apache.org/jira/browse/SPARK-17894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-17894: --- Assignee: Eren Avsarogullari > Ensure uniqueness of TaskSetManager name > > > Key: SPARK-17894 > URL: https://issues.apache.org/jira/browse/SPARK-17894 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari > > TaskSetManager should have unique name to avoid adding duplicate ones to > *Pool* via *SchedulableBuilder*. This problem surfaced with > https://issues.apache.org/jira/browse/SPARK-17759 and please find discussion: > https://github.com/apache/spark/pull/15326 > *Proposal* : > There is 1x1 relationship between Stage Attempt Id and TaskSetManager so > taskSet.Id covering both stageId and stageAttemptId looks to be used for > TaskSetManager as well. > *Current TaskSetManager Name* : > {code:java} var name = "TaskSet_" + taskSet.stageId.toString{code} > *Sample*: TaskSet_0 > *Proposed TaskSetManager Name* : > {code:java} var name = "TaskSet_" + taskSet.Id (stageId + "." + > stageAttemptId) {code} > *Sample* : TaskSet_0.0 > cc [~kayousterhout] [~markhamstra] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
[ https://issues.apache.org/jira/browse/SPARK-18084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicholas Chammas updated SPARK-18084: - Issue Type: Bug (was: Improvement) > write.partitionBy() does not recognize nested columns that select() can access > -- > > Key: SPARK-18084 > URL: https://issues.apache.org/jira/browse/SPARK-18084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Nicholas Chammas >Priority: Minor > > Here's a simple repro in the PySpark shell: > {code} > from pyspark.sql import Row > rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > df = spark.createDataFrame(rdd) > df.printSchema() > df.select('a.b').show() # works > df.write.partitionBy('a.b').text('/tmp/test') # doesn't work > {code} > Here's what I see when I run this: > {code} > >>> from pyspark.sql import Row > >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) > >>> df = spark.createDataFrame(rdd) > >>> df.printSchema() > root > |-- a: struct (nullable = true) > ||-- b: long (nullable = true) > >>> df.show() > +---+ > | a| > +---+ > |[5]| > +---+ > >>> df.select('a.b').show() > +---+ > | b| > +---+ > | 5| > +---+ > >>> df.write.partitionBy('a.b').text('/tmp/test') > Traceback (most recent call last): > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", > line 319, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. > : org.apache.spark.sql.AnalysisException: Partition column a.b not found in > schema > StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) > at > org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:214) > at java.lang.Thread.run(Thread.java:745) > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "", line 1, in > File > "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py", > line 656, in text > self._jwrite.text(path) > File >
[jira] [Created] (SPARK-18084) write.partitionBy() does not recognize nested columns that select() can access
Nicholas Chammas created SPARK-18084: Summary: write.partitionBy() does not recognize nested columns that select() can access Key: SPARK-18084 URL: https://issues.apache.org/jira/browse/SPARK-18084 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.1, 2.0.0 Reporter: Nicholas Chammas Priority: Minor Here's a simple repro in the PySpark shell: {code} from pyspark.sql import Row rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) df = spark.createDataFrame(rdd) df.printSchema() df.select('a.b').show() # works df.write.partitionBy('a.b').text('/tmp/test') # doesn't work {code} Here's what I see when I run this: {code} >>> from pyspark.sql import Row >>> rdd = spark.sparkContext.parallelize([Row(a=Row(b=5))]) >>> df = spark.createDataFrame(rdd) >>> df.printSchema() root |-- a: struct (nullable = true) ||-- b: long (nullable = true) >>> df.show() +---+ | a| +---+ |[5]| +---+ >>> df.select('a.b').show() +---+ | b| +---+ | 5| +---+ >>> df.write.partitionBy('a.b').text('/tmp/test') Traceback (most recent call last): File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o233.text. : org.apache.spark.sql.AnalysisException: Partition column a.b not found in schema StructType(StructField(a,StructType(StructField(b,LongType,true)),true)); at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1$$anonfun$apply$10.apply(PartitioningUtils.scala:368) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:367) at org.apache.spark.sql.execution.datasources.PartitioningUtils$$anonfun$partitionColumnsSchema$1.apply(PartitioningUtils.scala:366) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.partitionColumnsSchema(PartitioningUtils.scala:366) at org.apache.spark.sql.execution.datasources.PartitioningUtils$.validatePartitionColumn(PartitioningUtils.scala:349) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:458) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) at org.apache.spark.sql.DataFrameWriter.text(DataFrameWriter.scala:534) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "", line 1, in File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/readwriter.py", line 656, in text self._jwrite.text(path) File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/local/Cellar/apache-spark/2.0.1/libexec/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: 'Partition column a.b not
[jira] [Commented] (SPARK-16827) Stop reporting spill metrics as shuffle metrics
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603249#comment-15603249 ] Apache Spark commented on SPARK-16827: -- User 'dreamworks007' has created a pull request for this issue: https://github.com/apache/spark/pull/15616 > Stop reporting spill metrics as shuffle metrics > --- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia >Assignee: Brian Cho > Labels: performance > Fix For: 2.0.2, 2.1.0 > > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12757) Use reference counting to prevent blocks from being evicted during reads
[ https://issues.apache.org/jira/browse/SPARK-12757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603211#comment-15603211 ] Nicholas Chammas commented on SPARK-12757: -- Just to link back, [~josephkb] is reporting that [this GraphFrames issue|https://github.com/graphframes/graphframes/issues/116] may be related to the work done here. > Use reference counting to prevent blocks from being evicted during reads > > > Key: SPARK-12757 > URL: https://issues.apache.org/jira/browse/SPARK-12757 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > As a pre-requisite to off-heap caching of blocks, we need a mechanism to > prevent pages / blocks from being evicted while they are being read. With > on-heap objects, evicting a block while it is being read merely leads to > memory-accounting problems (because we assume that an evicted block is a > candidate for garbage-collection, which will not be true during a read), but > with off-heap memory this will lead to either data corruption or segmentation > faults. > To address this, we should add a reference-counting mechanism to track which > blocks/pages are being read in order to prevent them from being evicted > prematurely. I propose to do this in two phases: first, add a safe, > conservative approach in which all BlockManager.get*() calls implicitly > increment the reference count of blocks and where tasks' references are > automatically freed upon task completion. This will be correct but may have > adverse performance impacts because it will prevent legitimate block > evictions. In phase two, we should incrementally add release() calls in order > to fix the eviction of unreferenced blocks. The latter change may need to > touch many different components, which is why I propose to do it separately > in order to make the changes easier to reason about and review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work
[ https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603181#comment-15603181 ] Yuehua Zhang commented on SPARK-18017: -- Yeah, that is what i did: "spark-submit --conf spark.hadoop.fs.s3n.block.size=524288000 ...". It did get rid of the non-spark config warning though. > Changing Hadoop parameter through > sparkSession.sparkContext.hadoopConfiguration doesn't work > > > Key: SPARK-18017 > URL: https://issues.apache.org/jira/browse/SPARK-18017 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Scala version 2.11.8; Java 1.8.0_91; > com.databricks:spark-csv_2.11:1.2.0 >Reporter: Yuehua Zhang > > My Spark job tries to read csv files on S3. I need to control the number of > partitions created so I set Hadoop parameter fs.s3n.block.size. However, it > stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is > related to https://issues.apache.org/jira/browse/SPARK-15991. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
Joseph K. Bradley created SPARK-18081: - Summary: Locality Sensitive Hashing (LSH) User Guide Key: SPARK-18081 URL: https://issues.apache.org/jira/browse/SPARK-18081 Project: Spark Issue Type: New Feature Components: Documentation, ML Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18081: -- Issue Type: Documentation (was: New Feature) > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18083) Locality Sensitive Hashing (LSH) - BitSampling
[ https://issues.apache.org/jira/browse/SPARK-18083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18083: -- Assignee: Yun Ni > Locality Sensitive Hashing (LSH) - BitSampling > -- > > Key: SPARK-18083 > URL: https://issues.apache.org/jira/browse/SPARK-18083 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni >Priority: Minor > > See linked JIRA for original LSH for details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5992: - Assignee: Yun Ni > Locality Sensitive Hashing (LSH) for MLlib > -- > > Key: SPARK-5992 > URL: https://issues.apache.org/jira/browse/SPARK-5992 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Yun Ni > > Locality Sensitive Hashing (LSH) would be very useful for ML. It would be > great to discuss some possible algorithms here, choose an API, and make a PR > for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18082) Locality Sensitive Hashing (LSH) - SignRandomProjection
[ https://issues.apache.org/jira/browse/SPARK-18082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18082: -- Assignee: Yun Ni > Locality Sensitive Hashing (LSH) - SignRandomProjection > --- > > Key: SPARK-18082 > URL: https://issues.apache.org/jira/browse/SPARK-18082 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni >Priority: Minor > > See linked JIRA for original LSH for details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18081) Locality Sensitive Hashing (LSH) User Guide
[ https://issues.apache.org/jira/browse/SPARK-18081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18081: -- Assignee: Yun Ni > Locality Sensitive Hashing (LSH) User Guide > --- > > Key: SPARK-18081 > URL: https://issues.apache.org/jira/browse/SPARK-18081 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yun Ni > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API
Joseph K. Bradley created SPARK-18080: - Summary: Locality Sensitive Hashing (LSH) Python API Key: SPARK-18080 URL: https://issues.apache.org/jira/browse/SPARK-18080 Project: Spark Issue Type: New Feature Reporter: Joseph K. Bradley -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction
[ https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603030#comment-15603030 ] Joseph K. Bradley commented on SPARK-7334: -- [~sebalf] I'm sorry we weren't able to get your PR in. I do appreciate your work on this! Looking back, I believe the functionality in this JIRA should be a subset of what is in the PR for [SPARK-5992], so I'll go ahead and close this JIRA issue. If you have time, feedback on the current PR for [SPARK-5992] would be very valuable. Thanks very much. > Implement RandomProjection for Dimensionality Reduction > --- > > Key: SPARK-7334 > URL: https://issues.apache.org/jira/browse/SPARK-7334 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Sebastian Alfers >Priority: Minor > > Implement RandomProjection (RP) for dimensionality reduction > RP is a popular approach to reduce the amount of data while preserving a > reasonable amount of information (pairwise distance) of you data [1][2] > - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf > - [2] > http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf > I compared different implementations of that algorithm: > - https://github.com/sebastian-alfers/random-projection-python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction
[ https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7334: - Issue Type: New Feature (was: Improvement) > Implement RandomProjection for Dimensionality Reduction > --- > > Key: SPARK-7334 > URL: https://issues.apache.org/jira/browse/SPARK-7334 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Sebastian Alfers >Priority: Minor > > Implement RandomProjection (RP) for dimensionality reduction > RP is a popular approach to reduce the amount of data while preserving a > reasonable amount of information (pairwise distance) of you data [1][2] > - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf > - [2] > http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf > I compared different implementations of that algorithm: > - https://github.com/sebastian-alfers/random-projection-python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18080) Locality Sensitive Hashing (LSH) Python API
[ https://issues.apache.org/jira/browse/SPARK-18080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-18080: -- Component/s: PySpark ML > Locality Sensitive Hashing (LSH) Python API > --- > > Key: SPARK-18080 > URL: https://issues.apache.org/jira/browse/SPARK-18080 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18082) Locality Sensitive Hashing (LSH) - SignRandomProjection
Joseph K. Bradley created SPARK-18082: - Summary: Locality Sensitive Hashing (LSH) - SignRandomProjection Key: SPARK-18082 URL: https://issues.apache.org/jira/browse/SPARK-18082 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Priority: Minor See linked JIRA for original LSH for details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18083) Locality Sensitive Hashing (LSH) - BitSampling
Joseph K. Bradley created SPARK-18083: - Summary: Locality Sensitive Hashing (LSH) - BitSampling Key: SPARK-18083 URL: https://issues.apache.org/jira/browse/SPARK-18083 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Priority: Minor See linked JIRA for original LSH for details -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7334) Implement RandomProjection for Dimensionality Reduction
[ https://issues.apache.org/jira/browse/SPARK-7334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-7334. Resolution: Duplicate > Implement RandomProjection for Dimensionality Reduction > --- > > Key: SPARK-7334 > URL: https://issues.apache.org/jira/browse/SPARK-7334 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Sebastian Alfers >Priority: Minor > > Implement RandomProjection (RP) for dimensionality reduction > RP is a popular approach to reduce the amount of data while preserving a > reasonable amount of information (pairwise distance) of you data [1][2] > - [1] http://www.yaroslavvb.com/papers/achlioptas-database.pdf > - [2] > http://people.inf.elte.hu/fekete/algoritmusok_msc/dimenzio_csokkentes/randon_projection_kdd.pdf > I compared different implementations of that algorithm: > - https://github.com/sebastian-alfers/random-projection-python -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602974#comment-15602974 ] Cheng Lian commented on SPARK-18053: Yea, reproduced using 2.0. > ARRAY equality is broken in Spark 2.0 > - > > Key: SPARK-18053 > URL: https://issues.apache.org/jira/browse/SPARK-18053 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Cheng Lian >Assignee: Wenchen Fan > Labels: correctness > > The following Spark shell reproduces this issue: > {code} > case class Test(a: Seq[Int]) > Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t") > sql("SELECT a FROM t WHERE a = array(1)").show() > // +---+ > // | a| > // +---+ > // +---+ > sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show() > // +---+ > // | a| > // +---+ > // |[1]| > // +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18053) ARRAY equality is broken in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-18053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602969#comment-15602969 ] Cheng Lian commented on SPARK-18053: Hm, the user mailing list thread said that it fails under 2.0 https://lists.apache.org/thread.html/%3c1476953644701-27926.p...@n3.nabble.com%3E I haven't verify it under 2.0 yet. > ARRAY equality is broken in Spark 2.0 > - > > Key: SPARK-18053 > URL: https://issues.apache.org/jira/browse/SPARK-18053 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: Cheng Lian >Assignee: Wenchen Fan > Labels: correctness > > The following Spark shell reproduces this issue: > {code} > case class Test(a: Seq[Int]) > Seq(Test(Seq(1))).toDF().createOrReplaceTempView("t") > sql("SELECT a FROM t WHERE a = array(1)").show() > // +---+ > // | a| > // +---+ > // +---+ > sql("SELECT a FROM (SELECT array(1) AS a) x WHERE x.a = array(1)").show() > // +---+ > // | a| > // +---+ > // |[1]| > // +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work
[ https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602917#comment-15602917 ] Sean Owen commented on SPARK-18017: --- You need to set it with --conf, not programmatically, I'd imagine. > Changing Hadoop parameter through > sparkSession.sparkContext.hadoopConfiguration doesn't work > > > Key: SPARK-18017 > URL: https://issues.apache.org/jira/browse/SPARK-18017 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Scala version 2.11.8; Java 1.8.0_91; > com.databricks:spark-csv_2.11:1.2.0 >Reporter: Yuehua Zhang > > My Spark job tries to read csv files on S3. I need to control the number of > partitions created so I set Hadoop parameter fs.s3n.block.size. However, it > stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is > related to https://issues.apache.org/jira/browse/SPARK-15991. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site
[ https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602747#comment-15602747 ] Alex Bozarth commented on SPARK-18073: -- I think we should keep the Internals pages but (open some JIRAs to) update them, I'd be willing to look at updating the Web UI Internals page, it helped me when I was first starting but it's pretty out of date now. > Migrate wiki to spark.apache.org web site > - > > Key: SPARK-18073 > URL: https://issues.apache.org/jira/browse/SPARK-18073 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Sean Owen > > Per > http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html > , let's consider migrating all wiki pages to documents at > github.com/apache/spark-webiste (i.e. spark.apache.org). > Some reasons: > * No pull request system or history for changes to the wiki > * Separate, not-so-clear system for granting write access to wiki > * Wiki doesn't change much > * One less place to maintain or look for docs > The idea would be to then update all wiki pages with a message pointing to > the new home of the information (or message saying it's obsolete). > Here are the current wikis and my general proposal for what to do with the > content: > * Additional Language Bindings -> roll this into wherever Third Party > Projects ends up > * Committers -> Migrate to a new /committers.html page, linked under > Community menu (alread exists) > * Contributing to Spark -> Make this CONTRIBUTING.md? or a new > /contributing.html page under Community menu > ** Jira Permissions Scheme -> obsolete > ** Spark Code Style Guide -> roll this into new contributing.html page > * Development Discussions -> obsolete? > * Powered By Spark -> Make into new /powered-by.html linked by the existing > Commnunity menu item > * Preparing Spark Releases -> see below; roll into where "versioning policy" > goes? > * Profiling Spark Applications -> roll into where Useful Developer Tools goes > ** Profiling Spark Applications Using YourKit -> ditto > * Spark Internals -> all of these look somewhat to very stale; remove? > ** Java API Internals > ** PySpark Internals > ** Shuffle Internals > ** Spark SQL Internals > ** Web UI Internals > * Spark QA Infrastructure -> tough one. Good info to document; does it belong > on the website? we can just migrate it > * Spark Versioning Policy -> new page living under Community (?) that > documents release policy and process (better menu?) > ** spark-ec2 AMI list and install file version mappings -> obsolete > ** Spark-Shark version mapping -> obsolete > * Third Party Projects -> new Community menu item > * Useful Developer Tools -> new page under new Developer menu? Community? > ** Jenkins -> obsolete, remove > Of course, another outcome is to just remove outdated wikis, migrate some, > leave the rest. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18044) FileStreamSource should not infer partitions in every batch
[ https://issues.apache.org/jira/browse/SPARK-18044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-18044: - Fix Version/s: 2.0.2 > FileStreamSource should not infer partitions in every batch > --- > > Key: SPARK-18044 > URL: https://issues.apache.org/jira/browse/SPARK-18044 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.2, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17153) [Structured streams] readStream ignores partition columns
[ https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-17153: - Fix Version/s: 2.0.2 > [Structured streams] readStream ignores partition columns > - > > Key: SPARK-17153 > URL: https://issues.apache.org/jira/browse/SPARK-17153 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Dmitri Carpov >Assignee: Liang-Chi Hsieh > Fix For: 2.0.2, 2.1.0 > > > When parquet files are persisted using partitions, spark's `readStream` > returns data with all `null`s for the partitioned columns. > For example: > {noformat} > case class A(id: Int, value: Int) > val data = spark.createDataset(Seq( > A(1, 1), > A(2, 2), > A(2, 3)) > ) > val url = "/mnt/databricks/test" > data.write.partitionBy("id").parquet(url) > {noformat} > when data is read as stream: > {noformat} > spark.readStream.schema(spark.read.load(url).schema).parquet(url) > {noformat} > it reads: > {noformat} > id, value > null, 1 > null, 2 > null, 3 > {noformat} > A possible reason is `readStream` reads parquet files directly but when those > are stored the columns they are partitioned by are excluded from the file > itself. In the given example the parquet files contain `value` information > only since `id` is partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work
[ https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602708#comment-15602708 ] Yuehua Zhang commented on SPARK-18017: -- Yeah, I tried that also. Not working either... > Changing Hadoop parameter through > sparkSession.sparkContext.hadoopConfiguration doesn't work > > > Key: SPARK-18017 > URL: https://issues.apache.org/jira/browse/SPARK-18017 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Scala version 2.11.8; Java 1.8.0_91; > com.databricks:spark-csv_2.11:1.2.0 >Reporter: Yuehua Zhang > > My Spark job tries to read csv files on S3. I need to control the number of > partitions created so I set Hadoop parameter fs.s3n.block.size. However, it > stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is > related to https://issues.apache.org/jira/browse/SPARK-15991. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits
[ https://issues.apache.org/jira/browse/SPARK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18079: Assignee: (was: Apache Spark) > CollectLimitExec.executeToIterator() should perform per-partition limits > > > Key: SPARK-18079 > URL: https://issues.apache.org/jira/browse/SPARK-18079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Patrick Woody > > Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for > executeToIterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits
[ https://issues.apache.org/jira/browse/SPARK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602645#comment-15602645 ] Apache Spark commented on SPARK-18079: -- User 'pwoody' has created a pull request for this issue: https://github.com/apache/spark/pull/15614 > CollectLimitExec.executeToIterator() should perform per-partition limits > > > Key: SPARK-18079 > URL: https://issues.apache.org/jira/browse/SPARK-18079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Patrick Woody > > Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for > executeToIterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits
[ https://issues.apache.org/jira/browse/SPARK-18079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18079: Assignee: Apache Spark > CollectLimitExec.executeToIterator() should perform per-partition limits > > > Key: SPARK-18079 > URL: https://issues.apache.org/jira/browse/SPARK-18079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Patrick Woody >Assignee: Apache Spark > > Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for > executeToIterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18079) CollectLimitExec.executeToIterator() should perform per-partition limits
Patrick Woody created SPARK-18079: - Summary: CollectLimitExec.executeToIterator() should perform per-partition limits Key: SPARK-18079 URL: https://issues.apache.org/jira/browse/SPARK-18079 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.1 Reporter: Patrick Woody Analogous PR to https://issues.apache.org/jira/browse/SPARK-17515 for executeToIterator. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602525#comment-15602525 ] Apache Spark commented on SPARK-16988: -- User 'hayashidac' has created a pull request for this issue: https://github.com/apache/spark/pull/15611 > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16988: Assignee: Apache Spark > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Assignee: Apache Spark >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18078) Add option for customize zipPartition task preferred locations
[ https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602537#comment-15602537 ] Apache Spark commented on SPARK-18078: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/15612 > Add option for customize zipPartition task preferred locations > -- > > Key: SPARK-18078 > URL: https://issues.apache.org/jira/browse/SPARK-18078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > `RDD.zipPartitions` task preferred locations strategy will use the > intersection of corresponding zipped partitions locations, if the > intersection is null, it use union of these locations. > but in special case, I want to customize the task preferred locations for > better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* > operator: a distributed matrix(DMatrix) multiplying a distributed > vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be > partitioned in the same way beforehand). > https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala > Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope > the zipPartition task always locates on `DMatrix` partition's location. it > will get better data locality than the default preferred location strategy. > I think it make sense to add an option for this. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18078) Add option for customize zipPartition task preferred locations
[ https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18078: Assignee: Apache Spark > Add option for customize zipPartition task preferred locations > -- > > Key: SPARK-18078 > URL: https://issues.apache.org/jira/browse/SPARK-18078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Weichen Xu >Assignee: Apache Spark > Original Estimate: 24h > Remaining Estimate: 24h > > `RDD.zipPartitions` task preferred locations strategy will use the > intersection of corresponding zipped partitions locations, if the > intersection is null, it use union of these locations. > but in special case, I want to customize the task preferred locations for > better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* > operator: a distributed matrix(DMatrix) multiplying a distributed > vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be > partitioned in the same way beforehand). > https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala > Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope > the zipPartition task always locates on `DMatrix` partition's location. it > will get better data locality than the default preferred location strategy. > I think it make sense to add an option for this. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18078) Add option for customize zipPartition task preferred locations
[ https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18078: Assignee: (was: Apache Spark) > Add option for customize zipPartition task preferred locations > -- > > Key: SPARK-18078 > URL: https://issues.apache.org/jira/browse/SPARK-18078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > `RDD.zipPartitions` task preferred locations strategy will use the > intersection of corresponding zipped partitions locations, if the > intersection is null, it use union of these locations. > but in special case, I want to customize the task preferred locations for > better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* > operator: a distributed matrix(DMatrix) multiplying a distributed > vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be > partitioned in the same way beforehand). > https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala > Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope > the zipPartition task always locates on `DMatrix` partition's location. it > will get better data locality than the default preferred location strategy. > I think it make sense to add an option for this. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16988: Assignee: (was: Apache Spark) > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18078) Add option for customize zipPartition task preferred locations
[ https://issues.apache.org/jira/browse/SPARK-18078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-18078: --- Description: `RDD.zipPartitions` task preferred locations strategy will use the intersection of corresponding zipped partitions locations, if the intersection is null, it use union of these locations. but in special case, I want to customize the task preferred locations for better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* operator: a distributed matrix(DMatrix) multiplying a distributed vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be partitioned in the same way beforehand). https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the zipPartition task always locates on `DMatrix` partition's location. it will get better data locality than the default preferred location strategy. I think it make sense to add an option for this. was: `RDD.zipPartitions` task preferred locations strategy will use the intersection of corresponding zipped partitions locations, if the intersection is null, it use union of these locations. but in special case, I want to customize the task preferred locations for better performance. A typical case is in spark-tfocus: a distributed matrix(DMatrix) multiply a vector(DVector), it use RDD.zipPartitions. https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/DVectorFunctions.scala Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the zipPartition task always locates on `DMatrix` partition's location. it will get better data locality than the default preferred location strategy. I think it make sense to add an option for this. > Add option for customize zipPartition task preferred locations > -- > > Key: SPARK-18078 > URL: https://issues.apache.org/jira/browse/SPARK-18078 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Weichen Xu > Original Estimate: 24h > Remaining Estimate: 24h > > `RDD.zipPartitions` task preferred locations strategy will use the > intersection of corresponding zipped partitions locations, if the > intersection is null, it use union of these locations. > but in special case, I want to customize the task preferred locations for > better performance. A typical case is in spark-tfocus *LinopMatrixAdjoint* > operator: a distributed matrix(DMatrix) multiplying a distributed > vector(DVector), it use RDD.zipPartitions (DMatrix and DVector RDD must be > partitioned in the same way beforehand). > https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/fs/dvector/vector/LinopMatrixAdjoint.scala > Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope > the zipPartition task always locates on `DMatrix` partition's location. it > will get better data locality than the default preferred location strategy. > I think it make sense to add an option for this. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602484#comment-15602484 ] chie hayashida commented on SPARK-16988: Can I work on this issue? > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] chie hayashida updated SPARK-16988: --- Comment: was deleted (was: Can I work on it?) > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18078) Add option for customize zipPartition task preferred locations
Weichen Xu created SPARK-18078: -- Summary: Add option for customize zipPartition task preferred locations Key: SPARK-18078 URL: https://issues.apache.org/jira/browse/SPARK-18078 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Weichen Xu `RDD.zipPartitions` task preferred locations strategy will use the intersection of corresponding zipped partitions locations, if the intersection is null, it use union of these locations. but in special case, I want to customize the task preferred locations for better performance. A typical case is in spark-tfocus: a distributed matrix(DMatrix) multiply a vector(DVector), it use RDD.zipPartitions. https://github.com/WeichenXu123/spark-tfocs/blob/master/src/main/scala/org/apache/spark/mllib/optimization/tfocs/DVectorFunctions.scala Usually, the `DMatrix` RDD will be much larger than `DVector` RDD, we hope the zipPartition task always locates on `DMatrix` partition's location. it will get better data locality than the default preferred location strategy. I think it make sense to add an option for this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16988) spark history server log needs to be fixed to show https url when ssl is enabled
[ https://issues.apache.org/jira/browse/SPARK-16988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602442#comment-15602442 ] chie hayashida commented on SPARK-16988: Can I work on it? > spark history server log needs to be fixed to show https url when ssl is > enabled > > > Key: SPARK-16988 > URL: https://issues.apache.org/jira/browse/SPARK-16988 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Yesha Vora >Priority: Minor > > When spark ssl is enabled, spark history server ui ( http://host:port) is > redirected to https://host:port+400. > So, spark history server log should be updated to print https url instead > http url > {code:title=spark HS log} > 16/08/09 15:21:11 INFO ServerConnector: Started > ServerConnector@3970a5ee{SSL-HTTP/1.1}{0.0.0.0:18481} > 16/08/09 15:21:11 INFO Server: Started @4023ms > 16/08/09 15:21:11 INFO Utils: Successfully started service on port 18081. > 16/08/09 15:21:11 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and > started at http://xxx:18081 > 16/08/09 15:22:52 INFO FsHistoryProvider: Replaying log path: > hdfs://xxx:8020/yy/application_1470756121646_0001.inprogress{code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site
[ https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602432#comment-15602432 ] Shivaram Venkataraman commented on SPARK-18073: --- I think most of the list looks good to me. The only thing I am not sure about are the Spark Internals pages. While I agree they are obsolete I wonder if they are still a useful starting point to understanding the code. > Migrate wiki to spark.apache.org web site > - > > Key: SPARK-18073 > URL: https://issues.apache.org/jira/browse/SPARK-18073 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Sean Owen > > Per > http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html > , let's consider migrating all wiki pages to documents at > github.com/apache/spark-webiste (i.e. spark.apache.org). > Some reasons: > * No pull request system or history for changes to the wiki > * Separate, not-so-clear system for granting write access to wiki > * Wiki doesn't change much > * One less place to maintain or look for docs > The idea would be to then update all wiki pages with a message pointing to > the new home of the information (or message saying it's obsolete). > Here are the current wikis and my general proposal for what to do with the > content: > * Additional Language Bindings -> roll this into wherever Third Party > Projects ends up > * Committers -> Migrate to a new /committers.html page, linked under > Community menu (alread exists) > * Contributing to Spark -> Make this CONTRIBUTING.md? or a new > /contributing.html page under Community menu > ** Jira Permissions Scheme -> obsolete > ** Spark Code Style Guide -> roll this into new contributing.html page > * Development Discussions -> obsolete? > * Powered By Spark -> Make into new /powered-by.html linked by the existing > Commnunity menu item > * Preparing Spark Releases -> see below; roll into where "versioning policy" > goes? > * Profiling Spark Applications -> roll into where Useful Developer Tools goes > ** Profiling Spark Applications Using YourKit -> ditto > * Spark Internals -> all of these look somewhat to very stale; remove? > ** Java API Internals > ** PySpark Internals > ** Shuffle Internals > ** Spark SQL Internals > ** Web UI Internals > * Spark QA Infrastructure -> tough one. Good info to document; does it belong > on the website? we can just migrate it > * Spark Versioning Policy -> new page living under Community (?) that > documents release policy and process (better menu?) > ** spark-ec2 AMI list and install file version mappings -> obsolete > ** Spark-Shark version mapping -> obsolete > * Third Party Projects -> new Community menu item > * Useful Developer Tools -> new page under new Developer menu? Community? > ** Jenkins -> obsolete, remove > Of course, another outcome is to just remove outdated wikis, migrate some, > leave the rest. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow
[ https://issues.apache.org/jira/browse/SPARK-18077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] J.P Feng updated SPARK-18077: - Description: Hello,all. I face a strange thing in my project. there is a table: CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string) PARTITIONED BY (`pt` string, `dt` string) stored as orc; another table: CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int) PARTITIONED BY (`pt` string, `dt` string) -- Test-1: executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ): insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login where pt='mix_en' and dt='2016-10-21' it will cost a lot of time, below is a part of the logs: / [Stage 5:===> (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] [Stage 5:=> (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] [Stage 5:> (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] [Stage 5:> (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] [Stage 5:> (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] [Stage 5:> (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-3' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-4' to trash at: hdfs://master.com/user/hadoop/.Trash/Current ... Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current / i can see, the origin data is moved to .trash and then, there is no log printing, and after about 10 min, the log print again: / 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-0, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0, Status:true 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-1, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1, Status:true 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-2, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2, Status:true 16/10/24 17:24:15 INFO Hive: Replacing
[jira] [Created] (SPARK-18077) Run insert overwrite statements in spark to overwrite a partitioned table is very slow
J.P Feng created SPARK-18077: Summary: Run insert overwrite statements in spark to overwrite a partitioned table is very slow Key: SPARK-18077 URL: https://issues.apache.org/jira/browse/SPARK-18077 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Environment: spark 2.0 hive 2.0.1 driver memory: 4g total executors: 4 executor memory: 10g total cores: 13 Reporter: J.P Feng Hello,all. I face a strange thing in my project. there is a table: CREATE TABLE `login4game`(`account_name` string, `role_id` string, `server_id` string, `recdate` string) PARTITIONED BY (`pt` string, `dt` string) stored as orc; another table: CREATE TABLE `tbllog_login`(`server` string,`role_id` bigint, `account_name` string, `happened_time` int) PARTITIONED BY (`pt` string, `dt` string) Test-1: executed sql in spark-shell or spark-sql( before i run this sql, there is much data in partition(pt='mix_en', dt='2016-10-21') of table login4game ): insert overwrite table login4game partition(pt='mix_en',dt='2016-10-21') select distinct account_name,role_id,server,'1476979200' as recdate from tbllog_login where pt='mix_en' and dt='2016-10-21' it will cost a lot of time, below is a part of the logs: / [Stage 5:===> (144 + 8) / 200]15127.974: [GC [PSYoungGen: 587153K->103638K(572416K)] 893021K->412112K(1259008K), 0.0740800 secs] [Times: user=0.18 sys=0.00, real=0.08 secs] [Stage 5:=> (152 + 8) / 200]15128.441: [GC [PSYoungGen: 564438K->82692K(580096K)] 872912K->393836K(1266688K), 0.0808380 secs] [Times: user=0.16 sys=0.00, real=0.08 secs] [Stage 5:> (160 + 8) / 200]15128.854: [GC [PSYoungGen: 543297K->28369K(573952K)] 854441K->341282K(1260544K), 0.0674920 secs] [Times: user=0.12 sys=0.00, real=0.07 secs] [Stage 5:> (176 + 8) / 200]15129.152: [GC [PSYoungGen: 485073K->40441K(497152K)] 797986K->353651K(1183744K), 0.0588420 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] [Stage 5:> (177 + 8) / 200]15129.460: [GC [PSYoungGen: 496966K->50692K(579584K)] 810176K->364126K(1266176K), 0.0555160 secs] [Times: user=0.15 sys=0.00, real=0.06 secs] [Stage 5:> (192 + 8) / 200]15129.777: [GC [PSYoungGen: 508420K->57213K(515072K)] 821854K->371717K(1201664K), 0.0641580 secs] [Times: user=0.16 sys=0.00, real=0.06 secs] Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-2' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-3' to trash at: hdfs://master.com/user/hadoop/.Trash/Current Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-4' to trash at: hdfs://master.com/user/hadoop/.Trash/Current ... Moved: 'hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-00199' to trash at: hdfs://master.com/user/hadoop/.Trash/Current / i can see, the origin data is moved to .trash and then, there is no log printing, and after about 10 min, the log print again: / 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-0, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-0, Status:true 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-1, dest: hdfs://master.com/data/hivedata/warehouse/my_log.db/login4game/pt=mix_en/dt=2016-10-21/part-1, Status:true 16/10/24 17:24:15 INFO Hive: Replacing src:hdfs://master.com/data/hivedata/warehouse/staging/.hive-staging_hive_2016-10-24_17-15-48_033_4875949055726164713-1/-ext-1/part-2, dest:
[jira] [Assigned] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US
[ https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18076: Assignee: Apache Spark > Fix default Locale used in DateFormat, NumberFormat to Locale.US > > > Key: SPARK-18076 > URL: https://issues.apache.org/jira/browse/SPARK-18076 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: Sean Owen >Assignee: Apache Spark > > Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. > Although the behavior of these format is mostly determined by things like > format strings, the exact behavior can vary according to the platform's > default locale. Although the locale defaults to "en", it can be set to > something else by env variables. And if it does, it can cause the same code > to succeed or fail based just on locale: > {code} > import java.text._ > import java.util._ > def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", > l).parse(s) > parse("1989Dec31", Locale.US) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.UK) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.CHINA) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(:18) > ... 32 elided > parse("1989Dec31", Locale.GERMANY) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(:18) > ... 32 elided > {code} > Where not otherwise specified, I believe all instances in the code should > default to some fixed value, and that should probably be {{Locale.US}}. This > matches the JVM's default, and specifies both language ("en") and region > ("US") to remove ambiguity. This most closely matches what the current code > behavior would be (unless default locale was changed), because it will > currently default to "en". > This affects SQL date/time functions. At the moment, the only SQL function > that lets the user specify language/country is "sentences", which is > consistent with Hive. > It affects dates passed in the JSON API. > It affects some strings rendered in the UI, potentially. Although this isn't > a correctness issue, there may be an argument for not letting that vary (?) > It affects a bunch of instances where dates are formatted into strings for > things like IDs or file names, which is far less likely to cause a problem, > but worth making consistent. > The other occurrences are in tests. > The downside to this change is also its upside: the behavior doesn't depend > on default JVM locale, but, also can't be affected by the default JVM locale. > For example, if you wanted to parse some dates in a way that depended on an > non-US locale (not just the format string) then it would no longer be > possible. There's no means of specifying this, for example, in SQL functions > for parsing dates. However, controlling this by globally changing the locale > isn't exactly great either. > The purpose of this change is to make the current default behavior > deterministic and fixed. PR coming. > CC [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US
[ https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602233#comment-15602233 ] Apache Spark commented on SPARK-18076: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/15610 > Fix default Locale used in DateFormat, NumberFormat to Locale.US > > > Key: SPARK-18076 > URL: https://issues.apache.org/jira/browse/SPARK-18076 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: Sean Owen > > Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. > Although the behavior of these format is mostly determined by things like > format strings, the exact behavior can vary according to the platform's > default locale. Although the locale defaults to "en", it can be set to > something else by env variables. And if it does, it can cause the same code > to succeed or fail based just on locale: > {code} > import java.text._ > import java.util._ > def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", > l).parse(s) > parse("1989Dec31", Locale.US) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.UK) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.CHINA) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(:18) > ... 32 elided > parse("1989Dec31", Locale.GERMANY) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(:18) > ... 32 elided > {code} > Where not otherwise specified, I believe all instances in the code should > default to some fixed value, and that should probably be {{Locale.US}}. This > matches the JVM's default, and specifies both language ("en") and region > ("US") to remove ambiguity. This most closely matches what the current code > behavior would be (unless default locale was changed), because it will > currently default to "en". > This affects SQL date/time functions. At the moment, the only SQL function > that lets the user specify language/country is "sentences", which is > consistent with Hive. > It affects dates passed in the JSON API. > It affects some strings rendered in the UI, potentially. Although this isn't > a correctness issue, there may be an argument for not letting that vary (?) > It affects a bunch of instances where dates are formatted into strings for > things like IDs or file names, which is far less likely to cause a problem, > but worth making consistent. > The other occurrences are in tests. > The downside to this change is also its upside: the behavior doesn't depend > on default JVM locale, but, also can't be affected by the default JVM locale. > For example, if you wanted to parse some dates in a way that depended on an > non-US locale (not just the format string) then it would no longer be > possible. There's no means of specifying this, for example, in SQL functions > for parsing dates. However, controlling this by globally changing the locale > isn't exactly great either. > The purpose of this change is to make the current default behavior > deterministic and fixed. PR coming. > CC [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US
[ https://issues.apache.org/jira/browse/SPARK-18076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18076: Assignee: (was: Apache Spark) > Fix default Locale used in DateFormat, NumberFormat to Locale.US > > > Key: SPARK-18076 > URL: https://issues.apache.org/jira/browse/SPARK-18076 > Project: Spark > Issue Type: Bug > Components: MLlib, Spark Core, SQL >Affects Versions: 2.0.1 >Reporter: Sean Owen > > Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. > Although the behavior of these format is mostly determined by things like > format strings, the exact behavior can vary according to the platform's > default locale. Although the locale defaults to "en", it can be set to > something else by env variables. And if it does, it can cause the same code > to succeed or fail based just on locale: > {code} > import java.text._ > import java.util._ > def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", > l).parse(s) > parse("1989Dec31", Locale.US) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.UK) > Sun Dec 31 00:00:00 GMT 1989 > parse("1989Dec31", Locale.CHINA) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(:18) > ... 32 elided > parse("1989Dec31", Locale.GERMANY) > java.text.ParseException: Unparseable date: "1989Dec31" > at java.text.DateFormat.parse(DateFormat.java:366) > at .parse(:18) > ... 32 elided > {code} > Where not otherwise specified, I believe all instances in the code should > default to some fixed value, and that should probably be {{Locale.US}}. This > matches the JVM's default, and specifies both language ("en") and region > ("US") to remove ambiguity. This most closely matches what the current code > behavior would be (unless default locale was changed), because it will > currently default to "en". > This affects SQL date/time functions. At the moment, the only SQL function > that lets the user specify language/country is "sentences", which is > consistent with Hive. > It affects dates passed in the JSON API. > It affects some strings rendered in the UI, potentially. Although this isn't > a correctness issue, there may be an argument for not letting that vary (?) > It affects a bunch of instances where dates are formatted into strings for > things like IDs or file names, which is far less likely to cause a problem, > but worth making consistent. > The other occurrences are in tests. > The downside to this change is also its upside: the behavior doesn't depend > on default JVM locale, but, also can't be affected by the default JVM locale. > For example, if you wanted to parse some dates in a way that depended on an > non-US locale (not just the format string) then it would no longer be > possible. There's no means of specifying this, for example, in SQL functions > for parsing dates. However, controlling this by globally changing the locale > isn't exactly great either. > The purpose of this change is to make the current default behavior > deterministic and fixed. PR coming. > CC [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-18076) Fix default Locale used in DateFormat, NumberFormat to Locale.US
Sean Owen created SPARK-18076: - Summary: Fix default Locale used in DateFormat, NumberFormat to Locale.US Key: SPARK-18076 URL: https://issues.apache.org/jira/browse/SPARK-18076 Project: Spark Issue Type: Bug Components: MLlib, Spark Core, SQL Affects Versions: 2.0.1 Reporter: Sean Owen Many parts of the code use {{DateFormat}} and {{NumberFormat}} instances. Although the behavior of these format is mostly determined by things like format strings, the exact behavior can vary according to the platform's default locale. Although the locale defaults to "en", it can be set to something else by env variables. And if it does, it can cause the same code to succeed or fail based just on locale: {code} import java.text._ import java.util._ def parse(s: String, l: Locale) = new SimpleDateFormat("MMMdd", l).parse(s) parse("1989Dec31", Locale.US) Sun Dec 31 00:00:00 GMT 1989 parse("1989Dec31", Locale.UK) Sun Dec 31 00:00:00 GMT 1989 parse("1989Dec31", Locale.CHINA) java.text.ParseException: Unparseable date: "1989Dec31" at java.text.DateFormat.parse(DateFormat.java:366) at .parse(:18) ... 32 elided parse("1989Dec31", Locale.GERMANY) java.text.ParseException: Unparseable date: "1989Dec31" at java.text.DateFormat.parse(DateFormat.java:366) at .parse(:18) ... 32 elided {code} Where not otherwise specified, I believe all instances in the code should default to some fixed value, and that should probably be {{Locale.US}}. This matches the JVM's default, and specifies both language ("en") and region ("US") to remove ambiguity. This most closely matches what the current code behavior would be (unless default locale was changed), because it will currently default to "en". This affects SQL date/time functions. At the moment, the only SQL function that lets the user specify language/country is "sentences", which is consistent with Hive. It affects dates passed in the JSON API. It affects some strings rendered in the UI, potentially. Although this isn't a correctness issue, there may be an argument for not letting that vary (?) It affects a bunch of instances where dates are formatted into strings for things like IDs or file names, which is far less likely to cause a problem, but worth making consistent. The other occurrences are in tests. The downside to this change is also its upside: the behavior doesn't depend on default JVM locale, but, also can't be affected by the default JVM locale. For example, if you wanted to parse some dates in a way that depended on an non-US locale (not just the format string) then it would no longer be possible. There's no means of specifying this, for example, in SQL functions for parsing dates. However, controlling this by globally changing the locale isn't exactly great either. The purpose of this change is to make the current default behavior deterministic and fixed. PR coming. CC [~hyukjin.kwon] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602164#comment-15602164 ] Nick Orka commented on SPARK-9219: -- Here is UDF dedicated ticket https://issues.apache.org/jira/browse/SPARK-18075 > ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD > --- > > Key: SPARK-9219 > URL: https://issues.apache.org/jira/browse/SPARK-9219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Mohsen Zainalpour > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602160#comment-15602160 ] Steve Loughran commented on SPARK-2984: --- Alexy, can you describe your layout a bit more # are you using Amazon EMR? # why s3:// URLs? # what hadoop-* JARs are you using (it's not that easy to tell in spark-1.6...what did you download?) If you can, use Hadoop 2.7.x+ and then switch to s3a. What I actually suspect here, looking at the code listing, is that the following sequence is happening * previous work completes, deletes/renames _temporary * next code gets a list of all files (globStatus(pat)) * Then iterates through, getting info about each one. * During that listing, the _temp dir goes away, which breaks the code. Seems to me the listing logic in {{FileInputFormat.singleThreadedListStatus}} could be more forgiving about FNFEs: if the file is no longer there, well, skip it. However, S3A in Hadoop 2.8 (and in HDP2.5, I'll note) replaces this very inefficient directory listing with one optimised for S3A (HADOOP-13208), which is much less likely to exhibit this problem. Even so, some more resilience would be good; I'll make a note in the relevant bits of the Hadoop code > FileNotFoundException on _temporary directory > - > > Key: SPARK-2984 > URL: https://issues.apache.org/jira/browse/SPARK-2984 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.3.0 > > > We've seen several stacktraces and threads on the user mailing list where > people are having issues with a {{FileNotFoundException}} stemming from an > HDFS path containing {{_temporary}}. > I ([~aash]) think this may be related to {{spark.speculation}}. I think the > error condition might manifest in this circumstance: > 1) task T starts on a executor E1 > 2) it takes a long time, so task T' is started on another executor E2 > 3) T finishes in E1 so moves its data from {{_temporary}} to the final > destination and deletes the {{_temporary}} directory during cleanup > 4) T' finishes in E2 and attempts to move its data from {{_temporary}}, but > those files no longer exist! exception > Some samples: > {noformat} > 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job > 140774430 ms.0 > java.io.FileNotFoundException: File > hdfs://hadoopc/user/csong/output/human_bot/-140774430.out/_temporary/0/task_201408110805__m_07 > does not exist. > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:654) > at > org.apache.hadoop.hdfs.DistributedFileSystem.access$600(DistributedFileSystem.java:102) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:712) > at > org.apache.hadoop.hdfs.DistributedFileSystem$14.doCall(DistributedFileSystem.java:708) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:708) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:360) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:310) > at > org.apache.hadoop.mapred.FileOutputCommitter.commitJob(FileOutputCommitter.java:136) > at > org.apache.spark.SparkHadoopWriter.commitJob(SparkHadoopWriter.scala:126) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:841) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:724) > at > org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:643) > at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1068) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:773) > at > org.apache.spark.streaming.dstream.DStream$$anonfun$8.apply(DStream.scala:771) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at > org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40) > at scala.util.Try$.apply(Try.scala:161) > at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) > at > org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172) > at >
[jira] [Commented] (SPARK-18073) Migrate wiki to spark.apache.org web site
[ https://issues.apache.org/jira/browse/SPARK-18073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602152#comment-15602152 ] holdenk commented on SPARK-18073: - I like the idea of migrating everything off of the wiki - the fact its locked down makes sense from a SPAM POV but one of the best places for people to "get their toes wet" is with documentation fixes and by making that experience more consistent across the different types of documentation seems like it could be quite beneficial. (Not to mention the resulting ability for more than just committers to contribute suggestions on how to improve these parts of our documentation). > Migrate wiki to spark.apache.org web site > - > > Key: SPARK-18073 > URL: https://issues.apache.org/jira/browse/SPARK-18073 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.0.1 >Reporter: Sean Owen > > Per > http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html > , let's consider migrating all wiki pages to documents at > github.com/apache/spark-webiste (i.e. spark.apache.org). > Some reasons: > * No pull request system or history for changes to the wiki > * Separate, not-so-clear system for granting write access to wiki > * Wiki doesn't change much > * One less place to maintain or look for docs > The idea would be to then update all wiki pages with a message pointing to > the new home of the information (or message saying it's obsolete). > Here are the current wikis and my general proposal for what to do with the > content: > * Additional Language Bindings -> roll this into wherever Third Party > Projects ends up > * Committers -> Migrate to a new /committers.html page, linked under > Community menu (alread exists) > * Contributing to Spark -> Make this CONTRIBUTING.md? or a new > /contributing.html page under Community menu > ** Jira Permissions Scheme -> obsolete > ** Spark Code Style Guide -> roll this into new contributing.html page > * Development Discussions -> obsolete? > * Powered By Spark -> Make into new /powered-by.html linked by the existing > Commnunity menu item > * Preparing Spark Releases -> see below; roll into where "versioning policy" > goes? > * Profiling Spark Applications -> roll into where Useful Developer Tools goes > ** Profiling Spark Applications Using YourKit -> ditto > * Spark Internals -> all of these look somewhat to very stale; remove? > ** Java API Internals > ** PySpark Internals > ** Shuffle Internals > ** Spark SQL Internals > ** Web UI Internals > * Spark QA Infrastructure -> tough one. Good info to document; does it belong > on the website? we can just migrate it > * Spark Versioning Policy -> new page living under Community (?) that > documents release policy and process (better menu?) > ** spark-ec2 AMI list and install file version mappings -> obsolete > ** Spark-Shark version mapping -> obsolete > * Third Party Projects -> new Community menu item > * Useful Developer Tools -> new page under new Developer menu? Community? > ** Jenkins -> obsolete, remove > Of course, another outcome is to just remove outdated wikis, migrate some, > leave the rest. > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Orka updated SPARK-18075: -- Description: I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) According to this ticket https://issues.apache.org/jira/browse/SPARK-9219 I've made all spark dependancies with PROVIDED scope. I use 100% same versions of spark in the app as well as for spark server. Here is my pom: {code:title=pom.xml} 1.6 1.6 UTF-8 2.11.8 2.0.0 2.7.0 org.apache.spark spark-core_2.11 ${spark.version} provided org.apache.spark spark-sql_2.11 ${spark.version} provided org.apache.spark spark-hive_2.11 ${spark.version} provided {code} As you can see all spark dependencies have provided scope And this is a code for reproduction: {code:title=udfTest.scala} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} /** * Created by nborunov on 10/19/16. */ object udfTest { class Seq extends Serializable { var i = 0 def getVal: Int = { i = i + 1 i } } def main(args: Array[String]) { val spark = SparkSession .builder() .master("spark://nborunov-mbp.local:7077") // .master("local") .getOrCreate() val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) val schema = StructType(Array(StructField("name", StringType))) val df = spark.createDataFrame(rdd, schema) df.show() spark.udf.register("func", (name: String) => name.toUpperCase) import org.apache.spark.sql.functions.expr val newDf = df.withColumn("upperName", expr("func(name)")) newDf.show() val seq = new Seq spark.udf.register("seq", () => seq.getVal) val seqDf = df.withColumn("id", expr("seq()")) seqDf.show() df.createOrReplaceTempView("df") spark.sql("select *, seq() as sql_id from df").show() } } {code} When .master("local") - everything works fine. When .master("spark://...:7077"), it fails on line: {code} newDf.show() {code} The error is exactly the same: {code} scala> udfTest.main(Array()) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nborunov); groups with view permissions: Set(); users with modify permissions: Set(nborunov); groups with modify permissions: Set() 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on port 57828. 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB 16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator 16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.2.202:4040 16/10/19 19:37:54 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://nborunov-mbp.local:7077... 16/10/19 19:37:54 INFO TransportClientFactory: Successfully created connection to nborunov-mbp.local/192.168.2.202:7077 after 74 ms (0 ms spent in bootstraps) 16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20161019153755-0017 16/10/19 19:37:55 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20161019153755-0017/0 on
[jira] [Created] (SPARK-18075) UDF doesn't work on non-local spark
Nick Orka created SPARK-18075: - Summary: UDF doesn't work on non-local spark Key: SPARK-18075 URL: https://issues.apache.org/jira/browse/SPARK-18075 Project: Spark Issue Type: Bug Affects Versions: 2.0.0, 1.6.1 Reporter: Nick Orka I've decided to clone the ticket because it had the same problem for anothe spark version and provided workaround doesn't fix an issue. I'm duplicating my case here. I have the same issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) Here is my pom: {code:title=pom.xml} 1.6 1.6 UTF-8 2.11.8 2.0.0 2.7.0 org.apache.spark spark-core_2.11 ${spark.version} provided org.apache.spark spark-sql_2.11 ${spark.version} provided org.apache.spark spark-hive_2.11 ${spark.version} provided {code} As you can see all spark dependencies have provided scope And this is a code for reproduction: {code:title=udfTest.scala} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} /** * Created by nborunov on 10/19/16. */ object udfTest { class Seq extends Serializable { var i = 0 def getVal: Int = { i = i + 1 i } } def main(args: Array[String]) { val spark = SparkSession .builder() .master("spark://nborunov-mbp.local:7077") // .master("local") .getOrCreate() val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) val schema = StructType(Array(StructField("name", StringType))) val df = spark.createDataFrame(rdd, schema) df.show() spark.udf.register("func", (name: String) => name.toUpperCase) import org.apache.spark.sql.functions.expr val newDf = df.withColumn("upperName", expr("func(name)")) newDf.show() val seq = new Seq spark.udf.register("seq", () => seq.getVal) val seqDf = df.withColumn("id", expr("seq()")) seqDf.show() df.createOrReplaceTempView("df") spark.sql("select *, seq() as sql_id from df").show() } } {code} When .master("local") - everything works fine. When .master("spark://...:7077"), it fails on line: {code} newDf.show() {code} The error is exactly the same: {code} scala> udfTest.main(Array()) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nborunov); groups with view permissions: Set(); users with modify permissions: Set(nborunov); groups with modify permissions: Set() 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on port 57828. 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB 16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator 16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.2.202:4040 16/10/19 19:37:54 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://nborunov-mbp.local:7077... 16/10/19 19:37:54 INFO TransportClientFactory: Successfully created connection to nborunov-mbp.local/192.168.2.202:7077 after 74 ms (0 ms spent in bootstraps) 16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20161019153755-0017
[jira] [Updated] (SPARK-18075) UDF doesn't work on non-local spark
[ https://issues.apache.org/jira/browse/SPARK-18075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Orka updated SPARK-18075: -- Description: I have the issue with Spark 2.0.0 (spark-2.0.0-bin-hadoop2.7.tar.gz) Here is my pom: {code:title=pom.xml} 1.6 1.6 UTF-8 2.11.8 2.0.0 2.7.0 org.apache.spark spark-core_2.11 ${spark.version} provided org.apache.spark spark-sql_2.11 ${spark.version} provided org.apache.spark spark-hive_2.11 ${spark.version} provided {code} As you can see all spark dependencies have provided scope And this is a code for reproduction: {code:title=udfTest.scala} import org.apache.spark.sql.types.{StringType, StructField, StructType} import org.apache.spark.sql.{Row, SparkSession} /** * Created by nborunov on 10/19/16. */ object udfTest { class Seq extends Serializable { var i = 0 def getVal: Int = { i = i + 1 i } } def main(args: Array[String]) { val spark = SparkSession .builder() .master("spark://nborunov-mbp.local:7077") // .master("local") .getOrCreate() val rdd = spark.sparkContext.parallelize(Seq(Row("one"), Row("two"))) val schema = StructType(Array(StructField("name", StringType))) val df = spark.createDataFrame(rdd, schema) df.show() spark.udf.register("func", (name: String) => name.toUpperCase) import org.apache.spark.sql.functions.expr val newDf = df.withColumn("upperName", expr("func(name)")) newDf.show() val seq = new Seq spark.udf.register("seq", () => seq.getVal) val seqDf = df.withColumn("id", expr("seq()")) seqDf.show() df.createOrReplaceTempView("df") spark.sql("select *, seq() as sql_id from df").show() } } {code} When .master("local") - everything works fine. When .master("spark://...:7077"), it fails on line: {code} newDf.show() {code} The error is exactly the same: {code} scala> udfTest.main(Array()) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/nborunov/.m2/repository/org/slf4j/slf4j-log4j12/1.7.16/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/nborunov/.m2/repository/ch/qos/logback/logback-classic/1.1.7/logback-classic-1.1.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/10/19 19:37:52 INFO SparkContext: Running Spark version 2.0.0 16/10/19 19:37:52 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/10/19 19:37:52 INFO SecurityManager: Changing view acls to: nborunov 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls to: nborunov 16/10/19 19:37:52 INFO SecurityManager: Changing view acls groups to: 16/10/19 19:37:52 INFO SecurityManager: Changing modify acls groups to: 16/10/19 19:37:52 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(nborunov); groups with view permissions: Set(); users with modify permissions: Set(nborunov); groups with modify permissions: Set() 16/10/19 19:37:53 INFO Utils: Successfully started service 'sparkDriver' on port 57828. 16/10/19 19:37:53 INFO SparkEnv: Registering MapOutputTracker 16/10/19 19:37:53 INFO SparkEnv: Registering BlockManagerMaster 16/10/19 19:37:53 INFO DiskBlockManager: Created local directory at /private/var/folders/hl/2fv6555n2w92272zywwvpbzhgq/T/blockmgr-f2d05423-b7f7-4525-b41e-10dfe2f88264 16/10/19 19:37:53 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB 16/10/19 19:37:53 INFO SparkEnv: Registering OutputCommitCoordinator 16/10/19 19:37:54 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/10/19 19:37:54 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.2.202:4040 16/10/19 19:37:54 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://nborunov-mbp.local:7077... 16/10/19 19:37:54 INFO TransportClientFactory: Successfully created connection to nborunov-mbp.local/192.168.2.202:7077 after 74 ms (0 ms spent in bootstraps) 16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Connected to Spark cluster with app ID app-20161019153755-0017 16/10/19 19:37:55 INFO StandaloneAppClient$ClientEndpoint: Executor added: app-20161019153755-0017/0 on worker-20161018232014-192.168.2.202-61437 (192.168.2.202:61437) with 4 cores 16/10/19 19:37:55 INFO StandaloneSchedulerBackend: Granted executor ID app-20161019153755-0017/0 on hostPort 192.168.2.202:61437 with 4
[jira] [Commented] (SPARK-2984) FileNotFoundException on _temporary directory
[ https://issues.apache.org/jira/browse/SPARK-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602045#comment-15602045 ] Alexey Balchunas commented on SPARK-2984: - I'm getting a similar exception on Spark 1.6.0: {code} java.io.FileNotFoundException: No such file or directory 's3://bucket/path/part-1' at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:507) at org.apache.hadoop.fs.FileSystem.getFileBlockLocations(FileSystem.java:704) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1696) at org.apache.hadoop.fs.FileSystem$4.next(FileSystem.java:1681) at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:268) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at
[jira] [Created] (SPARK-18074) UDFs don't work on non-local environment
Alberto Andreotti created SPARK-18074: - Summary: UDFs don't work on non-local environment Key: SPARK-18074 URL: https://issues.apache.org/jira/browse/SPARK-18074 Project: Spark Issue Type: Bug Reporter: Alberto Andreotti It seems that UDFs fail to deserialize when they are sent to the remote cluster. This is an app that can help reproduce the problem, https://github.com/albertoandreottiATgmail/spark_udf and this is the stack trace with the exception, org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at
[jira] [Created] (SPARK-18073) Migrate wiki to spark.apache.org web site
Sean Owen created SPARK-18073: - Summary: Migrate wiki to spark.apache.org web site Key: SPARK-18073 URL: https://issues.apache.org/jira/browse/SPARK-18073 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 2.0.1 Reporter: Sean Owen Per http://apache-spark-developers-list.1001551.n3.nabble.com/Mini-Proposal-Make-it-easier-to-contribute-to-the-contributing-to-Spark-Guide-td19493.html , let's consider migrating all wiki pages to documents at github.com/apache/spark-webiste (i.e. spark.apache.org). Some reasons: * No pull request system or history for changes to the wiki * Separate, not-so-clear system for granting write access to wiki * Wiki doesn't change much * One less place to maintain or look for docs The idea would be to then update all wiki pages with a message pointing to the new home of the information (or message saying it's obsolete). Here are the current wikis and my general proposal for what to do with the content: * Additional Language Bindings -> roll this into wherever Third Party Projects ends up * Committers -> Migrate to a new /committers.html page, linked under Community menu (alread exists) * Contributing to Spark -> Make this CONTRIBUTING.md? or a new /contributing.html page under Community menu ** Jira Permissions Scheme -> obsolete ** Spark Code Style Guide -> roll this into new contributing.html page * Development Discussions -> obsolete? * Powered By Spark -> Make into new /powered-by.html linked by the existing Commnunity menu item * Preparing Spark Releases -> see below; roll into where "versioning policy" goes? * Profiling Spark Applications -> roll into where Useful Developer Tools goes ** Profiling Spark Applications Using YourKit -> ditto * Spark Internals -> all of these look somewhat to very stale; remove? ** Java API Internals ** PySpark Internals ** Shuffle Internals ** Spark SQL Internals ** Web UI Internals * Spark QA Infrastructure -> tough one. Good info to document; does it belong on the website? we can just migrate it * Spark Versioning Policy -> new page living under Community (?) that documents release policy and process (better menu?) ** spark-ec2 AMI list and install file version mappings -> obsolete ** Spark-Shark version mapping -> obsolete * Third Party Projects -> new Community menu item * Useful Developer Tools -> new page under new Developer menu? Community? ** Jenkins -> obsolete, remove Of course, another outcome is to just remove outdated wikis, migrate some, leave the rest. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601892#comment-15601892 ] Alberto Andreotti commented on SPARK-9219: -- Please paste the ticket number here. Thanks. > ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD > --- > > Key: SPARK-9219 > URL: https://issues.apache.org/jira/browse/SPARK-9219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Mohsen Zainalpour > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at
[jira] [Commented] (SPARK-9219) ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD
[ https://issues.apache.org/jira/browse/SPARK-9219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601836#comment-15601836 ] Nick Orka commented on SPARK-9219: -- This UDF functionality doesn't work on non-local environment. I would re-open the issue or create a new one dedicated to UDF. > ClassCastException in instance of org.apache.spark.rdd.MapPartitionsRDD > --- > > Key: SPARK-9219 > URL: https://issues.apache.org/jira/browse/SPARK-9219 > Project: Spark > Issue Type: Bug >Affects Versions: 1.4.1 >Reporter: Mohsen Zainalpour > > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 > (TID 77, 192.168.1.194): java.lang.ClassCastException: cannot assign instance > of scala.collection.immutable.List$SerializationProxy to field > org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type > scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD > at > java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083) > at > java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > scala.collection.immutable.List$SerializationProxy.readObject(List.scala:477) > at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at >
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601827#comment-15601827 ] Cédric Hernalsteens commented on SPARK-15487: - Sure, this one (reverse proxy to access the workers and apps from master webui) is resolved. I'll open another one. > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-18052) Spark Job failing with org.apache.spark.rpc.RpcTimeoutException
[ https://issues.apache.org/jira/browse/SPARK-18052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srikanth closed SPARK-18052. Resolution: Not A Bug > Spark Job failing with org.apache.spark.rpc.RpcTimeoutException > --- > > Key: SPARK-18052 > URL: https://issues.apache.org/jira/browse/SPARK-18052 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.0.0 > Environment: 3 node spark cluster, all AWS r3.xlarge instances > running on ubuntu. >Reporter: Srikanth > Attachments: sparkErrorLog.txt > > > Spark submit jobs are failing with org.apache.spark.rpc.RpcTimeoutException. > increased the spark.executor.heartbeatInterval value from 10s to 60s, but > still the same issue. > This is happening while saving a dataframe to a mounted network drive. Not > using HDFS here. We are able to write successfully for smaller size files > under 10G, the data here we are reading is nearly 20G > driver memory = 10G > executor memory = 25G > Please see the attached log file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601794#comment-15601794 ] Matthew Farrellee commented on SPARK-15487: --- well, unless you're putting another proxy in front of your master and want it to show up in a subsection of your domain, you should only need "/" works and it would be a great default. in the case of a site proxy on www.mydomain.com/spark i'd expect you only need to set the url to "/spark" fyi, the master passes the proxy url to the workers - https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L401 - so you should not need to set it on the workers if you're continuing to have a problem you should definitely open another issue and leave this one as resolved. > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601783#comment-15601783 ] zhangxinyu commented on SPARK-17935: I write a short deasign doc of KafkaSink for kafka-0.10 as follows. KafkaSink is designed based on kafka-0.10-sql module. Could you please take a look? > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17935) Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module
[ https://issues.apache.org/jira/browse/SPARK-17935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601778#comment-15601778 ] zhangxinyu commented on SPARK-17935: h2. KafkaSink Design Doc h4. Goal Output results to kafka cluster(version 0.10.0.0) in structured streaming module. h4. Implement Four classes are implemented to output data to kafka cluster in structured streaming module. * *KafkaSinkProvider* This class extends trait *StreamSinkProvider* and trait *DataSourceRegister* and overrides function *shortName* and *createSink*. In function *createSink*, *KafkaSink* is created. * *KafkaSink* KafkaSink extends *Sink* and overrides function *addBatch*. *KafkaSinkRDD* will be created in function *addBatch*. * *KafkaSinkRDD* *KafkaSinkRDD* is designed to distributedly send results to kafka clusters. It extends *RDD*. In function *compute*, *CachedKafkaProducer* will be called to get or create producer to send data * *CachedKafkaProducer* *CachedKafkaProducer* is used to store producers in the executors so that these producers can be reused. h4. Configuration * *Kafka Producer Configuration* "*.option()*" is used to configure kafka producer configurations which are all starting with "*kafka.*". For example, producer configuration *bootstrap.servers* can be configured by *.option("kafka.bootstrap.servers", kafka-servers)*. * *Other Configuration* Other configuration is also set by ".option()". The difference is these configurations don't start with "kafka.". h4. Usage val query = input.writeStream .format("kafkaSink") .outputMode("append") .option("kafka.bootstrap.servers", kafka-servers) .option(“topic”, topic) .start() > Add KafkaForeachWriter in external kafka-0.8.0 for structured streaming module > -- > > Key: SPARK-17935 > URL: https://issues.apache.org/jira/browse/SPARK-17935 > Project: Spark > Issue Type: Improvement > Components: SQL, Streaming >Affects Versions: 2.0.0 >Reporter: zhangxinyu > > Now spark already supports kafkaInputStream. It would be useful that we add > `KafkaForeachWriter` to output results to kafka in structured streaming > module. > `KafkaForeachWriter.scala` is put in external kafka-0.8.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18065) Spark 2 allows filter/where on columns not in current schema
[ https://issues.apache.org/jira/browse/SPARK-18065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matthew Scruggs updated SPARK-18065: Priority: Minor (was: Major) > Spark 2 allows filter/where on columns not in current schema > > > Key: SPARK-18065 > URL: https://issues.apache.org/jira/browse/SPARK-18065 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1 >Reporter: Matthew Scruggs >Priority: Minor > > I noticed in Spark 2 (unlike 1.6) it's possible to use filter/where on a > DataFrame that previously had a column, but no longer has it in its schema > due to a select() operation. > In Spark 1.6.2, in spark-shell, we see that an exception is thrown when > attempting to filter/where using the selected-out column: > {code:title=Spark 1.6.2} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 1.6.2 > /_/ > Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > Spark context available as sc. > SQL context available as sqlContext. > scala> val df1 = sqlContext.createDataFrame(sc.parallelize(Seq((1, "one"), > (2, "two".selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > org.apache.spark.sql.AnalysisException: cannot resolve 'word' given input > columns: [id]; > {code} > However in Spark 2.0.0 and 2.0.1, we see that the same filter/where succeeds > (no AnalysisException) and seems to filter out data as if the column remains: > {code:title=Spark 2.0.1} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.0.1 > /_/ > > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val df1 = sc.parallelize(Seq((1, "one"), (2, > "two"))).toDF().selectExpr("_1 as id", "_2 as word") > df1: org.apache.spark.sql.DataFrame = [id: int, word: string] > scala> df1.show() > +---++ > | id|word| > +---++ > | 1| one| > | 2| two| > +---++ > scala> val df2 = df1.select("id") > df2: org.apache.spark.sql.DataFrame = [id: int] > scala> df2.printSchema() > root > |-- id: integer (nullable = false) > scala> df2.where("word = 'one'").show() > +---+ > | id| > +---+ > | 1| > +---+ > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15487) Spark Master UI to reverse proxy Application and Workers UI
[ https://issues.apache.org/jira/browse/SPARK-15487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601746#comment-15601746 ] Cédric Hernalsteens commented on SPARK-15487: - @Matthew : that would be too easy ;) Seriously, I don't think I'll be the only one wanting to reverse proxy to something like www.mydomain/spark . @Gurvinder: For spark.ui.reverseProxy I got it, works well indeed. Maybe I should open another issue because I believe I set spark.ui.reverseProxyUrl correctly on the worker and it's not taken into account. Thanks for looking at this though. > Spark Master UI to reverse proxy Application and Workers UI > --- > > Key: SPARK-15487 > URL: https://issues.apache.org/jira/browse/SPARK-15487 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.0, 1.6.1 >Reporter: Gurvinder >Assignee: Gurvinder >Priority: Minor > Fix For: 2.1.0 > > > Currently when running in Standalone mode, Spark UI's link to workers and > application drivers are pointing to internal/protected network endpoints. So > to access workers/application UI user's machine has to connect to VPN or need > to have access to internal network directly. > Therefore the proposal is to make Spark master UI reverse proxy this > information back to the user. So only Spark master UI needs to be opened up > to internet. > The minimal changes can be done by adding another route e.g. > http://spark-master.com/target// so when request goes to target, > ProxyServlet kicks in and takes the and forwards the request to it > and send response back to user. > More information about discussions for this features can be found on this > mailing list thread > http://apache-spark-developers-list.1001551.n3.nabble.com/spark-on-kubernetes-tc17599.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601625#comment-15601625 ] Dhananjay Patkar commented on SPARK-4105: - I see this error intermittently. I am using spark :1.6.2 hadoop/yarn :2.7.3 {quote} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 40 in stage 7.0 failed 4 times, most recent failure: Lost task 40.3 in stage 7.0 (TID 411, 10.0.0.12): java.io.IOException: FAILED_TO_UNCOMPRESS(5) at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:98) at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:474) at org.xerial.snappy.Snappy.uncompress(Snappy.java:513) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:422) at org.xerial.snappy.SnappyInputStream.available(SnappyInputStream.java:469) at java.io.BufferedInputStream.read(BufferedInputStream.java:353) at java.io.DataInputStream.read(DataInputStream.java:149) at org.spark-project.guava.io.ByteStreams.read(ByteStreams.java:899) at org.spark-project.guava.io.ByteStreams.readFully(ByteStreams.java:733) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:119) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:102) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {quote} > FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based > shuffle > - > > Key: SPARK-4105 > URL: https://issues.apache.org/jira/browse/SPARK-4105 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.4.1, 1.5.1, 1.6.1, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Attachments: JavaObjectToSerialize.java, > SparkFailedToUncompressGenerator.scala > > > We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during > shuffle read. Here's a sample stacktrace from an executor: > {code} > 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID > 33053) > java.io.IOException: FAILED_TO_UNCOMPRESS(5) > at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78) > at org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391) > at org.xerial.snappy.Snappy.uncompress(Snappy.java:427) > at > org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127) > at > org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88) > at org.xerial.snappy.SnappyInputStream.(SnappyInputStream.java:58) > at > org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128) > at > org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at > org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) > at > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) >
[jira] [Resolved] (SPARK-17810) Default spark.sql.warehouse.dir is relative to local FS but can resolve as HDFS path
[ https://issues.apache.org/jira/browse/SPARK-17810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17810. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.2 Issue resolved by pull request 15382 [https://github.com/apache/spark/pull/15382] > Default spark.sql.warehouse.dir is relative to local FS but can resolve as > HDFS path > > > Key: SPARK-17810 > URL: https://issues.apache.org/jira/browse/SPARK-17810 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Sean Owen >Assignee: Sean Owen > Fix For: 2.0.2, 2.1.0 > > > Following SPARK-15899 and > https://github.com/apache/spark/pull/13868#discussion_r82252372 we have a > slightly different problem. > The change removed the {{file:}} scheme from the default > {{spark.sql.warehouse.dir}} as part of its fix, though the path is still > clearly intended to be a local FS path and defaults to "spark-warehouse" in > the user's home dir. However when running on HDFS this path will be resolved > as an HDFS path, where it almost surely doesn't exist. > Although it can be fixed by overriding {{spark.sql.warehouse.dir}} to a path > like "file:/tmp/spark-warehouse", or any valid HDFS path, this probably won't > work on Windows (the original problem) and of course means the default fails > to work for most HDFS use cases. > There's a related problem here: the docs say the default should be > spark-warehouse relative to the current working dir, not the user home dir. > We can adjust that. > PR coming shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17847) Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
[ https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang reassigned SPARK-17847: --- Assignee: Yanbo Liang > Reduce shuffled data size of GaussianMixture & copy the implementation from > mllib to ml > --- > > Key: SPARK-17847 > URL: https://issues.apache.org/jira/browse/SPARK-17847 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang > > Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new > features to it. > I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to > wrap the ml implementation. For the following reasons: > * mllib {{GaussianMixture}} allow k == 1, but ml does not. > * mllib {{GaussianMixture}} supports setting initial model, but ml does not > support currently. (We will definitely add this feature for ml in the future) > Meanwhile, There is a big performance improvement for {{GaussianMixture}} in > this task. Since the covariance matrix of multivariate gaussian distribution > is symmetric, we can only store the upper triangular part of the matrix and > it will greatly reduce the shuffled data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17847) Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
[ https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17847: Description: Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new features to it. I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: * mllib {{GaussianMixture}} allow k == 1, but ml does not. * mllib {{GaussianMixture}} supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) Meanwhile, There is a big performance improvement for {{GaussianMixture}} in this task. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. was: Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new features to it. I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: * mllib {{GaussianMixture}} allow k == 1, but ml does not. * mllib {{GaussianMixture}} supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) Meanwhile, There is a big performance improvement for {{GaussianMixture}} in this task. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. > Reduce shuffled data size of GaussianMixture & copy the implementation from > mllib to ml > --- > > Key: SPARK-17847 > URL: https://issues.apache.org/jira/browse/SPARK-17847 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang > > Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new > features to it. > I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to > wrap the ml implementation. For the following reasons: > * mllib {{GaussianMixture}} allow k == 1, but ml does not. > * mllib {{GaussianMixture}} supports setting initial model, but ml does not > support currently. (We will definitely add this feature for ml in the future) > Meanwhile, There is a big performance improvement for {{GaussianMixture}} in > this task. Since the covariance matrix of multivariate gaussian distribution > is symmetric, we can only store the upper triangular part of the matrix and > it will greatly reduce the shuffled data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17847) Copy GaussianMixture implementation from mllib to ml
[ https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17847: Description: Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new features to it. I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: * mllib {{GaussianMixture}} allow k == 1, but ml does not. * mllib {{GaussianMixture}} supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) Meanwhile, There is a big performance improvement for {{GaussianMixture}} in this task. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. was: Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new features to it. I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: * mllib {{GaussianMixture}} allow k == 1, but ml does not. * mllib {{GaussianMixture}} supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) > Copy GaussianMixture implementation from mllib to ml > > > Key: SPARK-17847 > URL: https://issues.apache.org/jira/browse/SPARK-17847 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang > > Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new > features to it. > I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to > wrap the ml implementation. For the following reasons: > * mllib {{GaussianMixture}} allow k == 1, but ml does not. > * mllib {{GaussianMixture}} supports setting initial model, but ml does not > support currently. (We will definitely add this feature for ml in the future) > Meanwhile, There is a big performance improvement for {{GaussianMixture}} in > this task. Since the covariance matrix of multivariate gaussian distribution > is symmetric, we can only store the upper triangular part of the matrix and > it will greatly reduce the shuffled data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17847) Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
[ https://issues.apache.org/jira/browse/SPARK-17847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-17847: Summary: Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml (was: Copy GaussianMixture implementation from mllib to ml) > Reduce shuffled data size of GaussianMixture & copy the implementation from > mllib to ml > --- > > Key: SPARK-17847 > URL: https://issues.apache.org/jira/browse/SPARK-17847 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Yanbo Liang > > Copy {{GaussianMixture}} implementation from mllib to ml, then we can add new > features to it. > I left mllib {{GaussianMixture}} untouched, unlike some other algorithms to > wrap the ml implementation. For the following reasons: > * mllib {{GaussianMixture}} allow k == 1, but ml does not. > * mllib {{GaussianMixture}} supports setting initial model, but ml does not > support currently. (We will definitely add this feature for ml in the future) > Meanwhile, There is a big performance improvement for {{GaussianMixture}} in > this task. Since the covariance matrix of multivariate gaussian distribution > is symmetric, we can only store the upper triangular part of the matrix and > it will greatly reduce the shuffled data size. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18017) Changing Hadoop parameter through sparkSession.sparkContext.hadoopConfiguration doesn't work
[ https://issues.apache.org/jira/browse/SPARK-18017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601473#comment-15601473 ] Sean Owen commented on SPARK-18017: --- Ah, try spark.hadoop.fs.s3n.block.size=... > Changing Hadoop parameter through > sparkSession.sparkContext.hadoopConfiguration doesn't work > > > Key: SPARK-18017 > URL: https://issues.apache.org/jira/browse/SPARK-18017 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: Scala version 2.11.8; Java 1.8.0_91; > com.databricks:spark-csv_2.11:1.2.0 >Reporter: Yuehua Zhang > > My Spark job tries to read csv files on S3. I need to control the number of > partitions created so I set Hadoop parameter fs.s3n.block.size. However, it > stopped working after we upgrade Spark from 1.6.1 to 2.0.0. Not sure if it is > related to https://issues.apache.org/jira/browse/SPARK-15991. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15601455#comment-15601455 ] Herman van Hovell commented on SPARK-18067: --- [~tejasp] This makes sense to me. However there are a few potential problems: - You generally have a better chance of getting nicely distributed data if you hash by multiple values. If the `key` in your example has a relatively low cardinality we can hit significant performance problems and OOMs if we need to buffer a lot of rows. - I am pretty sure this will break {{outputPartitioning/requiredChildDistribution}}. This would allow EnsureRequirements to give us a different distribution then we have asked for. This can be extremely problematic in case of shuffle joins, since we need to make sure that both the left and the right relation have exactly the same distribution. > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org