[jira] [Assigned] (SPARK-16765) Add Pipeline API example for KMeans
[ https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16765: Assignee: (was: Apache Spark) > Add Pipeline API example for KMeans > --- > > Key: SPARK-16765 > URL: https://issues.apache.org/jira/browse/SPARK-16765 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Manish Mishra >Priority: Trivial > > A pipeline API example for K Means would be nicer to have in examples package > since it is one of the widely used ML algorithms. A pipeline API example of > Logistic Regression is already added to the ml example package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16765) Add Pipeline API example for KMeans
[ https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401559#comment-15401559 ] Apache Spark commented on SPARK-16765: -- User 'manishatGit' has created a pull request for this issue: https://github.com/apache/spark/pull/14432 > Add Pipeline API example for KMeans > --- > > Key: SPARK-16765 > URL: https://issues.apache.org/jira/browse/SPARK-16765 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Manish Mishra >Priority: Trivial > > A pipeline API example for K Means would be nicer to have in examples package > since it is one of the widely used ML algorithms. A pipeline API example of > Logistic Regression is already added to the ml example package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16765) Add Pipeline API example for KMeans
[ https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401560#comment-15401560 ] Manish Mishra commented on SPARK-16765: --- You are right [~bryanc] There was one use case we are trying to build a pipeline for K Means in which were a using transformers for indexing and create vectors for categorical features before predicting on K-means. We were able to built it and I think it would be an example for another ML algorithm for Pipeline API. If you think the example added in the PR is redundant as an example for Pipeline, please close this issue. Thanks! > Add Pipeline API example for KMeans > --- > > Key: SPARK-16765 > URL: https://issues.apache.org/jira/browse/SPARK-16765 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Manish Mishra >Priority: Trivial > > A pipeline API example for K Means would be nicer to have in examples package > since it is one of the widely used ML algorithms. A pipeline API example of > Logistic Regression is already added to the ml example package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16765) Add Pipeline API example for KMeans
[ https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16765: Assignee: Apache Spark > Add Pipeline API example for KMeans > --- > > Key: SPARK-16765 > URL: https://issues.apache.org/jira/browse/SPARK-16765 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: Manish Mishra >Assignee: Apache Spark >Priority: Trivial > > A pipeline API example for K Means would be nicer to have in examples package > since it is one of the widely used ML algorithms. A pipeline API example of > Logistic Regression is already added to the ml example package. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401523#comment-15401523 ] Sital Kedia commented on SPARK-16827: - Actually it seems like this is a bug in shuffle write metrics calculation, actual amount of shuffle data written might be the same, because the final output is exactly the same. > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message
[ https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401456#comment-15401456 ] Tao Wang commented on SPARK-14559: -- Hi, we have same issues, along with another error sending message: ### 2016-07-13 10:06:10,123 | WARN | [spark-dynamic-executor-allocation] | Error sending message [message = RequestExecutors(32963,133120,Map(9-91-8-220 -> 53119, 9-91-8-219 -> 53306, 9-91-8-208 -> 75446, 9-91 -8-218 -> 87637, 9-96-101-254 -> 76229, 9-91-8-217 -> 53623))] in 2 attempts | org.apache.spark.Logging$class.logWarning(Logging.scala:92) org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [300 seconds]. This timeout is controlled by spark.network.timeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77) at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:57) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:426) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1476) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:396) at org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:345) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:299) at org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:228) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 16 more 2016-07-13 10:10:44,928 | ERROR | [shuffle-server-5] | Failed to send RPC 6030874580924985460 to 9-91-8-218/172.18.0.124:46353: java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra nsportClient$2.operationComplete(TransportClient.java:174) java.nio.channels.ClosedChannelException 2016-07-13 10:10:44,929 | ERROR | [shuffle-server-1] | Failed to send RPC 8715801397093832896 to 9-91-8-208/172.18.0.115:50585: java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra nsportClient$2.operationComplete(TransportClient.java:174) java.nio.channels.ClosedChannelException 2016-07-13 10:10:44,929 | ERROR | [shuffle-server-0] | Failed to send RPC 7130936020310576573 to 9-91-8-208/172.18.0.115:50597: java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra nsportClient$2.operationComplete(TransportClient.java:174) java.nio.channels.ClosedChannelException 2016-07-13 10:10:44,929 | ERROR | [shuffle-server-3] | Failed to send RPC 5337112436358516213 to 9-91-8-219/172.18.0.116:56608: java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra nsportClient$2.operationComplete(TransportClient.java:174) ### it last for a very long time (from 10:00 a.m. to 8:30 p.m.), and failed with too many executors failed(which is caused by error send message RetrieveSparkProps to Driver side). > Netty RPC didn't check channel is active before sending message >
[jira] [Commented] (SPARK-16817) Enable storing of shuffle data in Alluxio
[ https://issues.apache.org/jira/browse/SPARK-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401432#comment-15401432 ] Saisai Shao commented on SPARK-16817: - What's difference compared to use ramdisk to store shuffle data? If you want to store the shuffle data on memory, ramdisk is the simplest way to achieve. Also from my understanding, Alluxio may not be faster than ramdisk because of several unnecessary distributed communication overhead. > Enable storing of shuffle data in Alluxio > - > > Key: SPARK-16817 > URL: https://issues.apache.org/jira/browse/SPARK-16817 > Project: Spark > Issue Type: New Feature >Reporter: Tim Bisson > > If one is using Alluxio for storage, it would also be useful if Spark can > store shuffle spill data in Alluxio. For example: > spark.local.dir="alluxio://host:port/path" > Several users on the Alluxio mailing list have asked for this feature: > https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/90pRZWRVi0s/mgLWLS5aAgAJ > https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/s9H93PnDebw/v_1_FMjR7vEJ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16805) Log timezone when query result does not match
[ https://issues.apache.org/jira/browse/SPARK-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16805. -- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14413 [https://github.com/apache/spark/pull/14413] > Log timezone when query result does not match > - > > Key: SPARK-16805 > URL: https://issues.apache.org/jira/browse/SPARK-16805 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.1, 2.1.0 > > > It is useful to log the timezone when query result does not match, especially > on build machines that have different timezone from AMPLab Jenkins. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16731) use StructType in CatalogTable and remove CatalogColumn
[ https://issues.apache.org/jira/browse/SPARK-16731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-16731. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14363 [https://github.com/apache/spark/pull/14363] > use StructType in CatalogTable and remove CatalogColumn > --- > > Key: SPARK-16731 > URL: https://issues.apache.org/jira/browse/SPARK-16731 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16761) Fix doc link in docs/ml-guide.md
[ https://issues.apache.org/jira/browse/SPARK-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401419#comment-15401419 ] Dapeng Sun commented on SPARK-16761: Thank [~srowen] for your commit. > Fix doc link in docs/ml-guide.md > > > Key: SPARK-16761 > URL: https://issues.apache.org/jira/browse/SPARK-16761 > Project: Spark > Issue Type: Documentation > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Dapeng Sun >Assignee: Dapeng Sun >Priority: Trivial > Fix For: 2.0.1, 2.1.0 > > > Fix the invalid link at http://spark.apache.org/docs/latest/ml-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16775) Reduce internal warnings from deprecated accumulator API
[ https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401417#comment-15401417 ] holdenk commented on SPARK-16775: - I'm going to get started on this one tomorrow unless someone else is already started on it :) > Reduce internal warnings from deprecated accumulator API > > > Key: SPARK-16775 > URL: https://issues.apache.org/jira/browse/SPARK-16775 > Project: Spark > Issue Type: Sub-task > Components: PySpark, Spark Core, SQL >Reporter: holdenk > > Deprecating the old accumulator API added a large number of warnings - many > of these could be fixed with a bit of refactoring to offer a non-deprecated > internal class while still preserving the external deprecation warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
[ https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16258: Assignee: (was: Apache Spark) > Automatically append the grouping keys in SparkR's gapply > - > > Key: SPARK-16258 > URL: https://issues.apache.org/jira/browse/SPARK-16258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Timothy Hunter > > While working on the group apply function for python [1], we found it easier > to depart from SparkR's gapply function in the following way: > - the keys are appended by default to the spark dataframe being returned > - the output schema that the users provides is the schema of the R data > frame and does not include the keys > Here are the reasons for doing so: > - in most cases, users will want to know the key associated with a result -> > appending the key is the sensible default > - most functions in the SQL interface and in MLlib append columns, and > gapply departs from this philosophy > - for the cases when they do not need it, adding the key is a fraction of > the computation time and of the output size > - from a formal perspective, it makes calling gapply fully transparent to > the type of the key: it is easier to build a function with gapply because it > does not need to know anything about the key > This ticket proposes to change SparkR's gapply function to follow the same > convention as Python's implementation. > cc [~Narine] [~shivaram] > [1] > https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
[ https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16258: Assignee: Apache Spark > Automatically append the grouping keys in SparkR's gapply > - > > Key: SPARK-16258 > URL: https://issues.apache.org/jira/browse/SPARK-16258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Timothy Hunter >Assignee: Apache Spark > > While working on the group apply function for python [1], we found it easier > to depart from SparkR's gapply function in the following way: > - the keys are appended by default to the spark dataframe being returned > - the output schema that the users provides is the schema of the R data > frame and does not include the keys > Here are the reasons for doing so: > - in most cases, users will want to know the key associated with a result -> > appending the key is the sensible default > - most functions in the SQL interface and in MLlib append columns, and > gapply departs from this philosophy > - for the cases when they do not need it, adding the key is a fraction of > the computation time and of the output size > - from a formal perspective, it makes calling gapply fully transparent to > the type of the key: it is easier to build a function with gapply because it > does not need to know anything about the key > This ticket proposes to change SparkR's gapply function to follow the same > convention as Python's implementation. > cc [~Narine] [~shivaram] > [1] > https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
[ https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401414#comment-15401414 ] Apache Spark commented on SPARK-16258: -- User 'NarineK' has created a pull request for this issue: https://github.com/apache/spark/pull/14431 > Automatically append the grouping keys in SparkR's gapply > - > > Key: SPARK-16258 > URL: https://issues.apache.org/jira/browse/SPARK-16258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Timothy Hunter > > While working on the group apply function for python [1], we found it easier > to depart from SparkR's gapply function in the following way: > - the keys are appended by default to the spark dataframe being returned > - the output schema that the users provides is the schema of the R data > frame and does not include the keys > Here are the reasons for doing so: > - in most cases, users will want to know the key associated with a result -> > appending the key is the sensible default > - most functions in the SQL interface and in MLlib append columns, and > gapply departs from this philosophy > - for the cases when they do not need it, adding the key is a fraction of > the computation time and of the output size > - from a formal perspective, it makes calling gapply fully transparent to > the type of the key: it is easier to build a function with gapply because it > does not need to know anything about the key > This ticket proposes to change SparkR's gapply function to follow the same > convention as Python's implementation. > cc [~Narine] [~shivaram] > [1] > https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401413#comment-15401413 ] Sital Kedia edited comment on SPARK-16827 at 8/1/16 12:52 AM: -- That is not the case. There is no broadcast join involved, its using shuffle join in both 1.6 and 2.0. Plan in 1.6 - {code} Project [target_id#136L AS userid#115L] +- SortMergeJoin [source_id#135L], [id#140L] :- Sort [source_id#135L ASC], false, 0 : +- TungstenExchange hashpartitioning(source_id#135L,800), None : +- ConvertToUnsafe :+- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation foo, table1, Some(a), [(ds#134 = 2016-07-15)] +- Sort [id#140L ASC], false, 0 +- TungstenExchange hashpartitioning(id#140L,800), None +- ConvertToUnsafe +- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), [(ds#139 = 2016-07-15)] {code} Plan in 2.0 - {code} *Project [target_id#151L AS userid#111L] +- *SortMergeJoin [source_id#150L], [id#155L], Inner :- *Sort [source_id#150L ASC], false, 0 : +- Exchange hashpartitioning(source_id#150L, 800) : +- *Filter isnotnull(source_id#150L) :+- HiveTableScan [source_id#150L, target_id#151L], MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)] +- *Sort [id#155L ASC], false, 0 +- Exchange hashpartitioning(id#155L, 800) +- *Filter isnotnull(id#155L) +- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, [isnotnull(ds#154), (ds#154 = 2016-07-15)] {code} was (Author: sitalke...@gmail.com): That is not the case. There is no broadcast join involved, its using shuffle join in both 1.6 and 2.0. Plan in 1.6 - Project [target_id#136L AS userid#115L] +- SortMergeJoin [source_id#135L], [id#140L] :- Sort [source_id#135L ASC], false, 0 : +- TungstenExchange hashpartitioning(source_id#135L,800), None : +- ConvertToUnsafe :+- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation foo, table1, Some(a), [(ds#134 = 2016-07-15)] +- Sort [id#140L ASC], false, 0 +- TungstenExchange hashpartitioning(id#140L,800), None +- ConvertToUnsafe +- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), [(ds#139 = 2016-07-15)] Plan in 2.0 - *Project [target_id#151L AS userid#111L] +- *SortMergeJoin [source_id#150L], [id#155L], Inner :- *Sort [source_id#150L ASC], false, 0 : +- Exchange hashpartitioning(source_id#150L, 800) : +- *Filter isnotnull(source_id#150L) :+- HiveTableScan [source_id#150L, target_id#151L], MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)] +- *Sort [id#155L ASC], false, 0 +- Exchange hashpartitioning(id#155L, 800) +- *Filter isnotnull(id#155L) +- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, [isnotnull(ds#154), (ds#154 = 2016-07-15)] > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401413#comment-15401413 ] Sital Kedia commented on SPARK-16827: - That is not the case. There is no broadcast join involved, its using shuffle join in both 1.6 and 2.0. Plan in 1.6 - Project [target_id#136L AS userid#115L] +- SortMergeJoin [source_id#135L], [id#140L] :- Sort [source_id#135L ASC], false, 0 : +- TungstenExchange hashpartitioning(source_id#135L,800), None : +- ConvertToUnsafe :+- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation foo, table1, Some(a), [(ds#134 = 2016-07-15)] +- Sort [id#140L ASC], false, 0 +- TungstenExchange hashpartitioning(id#140L,800), None +- ConvertToUnsafe +- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), [(ds#139 = 2016-07-15)] Plan in 2.0 - *Project [target_id#151L AS userid#111L] +- *SortMergeJoin [source_id#150L], [id#155L], Inner :- *Sort [source_id#150L ASC], false, 0 : +- Exchange hashpartitioning(source_id#150L, 800) : +- *Filter isnotnull(source_id#150L) :+- HiveTableScan [source_id#150L, target_id#151L], MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)] +- *Sort [id#155L ASC], false, 0 +- Exchange hashpartitioning(id#155L, 800) +- *Filter isnotnull(id#155L) +- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, [isnotnull(ds#154), (ds#154 = 2016-07-15)] > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-16827: Description: One of our hive job which looks like this - {code} SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id {code} After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole stage code generation but that did not help. PS - Even if the intermediate shuffle data size is huge, the job still produces accurate output. was: One of our hive job which looks like this - {code} SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id {code} After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole stage code generation but that did not help. > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. > PS - Even if the intermediate shuffle data size is huge, the job still > produces accurate output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401407#comment-15401407 ] Reynold Xin commented on SPARK-16827: - Did the plan in Spark 1.6 use a broadcast join, and in 2.0 use a shuffle join? > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401405#comment-15401405 ] Sital Kedia commented on SPARK-16827: - [~rxin] - Any idea how to debug this issue? > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16827) Query with Join produces excessive shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-16827: Description: One of our hive job which looks like this - {code} SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id {code} After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole stage code generation but that did not help. was: One of our hive job which looks like this - {code] SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id {code} After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole stage code generation but that did not help. > Query with Join produces excessive shuffle data > --- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16827) Query with Join produces excessive amount of shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-16827: Summary: Query with Join produces excessive amount of shuffle data (was: Query with Join produces excessive shuffle data) > Query with Join produces excessive amount of shuffle data > - > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code} > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16827) Query with Join produces excessive shuffle data
[ https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sital Kedia updated SPARK-16827: Description: One of our hive job which looks like this - {code] SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id {code} After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 of the job which used to produce 32KB shuffle data with 1.6, now produces more than 400GB with Spark 2.0. We also tried turning off whole stage code generation but that did not help. was: One of our hive job which looks like this - SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6 > Query with Join produces excessive shuffle data > --- > > Key: SPARK-16827 > URL: https://issues.apache.org/jira/browse/SPARK-16827 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.0.0 >Reporter: Sital Kedia > Labels: performance > > One of our hive job which looks like this - > {code] > SELECT userid > FROM table1 a > JOIN table2 b > ONa.ds = '2016-07-15' > AND b.ds = '2016-07-15' > AND a.source_id = b.id > {code} > After upgrade to Spark 2.0 the job is significantly slow. Digging a little > into it, we found out that one of the stages produces excessive amount of > shuffle data. Please note that this is a regression from Spark 1.6. Stage 2 > of the job which used to produce 32KB shuffle data with 1.6, now produces > more than 400GB with Spark 2.0. We also tried turning off whole stage code > generation but that did not help. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16827) Query with Join produces excessive shuffle data
Sital Kedia created SPARK-16827: --- Summary: Query with Join produces excessive shuffle data Key: SPARK-16827 URL: https://issues.apache.org/jira/browse/SPARK-16827 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 2.0.0 Reporter: Sital Kedia One of our hive job which looks like this - SELECT userid FROM table1 a JOIN table2 b ONa.ds = '2016-07-15' AND b.ds = '2016-07-15' AND a.source_id = b.id After upgrade to Spark 2.0 the job is significantly slow. Digging a little into it, we found out that one of the stages produces excessive amount of shuffle data. Please note that this is a regression from Spark 1.6 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16826: --- Description: Hello! I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of {{parse_url(url, "host")}} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the executor threads are "BLOCKED", in that state: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use. Thanks! was: Hello! I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of {{parse_url(url, "host")}} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that state: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16826: --- Description: Hello! I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of {{parse_url(url, "host")}} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that state: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use. Thanks! was: Hello! I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of {{parse_url(url, "host")}} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that stage: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
[ https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sylvain Zimmer updated SPARK-16826: --- Description: Hello! I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of {{parse_url(url, "host")}} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that stage: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use. Thanks! was: Hello! I'm using {c4.8xlarge} instances on EC2 with 36 cores and doing lots of {parse_url(url, "host")} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that stage: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
[jira] [Created] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()
Sylvain Zimmer created SPARK-16826: -- Summary: java.util.Hashtable limits the throughput of PARSE_URL() Key: SPARK-16826 URL: https://issues.apache.org/jira/browse/SPARK-16826 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Sylvain Zimmer Hello! I'm using {c4.8xlarge} instances on EC2 with 36 cores and doing lots of {parse_url(url, "host")} in Spark SQL. Unfortunately it seems that there is an internal thread-safe cache in there, and the instances end up being 90% idle. When I view the thread dump for my executors, most of the 36 cores are in status "BLOCKED", in that stage: {code} java.util.Hashtable.get(Hashtable.java:362) java.net.URL.getURLStreamHandler(URL.java:1135) java.net.URL.(URL.java:599) java.net.URL.(URL.java:490) java.net.URL.(URL.java:439) org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731) org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772) org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785) org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203) org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202) scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463) org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) org.apache.spark.scheduler.Task.run(Task.scala:85) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745) {code} However, when I switch from 1 executor with 36 cores to 9 executors with 4 cores, throughput is almost 10x higher and the CPUs are back at ~100% use. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat
[ https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-16825: Description: Currently, we are using hive.default.fileformat for specifying the default file format in CREATE TABLE statement. Multiple issues exist: - This parameter value is not from `hive-site.xml`. Thus, even if users change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. To change the parameter values, users have to use Spark interface, (e.g., by a SET command or API). - This parameter is not documented. - This parameter value will not be sent to Hive metastore. It is being used by Spark internals when processing CREATE TABLE statement. - This parameter is case sensitive. Since this is being used by Spark only, it does not make sense to use a parameter starting from `hive`. we might follow the other Hive-related parameters and introduce a new Spark parameter here. It should be public. Thus, how about replacing `hive.default.fileformat` by `spark.sql.default.fileformat`. we also should make it case insensitive. was: Currently, we are using `hive.default.fileformat`, if users do not specify the format in the CREATE TABLE SQL statement. Multiple issues exist: - This parameter value is not from `hive-site.xml`. Thus, even if users change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. To change the parameter values, users have to use Spark interface, (e.g., by a SET command or API). - This parameter is not documented. - This parameter value will not be sent to Hive metastore. It is being used by Spark internals when processing CREATE TABLE statement. - This parameter is case sensitive. Since this is being used by Spark only, it does not make sense to use a parameter starting from `hive`. we might follow the other Hive-related parameters and introduce a new Spark parameter here. It should be public. Thus, how about replacing `hive.default.fileformat` by `spark.sql.default.fileformat`. we also should make it case insensitive. > Replace hive.default.fileformat by spark.sql.default.fileformat > --- > > Key: SPARK-16825 > URL: https://issues.apache.org/jira/browse/SPARK-16825 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are using hive.default.fileformat for specifying the default > file format in CREATE TABLE statement. Multiple issues exist: > - This parameter value is not from `hive-site.xml`. Thus, even if users > change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. > To change the parameter values, users have to use Spark interface, (e.g., by > a SET command or API). > - This parameter is not documented. > - This parameter value will not be sent to Hive metastore. It is being used > by Spark internals when processing CREATE TABLE statement. > - This parameter is case sensitive. > Since this is being used by Spark only, it does not make sense to use a > parameter starting from `hive`. we might follow the other Hive-related > parameters and introduce a new Spark parameter here. It should be public. > Thus, how about replacing `hive.default.fileformat` by > `spark.sql.default.fileformat`. we also should make it case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat
[ https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16825: Assignee: Apache Spark > Replace hive.default.fileformat by spark.sql.default.fileformat > --- > > Key: SPARK-16825 > URL: https://issues.apache.org/jira/browse/SPARK-16825 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Currently, we are using `hive.default.fileformat`, if users do not specify > the format in the CREATE TABLE SQL statement. Multiple issues exist: > - This parameter value is not from `hive-site.xml`. Thus, even if users > change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. > To change the parameter values, users have to use Spark interface, (e.g., by > a SET command or API). > - This parameter is not documented. > - This parameter value will not be sent to Hive metastore. It is being used > by Spark internals when processing CREATE TABLE statement. > - This parameter is case sensitive. > Since this is being used by Spark only, it does not make sense to use a > parameter starting from `hive`. we might follow the other Hive-related > parameters and introduce a new Spark parameter here. It should be public. > Thus, how about replacing `hive.default.fileformat` by > `spark.sql.default.fileformat`. we also should make it case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat
Xiao Li created SPARK-16825: --- Summary: Replace hive.default.fileformat by spark.sql.default.fileformat Key: SPARK-16825 URL: https://issues.apache.org/jira/browse/SPARK-16825 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li Currently, we are using `hive.default.fileformat`, if users do not specify the format in the CREATE TABLE SQL statement. Multiple issues exist: - This parameter value is not from `hive-site.xml`. Thus, even if users change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. To change the parameter values, users have to use Spark interface, (e.g., by a SET command or API). - This parameter is not documented. - This parameter value will not be sent to Hive metastore. It is being used by Spark internals when processing CREATE TABLE statement. - This parameter is case sensitive. Since this is being used by Spark only, it does not make sense to use a parameter starting from `hive`. we might follow the other Hive-related parameters and introduce a new Spark parameter here. It should be public. Thus, how about replacing `hive.default.fileformat` by `spark.sql.default.fileformat`. we also should make it case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat
[ https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401350#comment-15401350 ] Apache Spark commented on SPARK-16825: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14430 > Replace hive.default.fileformat by spark.sql.default.fileformat > --- > > Key: SPARK-16825 > URL: https://issues.apache.org/jira/browse/SPARK-16825 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are using `hive.default.fileformat`, if users do not specify > the format in the CREATE TABLE SQL statement. Multiple issues exist: > - This parameter value is not from `hive-site.xml`. Thus, even if users > change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. > To change the parameter values, users have to use Spark interface, (e.g., by > a SET command or API). > - This parameter is not documented. > - This parameter value will not be sent to Hive metastore. It is being used > by Spark internals when processing CREATE TABLE statement. > - This parameter is case sensitive. > Since this is being used by Spark only, it does not make sense to use a > parameter starting from `hive`. we might follow the other Hive-related > parameters and introduce a new Spark parameter here. It should be public. > Thus, how about replacing `hive.default.fileformat` by > `spark.sql.default.fileformat`. we also should make it case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat
[ https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16825: Assignee: (was: Apache Spark) > Replace hive.default.fileformat by spark.sql.default.fileformat > --- > > Key: SPARK-16825 > URL: https://issues.apache.org/jira/browse/SPARK-16825 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > Currently, we are using `hive.default.fileformat`, if users do not specify > the format in the CREATE TABLE SQL statement. Multiple issues exist: > - This parameter value is not from `hive-site.xml`. Thus, even if users > change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. > To change the parameter values, users have to use Spark interface, (e.g., by > a SET command or API). > - This parameter is not documented. > - This parameter value will not be sent to Hive metastore. It is being used > by Spark internals when processing CREATE TABLE statement. > - This parameter is case sensitive. > Since this is being used by Spark only, it does not make sense to use a > parameter starting from `hive`. we might follow the other Hive-related > parameters and introduce a new Spark parameter here. It should be public. > Thus, how about replacing `hive.default.fileformat` by > `spark.sql.default.fileformat`. we also should make it case insensitive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16811) spark streaming-Exception in thread “submit-job-thread-pool-40”
[ https://issues.apache.org/jira/browse/SPARK-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16811. --- Resolution: Invalid Read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first please > spark streaming-Exception in thread “submit-job-thread-pool-40” > --- > > Key: SPARK-16811 > URL: https://issues.apache.org/jira/browse/SPARK-16811 > Project: Spark > Issue Type: Question > Components: Streaming >Affects Versions: 1.6.1 > Environment: spark 1.6.1 >Reporter: BinXu >Priority: Blocker > Labels: maven > Original Estimate: 1,176h > Remaining Estimate: 1,176h > > {color:red}I am using spark 1.6. When I run streaming application for a few > minutes,it cause such Exception Like this:{color} > Exception in thread "submit-job-thread-pool-40" Exception in thread > "submit-job-thread-pool-33" Exception in thread "submit-job-thread-pool-23" > Exception in thread "submit-job-thread-pool-14" Exception in thread > "submit-job-thread-pool-29" Exception in thread "submit-job-thread-pool-39" > Exception in thread "submit-job-thread-pool-2" java.lang.Error: > java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at
[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-16816: -- Target Version/s: (was: 2.0.0) Labels: (was: documentation patch) Priority: Trivial (was: Minor) Fix Version/s: (was: 2.0.0) Please see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first > Add documentation to create JavaSparkContext from SparkSession > -- > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Priority: Trivial > Original Estimate: 3h > Remaining Estimate: 3h > > In this issue the user can know how to create the JavaSparkContext with spark > session. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package
[ https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16813: Fix Version/s: 2.0.1 > Remove private[sql] and private[spark] from catalyst package > > > Key: SPARK-16813 > URL: https://issues.apache.org/jira/browse/SPARK-16813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.1, 2.1.0 > > > The catalyst package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401208#comment-15401208 ] Cody Koeninger commented on SPARK-16534: It's on the PR. Yes, one comitter veto is generally sufficient. > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep purohit updated SPARK-16816: Labels: documentation patch (was: patch) Description: In this issue the user can know how to create the JavaSparkContext with spark session. was: In this improvement the user can directly get the JavaSparkContext from the SparkSession. Component/s: (was: SQL) Documentation Issue Type: Documentation (was: Improvement) > Add documentation to create JavaSparkContext from SparkSession > -- > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Priority: Minor > Labels: documentation, patch > Fix For: 2.0.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this issue the user can know how to create the JavaSparkContext with spark > session. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession
[ https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sandeep purohit updated SPARK-16816: Summary: Add documentation to create JavaSparkContext from SparkSession (was: Add api to get JavaSparkContext from SparkSession) > Add documentation to create JavaSparkContext from SparkSession > -- > > Key: SPARK-16816 > URL: https://issues.apache.org/jira/browse/SPARK-16816 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: sandeep purohit >Priority: Minor > Labels: patch > Fix For: 2.0.0 > > Original Estimate: 3h > Remaining Estimate: 3h > > In this improvement the user can directly get the JavaSparkContext from the > SparkSession. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401198#comment-15401198 ] Nicholas Chammas commented on SPARK-12157: -- OK. I've raised the issue of documenting this in PySpark here: https://issues.apache.org/jira/browse/SPARK-16824 > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT
[ https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401197#comment-15401197 ] Nicholas Chammas commented on SPARK-16824: -- cc [~josephkb] [~mengxr] - Should this type be publicly documented in PySpark? > Add API docs for VectorUDT > -- > > Key: SPARK-16824 > URL: https://issues.apache.org/jira/browse/SPARK-16824 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following on the [discussion > here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], > it appears that {{VectorUDT}} is missing documentation, at least in PySpark. > I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16824) Add API docs for VectorUDT
Nicholas Chammas created SPARK-16824: Summary: Add API docs for VectorUDT Key: SPARK-16824 URL: https://issues.apache.org/jira/browse/SPARK-16824 Project: Spark Issue Type: Improvement Components: Documentation, PySpark Affects Versions: 2.0.0 Reporter: Nicholas Chammas Priority: Minor Following on the [discussion here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153], it appears that {{VectorUDT}} is missing documentation, at least in PySpark. I'm not sure if this is intentional or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401192#comment-15401192 ] Jacek Laskowski edited comment on SPARK-16534 at 7/31/16 3:22 PM: -- Mind posting the reason for -1 from [~rxin] ad acta (at the very least). A single -1 should not be the only reason to ditch a proposal, should it? was (Author: jlaskowski): Ming posting the -1 from [~rxin] ad acta (at the very least). A single -1 should not be the only reason to ditch a proposal, should it? > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401192#comment-15401192 ] Jacek Laskowski commented on SPARK-16534: - Ming posting the -1 from [~rxin] ad acta (at the very least). A single -1 should not be the only reason to ditch a proposal, should it? > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401190#comment-15401190 ] Maciej Szymkiewicz commented on SPARK-12157: Well, it is alpha component (see Scala API docs https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.mllib.linalg.VectorUDT). > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16823) One dimensional typed select on DataFrame does not work as expected
TobiasP created SPARK-16823: --- Summary: One dimensional typed select on DataFrame does not work as expected Key: SPARK-16823 URL: https://issues.apache.org/jira/browse/SPARK-16823 Project: Spark Issue Type: Bug Reporter: TobiasP Priority: Minor {code} spark.sqlContext.createDataFrame(for { x <- 1 to 4; y <- 1 to 4 } yield ((x, y),x)).select($"_1".as[(Int,Int)], $"_2".as[Int]) {code} works, so I would expect {code} spark.sqlContext.createDataFrame(for { x <- 1 to 4; y <- 1 to 4 } yield ((x, y),x)).select($"_1".as[(Int,Int)]) {code} to work too, but it expects a wrapping type around the tuple. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401179#comment-15401179 ] Nicholas Chammas commented on SPARK-12157: -- Thanks for the pointer, Maciej. It appears that {{VectorUDT}} is [undocumented|http://spark.apache.org/docs/latest/api/python/search.html?q=VectorUDT_keywords=yes=default]. Do you know if that is intentional? > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16811) spark streaming-Exception in thread “submit-job-thread-pool-40”
[ https://issues.apache.org/jira/browse/SPARK-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BinXu updated SPARK-16811: -- Remaining Estimate: 1,176h (was: 4h) Original Estimate: 1,176h (was: 4h) > spark streaming-Exception in thread “submit-job-thread-pool-40” > --- > > Key: SPARK-16811 > URL: https://issues.apache.org/jira/browse/SPARK-16811 > Project: Spark > Issue Type: Question > Components: Streaming >Affects Versions: 1.6.1 > Environment: spark 1.6.1 >Reporter: BinXu >Priority: Blocker > Labels: maven > Original Estimate: 1,176h > Remaining Estimate: 1,176h > > {color:red}I am using spark 1.6. When I run streaming application for a few > minutes,it cause such Exception Like this:{color} > Exception in thread "submit-job-thread-pool-40" Exception in thread > "submit-job-thread-pool-33" Exception in thread "submit-job-thread-pool-23" > Exception in thread "submit-job-thread-pool-14" Exception in thread > "submit-job-thread-pool-29" Exception in thread "submit-job-thread-pool-39" > Exception in thread "submit-job-thread-pool-2" java.lang.Error: > java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) > at > org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165) > at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ... 2 more > java.lang.Error: java.lang.InterruptedException > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.lang.InterruptedException > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at
[jira] [Commented] (SPARK-14155) Hide UserDefinedType in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401156#comment-15401156 ] Maciej Szymkiewicz commented on SPARK-14155: [~rxin] Is there any progress on that or some opened JIRA? > Hide UserDefinedType in Spark 2.0 > - > > Key: SPARK-14155 > URL: https://issues.apache.org/jira/browse/SPARK-14155 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > > UserDefinedType is a developer API in Spark 1.x. > With very high probability we will create a new API for user-defined type > that also works well with column batches as well as encoders (datasets). In > Spark 2.0, let's make UserDefinedType private[spark] first. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs
[ https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401153#comment-15401153 ] Maciej Szymkiewicz commented on SPARK-12157: [~nchammas]You're using incorrect schema. ArrayType(FloatType()) is not a vector. It should be VectorUDT (see http://stackoverflow.com/q/38249291) > Support numpy types as return values of Python UDFs > --- > > Key: SPARK-12157 > URL: https://issues.apache.org/jira/browse/SPARK-12157 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.5.2 >Reporter: Justin Uang > > Currently, if I have a python UDF > {code} > import pyspark.sql.types as T > import pyspark.sql.functions as F > from pyspark.sql import Row > import numpy as np > argmax = F.udf(lambda x: np.argmax(x), T.IntegerType()) > df = sqlContext.createDataFrame([Row(array=[1,2,3])]) > df.select(argmax("array")).count() > {code} > I get an exception that is fairly opaque: > {code} > Caused by: net.razorvine.pickle.PickleException: expected zero arguments for > construction of ClassDict (for numpy.dtype) > at > net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) > at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701) > at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171) > at net.razorvine.pickle.Unpickler.load(Unpickler.java:85) > at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403) > {code} > Numpy types like np.int and np.float64 should automatically be cast to the > proper dtypes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version
[ https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16770: Assignee: (was: Apache Spark) > Spark shell not usable with german keyboard due to JLine version > > > Key: SPARK-16770 > URL: https://issues.apache.org/jira/browse/SPARK-16770 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Stefan Schulze >Priority: Minor > > It is impossible to enter a right square bracket with a single keystroke > using a german keyboard layout. The problem is known from former Scala > version, responsible is jline-2.12.jar (see > https://issues.scala-lang.org/browse/SI-8759). > Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version
[ https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401151#comment-15401151 ] Apache Spark commented on SPARK-16770: -- User 'stsc-pentasys' has created a pull request for this issue: https://github.com/apache/spark/pull/14429 > Spark shell not usable with german keyboard due to JLine version > > > Key: SPARK-16770 > URL: https://issues.apache.org/jira/browse/SPARK-16770 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Stefan Schulze >Priority: Minor > > It is impossible to enter a right square bracket with a single keystroke > using a german keyboard layout. The problem is known from former Scala > version, responsible is jline-2.12.jar (see > https://issues.scala-lang.org/browse/SI-8759). > Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version
[ https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16770: Assignee: Apache Spark > Spark shell not usable with german keyboard due to JLine version > > > Key: SPARK-16770 > URL: https://issues.apache.org/jira/browse/SPARK-16770 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.0.0 >Reporter: Stefan Schulze >Assignee: Apache Spark >Priority: Minor > > It is impossible to enter a right square bracket with a single keystroke > using a german keyboard layout. The problem is known from former Scala > version, responsible is jline-2.12.jar (see > https://issues.scala-lang.org/browse/SI-8759). > Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16822) Support latex in scaladoc with MathJax
[ https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401145#comment-15401145 ] Shuai Lin commented on SPARK-16822: --- I'm working on it and will post a PR soon. > Support latex in scaladoc with MathJax > -- > > Key: SPARK-16822 > URL: https://issues.apache.org/jira/browse/SPARK-16822 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Shuai Lin >Priority: Minor > > The scaladoc of some classes (mainly ml/mllib classes) include math formulas, > but currently it renders very ugly, e.g. [the doc of the LogisticGradient > class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. > We can improve this by including MathJax javascripts in the scaladocs page, > much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16822) Support latex in scaladoc with MathJax
Shuai Lin created SPARK-16822: - Summary: Support latex in scaladoc with MathJax Key: SPARK-16822 URL: https://issues.apache.org/jira/browse/SPARK-16822 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Shuai Lin Priority: Minor The scaladoc of some classes (mainly ml/mllib classes) include math formulas, but currently it renders very ugly, e.g. [the doc of the LogisticGradient class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient]. We can improve this by including MathJax javascripts in the scaladocs page, much like what we do for the markdown docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16820) Sparse - Sparse matrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401135#comment-15401135 ] Sean Owen commented on SPARK-16820: --- I think it depends a lot on how big the benefit is and how much complexity you introduce. But simple wins are always good. > Sparse - Sparse matrix multiplication > - > > Key: SPARK-16820 > URL: https://issues.apache.org/jira/browse/SPARK-16820 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.0 >Reporter: Ohad Raviv > > While working on MCL implementation on Spark we have encountered some > difficulties. > The main part of this process is distributed sparse matrix multiplication > that has two main steps: > 1.Simulate multiply – preparation before the real multiplication in order > to see which blocks should be multiplied. > 2.The actual blocks multiplication and summation. > In our case the sparse matrix has 50M rows and columns, and 2B non-zeros. > The current multiplication suffers from these issues: > 1.A relatively trivial bug already fixed in the first step the caused the > process to be very slow [SPARK-16469] > 2.Still after the bug fix, if we have too many blocks the Simulate > multiply will take very long time and will multiply the data many times. > (O(n^3) where n is the number of blocks) > 3.Spark supports only multiplication with Dense matrices. Thus, it > converts a Sparse matrix into a dense matrix before the multiplication. > 4.For summing the intermediate block results Spark uses Breeze’s CSC > matrix operations – here the problem is that it is very inefficient to update > a CSC matrix in a zero value. > That means that with many blocks (default block size is 1024) – in our case > 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will > generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger > block size e.g. 100k we get OutOfMemoryException in the “toDense” method of > the multiply. We have worked around that by implementing our-selves both the > Sparse multiplication and addition in a very naïve way – but at least it > works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16821) GraphX MCL algorithm
[ https://issues.apache.org/jira/browse/SPARK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401134#comment-15401134 ] Sean Owen commented on SPARK-16821: --- Generally these don't go into Spark but are made available as your own package first. See https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > GraphX MCL algorithm > > > Key: SPARK-16821 > URL: https://issues.apache.org/jira/browse/SPARK-16821 > Project: Spark > Issue Type: New Feature > Components: GraphX >Affects Versions: 2.0.0 >Reporter: Ohad Raviv > > we have had the need to use MCL clustering algorithm in a project we are > working on. We have based our implementation on joandre's code: > https://github.com/joandre/MCL_spark > We had a few scaling problems that we had to work around our selves and > opened a seperate Jira on them. > Since we started to work on the algorithm we have been approached a few times > by different people that also have the need for this algorithm. > Do you think you can add this algorithm to your base code? it looks like now > there isn't any graph clustering algorithm yet.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16819) Exception in thread “main” org.apache.spark.SparkException: Application application finished with failed status
[ https://issues.apache.org/jira/browse/SPARK-16819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16819. --- Resolution: Invalid Questions should go to u...@spark.apache.org, not JIRA. However I'd also say that you probably won't get much help if you just paste huge logs. Narrow down your problem further first. > Exception in thread “main” org.apache.spark.SparkException: Application > application finished with failed status > --- > > Key: SPARK-16819 > URL: https://issues.apache.org/jira/browse/SPARK-16819 > Project: Spark > Issue Type: Question > Components: Streaming, YARN >Reporter: Asmaa Ali > Labels: beginner > Original Estimate: 24h > Remaining Estimate: 24h > > What is the reason of this exception ?! > cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit > --class SparkBWA --master yarn-cluster --deploy-mode cluster --conf > spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 1500m > --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip --verbose > ./SparkBWA.jar -algorithm mem -reads paired -index /Data/HumanBase/hg38 > -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb > Output_ERR000589 > Using properties file: /usr/lib/spark/conf/spark-defaults.conf > Adding default property: > spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar > Adding default property: > spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog > Adding default property: spark.eventLog.enabled=true > Adding default property: spark.driver.maxResultSize=1920m > Adding default property: spark.shuffle.service.enabled=true > Adding default property: > spark.yarn.historyServer.address=cluster-cancerdetector-m:18080 > Adding default property: spark.sql.parquet.cacheMetadata=false > Adding default property: spark.driver.memory=3840m > Adding default property: spark.dynamicAllocation.maxExecutors=1 > Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0 > Adding default property: spark.yarn.am.memoryOverhead=558 > Adding default property: spark.yarn.am.memory=5586m > Adding default property: > spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar > Adding default property: spark.master=yarn-client > Adding default property: spark.executor.memory=5586m > Adding default property: > spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog > Adding default property: spark.dynamicAllocation.enabled=true > Adding default property: spark.executor.cores=2 > Adding default property: spark.yarn.executor.memoryOverhead=558 > Adding default property: spark.dynamicAllocation.minExecutors=1 > Adding default property: spark.dynamicAllocation.initialExecutors=1 > Adding default property: spark.akka.frameSize=512 > Parsed arguments: > master yarn-cluster > deployMode cluster > executorMemory 1500m > executorCores 1 > totalExecutorCores null > propertiesFile /usr/lib/spark/conf/spark-defaults.conf > driverMemory1500m > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath null > driverExtraJavaOptions > -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesfile:/home/cancerdetector/SparkBWA/build/./bwa.zip > mainClass SparkBWA > primaryResource > file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar > nameSparkBWA > childArgs [-algorithm mem -reads paired -index > /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq > ERR000589_2.filt.fastqhb Output_ERR000589] > jarsnull > packagesnull > packagesExclusions null > repositoriesnull > verbose true > Spark properties used, including those specified through > --conf and those from the properties file > /usr/lib/spark/conf/spark-defaults.conf: > spark.yarn.am.memoryOverhead -> 558 > spark.driver.memory -> 1500m > spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar > spark.executor.memory -> 5586m > spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080 > spark.eventLog.enabled -> true > spark.scheduler.minRegisteredResourcesRatio -> 0.0 > spark.dynamicAllocation.maxExecutors -> 1 > spark.akka.frameSize -> 512 > spark.executor.extraJavaOptions -> >
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401129#comment-15401129 ] Sean Owen commented on SPARK-6305: -- Yeah, I tried making Spark use log4j 2 and it was almost intractable. Most of the problem is maven exclusions to get every single instance of log4j 1.x in dependencies to map to sl4fj, because there are so many and keep popping up, and is a little harder because it's one version of log4j remapped to another via slf4j. Updating the actual use of log4j 1.x code was hard but not impossible because there are no direct counterparts to some existing methods. The final issue wasn't really a Spark issue but a usability one: lots of people's logging configs would stop working. I abandoned it as a result. I wonder if it would be simpler to map slf4j to the JDK logger if we need to. It doesn't solve the remapping problem but at least means we don't put log4j 2 in the way as well. > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401128#comment-15401128 ] Cody Koeninger commented on SPARK-16534: This idea got a -1 from Reynold, so unless anyone's going to argue for it the ticket should be closed > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16810) Refactor registerSinks with multiple constructos
[ https://issues.apache.org/jira/browse/SPARK-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401062#comment-15401062 ] Apache Spark commented on SPARK-16810: -- User 'lovexi' has created a pull request for this issue: https://github.com/apache/spark/pull/14428 > Refactor registerSinks with multiple constructos > > > Key: SPARK-16810 > URL: https://issues.apache.org/jira/browse/SPARK-16810 > Project: Spark > Issue Type: Improvement >Reporter: YangyangLiu > > For some metrics, it may require some app detailed information from > SparkConf. So for those sinks, we need to pass SparkConf via sink > constructor. Refactored registerSink class reflection part to allow multiple > types of sink constructor to be initialized. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package
[ https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16813. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14418 [https://github.com/apache/spark/pull/14418] > Remove private[sql] and private[spark] from catalyst package > > > Key: SPARK-16813 > URL: https://issues.apache.org/jira/browse/SPARK-16813 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.1.0 > > > The catalyst package is meant to be internal, and as a result it does not > make sense to mark things as private[sql] or private[spark]. It simply makes > debugging harder when Spark developers need to inspect the plans at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16820) Sparse - Sparse matrix multiplication
[ https://issues.apache.org/jira/browse/SPARK-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401039#comment-15401039 ] Ohad Raviv commented on SPARK-16820: I will create a PR soon with a suggested fix, but tell me what you think about that.. > Sparse - Sparse matrix multiplication > - > > Key: SPARK-16820 > URL: https://issues.apache.org/jira/browse/SPARK-16820 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.0.0 >Reporter: Ohad Raviv > > While working on MCL implementation on Spark we have encountered some > difficulties. > The main part of this process is distributed sparse matrix multiplication > that has two main steps: > 1.Simulate multiply – preparation before the real multiplication in order > to see which blocks should be multiplied. > 2.The actual blocks multiplication and summation. > In our case the sparse matrix has 50M rows and columns, and 2B non-zeros. > The current multiplication suffers from these issues: > 1.A relatively trivial bug already fixed in the first step the caused the > process to be very slow [SPARK-16469] > 2.Still after the bug fix, if we have too many blocks the Simulate > multiply will take very long time and will multiply the data many times. > (O(n^3) where n is the number of blocks) > 3.Spark supports only multiplication with Dense matrices. Thus, it > converts a Sparse matrix into a dense matrix before the multiplication. > 4.For summing the intermediate block results Spark uses Breeze’s CSC > matrix operations – here the problem is that it is very inefficient to update > a CSC matrix in a zero value. > That means that with many blocks (default block size is 1024) – in our case > 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will > generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger > block size e.g. 100k we get OutOfMemoryException in the “toDense” method of > the multiply. We have worked around that by implementing our-selves both the > Sparse multiplication and addition in a very naïve way – but at least it > works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16821) GraphX MCL algorithm
Ohad Raviv created SPARK-16821: -- Summary: GraphX MCL algorithm Key: SPARK-16821 URL: https://issues.apache.org/jira/browse/SPARK-16821 Project: Spark Issue Type: New Feature Components: GraphX Affects Versions: 2.0.0 Reporter: Ohad Raviv we have had the need to use MCL clustering algorithm in a project we are working on. We have based our implementation on joandre's code: https://github.com/joandre/MCL_spark We had a few scaling problems that we had to work around our selves and opened a seperate Jira on them. Since we started to work on the algorithm we have been approached a few times by different people that also have the need for this algorithm. Do you think you can add this algorithm to your base code? it looks like now there isn't any graph clustering algorithm yet.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16820) Sparse - Sparse matrix multiplication
Ohad Raviv created SPARK-16820: -- Summary: Sparse - Sparse matrix multiplication Key: SPARK-16820 URL: https://issues.apache.org/jira/browse/SPARK-16820 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 2.0.0 Reporter: Ohad Raviv While working on MCL implementation on Spark we have encountered some difficulties. The main part of this process is distributed sparse matrix multiplication that has two main steps: 1. Simulate multiply – preparation before the real multiplication in order to see which blocks should be multiplied. 2. The actual blocks multiplication and summation. In our case the sparse matrix has 50M rows and columns, and 2B non-zeros. The current multiplication suffers from these issues: 1. A relatively trivial bug already fixed in the first step the caused the process to be very slow [SPARK-16469] 2. Still after the bug fix, if we have too many blocks the Simulate multiply will take very long time and will multiply the data many times. (O(n^3) where n is the number of blocks) 3. Spark supports only multiplication with Dense matrices. Thus, it converts a Sparse matrix into a dense matrix before the multiplication. 4. For summing the intermediate block results Spark uses Breeze’s CSC matrix operations – here the problem is that it is very inefficient to update a CSC matrix in a zero value. That means that with many blocks (default block size is 1024) – in our case 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger block size e.g. 100k we get OutOfMemoryException in the “toDense” method of the multiply. We have worked around that by implementing our-selves both the Sparse multiplication and addition in a very naïve way – but at least it works. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark
[ https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400993#comment-15400993 ] Reynold Xin commented on SPARK-6305: [~srowen] looked into this in the past and he didn't get everything working. Sean -can you share more? > Add support for log4j 2.x to Spark > -- > > Key: SPARK-6305 > URL: https://issues.apache.org/jira/browse/SPARK-6305 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Tal Sliwowicz >Priority: Minor > > log4j 2 requires replacing the slf4j binding and adding the log4j jars in the > classpath. Since there are shaded jars, it must be done during the build. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16812) Open up SparkILoop.getAddedJars
[ https://issues.apache.org/jira/browse/SPARK-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16812. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Open up SparkILoop.getAddedJars > --- > > Key: SPARK-16812 > URL: https://issues.apache.org/jira/browse/SPARK-16812 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.1, 2.1.0 > > > SparkILoop.getAddedJars is a useful method to use so we can programmatically > get the list of jars added. > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org