[jira] [Assigned] (SPARK-16765) Add Pipeline API example for KMeans

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16765:


Assignee: (was: Apache Spark)

> Add Pipeline API example for KMeans
> ---
>
> Key: SPARK-16765
> URL: https://issues.apache.org/jira/browse/SPARK-16765
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Manish Mishra
>Priority: Trivial
>
> A pipeline API example for K Means would be nicer to have in examples package 
> since it is one of the widely used ML algorithms. A pipeline API example of 
> Logistic Regression is already added to the ml example package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16765) Add Pipeline API example for KMeans

2016-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401559#comment-15401559
 ] 

Apache Spark commented on SPARK-16765:
--

User 'manishatGit' has created a pull request for this issue:
https://github.com/apache/spark/pull/14432

> Add Pipeline API example for KMeans
> ---
>
> Key: SPARK-16765
> URL: https://issues.apache.org/jira/browse/SPARK-16765
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Manish Mishra
>Priority: Trivial
>
> A pipeline API example for K Means would be nicer to have in examples package 
> since it is one of the widely used ML algorithms. A pipeline API example of 
> Logistic Regression is already added to the ml example package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16765) Add Pipeline API example for KMeans

2016-07-31 Thread Manish Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401560#comment-15401560
 ] 

Manish Mishra commented on SPARK-16765:
---

You are right [~bryanc] There was one use case we are trying to build a 
pipeline for K Means in which were a using transformers for indexing and create 
vectors for categorical features before predicting on K-means. We were able to 
built it and I think it would be an example for another ML algorithm for 
Pipeline API. If you think the example added in the PR is redundant as an 
example for Pipeline, please close this issue. Thanks!

> Add Pipeline API example for KMeans
> ---
>
> Key: SPARK-16765
> URL: https://issues.apache.org/jira/browse/SPARK-16765
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Manish Mishra
>Priority: Trivial
>
> A pipeline API example for K Means would be nicer to have in examples package 
> since it is one of the widely used ML algorithms. A pipeline API example of 
> Logistic Regression is already added to the ml example package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16765) Add Pipeline API example for KMeans

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16765:


Assignee: Apache Spark

> Add Pipeline API example for KMeans
> ---
>
> Key: SPARK-16765
> URL: https://issues.apache.org/jira/browse/SPARK-16765
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Manish Mishra
>Assignee: Apache Spark
>Priority: Trivial
>
> A pipeline API example for K Means would be nicer to have in examples package 
> since it is one of the widely used ML algorithms. A pipeline API example of 
> Logistic Regression is already added to the ml example package. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401523#comment-15401523
 ] 

Sital Kedia commented on SPARK-16827:
-

Actually it seems like this is a bug in shuffle write metrics calculation, 
actual amount of shuffle data written might be the same,  because the final 
output is exactly the same. 

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14559) Netty RPC didn't check channel is active before sending message

2016-07-31 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401456#comment-15401456
 ] 

Tao Wang commented on SPARK-14559:
--

Hi, we have same issues, along with another error sending message:

###
2016-07-13 10:06:10,123 | WARN  | [spark-dynamic-executor-allocation] | Error 
sending message [message = RequestExecutors(32963,133120,Map(9-91-8-220 -> 
53119, 9-91-8-219 -> 53306, 9-91-8-208 -> 75446, 9-91
-8-218 -> 87637, 9-96-101-254 -> 76229, 9-91-8-217 -> 53623))] in 2 attempts | 
org.apache.spark.Logging$class.logWarning(Logging.scala:92)
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [300 
seconds]. This timeout is controlled by spark.network.timeout
at 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
at 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:77)
at 
org.apache.spark.scheduler.cluster.YarnSchedulerBackend.doRequestTotalExecutors(YarnSchedulerBackend.scala:57)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:426)
at 
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1476)
at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:396)
at 
org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:345)
at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:299)
at 
org.apache.spark.ExecutorAllocationManager$$anon$2.run(ExecutorAllocationManager.scala:228)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [300 
seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 16 more
2016-07-13 10:10:44,928 | ERROR | [shuffle-server-5] | Failed to send RPC 
6030874580924985460 to 9-91-8-218/172.18.0.124:46353: 
java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra
nsportClient$2.operationComplete(TransportClient.java:174)
java.nio.channels.ClosedChannelException
2016-07-13 10:10:44,929 | ERROR | [shuffle-server-1] | Failed to send RPC 
8715801397093832896 to 9-91-8-208/172.18.0.115:50585: 
java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra
nsportClient$2.operationComplete(TransportClient.java:174)
java.nio.channels.ClosedChannelException
2016-07-13 10:10:44,929 | ERROR | [shuffle-server-0] | Failed to send RPC 
7130936020310576573 to 9-91-8-208/172.18.0.115:50597: 
java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra
nsportClient$2.operationComplete(TransportClient.java:174)
java.nio.channels.ClosedChannelException
2016-07-13 10:10:44,929 | ERROR | [shuffle-server-3] | Failed to send RPC 
5337112436358516213 to 9-91-8-219/172.18.0.116:56608: 
java.nio.channels.ClosedChannelException | org.apache.spark.network.client.Tra
nsportClient$2.operationComplete(TransportClient.java:174)
###

it last for a very long time (from 10:00 a.m. to 8:30 p.m.), and failed with 
too many executors failed(which is caused by error send message 
RetrieveSparkProps to Driver side).

> Netty RPC didn't check channel is active before sending message
> 

[jira] [Commented] (SPARK-16817) Enable storing of shuffle data in Alluxio

2016-07-31 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401432#comment-15401432
 ] 

Saisai Shao commented on SPARK-16817:
-

What's difference compared to use ramdisk to store shuffle data? If you want to 
store the shuffle data on memory, ramdisk is the simplest way to achieve.

Also from my understanding, Alluxio may not be faster than ramdisk because of 
several unnecessary distributed communication overhead.

> Enable storing of shuffle data in Alluxio
> -
>
> Key: SPARK-16817
> URL: https://issues.apache.org/jira/browse/SPARK-16817
> Project: Spark
>  Issue Type: New Feature
>Reporter: Tim Bisson
>
> If one is using Alluxio for storage, it would also be useful if Spark can 
> store shuffle spill data in Alluxio. For example:
> spark.local.dir="alluxio://host:port/path"
> Several users on the Alluxio mailing list have asked for this feature:
> https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/90pRZWRVi0s/mgLWLS5aAgAJ
> https://groups.google.com/forum/?fromgroups#!searchin/alluxio-users/shuffle$20spark|sort:relevance/alluxio-users/s9H93PnDebw/v_1_FMjR7vEJ



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16805) Log timezone when query result does not match

2016-07-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16805.
--
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14413
[https://github.com/apache/spark/pull/14413]

> Log timezone when query result does not match
> -
>
> Key: SPARK-16805
> URL: https://issues.apache.org/jira/browse/SPARK-16805
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.1, 2.1.0
>
>
> It is useful to log the timezone when query result does not match, especially 
> on build machines that have different timezone from AMPLab Jenkins.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16731) use StructType in CatalogTable and remove CatalogColumn

2016-07-31 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-16731.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14363
[https://github.com/apache/spark/pull/14363]

> use StructType in CatalogTable and remove CatalogColumn
> ---
>
> Key: SPARK-16731
> URL: https://issues.apache.org/jira/browse/SPARK-16731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16761) Fix doc link in docs/ml-guide.md

2016-07-31 Thread Dapeng Sun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401419#comment-15401419
 ] 

Dapeng Sun commented on SPARK-16761:


Thank [~srowen] for your commit.

> Fix doc link in docs/ml-guide.md
> 
>
> Key: SPARK-16761
> URL: https://issues.apache.org/jira/browse/SPARK-16761
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Dapeng Sun
>Assignee: Dapeng Sun
>Priority: Trivial
> Fix For: 2.0.1, 2.1.0
>
>
> Fix the invalid link at http://spark.apache.org/docs/latest/ml-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16775) Reduce internal warnings from deprecated accumulator API

2016-07-31 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401417#comment-15401417
 ] 

holdenk commented on SPARK-16775:
-

I'm going to get started on this one tomorrow unless someone else is already 
started on it :)

> Reduce internal warnings from deprecated accumulator API
> 
>
> Key: SPARK-16775
> URL: https://issues.apache.org/jira/browse/SPARK-16775
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, Spark Core, SQL
>Reporter: holdenk
>
> Deprecating the old accumulator API added a large number of warnings - many 
> of these could be fixed with a bit of refactoring to offer a non-deprecated 
> internal class while still preserving the external deprecation warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16258:


Assignee: (was: Apache Spark)

> Automatically append the grouping keys in SparkR's gapply
> -
>
> Key: SPARK-16258
> URL: https://issues.apache.org/jira/browse/SPARK-16258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Timothy Hunter
>
> While working on the group apply function for python [1], we found it easier 
> to depart from SparkR's gapply function in the following way:
>  - the keys are appended by default to the spark dataframe being returned
>  - the output schema that the users provides is the schema of the R data 
> frame and does not include the keys
> Here are the reasons for doing so:
>  - in most cases, users will want to know the key associated with a result -> 
> appending the key is the sensible default
>  - most functions in the SQL interface and in MLlib append columns, and 
> gapply departs from this philosophy
>  - for the cases when they do not need it, adding the key is a fraction of 
> the computation time and of the output size
>  - from a formal perspective, it makes calling gapply fully transparent to 
> the type of the key: it is easier to build a function with gapply because it 
> does not need to know anything about the key
> This ticket proposes to change SparkR's gapply function to follow the same 
> convention as Python's implementation.
> cc [~Narine] [~shivaram]
> [1] 
> https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16258:


Assignee: Apache Spark

> Automatically append the grouping keys in SparkR's gapply
> -
>
> Key: SPARK-16258
> URL: https://issues.apache.org/jira/browse/SPARK-16258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>
> While working on the group apply function for python [1], we found it easier 
> to depart from SparkR's gapply function in the following way:
>  - the keys are appended by default to the spark dataframe being returned
>  - the output schema that the users provides is the schema of the R data 
> frame and does not include the keys
> Here are the reasons for doing so:
>  - in most cases, users will want to know the key associated with a result -> 
> appending the key is the sensible default
>  - most functions in the SQL interface and in MLlib append columns, and 
> gapply departs from this philosophy
>  - for the cases when they do not need it, adding the key is a fraction of 
> the computation time and of the output size
>  - from a formal perspective, it makes calling gapply fully transparent to 
> the type of the key: it is easier to build a function with gapply because it 
> does not need to know anything about the key
> This ticket proposes to change SparkR's gapply function to follow the same 
> convention as Python's implementation.
> cc [~Narine] [~shivaram]
> [1] 
> https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401414#comment-15401414
 ] 

Apache Spark commented on SPARK-16258:
--

User 'NarineK' has created a pull request for this issue:
https://github.com/apache/spark/pull/14431

> Automatically append the grouping keys in SparkR's gapply
> -
>
> Key: SPARK-16258
> URL: https://issues.apache.org/jira/browse/SPARK-16258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Timothy Hunter
>
> While working on the group apply function for python [1], we found it easier 
> to depart from SparkR's gapply function in the following way:
>  - the keys are appended by default to the spark dataframe being returned
>  - the output schema that the users provides is the schema of the R data 
> frame and does not include the keys
> Here are the reasons for doing so:
>  - in most cases, users will want to know the key associated with a result -> 
> appending the key is the sensible default
>  - most functions in the SQL interface and in MLlib append columns, and 
> gapply departs from this philosophy
>  - for the cases when they do not need it, adding the key is a fraction of 
> the computation time and of the output size
>  - from a formal perspective, it makes calling gapply fully transparent to 
> the type of the key: it is easier to build a function with gapply because it 
> does not need to know anything about the key
> This ticket proposes to change SparkR's gapply function to follow the same 
> convention as Python's implementation.
> cc [~Narine] [~shivaram]
> [1] 
> https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401413#comment-15401413
 ] 

Sital Kedia edited comment on SPARK-16827 at 8/1/16 12:52 AM:
--

That is not the case. There is no broadcast join involved, its using shuffle 
join in both 1.6 and 2.0.

Plan in 1.6 -
{code}
Project [target_id#136L AS userid#115L]
+- SortMergeJoin [source_id#135L], [id#140L]
   :- Sort [source_id#135L ASC], false, 0
   :  +- TungstenExchange hashpartitioning(source_id#135L,800), None
   : +- ConvertToUnsafe
   :+- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation 
foo, table1, Some(a), [(ds#134 = 2016-07-15)]
   +- Sort [id#140L ASC], false, 0
  +- TungstenExchange hashpartitioning(id#140L,800), None
 +- ConvertToUnsafe
+- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), 
[(ds#139 = 2016-07-15)]

{code}

Plan in 2.0  - 

{code}
*Project [target_id#151L AS userid#111L]
+- *SortMergeJoin [source_id#150L], [id#155L], Inner
   :- *Sort [source_id#150L ASC], false, 0
   :  +- Exchange hashpartitioning(source_id#150L, 800)
   : +- *Filter isnotnull(source_id#150L)
   :+- HiveTableScan [source_id#150L, target_id#151L], 
MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)]
   +- *Sort [id#155L ASC], false, 0
  +- Exchange hashpartitioning(id#155L, 800)
 +- *Filter isnotnull(id#155L)
+- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, 
[isnotnull(ds#154), (ds#154 = 2016-07-15)]
{code}



was (Author: sitalke...@gmail.com):
That is not the case. There is no broadcast join involved, its using shuffle 
join in both 1.6 and 2.0.

Plan in 1.6 -

Project [target_id#136L AS userid#115L]
+- SortMergeJoin [source_id#135L], [id#140L]
   :- Sort [source_id#135L ASC], false, 0
   :  +- TungstenExchange hashpartitioning(source_id#135L,800), None
   : +- ConvertToUnsafe
   :+- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation 
foo, table1, Some(a), [(ds#134 = 2016-07-15)]
   +- Sort [id#140L ASC], false, 0
  +- TungstenExchange hashpartitioning(id#140L,800), None
 +- ConvertToUnsafe
+- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), 
[(ds#139 = 2016-07-15)]


Plan in 2.0  - 

*Project [target_id#151L AS userid#111L]
+- *SortMergeJoin [source_id#150L], [id#155L], Inner
   :- *Sort [source_id#150L ASC], false, 0
   :  +- Exchange hashpartitioning(source_id#150L, 800)
   : +- *Filter isnotnull(source_id#150L)
   :+- HiveTableScan [source_id#150L, target_id#151L], 
MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)]
   +- *Sort [id#155L ASC], false, 0
  +- Exchange hashpartitioning(id#155L, 800)
 +- *Filter isnotnull(id#155L)
+- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, 
[isnotnull(ds#154), (ds#154 = 2016-07-15)]

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401413#comment-15401413
 ] 

Sital Kedia commented on SPARK-16827:
-

That is not the case. There is no broadcast join involved, its using shuffle 
join in both 1.6 and 2.0.

Plan in 1.6 -

Project [target_id#136L AS userid#115L]
+- SortMergeJoin [source_id#135L], [id#140L]
   :- Sort [source_id#135L ASC], false, 0
   :  +- TungstenExchange hashpartitioning(source_id#135L,800), None
   : +- ConvertToUnsafe
   :+- HiveTableScan [target_id#136L,source_id#135L], MetastoreRelation 
foo, table1, Some(a), [(ds#134 = 2016-07-15)]
   +- Sort [id#140L ASC], false, 0
  +- TungstenExchange hashpartitioning(id#140L,800), None
 +- ConvertToUnsafe
+- HiveTableScan [id#140L], MetastoreRelation foo, table2, Some(b), 
[(ds#139 = 2016-07-15)]


Plan in 2.0  - 

*Project [target_id#151L AS userid#111L]
+- *SortMergeJoin [source_id#150L], [id#155L], Inner
   :- *Sort [source_id#150L ASC], false, 0
   :  +- Exchange hashpartitioning(source_id#150L, 800)
   : +- *Filter isnotnull(source_id#150L)
   :+- HiveTableScan [source_id#150L, target_id#151L], 
MetastoreRelation foo, table1, a, [isnotnull(ds#149), (ds#149 = 2016-07-15)]
   +- *Sort [id#155L ASC], false, 0
  +- Exchange hashpartitioning(id#155L, 800)
 +- *Filter isnotnull(id#155L)
+- HiveTableScan [id#155L], MetastoreRelation foo, table2, b, 
[isnotnull(ds#154), (ds#154 = 2016-07-15)]

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-16827:

Description: 
One of our hive job which looks like this -

{code}
 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id
{code}

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 of 
the job which used to produce 32KB shuffle data with 1.6, now produces more 
than 400GB with Spark 2.0. We also tried turning off whole stage code 
generation but that did not help. 

PS - Even if the intermediate shuffle data size is huge, the job still produces 
accurate output.

  was:
One of our hive job which looks like this -

{code}
 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id
{code}

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 of 
the job which used to produce 32KB shuffle data with 1.6, now produces more 
than 400GB with Spark 2.0. We also tried turning off whole stage code 
generation but that did not help.




> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help. 
> PS - Even if the intermediate shuffle data size is huge, the job still 
> produces accurate output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401407#comment-15401407
 ] 

Reynold Xin commented on SPARK-16827:
-

Did the plan in Spark 1.6 use a broadcast join, and in 2.0 use a shuffle join?


> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401405#comment-15401405
 ] 

Sital Kedia commented on SPARK-16827:
-

[~rxin] - Any idea how to debug this issue?

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16827) Query with Join produces excessive shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-16827:

Description: 
One of our hive job which looks like this -

{code}
 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id
{code}

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 of 
the job which used to produce 32KB shuffle data with 1.6, now produces more 
than 400GB with Spark 2.0. We also tried turning off whole stage code 
generation but that did not help.



  was:
One of our hive job which looks like this -

{code]
 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id
{code}

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 of 
the job which used to produce 32KB shuffle data with 1.6, now produces more 
than 400GB with Spark 2.0. We also tried turning off whole stage code 
generation but that did not help.




> Query with Join produces excessive shuffle data
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16827) Query with Join produces excessive amount of shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-16827:

Summary: Query with Join produces excessive amount of shuffle data  (was: 
Query with Join produces excessive shuffle data)

> Query with Join produces excessive amount of shuffle data
> -
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code}
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16827) Query with Join produces excessive shuffle data

2016-07-31 Thread Sital Kedia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sital Kedia updated SPARK-16827:

Description: 
One of our hive job which looks like this -

{code]
 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id
{code}

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 of 
the job which used to produce 32KB shuffle data with 1.6, now produces more 
than 400GB with Spark 2.0. We also tried turning off whole stage code 
generation but that did not help.



  was:
One of our hive job which looks like this -

 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6


> Query with Join produces excessive shuffle data
> ---
>
> Key: SPARK-16827
> URL: https://issues.apache.org/jira/browse/SPARK-16827
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>  Labels: performance
>
> One of our hive job which looks like this -
> {code]
>  SELECT  userid
>  FROM  table1 a
>  JOIN table2 b
>   ONa.ds = '2016-07-15'
>   AND  b.ds = '2016-07-15'
>   AND  a.source_id = b.id
> {code}
> After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
> into it, we found out that one of the stages produces excessive amount of 
> shuffle data.  Please note that this is a regression from Spark 1.6. Stage 2 
> of the job which used to produce 32KB shuffle data with 1.6, now produces 
> more than 400GB with Spark 2.0. We also tried turning off whole stage code 
> generation but that did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16827) Query with Join produces excessive shuffle data

2016-07-31 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-16827:
---

 Summary: Query with Join produces excessive shuffle data
 Key: SPARK-16827
 URL: https://issues.apache.org/jira/browse/SPARK-16827
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 2.0.0
Reporter: Sital Kedia


One of our hive job which looks like this -

 SELECT  userid
 FROM  table1 a
 JOIN table2 b
  ONa.ds = '2016-07-15'
  AND  b.ds = '2016-07-15'
  AND  a.source_id = b.id

After upgrade to Spark 2.0 the job is significantly slow.  Digging a little 
into it, we found out that one of the stages produces excessive amount of 
shuffle data.  Please note that this is a regression from Spark 1.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-07-31 Thread Sylvain Zimmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16826:
---
Description: 
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the executor threads are 
"BLOCKED", in that state:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!

  was:
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that state:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-07-31 Thread Sylvain Zimmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16826:
---
Description: 
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that state:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!

  was:
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that stage:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

[jira] [Updated] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-07-31 Thread Sylvain Zimmer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sylvain Zimmer updated SPARK-16826:
---
Description: 
Hello!

I'm using {{c4.8xlarge}} instances on EC2 with 36 cores and doing lots of 
{{parse_url(url, "host")}} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that stage:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!

  was:
Hello!

I'm using {c4.8xlarge} instances on EC2 with 36 cores and doing lots of 
{parse_url(url, "host")} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that stage:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

[jira] [Created] (SPARK-16826) java.util.Hashtable limits the throughput of PARSE_URL()

2016-07-31 Thread Sylvain Zimmer (JIRA)
Sylvain Zimmer created SPARK-16826:
--

 Summary: java.util.Hashtable limits the throughput of PARSE_URL()
 Key: SPARK-16826
 URL: https://issues.apache.org/jira/browse/SPARK-16826
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Sylvain Zimmer


Hello!

I'm using {c4.8xlarge} instances on EC2 with 36 cores and doing lots of 
{parse_url(url, "host")} in Spark SQL.

Unfortunately it seems that there is an internal thread-safe cache in there, 
and the instances end up being 90% idle.

When I view the thread dump for my executors, most of the 36 cores are in 
status "BLOCKED", in that stage:
{code}
java.util.Hashtable.get(Hashtable.java:362)
java.net.URL.getURLStreamHandler(URL.java:1135)
java.net.URL.(URL.java:599)
java.net.URL.(URL.java:490)
java.net.URL.(URL.java:439)
org.apache.spark.sql.catalyst.expressions.ParseUrl.getUrl(stringExpressions.scala:731)
org.apache.spark.sql.catalyst.expressions.ParseUrl.parseUrlWithoutKey(stringExpressions.scala:772)
org.apache.spark.sql.catalyst.expressions.ParseUrl.eval(stringExpressions.scala:785)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$$anonfun$create$2.apply(GeneratePredicate.scala:69)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:203)
org.apache.spark.sql.execution.FilterExec$$anonfun$17$$anonfun$apply$2.apply(basicPhysicalOperators.scala:202)
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:147)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
org.apache.spark.scheduler.Task.run(Task.scala:85)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
{code}

However, when I switch from 1 executor with 36 cores to 9 executors with 4 
cores, throughput is almost 10x higher and the CPUs are back at ~100% use.

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat

2016-07-31 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-16825:

Description: 
Currently, we are using hive.default.fileformat for specifying the default file 
format in CREATE TABLE statement. Multiple issues exist:
- This parameter value is not from `hive-site.xml`. Thus, even if users change 
the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. To change 
the parameter values, users have to use Spark interface, (e.g., by a SET 
command or API). 
- This parameter is not documented. 
- This parameter value will not be sent to Hive metastore. It is being used by 
Spark internals when processing CREATE TABLE statement. 
- This parameter is case sensitive. 

Since this is being used by Spark only, it does not make sense to use a 
parameter starting from `hive`. we might follow the other Hive-related 
parameters and introduce a new Spark parameter here. It should be public. Thus, 
how about replacing `hive.default.fileformat` by 
`spark.sql.default.fileformat`. we also should make it case insensitive.


  was:
Currently, we are using `hive.default.fileformat`, if users do not specify the 
format in the CREATE TABLE SQL statement. Multiple issues exist:
- This parameter value is not from `hive-site.xml`. Thus, even if users change 
the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. To change 
the parameter values, users have to use Spark interface, (e.g., by a SET 
command or API). 
- This parameter is not documented. 
- This parameter value will not be sent to Hive metastore. It is being used by 
Spark internals when processing CREATE TABLE statement. 
- This parameter is case sensitive. 

Since this is being used by Spark only, it does not make sense to use a 
parameter starting from `hive`. we might follow the other Hive-related 
parameters and introduce a new Spark parameter here. It should be public. Thus, 
how about replacing `hive.default.fileformat` by 
`spark.sql.default.fileformat`. we also should make it case insensitive.



> Replace hive.default.fileformat by spark.sql.default.fileformat
> ---
>
> Key: SPARK-16825
> URL: https://issues.apache.org/jira/browse/SPARK-16825
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we are using hive.default.fileformat for specifying the default 
> file format in CREATE TABLE statement. Multiple issues exist:
> - This parameter value is not from `hive-site.xml`. Thus, even if users 
> change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. 
> To change the parameter values, users have to use Spark interface, (e.g., by 
> a SET command or API). 
> - This parameter is not documented. 
> - This parameter value will not be sent to Hive metastore. It is being used 
> by Spark internals when processing CREATE TABLE statement. 
> - This parameter is case sensitive. 
> Since this is being used by Spark only, it does not make sense to use a 
> parameter starting from `hive`. we might follow the other Hive-related 
> parameters and introduce a new Spark parameter here. It should be public. 
> Thus, how about replacing `hive.default.fileformat` by 
> `spark.sql.default.fileformat`. we also should make it case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16825:


Assignee: Apache Spark

> Replace hive.default.fileformat by spark.sql.default.fileformat
> ---
>
> Key: SPARK-16825
> URL: https://issues.apache.org/jira/browse/SPARK-16825
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Currently, we are using `hive.default.fileformat`, if users do not specify 
> the format in the CREATE TABLE SQL statement. Multiple issues exist:
> - This parameter value is not from `hive-site.xml`. Thus, even if users 
> change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. 
> To change the parameter values, users have to use Spark interface, (e.g., by 
> a SET command or API). 
> - This parameter is not documented. 
> - This parameter value will not be sent to Hive metastore. It is being used 
> by Spark internals when processing CREATE TABLE statement. 
> - This parameter is case sensitive. 
> Since this is being used by Spark only, it does not make sense to use a 
> parameter starting from `hive`. we might follow the other Hive-related 
> parameters and introduce a new Spark parameter here. It should be public. 
> Thus, how about replacing `hive.default.fileformat` by 
> `spark.sql.default.fileformat`. we also should make it case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat

2016-07-31 Thread Xiao Li (JIRA)
Xiao Li created SPARK-16825:
---

 Summary: Replace hive.default.fileformat by 
spark.sql.default.fileformat
 Key: SPARK-16825
 URL: https://issues.apache.org/jira/browse/SPARK-16825
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


Currently, we are using `hive.default.fileformat`, if users do not specify the 
format in the CREATE TABLE SQL statement. Multiple issues exist:
- This parameter value is not from `hive-site.xml`. Thus, even if users change 
the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. To change 
the parameter values, users have to use Spark interface, (e.g., by a SET 
command or API). 
- This parameter is not documented. 
- This parameter value will not be sent to Hive metastore. It is being used by 
Spark internals when processing CREATE TABLE statement. 
- This parameter is case sensitive. 

Since this is being used by Spark only, it does not make sense to use a 
parameter starting from `hive`. we might follow the other Hive-related 
parameters and introduce a new Spark parameter here. It should be public. Thus, 
how about replacing `hive.default.fileformat` by 
`spark.sql.default.fileformat`. we also should make it case insensitive.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat

2016-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401350#comment-15401350
 ] 

Apache Spark commented on SPARK-16825:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14430

> Replace hive.default.fileformat by spark.sql.default.fileformat
> ---
>
> Key: SPARK-16825
> URL: https://issues.apache.org/jira/browse/SPARK-16825
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we are using `hive.default.fileformat`, if users do not specify 
> the format in the CREATE TABLE SQL statement. Multiple issues exist:
> - This parameter value is not from `hive-site.xml`. Thus, even if users 
> change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. 
> To change the parameter values, users have to use Spark interface, (e.g., by 
> a SET command or API). 
> - This parameter is not documented. 
> - This parameter value will not be sent to Hive metastore. It is being used 
> by Spark internals when processing CREATE TABLE statement. 
> - This parameter is case sensitive. 
> Since this is being used by Spark only, it does not make sense to use a 
> parameter starting from `hive`. we might follow the other Hive-related 
> parameters and introduce a new Spark parameter here. It should be public. 
> Thus, how about replacing `hive.default.fileformat` by 
> `spark.sql.default.fileformat`. we also should make it case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16825) Replace hive.default.fileformat by spark.sql.default.fileformat

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16825:


Assignee: (was: Apache Spark)

> Replace hive.default.fileformat by spark.sql.default.fileformat
> ---
>
> Key: SPARK-16825
> URL: https://issues.apache.org/jira/browse/SPARK-16825
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> Currently, we are using `hive.default.fileformat`, if users do not specify 
> the format in the CREATE TABLE SQL statement. Multiple issues exist:
> - This parameter value is not from `hive-site.xml`. Thus, even if users 
> change the hive.default.fileformat in `hive-site.xml`, Spark will ignore it. 
> To change the parameter values, users have to use Spark interface, (e.g., by 
> a SET command or API). 
> - This parameter is not documented. 
> - This parameter value will not be sent to Hive metastore. It is being used 
> by Spark internals when processing CREATE TABLE statement. 
> - This parameter is case sensitive. 
> Since this is being used by Spark only, it does not make sense to use a 
> parameter starting from `hive`. we might follow the other Hive-related 
> parameters and introduce a new Spark parameter here. It should be public. 
> Thus, how about replacing `hive.default.fileformat` by 
> `spark.sql.default.fileformat`. we also should make it case insensitive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16811) spark streaming-Exception in thread “submit-job-thread-pool-40”

2016-07-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16811.
---
Resolution: Invalid

Read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
first please

> spark streaming-Exception in thread “submit-job-thread-pool-40”
> ---
>
> Key: SPARK-16811
> URL: https://issues.apache.org/jira/browse/SPARK-16811
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.1
> Environment: spark 1.6.1
>Reporter: BinXu
>Priority: Blocker
>  Labels: maven
>   Original Estimate: 1,176h
>  Remaining Estimate: 1,176h
>
> {color:red}I am using spark 1.6. When I run streaming application for a few 
> minutes,it cause such Exception Like this:{color}
>  Exception in thread "submit-job-thread-pool-40" Exception in thread 
> "submit-job-thread-pool-33" Exception in thread "submit-job-thread-pool-23" 
> Exception in thread "submit-job-thread-pool-14" Exception in thread 
> "submit-job-thread-pool-29" Exception in thread "submit-job-thread-pool-39" 
> Exception in thread "submit-job-thread-pool-2" java.lang.Error: 
> java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at   
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at 

[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession

2016-07-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-16816:
--
Target Version/s:   (was: 2.0.0)
  Labels:   (was: documentation patch)
Priority: Trivial  (was: Minor)
   Fix Version/s: (was: 2.0.0)

Please see 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first

> Add documentation to create JavaSparkContext from SparkSession
> --
>
> Key: SPARK-16816
> URL: https://issues.apache.org/jira/browse/SPARK-16816
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: sandeep purohit
>Priority: Trivial
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> In this issue the user can know how to create the JavaSparkContext with spark 
> session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package

2016-07-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16813:

Fix Version/s: 2.0.1

> Remove private[sql] and private[spark] from catalyst package
> 
>
> Key: SPARK-16813
> URL: https://issues.apache.org/jira/browse/SPARK-16813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.1, 2.1.0
>
>
> The catalyst package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401208#comment-15401208
 ] 

Cody Koeninger commented on SPARK-16534:


It's on the PR.  Yes, one comitter veto is generally sufficient.



> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession

2016-07-31 Thread sandeep purohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep purohit updated SPARK-16816:

 Labels: documentation patch  (was: patch)
Description: 
In this issue the user can know how to create the JavaSparkContext with spark 
session.


  was:
In this improvement the user can directly get the JavaSparkContext from the 
SparkSession.


Component/s: (was: SQL)
 Documentation
 Issue Type: Documentation  (was: Improvement)

> Add documentation to create JavaSparkContext from SparkSession
> --
>
> Key: SPARK-16816
> URL: https://issues.apache.org/jira/browse/SPARK-16816
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.0.0
>Reporter: sandeep purohit
>Priority: Minor
>  Labels: documentation, patch
> Fix For: 2.0.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> In this issue the user can know how to create the JavaSparkContext with spark 
> session.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16816) Add documentation to create JavaSparkContext from SparkSession

2016-07-31 Thread sandeep purohit (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sandeep purohit updated SPARK-16816:

Summary: Add documentation to create JavaSparkContext from SparkSession  
(was: Add api to get JavaSparkContext from SparkSession)

> Add documentation to create JavaSparkContext from SparkSession
> --
>
> Key: SPARK-16816
> URL: https://issues.apache.org/jira/browse/SPARK-16816
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: sandeep purohit
>Priority: Minor
>  Labels: patch
> Fix For: 2.0.0
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> In this improvement the user can directly get the JavaSparkContext from the 
> SparkSession.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-31 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401198#comment-15401198
 ] 

Nicholas Chammas commented on SPARK-12157:
--

OK. I've raised the issue of documenting this in PySpark here: 
https://issues.apache.org/jira/browse/SPARK-16824

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16824) Add API docs for VectorUDT

2016-07-31 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401197#comment-15401197
 ] 

Nicholas Chammas commented on SPARK-16824:
--

cc [~josephkb] [~mengxr] - Should this type be publicly documented in PySpark?

> Add API docs for VectorUDT
> --
>
> Key: SPARK-16824
> URL: https://issues.apache.org/jira/browse/SPARK-16824
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following on the [discussion 
> here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
>  it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
> I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16824) Add API docs for VectorUDT

2016-07-31 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-16824:


 Summary: Add API docs for VectorUDT
 Key: SPARK-16824
 URL: https://issues.apache.org/jira/browse/SPARK-16824
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 2.0.0
Reporter: Nicholas Chammas
Priority: Minor


Following on the [discussion 
here|https://issues.apache.org/jira/browse/SPARK-12157?focusedCommentId=15401153=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15401153],
 it appears that {{VectorUDT}} is missing documentation, at least in PySpark. 
I'm not sure if this is intentional or not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16534) Kafka 0.10 Python support

2016-07-31 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401192#comment-15401192
 ] 

Jacek Laskowski edited comment on SPARK-16534 at 7/31/16 3:22 PM:
--

Mind posting the reason for -1 from [~rxin] ad acta (at the very least). A 
single -1 should not be the only reason to ditch a proposal, should it?


was (Author: jlaskowski):
Ming posting the -1 from [~rxin] ad acta (at the very least). A single -1 
should not be the only reason to ditch a proposal, should it?

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-31 Thread Jacek Laskowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401192#comment-15401192
 ] 

Jacek Laskowski commented on SPARK-16534:
-

Ming posting the -1 from [~rxin] ad acta (at the very least). A single -1 
should not be the only reason to ditch a proposal, should it?

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-31 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401190#comment-15401190
 ] 

Maciej Szymkiewicz commented on SPARK-12157:


Well, it is alpha component (see Scala API docs 
https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.mllib.linalg.VectorUDT).
 

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16823) One dimensional typed select on DataFrame does not work as expected

2016-07-31 Thread TobiasP (JIRA)
TobiasP created SPARK-16823:
---

 Summary: One dimensional typed select on DataFrame does not work 
as expected
 Key: SPARK-16823
 URL: https://issues.apache.org/jira/browse/SPARK-16823
 Project: Spark
  Issue Type: Bug
Reporter: TobiasP
Priority: Minor


{code}
spark.sqlContext.createDataFrame(for { x <- 1 to 4; y <- 1 to 4 } yield ((x, 
y),x)).select($"_1".as[(Int,Int)], $"_2".as[Int])
{code}
works, so I would expect
{code}
spark.sqlContext.createDataFrame(for { x <- 1 to 4; y <- 1 to 4 } yield ((x, 
y),x)).select($"_1".as[(Int,Int)])
{code}
to work too, but it expects a wrapping type around the tuple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-31 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401179#comment-15401179
 ] 

Nicholas Chammas commented on SPARK-12157:
--

Thanks for the pointer, Maciej. It appears that {{VectorUDT}} is 
[undocumented|http://spark.apache.org/docs/latest/api/python/search.html?q=VectorUDT_keywords=yes=default].
 Do you know if that is intentional?

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16811) spark streaming-Exception in thread “submit-job-thread-pool-40”

2016-07-31 Thread BinXu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BinXu updated SPARK-16811:
--
Remaining Estimate: 1,176h  (was: 4h)
 Original Estimate: 1,176h  (was: 4h)

> spark streaming-Exception in thread “submit-job-thread-pool-40”
> ---
>
> Key: SPARK-16811
> URL: https://issues.apache.org/jira/browse/SPARK-16811
> Project: Spark
>  Issue Type: Question
>  Components: Streaming
>Affects Versions: 1.6.1
> Environment: spark 1.6.1
>Reporter: BinXu
>Priority: Blocker
>  Labels: maven
>   Original Estimate: 1,176h
>  Remaining Estimate: 1,176h
>
> {color:red}I am using spark 1.6. When I run streaming application for a few 
> minutes,it cause such Exception Like this:{color}
>  Exception in thread "submit-job-thread-pool-40" Exception in thread 
> "submit-job-thread-pool-33" Exception in thread "submit-job-thread-pool-23" 
> Exception in thread "submit-job-thread-pool-14" Exception in thread 
> "submit-job-thread-pool-29" Exception in thread "submit-job-thread-pool-39" 
> Exception in thread "submit-job-thread-pool-2" java.lang.Error: 
> java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at   
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
> at 
> org.apache.spark.SimpleFutureAction.org$apache$spark$SimpleFutureAction$$awaitResult(FutureAction.scala:165)
> at org.apache.spark.SimpleFutureAction$$anon$1.run(FutureAction.scala:147)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ... 2 more
> java.lang.Error: java.lang.InterruptedException
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.InterruptedException
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 

[jira] [Commented] (SPARK-14155) Hide UserDefinedType in Spark 2.0

2016-07-31 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401156#comment-15401156
 ] 

Maciej Szymkiewicz commented on SPARK-14155:


[~rxin] Is there any progress on that or some opened JIRA?

> Hide UserDefinedType in Spark 2.0
> -
>
> Key: SPARK-14155
> URL: https://issues.apache.org/jira/browse/SPARK-14155
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> UserDefinedType is a developer API in Spark 1.x.
> With very high probability we will create a new API for user-defined type 
> that also works well with column batches as well as encoders (datasets). In 
> Spark 2.0, let's make UserDefinedType private[spark] first.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12157) Support numpy types as return values of Python UDFs

2016-07-31 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401153#comment-15401153
 ] 

Maciej Szymkiewicz commented on SPARK-12157:


[~nchammas]You're using incorrect schema. ArrayType(FloatType()) is not a 
vector. It should be VectorUDT (see http://stackoverflow.com/q/38249291)

> Support numpy types as return values of Python UDFs
> ---
>
> Key: SPARK-12157
> URL: https://issues.apache.org/jira/browse/SPARK-12157
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.2
>Reporter: Justin Uang
>
> Currently, if I have a python UDF
> {code}
> import pyspark.sql.types as T
> import pyspark.sql.functions as F
> from pyspark.sql import Row
> import numpy as np
> argmax = F.udf(lambda x: np.argmax(x), T.IntegerType())
> df = sqlContext.createDataFrame([Row(array=[1,2,3])])
> df.select(argmax("array")).count()
> {code}
> I get an exception that is fairly opaque:
> {code}
> Caused by: net.razorvine.pickle.PickleException: expected zero arguments for 
> construction of ClassDict (for numpy.dtype)
> at 
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:404)
> at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1$$anonfun$apply$3.apply(python.scala:403)
> {code}
> Numpy types like np.int and np.float64 should automatically be cast to the 
> proper dtypes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16770:


Assignee: (was: Apache Spark)

> Spark shell not usable with german keyboard due to JLine version
> 
>
> Key: SPARK-16770
> URL: https://issues.apache.org/jira/browse/SPARK-16770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Stefan Schulze
>Priority: Minor
>
> It is impossible to enter a right square bracket with a single keystroke 
> using a german keyboard layout. The problem is known from former Scala 
> version, responsible is jline-2.12.jar (see 
> https://issues.scala-lang.org/browse/SI-8759).
> Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version

2016-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401151#comment-15401151
 ] 

Apache Spark commented on SPARK-16770:
--

User 'stsc-pentasys' has created a pull request for this issue:
https://github.com/apache/spark/pull/14429

> Spark shell not usable with german keyboard due to JLine version
> 
>
> Key: SPARK-16770
> URL: https://issues.apache.org/jira/browse/SPARK-16770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Stefan Schulze
>Priority: Minor
>
> It is impossible to enter a right square bracket with a single keystroke 
> using a german keyboard layout. The problem is known from former Scala 
> version, responsible is jline-2.12.jar (see 
> https://issues.scala-lang.org/browse/SI-8759).
> Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16770) Spark shell not usable with german keyboard due to JLine version

2016-07-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16770:


Assignee: Apache Spark

> Spark shell not usable with german keyboard due to JLine version
> 
>
> Key: SPARK-16770
> URL: https://issues.apache.org/jira/browse/SPARK-16770
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.0
>Reporter: Stefan Schulze
>Assignee: Apache Spark
>Priority: Minor
>
> It is impossible to enter a right square bracket with a single keystroke 
> using a german keyboard layout. The problem is known from former Scala 
> version, responsible is jline-2.12.jar (see 
> https://issues.scala-lang.org/browse/SI-8759).
> Workaround: Replace jline-2.12.jar by jline.2.12.1.jar in the jars folder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16822) Support latex in scaladoc with MathJax

2016-07-31 Thread Shuai Lin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401145#comment-15401145
 ] 

Shuai Lin commented on SPARK-16822:
---

I'm working on it and will post a PR soon.

> Support latex in scaladoc with MathJax
> --
>
> Key: SPARK-16822
> URL: https://issues.apache.org/jira/browse/SPARK-16822
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Shuai Lin
>Priority: Minor
>
> The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
> but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
> class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].
> We can improve this by including MathJax javascripts in the scaladocs page, 
> much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16822) Support latex in scaladoc with MathJax

2016-07-31 Thread Shuai Lin (JIRA)
Shuai Lin created SPARK-16822:
-

 Summary: Support latex in scaladoc with MathJax
 Key: SPARK-16822
 URL: https://issues.apache.org/jira/browse/SPARK-16822
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Shuai Lin
Priority: Minor


The scaladoc of some classes (mainly ml/mllib classes) include math formulas, 
but currently it renders very ugly, e.g. [the doc of the LogisticGradient 
class|https://spark.apache.org/docs/2.0.0-preview/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient].

We can improve this by including MathJax javascripts in the scaladocs page, 
much like what we do for the markdown docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16820) Sparse - Sparse matrix multiplication

2016-07-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401135#comment-15401135
 ] 

Sean Owen commented on SPARK-16820:
---

I think it depends a lot on how big the benefit is and how much complexity you 
introduce. But simple wins are always good.

> Sparse - Sparse matrix multiplication
> -
>
> Key: SPARK-16820
> URL: https://issues.apache.org/jira/browse/SPARK-16820
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Ohad Raviv
>
> While working on MCL implementation on Spark we have encountered some 
> difficulties.
> The main part of this process is distributed sparse matrix multiplication 
> that has two main steps:
> 1.Simulate multiply – preparation before the real multiplication in order 
> to see which blocks should be multiplied.
> 2.The actual blocks multiplication and summation.
> In our case the sparse matrix has 50M rows and columns, and 2B non-zeros.
> The current multiplication suffers from these issues:
> 1.A relatively trivial bug already fixed in the first step the caused the 
> process to be very slow [SPARK-16469]
> 2.Still after the bug fix, if we have too many blocks the Simulate 
> multiply will take very long time and will multiply the data many times. 
> (O(n^3) where n is the number of blocks)
> 3.Spark supports only multiplication with Dense matrices. Thus, it 
> converts a Sparse matrix into a dense matrix before the multiplication.
> 4.For summing the intermediate block results Spark uses Breeze’s CSC 
> matrix operations – here the problem is that it is very inefficient to update 
> a CSC matrix in a zero value.
> That means that with many blocks (default block size is 1024) – in our case 
> 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will 
> generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger 
> block size e.g. 100k we get OutOfMemoryException in the “toDense” method of 
> the multiply. We have worked around that by implementing our-selves both the 
> Sparse multiplication and addition in a very naïve way – but at least it 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16821) GraphX MCL algorithm

2016-07-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401134#comment-15401134
 ] 

Sean Owen commented on SPARK-16821:
---

Generally these don't go into Spark but are made available as your own package 
first. See 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> GraphX MCL algorithm
> 
>
> Key: SPARK-16821
> URL: https://issues.apache.org/jira/browse/SPARK-16821
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Affects Versions: 2.0.0
>Reporter: Ohad Raviv
>
> we have had the need to use MCL clustering algorithm in a project we are 
> working on. We have based our implementation on joandre's code:
> https://github.com/joandre/MCL_spark
> We had a few scaling problems that we had to work around our selves and 
> opened a seperate Jira on them.
> Since we started to work on the algorithm we have been approached a few times 
> by different people that also have the need for this algorithm.
> Do you think you can add this algorithm to your base code? it looks like now 
> there isn't any graph clustering algorithm yet..



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16819) Exception in thread “main” org.apache.spark.SparkException: Application application finished with failed status

2016-07-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16819.
---
Resolution: Invalid

Questions should go to u...@spark.apache.org, not JIRA. However I'd also say 
that you probably won't get much help if you just paste huge logs. Narrow down 
your problem further first.

> Exception in thread “main” org.apache.spark.SparkException: Application 
> application finished with failed status
> ---
>
> Key: SPARK-16819
> URL: https://issues.apache.org/jira/browse/SPARK-16819
> Project: Spark
>  Issue Type: Question
>  Components: Streaming, YARN
>Reporter: Asmaa Ali 
>  Labels: beginner
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> What is the reason of this exception ?!
> cancerdetector@cluster-cancerdetector-m:~/SparkBWA/build$ spark-submit 
> --class SparkBWA --master yarn-cluster --deploy-mode cluster --conf 
> spark.yarn.jar=hdfs:///user/spark/spark-assembly.jar --driver-memory 1500m 
> --executor-memory 1500m --executor-cores 1 --archives ./bwa.zip --verbose 
> ./SparkBWA.jar -algorithm mem -reads paired -index /Data/HumanBase/hg38 
> -partitions 32 ERR000589_1.filt.fastq ERR000589_2.filt.fastqhb 
> Output_ERR000589
> Using properties file: /usr/lib/spark/conf/spark-defaults.conf
> Adding default property: 
> spark.executor.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: 
> spark.history.fs.logDirectory=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.driver.maxResultSize=1920m
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.yarn.historyServer.address=cluster-cancerdetector-m:18080
> Adding default property: spark.sql.parquet.cacheMetadata=false
> Adding default property: spark.driver.memory=3840m
> Adding default property: spark.dynamicAllocation.maxExecutors=1
> Adding default property: spark.scheduler.minRegisteredResourcesRatio=0.0
> Adding default property: spark.yarn.am.memoryOverhead=558
> Adding default property: spark.yarn.am.memory=5586m
> Adding default property: 
> spark.driver.extraJavaOptions=-Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
> Adding default property: spark.master=yarn-client
> Adding default property: spark.executor.memory=5586m
> Adding default property: 
> spark.eventLog.dir=hdfs://cluster-cancerdetector-m/user/spark/eventlog
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.executor.cores=2
> Adding default property: spark.yarn.executor.memoryOverhead=558
> Adding default property: spark.dynamicAllocation.minExecutors=1
> Adding default property: spark.dynamicAllocation.initialExecutors=1
> Adding default property: spark.akka.frameSize=512
> Parsed arguments:
>   master  yarn-cluster
>   deployMode  cluster
>   executorMemory  1500m
>   executorCores   1
>   totalExecutorCores  null
>   propertiesFile  /usr/lib/spark/conf/spark-defaults.conf
>   driverMemory1500m
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  null
>   driverExtraJavaOptions  
> -Xbootclasspath/p:/usr/local/share/google/alpn/alpn-boot-8.1.7.v20160121.jar
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesfile:/home/cancerdetector/SparkBWA/build/./bwa.zip
>   mainClass   SparkBWA
>   primaryResource 
> file:/home/cancerdetector/SparkBWA/build/./SparkBWA.jar
>   nameSparkBWA
>   childArgs   [-algorithm mem -reads paired -index 
> /Data/HumanBase/hg38 -partitions 32 ERR000589_1.filt.fastq 
> ERR000589_2.filt.fastqhb Output_ERR000589]
>   jarsnull
>   packagesnull
>   packagesExclusions  null
>   repositoriesnull
>   verbose true
> Spark properties used, including those specified through
>  --conf and those from the properties file 
> /usr/lib/spark/conf/spark-defaults.conf:
>   spark.yarn.am.memoryOverhead -> 558
>   spark.driver.memory -> 1500m
>   spark.yarn.jar -> hdfs:///user/spark/spark-assembly.jar
>   spark.executor.memory -> 5586m
>   spark.yarn.historyServer.address -> cluster-cancerdetector-m:18080
>   spark.eventLog.enabled -> true
>   spark.scheduler.minRegisteredResourcesRatio -> 0.0
>   spark.dynamicAllocation.maxExecutors -> 1
>   spark.akka.frameSize -> 512
>   spark.executor.extraJavaOptions -> 
> 

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-07-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401129#comment-15401129
 ] 

Sean Owen commented on SPARK-6305:
--

Yeah, I tried making Spark use log4j 2 and it was almost intractable. Most of 
the problem is maven exclusions to get every single instance of log4j 1.x in 
dependencies to map to sl4fj, because there are so many and keep popping up, 
and is a little harder because it's one version of log4j remapped to another 
via slf4j. 

Updating the actual use of log4j 1.x code was hard but not impossible because 
there are no direct counterparts to some existing methods. The final issue 
wasn't really a Spark issue but a usability one: lots of people's logging 
configs would stop working. I abandoned it as a result.

I wonder if it would be simpler to map slf4j to the JDK logger if we need to. 
It doesn't solve the remapping problem but at least means we don't put log4j 2 
in the way as well.

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-07-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401128#comment-15401128
 ] 

Cody Koeninger commented on SPARK-16534:


This idea got a -1 from Reynold, so unless anyone's going to argue for it the 
ticket should be closed

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16810) Refactor registerSinks with multiple constructos

2016-07-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401062#comment-15401062
 ] 

Apache Spark commented on SPARK-16810:
--

User 'lovexi' has created a pull request for this issue:
https://github.com/apache/spark/pull/14428

> Refactor registerSinks with multiple constructos
> 
>
> Key: SPARK-16810
> URL: https://issues.apache.org/jira/browse/SPARK-16810
> Project: Spark
>  Issue Type: Improvement
>Reporter: YangyangLiu
>
> For some metrics, it may require some app detailed information from 
> SparkConf. So for those sinks, we need to pass SparkConf via sink 
> constructor. Refactored registerSink class reflection part to allow multiple 
> types of sink constructor to be initialized.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16813) Remove private[sql] and private[spark] from catalyst package

2016-07-31 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16813.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14418
[https://github.com/apache/spark/pull/14418]

> Remove private[sql] and private[spark] from catalyst package
> 
>
> Key: SPARK-16813
> URL: https://issues.apache.org/jira/browse/SPARK-16813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.1.0
>
>
> The catalyst package is meant to be internal, and as a result it does not 
> make sense to mark things as private[sql] or private[spark]. It simply makes 
> debugging harder when Spark developers need to inspect the plans at runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16820) Sparse - Sparse matrix multiplication

2016-07-31 Thread Ohad Raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15401039#comment-15401039
 ] 

Ohad Raviv commented on SPARK-16820:


I will create a PR soon with a suggested fix, but tell me what you think about 
that..

> Sparse - Sparse matrix multiplication
> -
>
> Key: SPARK-16820
> URL: https://issues.apache.org/jira/browse/SPARK-16820
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Ohad Raviv
>
> While working on MCL implementation on Spark we have encountered some 
> difficulties.
> The main part of this process is distributed sparse matrix multiplication 
> that has two main steps:
> 1.Simulate multiply – preparation before the real multiplication in order 
> to see which blocks should be multiplied.
> 2.The actual blocks multiplication and summation.
> In our case the sparse matrix has 50M rows and columns, and 2B non-zeros.
> The current multiplication suffers from these issues:
> 1.A relatively trivial bug already fixed in the first step the caused the 
> process to be very slow [SPARK-16469]
> 2.Still after the bug fix, if we have too many blocks the Simulate 
> multiply will take very long time and will multiply the data many times. 
> (O(n^3) where n is the number of blocks)
> 3.Spark supports only multiplication with Dense matrices. Thus, it 
> converts a Sparse matrix into a dense matrix before the multiplication.
> 4.For summing the intermediate block results Spark uses Breeze’s CSC 
> matrix operations – here the problem is that it is very inefficient to update 
> a CSC matrix in a zero value.
> That means that with many blocks (default block size is 1024) – in our case 
> 50M/1024 ~= 50K, the simulate multiply will effectively never finish or will 
> generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger 
> block size e.g. 100k we get OutOfMemoryException in the “toDense” method of 
> the multiply. We have worked around that by implementing our-selves both the 
> Sparse multiplication and addition in a very naïve way – but at least it 
> works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16821) GraphX MCL algorithm

2016-07-31 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-16821:
--

 Summary: GraphX MCL algorithm
 Key: SPARK-16821
 URL: https://issues.apache.org/jira/browse/SPARK-16821
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Affects Versions: 2.0.0
Reporter: Ohad Raviv


we have had the need to use MCL clustering algorithm in a project we are 
working on. We have based our implementation on joandre's code:
https://github.com/joandre/MCL_spark
We had a few scaling problems that we had to work around our selves and opened 
a seperate Jira on them.
Since we started to work on the algorithm we have been approached a few times 
by different people that also have the need for this algorithm.
Do you think you can add this algorithm to your base code? it looks like now 
there isn't any graph clustering algorithm yet..




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16820) Sparse - Sparse matrix multiplication

2016-07-31 Thread Ohad Raviv (JIRA)
Ohad Raviv created SPARK-16820:
--

 Summary: Sparse - Sparse matrix multiplication
 Key: SPARK-16820
 URL: https://issues.apache.org/jira/browse/SPARK-16820
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.0.0
Reporter: Ohad Raviv


While working on MCL implementation on Spark we have encountered some 
difficulties.
The main part of this process is distributed sparse matrix multiplication that 
has two main steps:
1.  Simulate multiply – preparation before the real multiplication in order 
to see which blocks should be multiplied.
2.  The actual blocks multiplication and summation.

In our case the sparse matrix has 50M rows and columns, and 2B non-zeros.

The current multiplication suffers from these issues:

1.  A relatively trivial bug already fixed in the first step the caused the 
process to be very slow [SPARK-16469]
2.  Still after the bug fix, if we have too many blocks the Simulate 
multiply will take very long time and will multiply the data many times. 
(O(n^3) where n is the number of blocks)
3.  Spark supports only multiplication with Dense matrices. Thus, it 
converts a Sparse matrix into a dense matrix before the multiplication.
4.  For summing the intermediate block results Spark uses Breeze’s CSC 
matrix operations – here the problem is that it is very inefficient to update a 
CSC matrix in a zero value.

That means that with many blocks (default block size is 1024) – in our case 
50M/1024 ~= 50K, the simulate multiply will effectively never finish or will 
generate 50K*16GB ~= 1000TB of data. On the other hand, if we use bigger block 
size e.g. 100k we get OutOfMemoryException in the “toDense” method of the 
multiply. We have worked around that by implementing our-selves both the Sparse 
multiplication and addition in a very naïve way – but at least it works.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2016-07-31 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400993#comment-15400993
 ] 

Reynold Xin commented on SPARK-6305:


[~srowen] looked into this in the past and he didn't get everything working. 
Sean -can you share more?


> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16812) Open up SparkILoop.getAddedJars

2016-07-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16812.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Open up SparkILoop.getAddedJars
> ---
>
> Key: SPARK-16812
> URL: https://issues.apache.org/jira/browse/SPARK-16812
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.1, 2.1.0
>
>
> SparkILoop.getAddedJars is a useful method to use so we can programmatically 
> get the list of jars added.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org