date:20150212


 [ 
https://issues.apache.org/jira/browse/SPARK-5757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5757.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Use json4s instead of DataFrame.toJSON in model export
 --

 Key: SPARK-5757
 URL: https://issues.apache.org/jira/browse/SPARK-5757
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.3.0


 We use DataFrame.toJSON to save/load model metadata, which depends on 
 DataFrame's JSON support and subject to changes made there. To avoid 
 conflicts, e.g., https://github.com/apache/spark/pull/4544, we should use 
 json4s directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5780) The loggings of Python unittests are noisy and scaring in


[ 
https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318836#comment-14318836
 ] 

Apache Spark commented on SPARK-5780:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4572

 The loggings of Python unittests are noisy and scaring in 
 --

 Key: SPARK-5780
 URL: https://issues.apache.org/jira/browse/SPARK-5780
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0, 1.4.0
Reporter: Davies Liu

 There a bunch of logging coming from driver and worker, it's noisy and 
 scaring, and a lots of exception in it, people are confusing about the tests 
 are failing or not.
 It should mute the logging during tests, only show them if any one failed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5522) Accelerate the Histroty Server start


[ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318857#comment-14318857
 ] 

Ryan Williams commented on SPARK-5522:
--

I think [SPARK-4558|https://issues.apache.org/jira/browse/SPARK-4558] is 
talking about the same problem.

 Accelerate the Histroty Server start
 

 Key: SPARK-5522
 URL: https://issues.apache.org/jira/browse/SPARK-5522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Liangliang Gu

 When starting the history server, all the log files will be fetched and 
 parsed in order to get the applications' meta data e.g. App Name, Start Time, 
 Duration, etc. In our production cluster, there exist 2600 log files (160G) 
 in HDFS and it costs 3 hours to restart the history server, which is a little 
 bit too long for us.
 It would be better, if the history server can show logs with missing 
 information during start-up and fill the missing information after fetching 
 and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-02-12 Thread Mark Khaitman (JIRA)

Mark Khaitman created SPARK-5782:


 Summary: Python Worker / Pyspark Daemon Memory Issue
 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.2.1, 1.3.0, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman


I'm including the Shuffle component on this, as a brief scan through the code 
(which I'm not 100% familiar with just yet) shows a large amount of memory 
handling in it:

It appears that any type of join between two RDDs spawns up twice as many 
pyspark.daemon workers compared to the default 1 task - 1 core configuration 
in our environment. This can become problematic in the cases where you build up 
a tree of RDD joins, since the pyspark.daemons do not cease to exist until the 
top level join is completed (or so it seems)... This can lead to memory 
exhaustion by a single framework, even though is set to have a 512MB python 
worker memory limit and few gigs of executor memory.

Another related issue to this is that the individual python workers are not 
supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
spill to disk.

I came across this bit of code in shuffle.py which *may* have something to do 
with allowing some of our python workers from somehow reaching 2GB each (which 
when multiplied by the number of cores per executor * the number of joins 
occurring in some cases), causing the Out-of-Memory killer to step up to its 
unfortunate job! :(

def _next_limit(self):

Return the next memory limit. If the memory is not released
after spilling, it will dump the data only when the used memory
starts to increase.

return max(self.memory_limit, get_used_memory() * 1.05)


I've only just started looking into the code, and would definitely love to 
contribute towards Spark, though I figured it might be quicker to resolve if 
someone already owns the code!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5783) Include filename, line number in eventlog-parsing error message

Ryan Williams created SPARK-5783:


 Summary: Include filename, line number in eventlog-parsing error 
message
 Key: SPARK-5783
 URL: https://issues.apache.org/jira/browse/SPARK-5783
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Priority: Minor


While investigating why some recent applications were not showing up in my 
History Server UI, I found error message blocks like this in the history server 
logs:

{code}
15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Exception in parsing Spark 
event log.
java.lang.ClassCastException: org.json4s.JsonAST$JNothing$ cannot be cast to 
org.json4s.JsonAST$JObject
at 
org.apache.spark.util.JsonProtocol$.mapFromJson(JsonProtocol.scala:814)
at 
org.apache.spark.util.JsonProtocol$.executorInfoFromJson(JsonProtocol.scala:805)
...
at 
org.apache.spark.deploy.history.FsHistoryProvider$$anon$1.run(FsHistoryProvider.scala:84)
15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Malformed line: 
{Event:SparkListenerExecutorAdded,Timestamp:1422897479154,Executor 
ID:12,Executor 
Info:{Host:demeter-csmaz11-1.demeter.hpc.mssm.edu,Total Cores:4}}
{code}

Turns out certain files had some malformed lines due to having been generated 
by a forked Spark with some WIP event-log functionality.

It would be nice if the first line specified the file the error was found in, 
and the last line specified the line number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py


 [ 
https://issues.apache.org/jira/browse/SPARK-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5776.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4570
[https://github.com/apache/spark/pull/4570]

 JIRA version not of form x.y.z breaks merge_spark_pr.py
 ---

 Key: SPARK-5776
 URL: https://issues.apache.org/jira/browse/SPARK-5776
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 It appears that the version 2+ I added to JIRA breaks the merge script 
 since it expects x.y.z only. I will try to patch the logic quickly. Worst 
 case, we can name the version 2.0.0 if we have to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode


 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5655:
-
Affects Version/s: (was: 1.3.0)
Fix Version/s: 1.2.2

 YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster 
 configured in secure mode
 -

 Key: SPARK-5655
 URL: https://issues.apache.org/jira/browse/SPARK-5655
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
Reporter: Andrew Rowson
Assignee: Andrew Rowson
Priority: Critical
  Labels: hadoop
 Fix For: 1.3.0, 1.2.2


 When running a Spark job on a YARN cluster which doesn't run containers under 
 the same user as the nodemanager, and also when using the YARN auxiliary 
 shuffle service, jobs fail with something similar to:
 {code:java}
 java.io.FileNotFoundException: 
 /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
  (Permission denied)
 {code}
 The root cause of this here: 
 https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287
 Spark will attempt to chmod 700 any application directories it creates during 
 the job, which includes files created in the nodemanager's usercache 
 directory. The owner of these files is the container UID, which on a secure 
 cluster is the name of the user creating the job, and on an nonsecure cluster 
 but with the yarn.nodemanager.container-executor.class configured is the 
 value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.
 The problem with this is that the auxiliary shuffle manager runs as part of 
 the nodemanager, which is typically running as the user 'yarn'. This can't 
 access these files that are only owner-readable.
 YARN already attempts to secure files created under appcache but keep them 
 readable by the nodemanager, by setting the group of the appcache directory 
 to 'yarn' and also setting the setgid flag. This means that files and 
 directories created under this should also have the 'yarn' group. Normally 
 this means that the nodemanager should also be able to read these files, but 
 Spark setting chmod700 wipes this out.
 I'm not sure what the right approach is here. Commenting out the chmod700 
 functionality makes this work on YARN, and still makes the application files 
 only readable by the owner and the group:
 {code}
 /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
  # ls -lah
 total 206M
 drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
 drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
 -rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data
 {code}
 But this may not be the right approach on non-YARN. Perhaps an additional 
 step to see if this chmod700 step is necessary (ie non-YARN) is required. 
 Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to 
 suggest a patch.
 I believe this is a related issue in the MapReduce framwork: 
 https://issues.apache.org/jira/browse/MAPREDUCE-3728



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided


[ 
https://issues.apache.org/jira/browse/SPARK-5778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318823#comment-14318823
 ] 

Apache Spark commented on SPARK-5778:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4571

 Throw if nonexistent spark.metrics.conf file is provided
 --

 Key: SPARK-5778
 URL: https://issues.apache.org/jira/browse/SPARK-5778
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Priority: Minor

 Spark looks for a {{MetricsSystem}} configuration file when the 
 {{spark.metrics.conf}} parameter is set, [defaulting to the path 
 {{metrics.properties}} when it's not 
 set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55].
 In the event of a failure to find or parse the file, [the exception is caught 
 and an error is 
 logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61].
 This seems like reasonable behavior in the general case where the user has 
 not specified a {{spark.metrics.conf}} file. However, I've been bitten 
 several times by having specified a file that all or some executors did not 
 have present (I typo'd the path, or forgot to add an additional {{--files}} 
 flag to make my local metrics config file get shipped to all executors), the 
 error was swallowed and I was confused about why I'd captured no metrics from 
 a job that appeared to have run successfully.
 I'd like to change the behavior to actually throw if the user has specified a 
 configuration file that doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5779) Python broadcast does not work with Kryo serializer

2015-02-12 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5779:
-

 Summary: Python broadcast does not work with Kryo serializer
 Key: SPARK-5779
 URL: https://issues.apache.org/jira/browse/SPARK-5779
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1, 1.3.0
Reporter: Davies Liu
Priority: Critical


The PythonBroadcast cannot be serialized by Kryo, which is introduced in 1.2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5765) word split problem in run-example and compute-classpath

2015-02-12 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318869#comment-14318869
 ] 

Nicholas Chammas commented on SPARK-5765:
-

FWIW [~srowen], the last time I had this discussion with [~pwendell] I believe 
the direction was to have separate JIRAs for separate pieces of work. That way 
they can be assigned and credited to separate people.

If 3 people work on fixing Bash word splitting bugs in different places, how do 
you track those contributions in JIRA in an automated way? Per Patrick's 
announcement some weeks ago, JIRA is now the source for release notes credits.

 word split problem in run-example and compute-classpath
 ---

 Key: SPARK-5765
 URL: https://issues.apache.org/jira/browse/SPARK-5765
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Affects Versions: 1.3.0, 1.1.2, 1.2.1
Reporter: Venkata Ramana G

 Work split problem in spark directory path in scripts
 run-example and compute-classpath.sh
 This was introduced in defect fix SPARK-4504



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4856) Null empty string should not be considered as StringType at begining in Json schema inferring

2015-02-12 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318637#comment-14318637
 ] 

Yin Huai commented on SPARK-4856:
-

[~chenghao] I think it is fine to use NullType for an empty string during the 
process of inferring schema. However, I think we should not always treat an 
empty string as null 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala#L402).
 If the inferred data type is StringType, we should return the empty string. 
Otherwise, we are destroying information. If the inferred data type is any 
other data type, I think it is reasonable to return a null.

 Null  empty string should not be considered as StringType at begining in 
 Json schema inferring
 ---

 Key: SPARK-4856
 URL: https://issues.apache.org/jira/browse/SPARK-4856
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.3.0


 We have data like:
 {noformat}
 TestSQLContext.sparkContext.parallelize(
   
 {ip:27.31.100.29,headers:{Host:1.abc.com,Charset:UTF-8}} 
 ::
   {ip:27.31.100.29,headers:{}} ::
   {ip:27.31.100.29,headers:} :: Nil)
 {noformat}
 As empty string (the headers) will be considered as String, and it ignores 
 the real nested data type (struct type headers in line 1), and then we will 
 get the headers (in line 1) as String Type, which is not our expectation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5747) Review all Bash scripts for word splitting bugs


[ 
https://issues.apache.org/jira/browse/SPARK-5747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318717#comment-14318717
 ] 

Apache Spark commented on SPARK-5747:
-

User 'dyross' has created a pull request for this issue:
https://github.com/apache/spark/pull/4540

 Review all Bash scripts for word splitting bugs
 ---

 Key: SPARK-5747
 URL: https://issues.apache.org/jira/browse/SPARK-5747
 Project: Spark
  Issue Type: Umbrella
  Components: Build
Reporter: Nicholas Chammas

 Triggered by [this 
 discussion|http://apache-spark-developers-list.1001551.n3.nabble.com/1-2-1-start-all-sh-broken-td10583.html].
 Bash Word Splitting is nefarious problem.
 http://mywiki.wooledge.org/WordSplitting
 Bad (x):
 {code}
 command $variable
 {code}
 Good (/):
 {code}
 command $variable
 {code}
 Bad (x):
 {code}
 command $variable/path
 {code}
 Good (/):
 {code}
 command $variable/path
 {code}
 Bad (x):
 {code}
 command $variable/stuff*
 {code}
 Good (/):
 {code}
 command $variable/stuff*
 {code}
 It's that simple.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running


 [ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4180:
--
Labels:   (was: backport-needed)

 SparkContext constructor should throw exception if another SparkContext is 
 already running
 --

 Key: SPARK-4180
 URL: https://issues.apache.org/jira/browse/SPARK-4180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.2.0, 1.3.0


 Spark does not currently support multiple concurrently-running SparkContexts 
 in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
 should throw an exception if there is an active SparkContext that has not 
 been shut down via {{stop()}}.
 PySpark already does this, but the Scala SparkContext should do the same 
 thing.  The current behavior with multiple active contexts is unspecified / 
 not understood and it may be the source of confusing errors (see the user 
 error report in SPARK-4080, for example).
 This should be pretty easy to add: just add a {{activeSparkContext}} field to 
 the SparkContext companion object and {{synchronize}} on it in the 
 constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
 example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running


 [ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4180.
---
  Resolution: Fixed
   Fix Version/s: (was: 1.2.1)
  1.2.0
Target Version/s: 1.2.0  (was: 1.2.0, 1.0.3, 1.1.2)

I'm going to resolve this as fixed since it was included in 1.2.0.  Now that 
we're about to release 1.3, I don't think that we need to backport this into 
branch-1.0, so I'm going to remove the {{backport-needed}} label.

 SparkContext constructor should throw exception if another SparkContext is 
 already running
 --

 Key: SPARK-4180
 URL: https://issues.apache.org/jira/browse/SPARK-4180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Fix For: 1.3.0, 1.2.0


 Spark does not currently support multiple concurrently-running SparkContexts 
 in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
 should throw an exception if there is an active SparkContext that has not 
 been shut down via {{stop()}}.
 PySpark already does this, but the Scala SparkContext should do the same 
 thing.  The current behavior with multiple active contexts is unspecified / 
 not understood and it may be the source of confusing errors (see the user 
 error report in SPARK-4080, for example).
 This should be pretty easy to add: just add a {{activeSparkContext}} field to 
 the SparkContext companion object and {{synchronize}} on it in the 
 constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
 example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py


[ 
https://issues.apache.org/jira/browse/SPARK-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318777#comment-14318777
 ] 

Apache Spark commented on SPARK-5776:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4570

 JIRA version not of form x.y.z breaks merge_spark_pr.py
 ---

 Key: SPARK-5776
 URL: https://issues.apache.org/jira/browse/SPARK-5776
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Sean Owen
Priority: Minor

 It appears that the version 2+ I added to JIRA breaks the merge script 
 since it expects x.y.z only. I will try to patch the logic quickly. Worst 
 case, we can name the version 2.0.0 if we have to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5781) Add metadata files for JSON datasets

2015-02-12 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5781:
---

 Summary: Add metadata files for JSON datasets
 Key: SPARK-5781
 URL: https://issues.apache.org/jira/browse/SPARK-5781
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai


If we save a dataset in JSON format (e.g. through DataFrame.save), we should 
also persist the schema of the table. So, we can avoid inferring the schema 
when we want to query it in future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5765) word split problem in run-example and compute-classpath


[ 
https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1431#comment-1431
 ] 

Sean Owen commented on SPARK-5765:
--

I don't think JIRAs are for splitting up work on one issue among people. This 
seems like one issue. PRs track implementation and there can be several for a 
JIRA. You're right that there's this problem of one Assignee field. Usually 
there's clearly one contributor; where not I hope we'd give kudos to the newer 
contributor; that's 99% of all cases. Here, no problem, we have JIRAs enough to 
go around.

 word split problem in run-example and compute-classpath
 ---

 Key: SPARK-5765
 URL: https://issues.apache.org/jira/browse/SPARK-5765
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Affects Versions: 1.3.0, 1.1.2, 1.2.1
Reporter: Venkata Ramana G

 Work split problem in spark directory path in scripts
 run-example and compute-classpath.sh
 This was introduced in defect fix SPARK-4504



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-984) SPARK_TOOLS_JAR not set if multiple tools jars exists


 [ 
https://issues.apache.org/jira/browse/SPARK-984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-984:
-
Assignee: (was: Josh Rosen)

 SPARK_TOOLS_JAR not set if multiple tools jars exists
 -

 Key: SPARK-984
 URL: https://issues.apache.org/jira/browse/SPARK-984
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 0.8.1, 0.9.0
Reporter: Aaron Davidson
Priority: Minor

 If you have multiple tools assemblies (e.g., if you assembled on 0.8.1 and 
 0.9.0 before, for instance), then this error is thrown in spark-class:
 {noformat}./spark-class: line 115: [: 
 /home/aaron/spark/tools/target/scala-2.9.3/spark-tools-assembly-0.8.1-incubating-SNAPSHOT.jar:
  binary operator expected{noformat}
 This is because of a flaw in the bash script:
 {noformat}if [ -e 
 $TOOLS_DIR/target/scala-$SCALA_VERSION/*assembly*[0-9Tg].jar ]; 
 then{noformat}
 which does not parse correctly if the path resolves to multiple files.
 The error is non-fatal, but a nuisance and presumably breaks whatever 
 SPARK_TOOLS_JAR is used for.
 Currently, we error if multiple Spark assemblies are found, so we could do 
 something similar for tools assemblies. The only issue is that means that the 
 user will always have to go through both errors (clean the assembly/ jars 
 then tools/ jar) when it appears that the tools/ jar is not actually 
 important for normal operation. The second possibility is to infer the 
 correct tools jar using the single available assembly jar, but this is 
 slightly complicated by the code path if $FWDIR/RELEASE exists.
 Since I'm not 100% on what SPARK_TOOLS_JAR is even for, I'm assigning this to 
 Josh who wrote the code initially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py

Sean Owen created SPARK-5776:


 Summary: JIRA version not of form x.y.z breaks merge_spark_pr.py
 Key: SPARK-5776
 URL: https://issues.apache.org/jira/browse/SPARK-5776
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Sean Owen
Priority: Minor


It appears that the version 2+ I added to JIRA breaks the merge script since 
it expects x.y.z only. I will try to patch the logic quickly. Worst case, we 
can name the version 2.0.0 if we have to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5776) JIRA version not of form x.y.z breaks merge_spark_pr.py


 [ 
https://issues.apache.org/jira/browse/SPARK-5776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-5776:


Assignee: Sean Owen

 JIRA version not of form x.y.z breaks merge_spark_pr.py
 ---

 Key: SPARK-5776
 URL: https://issues.apache.org/jira/browse/SPARK-5776
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor
 Fix For: 1.4.0


 It appears that the version 2+ I added to JIRA breaks the merge script 
 since it expects x.y.z only. I will try to patch the logic quickly. Worst 
 case, we can name the version 2.0.0 if we have to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4180) SparkContext constructor should throw exception if another SparkContext is already running


 [ 
https://issues.apache.org/jira/browse/SPARK-4180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4180:
--
Fix Version/s: 1.2.1
   1.3.0

 SparkContext constructor should throw exception if another SparkContext is 
 already running
 --

 Key: SPARK-4180
 URL: https://issues.apache.org/jira/browse/SPARK-4180
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.1


 Spark does not currently support multiple concurrently-running SparkContexts 
 in the same JVM (see SPARK-2243).  Therefore, SparkContext's constructor 
 should throw an exception if there is an active SparkContext that has not 
 been shut down via {{stop()}}.
 PySpark already does this, but the Scala SparkContext should do the same 
 thing.  The current behavior with multiple active contexts is unspecified / 
 not understood and it may be the source of confusing errors (see the user 
 error report in SPARK-4080, for example).
 This should be pretty easy to add: just add a {{activeSparkContext}} field to 
 the SparkContext companion object and {{synchronize}} on it in the 
 constructor and {{stop()}} methods; see PySpark's {{context.py}} file for an 
 example of this approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode


 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5655.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster 
 configured in secure mode
 -

 Key: SPARK-5655
 URL: https://issues.apache.org/jira/browse/SPARK-5655
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0, 1.2.1
 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
Reporter: Andrew Rowson
Assignee: Andrew Rowson
Priority: Critical
  Labels: hadoop
 Fix For: 1.3.0


 When running a Spark job on a YARN cluster which doesn't run containers under 
 the same user as the nodemanager, and also when using the YARN auxiliary 
 shuffle service, jobs fail with something similar to:
 {code:java}
 java.io.FileNotFoundException: 
 /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
  (Permission denied)
 {code}
 The root cause of this here: 
 https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287
 Spark will attempt to chmod 700 any application directories it creates during 
 the job, which includes files created in the nodemanager's usercache 
 directory. The owner of these files is the container UID, which on a secure 
 cluster is the name of the user creating the job, and on an nonsecure cluster 
 but with the yarn.nodemanager.container-executor.class configured is the 
 value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.
 The problem with this is that the auxiliary shuffle manager runs as part of 
 the nodemanager, which is typically running as the user 'yarn'. This can't 
 access these files that are only owner-readable.
 YARN already attempts to secure files created under appcache but keep them 
 readable by the nodemanager, by setting the group of the appcache directory 
 to 'yarn' and also setting the setgid flag. This means that files and 
 directories created under this should also have the 'yarn' group. Normally 
 this means that the nodemanager should also be able to read these files, but 
 Spark setting chmod700 wipes this out.
 I'm not sure what the right approach is here. Commenting out the chmod700 
 functionality makes this work on YARN, and still makes the application files 
 only readable by the owner and the group:
 {code}
 /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
  # ls -lah
 total 206M
 drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
 drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
 -rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data
 {code}
 But this may not be the right approach on non-YARN. Perhaps an additional 
 step to see if this chmod700 step is necessary (ie non-YARN) is required. 
 Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to 
 suggest a patch.
 I believe this is a related issue in the MapReduce framwork: 
 https://issues.apache.org/jira/browse/MAPREDUCE-3728



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5778) Throw if nonexistent spark.metrics.conf file is provided

Ryan Williams created SPARK-5778:


 Summary: Throw if nonexistent spark.metrics.conf file is provided
 Key: SPARK-5778
 URL: https://issues.apache.org/jira/browse/SPARK-5778
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Priority: Minor


Spark looks for a {{MetricsSystem}} configuration file when the 
{{spark.metrics.conf}} parameter is set, [defaulting to the path 
{{metrics.properties}} when it's not 
set|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L52-L55].

In the event of a failure to find or parse the file, [the exception is caught 
and an error is 
logged|https://github.com/apache/spark/blob/466b1f671b21f575d28f9c103f51765790914fe3/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#L61].

This seems like reasonable behavior in the general case where the user has not 
specified a {{spark.metrics.conf}} file. However, I've been bitten several 
times by having specified a file that all or some executors did not have 
present (I typo'd the path, or forgot to add an additional {{--files}} flag to 
make my local metrics config file get shipped to all executors), the error was 
swallowed and I was confused about why I'd captured no metrics from a job that 
appeared to have run successfully.

I'd like to change the behavior to actually throw if the user has specified a 
configuration file that doesn't exist.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5522) Accelerate the History Server start


 [ 
https://issues.apache.org/jira/browse/SPARK-5522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Williams updated SPARK-5522:
-
Summary: Accelerate the History Server start  (was: Accelerate the Histroty 
Server start)

 Accelerate the History Server start
 ---

 Key: SPARK-5522
 URL: https://issues.apache.org/jira/browse/SPARK-5522
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Reporter: Liangliang Gu

 When starting the history server, all the log files will be fetched and 
 parsed in order to get the applications' meta data e.g. App Name, Start Time, 
 Duration, etc. In our production cluster, there exist 2600 log files (160G) 
 in HDFS and it costs 3 hours to restart the history server, which is a little 
 bit too long for us.
 It would be better, if the history server can show logs with missing 
 information during start-up and fill the missing information after fetching 
 and parsing a log file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5210) Support log rolling in EventLogger


 [ 
https://issues.apache.org/jira/browse/SPARK-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5210.
---
Resolution: Later
  Assignee: (was: Josh Rosen)

I'm closing this issue for now, since my original motivation for this feature 
has changed and there's no reason to let it clutter up the JIRA in the meantime.

 Support log rolling in EventLogger
 --

 Key: SPARK-5210
 URL: https://issues.apache.org/jira/browse/SPARK-5210
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Reporter: Josh Rosen

 For long-running Spark applications (e.g. running for days / weeks), the 
 Spark event log may grow to be very large.
 As a result, it would be useful if EventLoggingListener supported log file 
 rolling / rotation.  Adding this feature will involve changes to the 
 HistoryServer in order to be able to load event logs from a sequence of files 
 instead of a single file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5655) YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster configured in secure mode


 [ 
https://issues.apache.org/jira/browse/SPARK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5655:
-
Assignee: Andrew Rowson

 YARN Auxiliary Shuffle service can't access shuffle files on Hadoop cluster 
 configured in secure mode
 -

 Key: SPARK-5655
 URL: https://issues.apache.org/jira/browse/SPARK-5655
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0, 1.2.1
 Environment: Both CDH5.3.0 and CDH5.1.3, latest build on branch-1.2
Reporter: Andrew Rowson
Assignee: Andrew Rowson
Priority: Critical
  Labels: hadoop
 Fix For: 1.3.0


 When running a Spark job on a YARN cluster which doesn't run containers under 
 the same user as the nodemanager, and also when using the YARN auxiliary 
 shuffle service, jobs fail with something similar to:
 {code:java}
 java.io.FileNotFoundException: 
 /data/9/yarn/nm/usercache/username/appcache/application_1423069181231_0032/spark-c434a703-7368-4a05-9e99-41e77e564d1d/3e/shuffle_0_0_0.index
  (Permission denied)
 {code}
 The root cause of this here: 
 https://github.com/apache/spark/blob/branch-1.2/core/src/main/scala/org/apache/spark/util/Utils.scala#L287
 Spark will attempt to chmod 700 any application directories it creates during 
 the job, which includes files created in the nodemanager's usercache 
 directory. The owner of these files is the container UID, which on a secure 
 cluster is the name of the user creating the job, and on an nonsecure cluster 
 but with the yarn.nodemanager.container-executor.class configured is the 
 value of yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.
 The problem with this is that the auxiliary shuffle manager runs as part of 
 the nodemanager, which is typically running as the user 'yarn'. This can't 
 access these files that are only owner-readable.
 YARN already attempts to secure files created under appcache but keep them 
 readable by the nodemanager, by setting the group of the appcache directory 
 to 'yarn' and also setting the setgid flag. This means that files and 
 directories created under this should also have the 'yarn' group. Normally 
 this means that the nodemanager should also be able to read these files, but 
 Spark setting chmod700 wipes this out.
 I'm not sure what the right approach is here. Commenting out the chmod700 
 functionality makes this work on YARN, and still makes the application files 
 only readable by the owner and the group:
 {code}
 /data/1/yarn/nm/usercache/username/appcache/application_1423247249655_0001/spark-c7a6fc0f-e5df-49cf-a8f5-e51a1ca087df/0c
  # ls -lah
 total 206M
 drwxr-s---  2 nobody yarn 4.0K Feb  6 18:30 .
 drwxr-s--- 12 nobody yarn 4.0K Feb  6 18:30 ..
 -rw-r-  1 nobody yarn 206M Feb  6 18:30 shuffle_0_0_0.data
 {code}
 But this may not be the right approach on non-YARN. Perhaps an additional 
 step to see if this chmod700 step is necessary (ie non-YARN) is required. 
 Sadly, I don't have a non-YARN environment to test, otherwise I'd be able to 
 suggest a patch.
 I believe this is a related issue in the MapReduce framwork: 
 https://issues.apache.org/jira/browse/MAPREDUCE-3728



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5777) Completes data source filter types and remove CatalystScan

2015-02-12 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-5777:
-

 Summary: Completes data source filter types and remove CatalystScan
 Key: SPARK-5777
 URL: https://issues.apache.org/jira/browse/SPARK-5777
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1, 1.2.0, 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Currently the data source API only supports a fraction of common filters, esp. 
{{And}} is not supported yet. To workaround this issue and enable full filter 
push-down optimization in the Parquet data source, {{CatalystScan}} was 
introduced to receive full Catalyst filter expressions. This class should be 
removed, since in principle, data source implementations shouldn't touch 
Catalyst expressions (which are not part of the public developer API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5767) Migrate Parquet data source to the write support of data source API

2015-02-12 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-5767:
-

 Summary: Migrate Parquet data source to the write support of data 
source API
 Key: SPARK-5767
 URL: https://issues.apache.org/jira/browse/SPARK-5767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Migrate to the newly introduced data source write support API (SPARK-5658). Add 
support for overwriting and appending to existing tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4819) Remove Guava's Optional from public API


 [ 
https://issues.apache.org/jira/browse/SPARK-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4819:
-
Target Version/s: 2+

 Remove Guava's Optional from public API
 -

 Key: SPARK-4819
 URL: https://issues.apache.org/jira/browse/SPARK-4819
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin

 Filing this mostly so this isn't forgotten. Spark currently exposes Guava 
 types in its public API (the {{Optional}} class is used in the Java 
 bindings). This makes it hard to properly hide Guava from user applications, 
 and makes mixing different Guava versions with Spark a little sketchy (even 
 if things should work, since those classes are pretty simple in general).
 Since this changes the public API, it has to be done in a release that allows 
 such breakages. But it would be nice to at least have a transition plan for 
 deprecating the affected APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3369) Java mapPartitions Iterator-Iterable is inconsistent with Scala's Iterator-Iterator


 [ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3369:
-
 Priority: Major  (was: Critical)
 Target Version/s: 2+  (was: 1.2.0)
Affects Version/s: 1.2.1
 Assignee: Sean Owen

 Java mapPartitions Iterator-Iterable is inconsistent with Scala's 
 Iterator-Iterator
 -

 Key: SPARK-3369
 URL: https://issues.apache.org/jira/browse/SPARK-3369
 Project: Spark
  Issue Type: Improvement
  Components: Java API
Affects Versions: 1.0.2, 1.2.1
Reporter: Sean Owen
Assignee: Sean Owen
  Labels: breaking_change
 Attachments: FlatMapIterator.patch


 {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
 {{Iterator}} to an {{Iterator}}: 
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
 In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
 an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
 stronger condition and appears inconsistent. It's a problematic inconsistent 
 though because this seems to require copying all of the input into memory in 
 order to create an object that can be iterated many times, since the input 
 does not afford this itself.
 Similarity for other {{mapPartitions*}} methods and other 
 {{*FlatMapFunctions}}s in Java.
 (Is there a reason for this difference that I'm overlooking?)
 If I'm right that this was inadvertent inconsistency, then the big issue here 
 is that of course this is part of a public API. Workarounds I can think of:
 Promise that Spark will only call {{iterator()}} once, so implementors can 
 use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
 Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
 desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()


 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3266:
-
Target Version/s: 2+  (was: 1.1.1, 1.2.0)
Assignee: Sean Owen

 JavaDoubleRDD doesn't contain max()
 ---

 Key: SPARK-3266
 URL: https://issues.apache.org/jira/browse/SPARK-3266
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.1, 1.0.2, 1.1.0, 1.2.0
Reporter: Amey Chaugule
Assignee: Sean Owen
 Attachments: spark-repro-3266.tar.gz


 While I can compile my code, I see:
 Caused by: java.lang.NoSuchMethodError: 
 org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
 When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
 don't notice max()
 although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader


[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318041#comment-14318041
 ] 

Sean Owen commented on SPARK-5770:
--

Can you be more specific about where you think the code path doesn't copy the 
new file? your PR does not touch the copying code but disables overwrite 
entirely, which is not OK.

 Use addJar() to upload a new jar file to executor, it can't be added to 
 classloader
 ---

 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 First use addJar() to upload a jar to the executor, then change the jar 
 content and upload it again. We can see the jar file in the local has be 
 updated, but the classloader still load the old one. The executor log has no 
 error or exception to point it.
 I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3365) Failure to save Lists to Parquet

2015-02-12 Thread Yi Tian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317768#comment-14317768
 ] 

Yi Tian commented on SPARK-3365:


The reason is Spark generated wrong schema for type {{List}} in 
{{ScalaReflection.scala}}
for example:

the generated schema for type {{Seq\[String\]}} is:
{code}
{name:x,type:{type:array,elementType:string,containsNull:true},nullable:true,metadata:{}}
{code}

the generated schema for type {{List\[String\]}} is:
{code}
{name:x,type:{type:struct,fields:[]},nullable:true,metadata:{}}
{code}

The related code is 
[here|https://github.com/apache/spark/blob/500dc2b4b3136029457e708859fe27da93b1f9e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L110]
 


The order of resolution is:
# UserCustomType
# Option\[\_\]
# Product
# Array\[Byte\]
# Array\[\_\]
# Seq\[\_\]
# Map\[\_, _\]
# String
# Timestamp
# java.sql.Date
# BigDecimal
# java.math.BigDecimal
# Decimal
# java.lang.Integer
# ...

I think the {{List}} type should belong to {{Seq\[\_\]}} pattern, so we should 
move {{Product}} behind {{Seq\[\_\]}}.

May I open a PR for this issue?

 Failure to save Lists to Parquet
 

 Key: SPARK-3365
 URL: https://issues.apache.org/jira/browse/SPARK-3365
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 Reproduction, same works if type is Seq. (props to [~chrisgrier] for finding 
 this)
 {code}
 scala case class Test(x: List[String])
 defined class Test
 scala sparkContext.parallelize(Test(List()) :: Nil).saveAsParquetFile(bug)
 23:09:51.807 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
 in stage 0.0 (TID 0)
 java.lang.ArithmeticException: / by zero
   at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99)
   at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:92)
   at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-02-12 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lianhui Wang updated SPARK-5763:

Description:
In SPARK-4644, it provide a way to resolve skewed data. But when we has more
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So
we can use sort-merge to resolve skewed-groupby and skewed-join.because
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well
for skewed data in my test.Later i will implement sort-merge-join to resolve
skewed-join.
[~rxin] [~sandyr] [~andrewor14] how about your opinions about this?

was:
In SPARK-4644, it provide a way to resolve skewed data. But when we has more
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So
we can use sort-merge to resolve skewed-groupby and skewed-join.because
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well
for skewed data in my test.Later i will implement sort-merge-join to resolve
skewed-join.
[~rxin] [~sandyr] [~andrewor14] how about your opinion about this?

Sort-based Groupby and Join to resolve skewed data
--

Key: SPARK-5763
URL: https://issues.apache.org/jira/browse/SPARK-5763
Project: Spark
Issue Type: Improvement
Reporter: Lianhui Wang

In SPARK-4644, it provide a way to resolve skewed data. But when we has more
keys that are skewed, I think that the way in SPARK-4644 is inappropriate. So
we can use sort-merge to resolve skewed-groupby and skewed-join.because
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well
for skewed data in my test.Later i will implement sort-merge-join to resolve
skewed-join.
[~rxin] [~sandyr] [~andrewor14] how about your opinions about this?

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5763) Sort-based Groupby and Join to resolve skewed data

2015-02-12 Thread Lianhui Wang (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Lianhui Wang updated SPARK-5763:

was:In SPARK-4644, it provide a way to resolve skewed data. But when we has
more keys that are skewed, I think that the way in SPARK-4644 is inappropriate.
So we can use sort-merge to resolve skewed-groupby and skewed-join.because
SPARK-2926 implement merge-sort, we can implement sort-merge for skewed based
on SPARK-2926. And i have implemented sort-merge-groupby and it is very well
for skewed data in my test.Later i will implement sort-merge-join to resolve
skewed-join.

Sort-based Groupby and Join to resolve skewed data
--

Key: SPARK-5763
URL: https://issues.apache.org/jira/browse/SPARK-5763
Project: Spark
Issue Type: Improvement
Reporter: Lianhui Wang

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-5508) [hive context] Unable to query array once saved as parquet

2015-02-12 Thread Ayoub Benali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301349#comment-14301349
 ] 

Ayoub Benali edited comment on SPARK-5508 at 2/12/15 9:53 AM:
--

I narrowed down the problem, the issue seems to come from the insert into 
table persisted_table select * from tmp_table command.
By using scala code: saving the Schema RDD as a parquet file, reloading it, 
saving it in the hive meta store and querying the column of array type works 
just fine.

So When I do the insertion from the tmp_table to the persisted_table using Hive 
Context the data in column of array type seems to be inserted in the wrong way 
into parquet which breaks the query after words. 

I tried Spark SQL CLI and insertIntoTable method to do the insertion as well, 
but it lead to the same issue when querying the table. 


 


was (Author: ayoub):
I narrowed down the problem, the issue seems to come from the insert into 
table persisted_table select * from tmp_table command.
By using scala code: saving the Schema RDD as a parquet file, reloading it, 
saving it in the hive meta store and querying the column of array type works 
just fine.

So When I do the insertion from the tmp_table to the persisted_table using Hive 
Context the data in column of array type seems to be inserted in the wrong way 
into parquet which breaks the query after words. 

I tried Spark SQL CLI to do the insertion as well, but it lead to the same 
issue when querying the table. 


 

 [hive context] Unable to query array once saved as parquet
 --

 Key: SPARK-5508
 URL: https://issues.apache.org/jira/browse/SPARK-5508
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
 Environment: mesos, cdh
Reporter: Ayoub Benali
  Labels: hivecontext, parquet

 When the table is saved as parquet, we cannot query a field which is an array 
 of struct, like show bellow:  
 {noformat}
 scala val data1={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 1,
  | field2: 2
  | }
  | ]
  | }
 scala val data2={
  | timestamp: 1422435598,
  | data_array: [
  | {
  | field1: 3,
  | field2: 4
  | }
  | ]
 scala val jsonRDD = sc.makeRDD(data1 :: data2 :: Nil)
 scala val rdd = hiveContext.jsonRDD(jsonRDD)
 scala rdd.printSchema
 root
  |-- data_array: array (nullable = true)
  ||-- element: struct (containsNull = false)
  |||-- field1: integer (nullable = true)
  |||-- field2: integer (nullable = true)
  |-- timestamp: integer (nullable = true)
 scala rdd.registerTempTable(tmp_table)
 scala hiveContext.sql(select data.field1 from tmp_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 res3: Array[org.apache.spark.sql.Row] = Array([1], [3])
 scala hiveContext.sql(SET hive.exec.dynamic.partition = true)
 scala hiveContext.sql(SET hive.exec.dynamic.partition.mode = nonstrict)
 scala hiveContext.sql(set parquet.compression=GZIP)
 scala hiveContext.setConf(spark.sql.parquet.binaryAsString, true)
 scala hiveContext.sql(create external table if not exists 
 persisted_table(data_array ARRAY STRUCTfield1: INT, field2: INT, 
 timestamp INT) STORED AS PARQUET Location 'hdfs:///test_table')
 scala hiveContext.sql(insert into table persisted_table select * from 
 tmp_table).collect
 scala hiveContext.sql(select data.field1 from persisted_table LATERAL VIEW 
 explode(data_array) nestedStuff AS data).collect
 parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in 
 file hdfs://*/test_table/part-1
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at

[jira] [Updated] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map


 [ 
https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5739:
-
Component/s: MLlib
   Priority: Minor  (was: Major)

 Size exceeds Integer.MAX_VALUE in File Map
 --

 Key: SPARK-5739
 URL: https://issues.apache.org/jira/browse/SPARK-5739
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1
 Environment: Spark1.1.1 on a cluster with 12 node. Every node with 
 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a 
 node.
Reporter: DjvuLee
Priority: Minor

 I just run the kmeans algorithm using a random generate data,but occurred 
 this problem after some iteration. I try several time, and this problem is 
 reproduced. 
 Because the data is random generate, so I guess is there a bug ? Or if random 
 data can lead to such a scenario that the size is bigger than 
 Integer.MAX_VALUE, can we check the size before using the file map?
 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.util.SizeEstimator - Failed to check whether 
 UseCompressedOops is set; assuming yes
 [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds 
 Integer.MAX_VALUE
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
   at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
   at 
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
   at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
   at 
 org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
   at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
   at KMeansDataGenerator$.main(kmeans.scala:105)
   at KMeansDataGenerator.main(kmeans.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
   at java.lang.reflect.Method.invoke(Method.java:619)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5766) Slow RowMatrix multiplication


[ 
https://issues.apache.org/jira/browse/SPARK-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317961#comment-14317961
 ] 

Sean Owen commented on SPARK-5766:
--

Given that RowMatrix is a row-by-row representation, it would have to be 
vector-matrix multiplication I think. In principle I think you can leverage 
native code here; the only question is whether it overcomes the overhead of the 
call for typical inputs, but it's possible.

 Slow RowMatrix multiplication
 -

 Key: SPARK-5766
 URL: https://issues.apache.org/jira/browse/SPARK-5766
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Amaru Cuba Gyllensten
Priority: Minor
  Labels: matrix

 Looking at the source code for RowMatrix multiplication by a local matrix, it 
 seems like it is going through all columnvectors of the matrix, doing 
 pairwise dot product on each column.  
 It seems like this could be sped up by using gemm, performing full 
 matrix-matrix multiplication on the local data, (or gemv, for vector-matrix 
 multiplication), as is done in BlockMatrix or Matrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5644) Delete tmp dir when sc is stop


 [ 
https://issues.apache.org/jira/browse/SPARK-5644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5644:
-
Assignee: Weizhong

 Delete tmp dir when sc is stop
 --

 Key: SPARK-5644
 URL: https://issues.apache.org/jira/browse/SPARK-5644
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Weizhong
Assignee: Weizhong
Priority: Minor
 Fix For: 1.4.0


 When we run driver as a service which will not stop. In this service process 
 we will create SparkContext and run job and then stop it, because we only 
 call sc.stop but not exit this service process so the tmp dirs created by 
 HttpFileServer and SparkEnv will not be deleted after SparkContext is 
 stopped, and this will lead to creating too many tmp dirs if we create many 
 SparkContext to run job in this service process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5436) Validate GradientBoostedTrees during training

2015-02-12 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318021#comment-14318021
 ] 

Manoj Kumar commented on SPARK-5436:


Hi, I would like to give this a go. [~ChrisT] are you still working on this? 
Otherwise I would love to carry this forward.

 Validate GradientBoostedTrees during training
 -

 Key: SPARK-5436
 URL: https://issues.apache.org/jira/browse/SPARK-5436
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 For Gradient Boosting, it would be valuable to compute test error on a 
 separate validation set during training.  That way, training could stop early 
 based on the test error (or some other metric specified by the user).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3365) Failure to save Lists to Parquet

2015-02-12 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317949#comment-14317949
 ] 

Cheng Lian commented on SPARK-3365:
---

Hey [~tianyi], please open a PR for this. However, I'd suggest adding a 
{{List[_]}} clause before {{Product}}, rather than moving the latter. Thanks!

 Failure to save Lists to Parquet
 

 Key: SPARK-3365
 URL: https://issues.apache.org/jira/browse/SPARK-3365
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 Reproduction, same works if type is Seq. (props to [~chrisgrier] for finding 
 this)
 {code}
 scala case class Test(x: List[String])
 defined class Test
 scala sparkContext.parallelize(Test(List()) :: Nil).saveAsParquetFile(bug)
 23:09:51.807 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
 in stage 0.0 (TID 0)
 java.lang.ArithmeticException: / by zero
   at 
 parquet.hadoop.InternalParquetRecordWriter.initStore(InternalParquetRecordWriter.java:99)
   at 
 parquet.hadoop.InternalParquetRecordWriter.init(InternalParquetRecordWriter.java:92)
   at 
 parquet.hadoop.ParquetRecordWriter.init(ParquetRecordWriter.java:64)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:282)
   at 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:300)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:318)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
   at org.apache.spark.scheduler.Task.run(Task.scala:54)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-02-12 Thread Al M (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318088#comment-14318088
]

Al M commented on SPARK-5768:
-

So when it says *Memory Used* 3.2GB / 20GB it actually means we are using 3.2GB
of memory for caching out of a total 20GB available for caching?

Calling the the column 'Storage Memory' this would be clearer to me. If
changing the heading of the column is not an option then a tooltip explaining
that it is referring to memory used for storage.

I'd find it pretty useful to have another column that shows my total memory
usage. Right now I can only see this by running 'free' or 'top' every machine
or looking at the Yarn UI.

Spark UI Shows incorrect memory under Yarn
--

Key: SPARK-5768
URL: https://issues.apache.org/jira/browse/SPARK-5768
Project: Spark
Issue Type: Improvement
Components: Web UI
Affects Versions: 1.2.0, 1.2.1
Environment: Centos 6
Reporter: Al M
Priority: Trivial

I am running Spark on Yarn with 2 executors. The executors are running on
separate physical machines.
I have spark.executor.memory set to '40g'. This is because I want to have
40g of memory used on each machine. I have one executor per machine.
When I run my application I see from 'top' that both my executors are using
the full 40g of memory I allocated to them.
The 'Executors' tab in the Spark UI shows something different. It shows the
memory used as a total of 20GB per executor e.g. x / 20.3GB. This makes it
look like I only have 20GB available per executor when really I have 40GB
available.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

[
https://issues.apache.org/jira/browse/SPARK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-5768:
-
Component/s: (was: YARN)
Web UI
Issue Type: Improvement (was: Bug)

It sounds like you are looking at the memory available for caching, which is
~0.54 (0.6*0.9) of the total. It's correct in that sense, but this is a common
misconception. I suggest this track suggesting a small UI change to make this
clearer. What do you think would be clearer?

Spark UI Shows incorrect memory under Yarn
--

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5769) Set params in constructor and setParams() in Python ML pipeline API

Xiangrui Meng created SPARK-5769:


 Summary: Set params in constructor and setParams() in Python ML 
pipeline API
 Key: SPARK-5769
 URL: https://issues.apache.org/jira/browse/SPARK-5769
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


As discussed in the design doc of SPARK-4586, we want to make Python users 
happy (no setters/getters) while keeping a low maintenance cost by forcing 
keyword arguments in the constructor and in setParams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5769) Set params in constructor and setParams() in Python ML pipeline API


[ 
https://issues.apache.org/jira/browse/SPARK-5769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318005#comment-14318005
 ] 

Apache Spark commented on SPARK-5769:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4564

 Set params in constructor and setParams() in Python ML pipeline API
 ---

 Key: SPARK-5769
 URL: https://issues.apache.org/jira/browse/SPARK-5769
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 As discussed in the design doc of SPARK-4586, we want to make Python users 
 happy (no setters/getters) while keeping a low maintenance cost by forcing 
 keyword arguments in the constructor and in setParams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader

2015-02-12 Thread meiyoula (JIRA)

meiyoula created SPARK-5770:
---

 Summary: Use addJar() to upload a new jar file to executor, it 
can't be added to classloader
 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula


First use addJar() to upload a jar to the executor, then change the jar content 
and upload it again. We can see the jar file in the local has be updated, but 
the classloader still load the old one. The executor log has no error or 
exception to point it.

I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5770) Use addJar() to upload a new jar file to executor, it can't be added to classloader


[ 
https://issues.apache.org/jira/browse/SPARK-5770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318027#comment-14318027
 ] 

Apache Spark commented on SPARK-5770:
-

User 'XuTingjun' has created a pull request for this issue:
https://github.com/apache/spark/pull/4565

 Use addJar() to upload a new jar file to executor, it can't be added to 
 classloader
 ---

 Key: SPARK-5770
 URL: https://issues.apache.org/jira/browse/SPARK-5770
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 First use addJar() to upload a jar to the executor, then change the jar 
 content and upload it again. We can see the jar file in the local has be 
 updated, but the classloader still load the old one. The executor log has no 
 error or exception to point it.
 I use spark-shell to test it. And set spark.files.overwrite is true.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5768) Spark UI Shows incorrect memory under Yarn

2015-02-12 Thread Al M (JIRA)

Al M created SPARK-5768:
---

 Summary: Spark UI Shows incorrect memory under Yarn
 Key: SPARK-5768
 URL: https://issues.apache.org/jira/browse/SPARK-5768
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1, 1.2.0
 Environment: Centos 6
Reporter: Al M
Priority: Trivial


I am running Spark on Yarn with 2 executors.  The executors are running on 
separate physical machines.

I have spark.executor.memory set to '40g'.  This is because I want to have 40g 
of memory used on each machine.  I have one executor per machine.

When I run my application I see from 'top' that both my executors are using the 
full 40g of memory I allocated to them.

The 'Executors' tab in the Spark UI shows something different.  It shows the 
memory used as a total of 20GB per executor e.g. x / 20.3GB.  This makes it 
look like I only have 20GB available per executor when really I have 40GB 
available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result


[ 
https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317976#comment-14317976
 ] 

Apache Spark commented on SPARK-4553:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4563

 query for parquet table with string fields in spark sql hive get binary result
 --

 Key: SPARK-4553
 URL: https://issues.apache.org/jira/browse/SPARK-4553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Cheng Lian
Priority: Blocker

 run 
 create table test_parquet(key int, value string) stored as parquet;
 insert into table test_parquet select * from src;
 select * from test_parquet;
 get result as follow
 ...
 282 [B@38fda3b
 138 [B@1407a24
 238 [B@12de6fb
 419 [B@6c97695
 15 [B@4885067
 118 [B@156a8d3
 72 [B@65d20dd
 90 [B@4c18906
 307 [B@60b24cc
 19 [B@59cf51b
 435 [B@39fdf37
 10 [B@4f799d7
 277 [B@3950951
 273 [B@596bf4b
 306 [B@3e91557
 224 [B@3781d61
 309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5767) Migrate Parquet data source to the write support of data source API


[ 
https://issues.apache.org/jira/browse/SPARK-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317977#comment-14317977
 ] 

Apache Spark commented on SPARK-5767:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/4563

 Migrate Parquet data source to the write support of data source API
 ---

 Key: SPARK-5767
 URL: https://issues.apache.org/jira/browse/SPARK-5767
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 Migrate to the newly introduced data source write support API (SPARK-5658). 
 Add support for overwriting and appending to existing tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5766) Slow RowMatrix multiplication

2015-02-12 Thread Amaru Cuba Gyllensten (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318013#comment-14318013
 ] 

Amaru Cuba Gyllensten commented on SPARK-5766:
--

Yeah, I noticed it when multiplying a 10,000 by 2000 IndexedRowMatrix with its 
transpose (represented as a local matrix), and doing some reductions on the 
rows. 
Running on my local machine, the multiplication in spark took about 7 times 
longer than an implementation where the left hand matrix was chunked and each 
chunk (consisiting of ~1000 rows) was multiplied with gemm (or similar).
This might be an unfair comparison, as it kinda requires the rows to be stored 
locally as dense matrices. (A use case which might be covered by the upcoming 
BlockMatrix?) 

 Slow RowMatrix multiplication
 -

 Key: SPARK-5766
 URL: https://issues.apache.org/jira/browse/SPARK-5766
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Amaru Cuba Gyllensten
Priority: Minor
  Labels: matrix

 Looking at the source code for RowMatrix multiplication by a local matrix, it 
 seems like it is going through all columnvectors of the matrix, doing 
 pairwise dot product on each column.  
 It seems like this could be sped up by using gemm, performing full 
 matrix-matrix multiplication on the local data, (or gemv, for vector-matrix 
 multiplication), as is done in BlockMatrix or Matrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5786) Documentation of Narrow Dependencies

2015-02-12 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-5786:
---

 Summary: Documentation of Narrow Dependencies
 Key: SPARK-5786
 URL: https://issues.apache.org/jira/browse/SPARK-5786
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Imran Rashid


Narrow dependencies can really improve job performance by skipping shuffles 
entirely.  However aside from being mentioned in some early papers and during 
some meetups, they aren't explained (or even mentioned) in the docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5759) ExecutorRunnable should catch YarnException while NMClient start container


 [ 
https://issues.apache.org/jira/browse/SPARK-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5759:
-
Affects Version/s: 1.2.0

 ExecutorRunnable should catch YarnException while NMClient start container
 --

 Key: SPARK-5759
 URL: https://issues.apache.org/jira/browse/SPARK-5759
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Lianhui Wang

 some time since some of reasons, it lead to some exception while NMClient 
 start container.example:we do not config spark_shuffle on some machines, so 
 it will throw a exception:
 java.lang.Error: 
 org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
 auxService:spark_shuffle does not exist.
  because YarnAllocator use ThreadPoolExecutor to start Container, so we can 
 not find which container or hostname throw exception. I think we should catch 
 YarnException  in ExecutorRunnable  when start container. if there are some 
 exceptions, we can know the container id or hostname of failed container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5787) Protect JVM from some not-important exceptions

2015-02-12 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5787:
-

 Summary: Protect JVM from some not-important exceptions
 Key: SPARK-5787
 URL: https://issues.apache.org/jira/browse/SPARK-5787
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Critical


Any un-captured exception will shutdown the executor JVM, so we should capture 
all those exceptions which did not hurt executor much (executor is still 
functional).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5765) word split problem in run-example and compute-classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5765:
-
Assignee: Venkata Ramana G

 word split problem in run-example and compute-classpath
 ---

 Key: SPARK-5765
 URL: https://issues.apache.org/jira/browse/SPARK-5765
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Affects Versions: 1.3.0, 1.1.2, 1.2.2
Reporter: Venkata Ramana G
Assignee: Venkata Ramana G

 Work split problem in spark directory path in scripts
 run-example and compute-classpath.sh
 This was introduced in defect fix SPARK-4504



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2774) Set preferred locations for reduce tasks


[ 
https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319056#comment-14319056
 ] 

Apache Spark commented on SPARK-2774:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/4576

 Set preferred locations for reduce tasks
 

 Key: SPARK-2774
 URL: https://issues.apache.org/jira/browse/SPARK-2774
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman

 Currently we do not set preferred locations for reduce tasks in Spark. This 
 patch proposes setting preferred locations based on the map output sizes and 
 locations tracked by the MapOutputTracker. This is useful in two conditions
 1. When you have a small job in a large cluster it can be useful to co-locate 
 map and reduce tasks to avoid going over the network
 2. If there is a lot of data skew in the map stage outputs, then it is 
 beneficial to place the reducer close to the largest output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5760) StandaloneRestClient/Server error behavior is incorrect


 [ 
https://issues.apache.org/jira/browse/SPARK-5760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5760.

   Resolution: Fixed
Fix Version/s: 1.3.0

 StandaloneRestClient/Server error behavior is incorrect
 ---

 Key: SPARK-5760
 URL: https://issues.apache.org/jira/browse/SPARK-5760
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.3.0


 There are three main known issues:
 (1) Server would always send the JSON to the servlet's output stream. 
 However, if the response code is not 200, the client reads from the error 
 stream instead. The server must write to the correct stream depending on the 
 response code.
 (2) If the server returns an empty response (no JSON), then both output and 
 error streams are null at the client, leading to NPEs. This happens if the 
 server throws an internal exception that it cannot recover from.
 (3) The default error handling servlet did not match the URL cases correctly, 
 because there are empty strings in the list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer


[ 
https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319182#comment-14319182
 ] 

Sean Owen commented on SPARK-5726:
--

You can ignore this comment, but I wonder if it would be even more immediately 
recognized if called ElementwiseProduct or something like that, as that's all 
the Hadamard product is right?

 Hadamard Vector Product Transformer
 ---

 Key: SPARK-5726
 URL: https://issues.apache.org/jira/browse/SPARK-5726
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Octavian Geagla
Assignee: Octavian Geagla

 I originally posted my idea here: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html
 A draft of this feature is implemented, documented, and tested already.  Code 
 is on a branch on my fork here: 
 https://github.com/ogeagla/spark/compare/spark-mllib-weighting
 I'm curious if there is any interest in this feature, in which case I'd 
 appreciate some feedback.  One thing that might be useful is an example/test 
 case using the transformer within the ML pipeline, since there are not any 
 examples which use Vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3570) Shuffle write time does not include time to open shuffle files

2015-02-12 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318960#comment-14318960
 ] 

Kay Ousterhout commented on SPARK-3570:
---

There are a bunch of times when files are opened and written to that are not 
currently logged, so I did some investigation of this to figure out when the 
time may be significant and is therefore worth logging.  I did this on a 
5-machine cluster using ext3 (which exacerbates disk access issues, making it 
easy to see when times may be long) running query 3a of the big data benchmark 
(which struggles with disk access because of the many shuffle files).  Here's 
what I found:

SortShuffleWriter.write, call to shuffleBlockManager.getDataFile: this just 
opens 1 file, and typically takes about 100us, so not worth adding logging

SortShuffleWriter.write, call to shuffleBlockManager.getIndexFile: this writes 
a single index file and typically took about 0.1ms (as high as 1ms). Also 
doesn't seem worth logging.

ExternalSorter.spillToPartitionFiles, creating the disk writers for each 
partition: because this creates one file for each partition, the time to create 
all of the files adds up: this took 75-100ms

ExternalSorter.writePartitionedFile, copying the data from the partitioned 
files to a single file: because this reads and writes all of the shuffle data, 
it can be long; ~13ms on the workload I looked at.

ExternalSorter.writePartitionedFile, time to call blockManager.getDiskWriter on 
line 748: getDiskWriter *CAN* take a long time because of the call to 
file.length(), which may hit disk. However, in this case, each call takes 20us 
or less (and this is likely noisy -- getting small to measure). To totally 
speculate, I'd guess that because this is called many times on the same file, 
as opposed to different files, and the file is actively being written to, the 
length is cached in memory by the OS.

To summarize, this all leads to the intuitive conclusion that we only need to 
long when we're writing lots of data (e.g., when copying all of the shuffle 
data to a single file) or when we're opening a lot of files.

 Shuffle write time does not include time to open shuffle files
 --

 Key: SPARK-3570
 URL: https://issues.apache.org/jira/browse/SPARK-3570
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.2, 1.1.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 
 3a_1410957857_0_job_log_waterfall.pdf


 Currently, the reported shuffle write time does not include time to open the 
 shuffle files.  This time can be very significant when the disk is highly 
 utilized and many shuffle files exist on the machine (I'm not sure how severe 
 this is in 1.0 onward -- since shuffle files are automatically deleted, this 
 may be less of an issue because there are fewer old files sitting around).  
 In experiments I did, in extreme cases, adding the time to open files can 
 increase the shuffle write time from 5ms (of a 2 second task) to 1 second.  
 We should fix this for better performance debugging.
 Thanks [~shivaram] for helping to diagnose this problem.  cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5785) Pyspark does not support narrow dependencies

2015-02-12 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-5785:
---

 Summary: Pyspark does not support narrow dependencies
 Key: SPARK-5785
 URL: https://issues.apache.org/jira/browse/SPARK-5785
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Imran Rashid


joins ( cogroups etc.) are always considered to have wide dependencies in 
pyspark, they are never narrow.  This can cause unnecessary shuffles.  eg., 
this simple job should shuffle rddA  rddB once each, but it also will do a 
third shuffle of the unioned data:

{code}
rddA = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

rddB = sc.parallelize(range(100)).map(lambda x: (x,x)).partitionBy(64)

joined = rddA.join(rddB)
joined.count()

 rddA._partitionFunc == rddB._partitionFunc
True
{code}


(Or the docs should somewhere explain that this feature is missing from spark.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5788) Capture exceptions in Python write thread


[ 
https://issues.apache.org/jira/browse/SPARK-5788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319135#comment-14319135
 ] 

Apache Spark commented on SPARK-5788:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4577

 Capture exceptions in Python write thread 
 --

 Key: SPARK-5788
 URL: https://issues.apache.org/jira/browse/SPARK-5788
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1
Reporter: Davies Liu
Priority: Blocker

 The exception in Python writer thread will shutdown executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5765) word split problem in run-example and compute-classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5765:
-
Affects Version/s: 1.3.0
   1.2.1

 word split problem in run-example and compute-classpath
 ---

 Key: SPARK-5765
 URL: https://issues.apache.org/jira/browse/SPARK-5765
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Affects Versions: 1.3.0, 1.1.2, 1.2.1
Reporter: Venkata Ramana G
Assignee: Venkata Ramana G
 Fix For: 1.3.0, 1.1.2, 1.2.2


 Work split problem in spark directory path in scripts
 run-example and compute-classpath.sh
 This was introduced in defect fix SPARK-4504



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5765) word split problem in run-example and compute-classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-5765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5765:
-
Affects Version/s: (was: 1.2.2)
   (was: 1.3.0)

 word split problem in run-example and compute-classpath
 ---

 Key: SPARK-5765
 URL: https://issues.apache.org/jira/browse/SPARK-5765
 Project: Spark
  Issue Type: Sub-task
  Components: Examples
Affects Versions: 1.3.0, 1.1.2, 1.2.1
Reporter: Venkata Ramana G
Assignee: Venkata Ramana G
 Fix For: 1.3.0, 1.1.2, 1.2.2


 Work split problem in spark directory path in scripts
 run-example and compute-classpath.sh
 This was introduced in defect fix SPARK-4504



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5762.

  Resolution: Fixed
   Fix Version/s: 1.2.2
  1.3.0
Target Version/s: 1.3.0, 1.2.2  (was: 1.3.0)

 Shuffle write time is incorrect for sort-based shuffle
 --

 Key: SPARK-5762
 URL: https://issues.apache.org/jira/browse/SPARK-5762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.3.0, 1.2.2


 For the sort-based shuffle, when bypassing merge sort, one file is written 
 for each partition, and then a final file is written that concatenates all of 
 the existing files together. The time to write this final file is not 
 included in the shuffle write time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5780) The loggings of Python unittests are noisy and scaring in


 [ 
https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5780.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0  (was: 1.4.0)

 The loggings of Python unittests are noisy and scaring in 
 --

 Key: SPARK-5780
 URL: https://issues.apache.org/jira/browse/SPARK-5780
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Davies Liu
 Fix For: 1.3.0


 There a bunch of logging coming from driver and worker, it's noisy and 
 scaring, and a lots of exception in it, people are confusing about the tests 
 are failing or not.
 It should mute the logging during tests, only show them if any one failed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5780) The loggings of Python unittests are noisy and scaring in


 [ 
https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5780:
-
Affects Version/s: (was: 1.4.0)

 The loggings of Python unittests are noisy and scaring in 
 --

 Key: SPARK-5780
 URL: https://issues.apache.org/jira/browse/SPARK-5780
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Davies Liu
 Fix For: 1.3.0


 There a bunch of logging coming from driver and worker, it's noisy and 
 scaring, and a lots of exception in it, people are confusing about the tests 
 are failing or not.
 It should mute the logging during tests, only show them if any one failed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1192) Around 30 parameters in Spark are used but undocumented and some are having confusing name


 [ 
https://issues.apache.org/jira/browse/SPARK-1192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1192.
--
Resolution: Won't Fix

PR was withdrawn; this probably deserves a rethink if it were reconsidered 
anyway so let's resolve.

 Around 30 parameters in Spark are used but undocumented and some are having 
 confusing name
 --

 Key: SPARK-1192
 URL: https://issues.apache.org/jira/browse/SPARK-1192
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Nan Zhu
Assignee: Nan Zhu

 I grep the code in core component, I found that around 30 parameters in the 
 implementation is actually used but undocumented. By reading the source code, 
 I found that some of them are actually very useful for the user.
 I suggest to make a complete document on the parameters. 
 Also some parameters are having confusing names
 spark.shuffle.copier.threads - this parameters is to control how many threads 
 you will use when you start a Netty-based shuffle servicebut from the 
 name, we cannot get this information
 spark.shuffle.sender.port - the similar problem with the above one, when you 
 use Netty-based shuffle receiver, you will have to setup a Netty-based 
 sender...this parameter is to setup the port used by the Netty sender, but 
 the name cannot convey this information



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5690) Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until completion


 [ 
https://issues.apache.org/jira/browse/SPARK-5690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5690.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0

 Flaky test: org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple 
 submit until completion
 -

 Key: SPARK-5690
 URL: https://issues.apache.org/jira/browse/SPARK-5690
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical
  Labels: flaky-test
 Fix For: 1.3.0


 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/1647/testReport/junit/org.apache.spark.deploy.rest/StandaloneRestSubmitSuite/simple_submit_until_completion/
 {code}
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.simple submit until 
 completion
 Failing for the past 1 build (Since Failed#1647 )
 Took 30 sec.
 Error Message
 Driver driver-20150209035158- did not finish within 30 seconds.
 Stacktrace
 sbt.ForkMain$ForkError: Driver driver-20150209035158- did not finish 
 within 30 seconds.
   at 
 org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495)
   at 
 org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555)
   at org.scalatest.Assertions$class.fail(Assertions.scala:1328)
   at org.scalatest.FunSuite.fail(FunSuite.scala:1555)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$apache$spark$deploy$rest$StandaloneRestSubmitSuite$$waitUntilFinished(StandaloneRestSubmitSuite.scala:152)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply$mcV$sp(StandaloneRestSubmitSuite.scala:57)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite$$anonfun$1.apply(StandaloneRestSubmitSuite.scala:52)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255)
   at 
 org.apache.spark.deploy.rest.StandaloneRestSubmitSuite.runTest(StandaloneRestSubmitSuite.scala:41)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
   at org.scalatest.Suite$class.run(Suite.scala:1424)
   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
   at

[jira] [Closed] (SPARK-5761) Revamp StandaloneRestProtocolSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-5761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5761.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Revamp StandaloneRestProtocolSuite
 --

 Key: SPARK-5761
 URL: https://issues.apache.org/jira/browse/SPARK-5761
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.3.0


 It currently runs an end-to-end test which is both slow and reported as flaky 
 here: SPARK-5690. We should make it test the individual components more 
 closely and make it more like unit test suite instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4897) Python 3 support

2015-02-12 Thread Ryan Ovas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318935#comment-14318935
 ] 

Ryan Ovas commented on SPARK-4897:
--

I'm interested in using Spark in my startup, but everything we do is in Python 
3.4 which makes adopting Spark difficult for me as well. I was surprised and 
disappointed (since I will have trouble using it myself) to see that there is 
no Python 3.x support when (as [~ianozsvald] suggested) the community as a 
whole is moving towards Python 3.4.

 Python 3 support
 

 Key: SPARK-4897
 URL: https://issues.apache.org/jira/browse/SPARK-4897
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Josh Rosen
Priority: Minor

 It would be nice to have Python 3 support in PySpark, provided that we can do 
 it in a way that maintains backwards-compatibility with Python 2.6.
 I started looking into porting this; my WIP work can be found at 
 https://github.com/JoshRosen/spark/compare/python3
 I was able to use the 
 [futurize|http://python-future.org/futurize.html#forwards-conversion-stage1] 
 tool to handle the basic conversion of things like {{print}} statements, etc. 
 and had to manually fix up a few imports for packages that moved / were 
 renamed, but the major blocker that I hit was {{cloudpickle}}:
 {code}
 [joshrosen python (python3)]$ PYSPARK_PYTHON=python3 ../bin/pyspark
 Python 3.4.2 (default, Oct 19 2014, 17:52:17)
 [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.51)] on darwin
 Type help, copyright, credits or license for more information.
 Traceback (most recent call last):
   File /Users/joshrosen/Documents/Spark/python/pyspark/shell.py, line 28, 
 in module
 import pyspark
   File /Users/joshrosen/Documents/spark/python/pyspark/__init__.py, line 
 41, in module
 from pyspark.context import SparkContext
   File /Users/joshrosen/Documents/spark/python/pyspark/context.py, line 26, 
 in module
 from pyspark import accumulators
   File /Users/joshrosen/Documents/spark/python/pyspark/accumulators.py, 
 line 97, in module
 from pyspark.cloudpickle import CloudPickler
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 120, in module
 class CloudPickler(pickle.Pickler):
   File /Users/joshrosen/Documents/spark/python/pyspark/cloudpickle.py, line 
 122, in CloudPickler
 dispatch = pickle.Pickler.dispatch.copy()
 AttributeError: type object '_pickle.Pickler' has no attribute 'dispatch'
 {code}
 This code looks like it will be hard difficult to port to Python 3, so this 
 might be a good reason to switch to 
 [Dill|https://github.com/uqfoundation/dill] for Python serialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5784) Add StatsDSink to MetricsSystem


[ 
https://issues.apache.org/jira/browse/SPARK-5784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14318993#comment-14318993
 ] 

Apache Spark commented on SPARK-5784:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/4574

 Add StatsDSink to MetricsSystem
 ---

 Key: SPARK-5784
 URL: https://issues.apache.org/jira/browse/SPARK-5784
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Ryan Williams
Priority: Minor

 [StatsD|https://github.com/etsy/statsd/] is a common wrapper for Graphite; it 
 would be useful to support sending metrics to StatsD in addition to [the 
 existing Graphite 
 support|https://github.com/apache/spark/blob/6a1be026cf37e4c8bf39133dfb4a73f7caedcc26/core/src/main/scala/org/apache/spark/metrics/sink/GraphiteSink.scala].
 [readytalk/metrics-statsd|https://github.com/readytalk/metrics-statsd] is a 
 StatsD adapter for the 
 [dropwizard/metrics|https://github.com/dropwizard/metrics] library that Spark 
 uses. The Maven repository at http://dl.bintray.com/readytalk/maven/ serves 
 {{metrics-statsd}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5726) Hadamard Vector Product Transformer


 [ 
https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5726:
-
Assignee: Octavian Geagla

 Hadamard Vector Product Transformer
 ---

 Key: SPARK-5726
 URL: https://issues.apache.org/jira/browse/SPARK-5726
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Octavian Geagla
Assignee: Octavian Geagla

 I originally posted my idea here: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html
 A draft of this feature is implemented, documented, and tested already.  Code 
 is on a branch on my fork here: 
 https://github.com/ogeagla/spark/compare/spark-mllib-weighting
 I'm curious if there is any interest in this feature, in which case I'd 
 appreciate some feedback.  One thing that might be useful is an example/test 
 case using the transformer within the ML pipeline, since there are not any 
 examples which use Vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5726) Hadamard Vector Product Transformer


[ 
https://issues.apache.org/jira/browse/SPARK-5726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319163#comment-14319163
 ] 

Xiangrui Meng commented on SPARK-5726:
--

This is a nice feature. I like the name `HadamardProduct` better than 
`HadamardScaler`. The former is more explicit though it always reminds me the 
Hadamard transform. Could you submit a PR?

 Hadamard Vector Product Transformer
 ---

 Key: SPARK-5726
 URL: https://issues.apache.org/jira/browse/SPARK-5726
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Octavian Geagla

 I originally posted my idea here: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Any-interest-in-weighting-VectorTransformer-which-does-component-wise-scaling-td10265.html
 A draft of this feature is implemented, documented, and tested already.  Code 
 is on a branch on my fork here: 
 https://github.com/ogeagla/spark/compare/spark-mllib-weighting
 I'm curious if there is any interest in this feature, in which case I'd 
 appreciate some feedback.  One thing that might be useful is an example/test 
 case using the transformer within the ML pipeline, since there are not any 
 examples which use Vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5782) Python Worker / Pyspark Daemon Memory Issue

2015-02-12 Thread Mark Khaitman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Khaitman updated SPARK-5782:
-
Description: 
I'm including the Shuffle component on this, as a brief scan through the code 
(which I'm not 100% familiar with just yet) shows a large amount of memory 
handling in it:

It appears that any type of join between two RDDs spawns up twice as many 
pyspark.daemon workers compared to the default 1 task - 1 core configuration 
in our environment. This can become problematic in the cases where you build up 
a tree of RDD joins, since the pyspark.daemons do not cease to exist until the 
top level join is completed (or so it seems)... This can lead to memory 
exhaustion by a single framework, even though is set to have a 512MB python 
worker memory limit and few gigs of executor memory.

Another related issue to this is that the individual python workers are not 
supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
spill to disk.

Some of our python workers are somehow reaching 2GB each (which when multiplied 
by the number of cores per executor * the number of joins occurring in some 
cases), causing the Out-of-Memory killer to step up to its unfortunate job! :(

I originally thought _next_limit in shuffle.py had an issue though I initially 
misread it. Logic looks good there :) Somewhere the 512mb limit is not being 
checked is my current suspicion.

I've only just started looking into the code, and would definitely love to 
contribute towards Spark, though I figured it might be quicker to resolve if 
someone already owns the code!

  was:
I'm including the Shuffle component on this, as a brief scan through the code 
(which I'm not 100% familiar with just yet) shows a large amount of memory 
handling in it:

It appears that any type of join between two RDDs spawns up twice as many 
pyspark.daemon workers compared to the default 1 task - 1 core configuration 
in our environment. This can become problematic in the cases where you build up 
a tree of RDD joins, since the pyspark.daemons do not cease to exist until the 
top level join is completed (or so it seems)... This can lead to memory 
exhaustion by a single framework, even though is set to have a 512MB python 
worker memory limit and few gigs of executor memory.

Another related issue to this is that the individual python workers are not 
supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
spill to disk.

I came across this bit of code in shuffle.py which *may* have something to do 
with allowing some of our python workers from somehow reaching 2GB each (which 
when multiplied by the number of cores per executor * the number of joins 
occurring in some cases), causing the Out-of-Memory killer to step up to its 
unfortunate job! :(

def _next_limit(self):

Return the next memory limit. If the memory is not released
after spilling, it will dump the data only when the used memory
starts to increase.

return max(self.memory_limit, get_used_memory() * 1.05)


I've only just started looking into the code, and would definitely love to 
contribute towards Spark, though I figured it might be quicker to resolve if 
someone already owns the code!


 Python Worker / Pyspark Daemon Memory Issue
 ---

 Key: SPARK-5782
 URL: https://issues.apache.org/jira/browse/SPARK-5782
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Shuffle
Affects Versions: 1.3.0, 1.2.1, 1.2.2
 Environment: CentOS 7, Spark Standalone
Reporter: Mark Khaitman

 I'm including the Shuffle component on this, as a brief scan through the code 
 (which I'm not 100% familiar with just yet) shows a large amount of memory 
 handling in it:
 It appears that any type of join between two RDDs spawns up twice as many 
 pyspark.daemon workers compared to the default 1 task - 1 core configuration 
 in our environment. This can become problematic in the cases where you build 
 up a tree of RDD joins, since the pyspark.daemons do not cease to exist until 
 the top level join is completed (or so it seems)... This can lead to memory 
 exhaustion by a single framework, even though is set to have a 512MB python 
 worker memory limit and few gigs of executor memory.
 Another related issue to this is that the individual python workers are not 
 supposed to even exceed that far beyond 512MB, otherwise they're supposed to 
 spill to disk.
 Some of our python workers are somehow reaching 2GB each (which when 
 multiplied by the number of cores per executor * the number of joins 
 occurring in some cases), causing the Out-of-Memory killer to step up to its 
 unfortunate job! :(
 I originally thought _next_limit in shuffle.py had an issue though I 
 initially misread it. Logic looks good

[jira] [Updated] (SPARK-5783) Include filename, line number in eventlog-parsing error message


 [ 
https://issues.apache.org/jira/browse/SPARK-5783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5783:
-
Affects Version/s: (was: 1.2.1)
   1.0.0

 Include filename, line number in eventlog-parsing error message
 ---

 Key: SPARK-5783
 URL: https://issues.apache.org/jira/browse/SPARK-5783
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Ryan Williams
Priority: Minor

 While investigating why some recent applications were not showing up in my 
 History Server UI, I found error message blocks like this in the history 
 server logs:
 {code}
 15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Exception in parsing 
 Spark event log.
 java.lang.ClassCastException: org.json4s.JsonAST$JNothing$ cannot be cast to 
 org.json4s.JsonAST$JObject
   at 
 org.apache.spark.util.JsonProtocol$.mapFromJson(JsonProtocol.scala:814)
   at 
 org.apache.spark.util.JsonProtocol$.executorInfoFromJson(JsonProtocol.scala:805)
 ...
   at 
 org.apache.spark.deploy.history.FsHistoryProvider$$anon$1.run(FsHistoryProvider.scala:84)
 15/02/12 18:51:55 ERROR scheduler.ReplayListenerBus: Malformed line: 
 {Event:SparkListenerExecutorAdded,Timestamp:1422897479154,Executor 
 ID:12,Executor 
 Info:{Host:demeter-csmaz11-1.demeter.hpc.mssm.edu,Total Cores:4}}
 {code}
 Turns out certain files had some malformed lines due to having been generated 
 by a forked Spark with some WIP event-log functionality.
 It would be nice if the first line specified the file the error was found in, 
 and the last line specified the line number.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5746) INSERT OVERWRITE throws FileNotFoundException when the source and destination point to the same table.

2015-02-12 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319014#comment-14319014
 ] 

Yin Huai commented on SPARK-5746:
-

Here are places where we need to take care overwrite, 
CreateMetastoreDataSourceAsSelect, CreatableRelationProvider.createRelation, 
and InsertableRelation.insert.

 INSERT OVERWRITE throws FileNotFoundException when the source and destination 
 point to the same table.
 --

 Key: SPARK-5746
 URL: https://issues.apache.org/jira/browse/SPARK-5746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker

 With the newly introduced write support of data source API, {{JSONRelation}} 
 and {{ParquetRelation2}} both suffer this bug.
 The root cause is that we removed the source table before insertion 
 ([here|https://github.com/apache/spark/blob/1ac099e3e00ddb01af8e6e3a84c70f8363f04b5c/sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala#L112-L121]).
 The correct solution should be first insert into a temporary folder, and then 
 overwrite the source table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5735) Replace uses of EasyMock with Mockito


[ 
https://issues.apache.org/jira/browse/SPARK-5735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319140#comment-14319140
 ] 

Apache Spark commented on SPARK-5735:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4578

 Replace uses of EasyMock with Mockito
 -

 Key: SPARK-5735
 URL: https://issues.apache.org/jira/browse/SPARK-5735
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Reporter: Patrick Wendell
Assignee: Josh Rosen

 There are a few reasons we should drop EasyMock. First, we should have a 
 single mocking framework in our tests in general to keep things consistent. 
 Second, EasyMock has caused us some dependency pain in our tests due to 
 objenesis. We aren't totally sure but suspect such conflicts might be causing 
 non deterministic test failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5759) ExecutorRunnable should catch YarnException while NMClient start container


 [ 
https://issues.apache.org/jira/browse/SPARK-5759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5759.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Lianhui Wang
Target Version/s: 1.3.0

 ExecutorRunnable should catch YarnException while NMClient start container
 --

 Key: SPARK-5759
 URL: https://issues.apache.org/jira/browse/SPARK-5759
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Lianhui Wang
Assignee: Lianhui Wang
 Fix For: 1.3.0


 some time since some of reasons, it lead to some exception while NMClient 
 start container.example:we do not config spark_shuffle on some machines, so 
 it will throw a exception:
 java.lang.Error: 
 org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
 auxService:spark_shuffle does not exist.
  because YarnAllocator use ThreadPoolExecutor to start Container, so we can 
 not find which container or hostname throw exception. I think we should catch 
 YarnException  in ExecutorRunnable  when start container. if there are some 
 exceptions, we can know the container id or hostname of failed container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups

[
https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-5335.
--
Resolution: Fixed
Fix Version/s: 1.2.2
1.3.0

Issue resolved by pull request 4122
[https://github.com/apache/spark/pull/4122]

Destroying cluster in VPC with --delete-groups fails to remove security
groups

Key: SPARK-5335
URL: https://issues.apache.org/jira/browse/SPARK-5335
Project: Spark
Issue Type: Bug
Components: EC2
Reporter: Vladimir Grigor
Fix For: 1.3.0, 1.2.2

When I try to remove security groups using option of the script, it fails
because in VPC one should remove security groups by id, not name as it is now.
{code}
$ ./spark-ec2 -k key20141114 -i ~/key.pem --region=eu-west-1 --delete-groups
destroy SparkByScript
Are you sure you want to destroy the cluster SparkByScript?
The following instances will be terminated:
Searching for existing cluster SparkByScript...
ALL DATA ON ALL NODES WILL BE LOST!!
Destroy cluster SparkByScript (y/N): y
Terminating master...
Terminating slaves...
Deleting security groups (this will take some time)...
Waiting for cluster to enter 'terminated' state.
Cluster is now in 'terminated' state. Waited 0 seconds.
Attempt 1
Deleting rules in security group SparkByScript-slaves
Deleting rules in security group SparkByScript-master
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
value 'SparkByScript-slaves' for groupName. You may not reference Amazon VPC
security groups by name. Please use the corresponding id for this
operation./Message/Error/ErrorsRequestID60313fac-5d47-48dd-a8bf-e9832948c0a6/RequestID/Response
Failed to delete security group SparkByScript-slaves
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
value 'SparkByScript-master' for groupName. You may not reference Amazon VPC
security groups by name. Please use the corresponding id for this
operation./Message/Error/ErrorsRequestID74ff8431-c0c1-4052-9ecb-c0adfa7eeeac/RequestID/Response
Failed to delete security group SparkByScript-master
Attempt 2

{code}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5335) Destroying cluster in VPC with --delete-groups fails to remove security groups

[
https://issues.apache.org/jira/browse/SPARK-5335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen updated SPARK-5335:
-
Assignee: Vladimir Grigor

Destroying cluster in VPC with --delete-groups fails to remove security
groups

Key: SPARK-5335
URL: https://issues.apache.org/jira/browse/SPARK-5335
Project: Spark
Issue Type: Bug
Components: EC2
Reporter: Vladimir Grigor
Assignee: Vladimir Grigor
Fix For: 1.3.0, 1.2.2

{code}

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5762:
-
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.2)

 Shuffle write time is incorrect for sort-based shuffle
 --

 Key: SPARK-5762
 URL: https://issues.apache.org/jira/browse/SPARK-5762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.3.0


 For the sort-based shuffle, when bypassing merge sort, one file is written 
 for each partition, and then a final file is written that concatenates all of 
 the existing files together. The time to write this final file is not 
 included in the shuffle write time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5762:
-
Fix Version/s: (was: 1.2.2)

 Shuffle write time is incorrect for sort-based shuffle
 --

 Key: SPARK-5762
 URL: https://issues.apache.org/jira/browse/SPARK-5762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
 Fix For: 1.3.0


 For the sort-based shuffle, when bypassing merge sort, one file is written 
 for each partition, and then a final file is written that concatenates all of 
 the existing files together. The time to write this final file is not 
 included in the shuffle write time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability

2015-02-12 Thread Brennon York (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319316#comment-14319316
 ] 

Brennon York commented on SPARK-5790:
-

FWIW this issue is a blocker for 
[SPARK-4600|https://issues.apache.org/jira/browse/SPARK-4600] that I'm working 
on as `diff` relies on the use of `zipPartitions` causing this. If someone 
could assign this to me I'll continue working this issue.

 VertexRDD's won't zip properly for `diff` capability
 

 Key: SPARK-5790
 URL: https://issues.apache.org/jira/browse/SPARK-5790
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York

 For VertexRDD's with differing partition sizes one cannot run commands like 
 `diff` as it will thrown an IllegalArgumentException. The code below provides 
 an example:
 {code}
 import org.apache.spark.graphx._
 import org.apache.spark.rdd._
 val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id = 
 (id, id.toInt+1)))
 setA.collect.foreach(println(_))
 val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = 
 (id, id.toInt+2)))
 setB.collect.foreach(println(_))
 val diff = setA.diff(setB)
 diff.collect.foreach(println(_))
 val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = 
 (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id = (id, id.toInt+2)))
 setA.diff(setC).collect
 // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability


 [ 
https://issues.apache.org/jira/browse/SPARK-5790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5790:
-
Assignee: Brennon York

 VertexRDD's won't zip properly for `diff` capability
 

 Key: SPARK-5790
 URL: https://issues.apache.org/jira/browse/SPARK-5790
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York
Assignee: Brennon York

 For VertexRDD's with differing partition sizes one cannot run commands like 
 `diff` as it will thrown an IllegalArgumentException. The code below provides 
 an example:
 {code}
 import org.apache.spark.graphx._
 import org.apache.spark.rdd._
 val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id = 
 (id, id.toInt+1)))
 setA.collect.foreach(println(_))
 val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = 
 (id, id.toInt+2)))
 setB.collect.foreach(println(_))
 val diff = setA.diff(setB)
 diff.collect.foreach(println(_))
 val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = 
 (id, id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id = (id, id.toInt+2)))
 setA.diff(setC).collect
 // java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5790) VertexRDD's won't zip properly for `diff` capability

2015-02-12 Thread Brennon York (JIRA)

Brennon York created SPARK-5790:
---

 Summary: VertexRDD's won't zip properly for `diff` capability
 Key: SPARK-5790
 URL: https://issues.apache.org/jira/browse/SPARK-5790
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Brennon York


For VertexRDD's with differing partition sizes one cannot run commands like 
`diff` as it will thrown an IllegalArgumentException. The code below provides 
an example:

{code:scala}
import org.apache.spark.graphx._
import org.apache.spark.rdd._
val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 3L).map(id = (id, 
id.toInt+1)))
setA.collect.foreach(println(_))
val setB: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, 
id.toInt+2)))
setB.collect.foreach(println(_))
val diff = setA.diff(setB)
diff.collect.foreach(println(_))
val setC: VertexRDD[Int] = VertexRDD(sc.parallelize(2L until 4L).map(id = (id, 
id.toInt+2)) ++ sc.parallelize(6L until 8L).map(id = (id, id.toInt+2)))
setA.diff(setC).collect
// java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-12 Thread Cheng Hao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319400#comment-14319400
 ] 

Cheng Hao commented on SPARK-5791:
--

Can you also attach the performance comparison result for this query?

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou

 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-12 Thread Yi Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Zhou updated SPARK-5791:
---
Description: Spark SQL shows poor performance when multiple tables do join 
operation  (was: Spark SQL shows poor performance when multiple tables do join 
operation compared with  Hive on MapReduce.)

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou

 Spark SQL shows poor performance when multiple tables do join operation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3299) [SQL] Public API in SQLContext to list tables

2015-02-12 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3299.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

 [SQL] Public API in SQLContext to list tables
 -

 Key: SPARK-3299
 URL: https://issues.apache.org/jira/browse/SPARK-3299
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Assignee: Bill Bejeck
Priority: Minor
  Labels: newbie
 Fix For: 1.3.0


 There is no public API in SQLContext to list the current tables.  This would 
 be pretty helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3168) The ServletContextHandler of webui lacks a SessionManager


 [ 
https://issues.apache.org/jira/browse/SPARK-3168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3168:
-
Component/s: (was: Spark Core)
 Web UI
   Priority: Minor  (was: Major)
 Issue Type: Improvement  (was: Bug)

I'm wondering if this is still live. I agree that it shouldn't be on by 
default, and then, the details of what this enables matter less. They do incur 
overhead in cookies and memory, etc.

 The ServletContextHandler of webui lacks a SessionManager
 -

 Key: SPARK-3168
 URL: https://issues.apache.org/jira/browse/SPARK-3168
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
 Environment: CAS
Reporter: meiyoula
Priority: Minor

 When i use CAS to realize single sign of webui, it occurs a exception:
 {code}
 WARN  [qtp1076146544-24] / 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:561)
 java.lang.IllegalStateException: No SessionManager
 at org.eclipse.jetty.server.Request.getSession(Request.java:1269)
 at org.eclipse.jetty.server.Request.getSession(Request.java:1248)
 at 
 org.jasig.cas.client.validation.AbstractTicketValidationFilter.doFilter(AbstractTicketValidationFilter.java:178)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.authentication.AuthenticationFilter.doFilter(AuthenticationFilter.java:116)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.jasig.cas.client.session.SingleSignOutFilter.doFilter(SingleSignOutFilter.java:76)
 at 
 org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1467)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:499)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-12 Thread Yi Zhou (JIRA)

Yi Zhou created SPARK-5791:
--

 Summary: [Spark SQL] show poor performance when multiple table do 
join operation
 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou


Spark SQL shows poor performance when multiple tables do join operation 
compared with  Hive on MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5791) [Spark SQL] show poor performance when multiple table do join operation

2015-02-12 Thread Yi Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14319364#comment-14319364
 ] 

Yi Zhou commented on SPARK-5791:


For example:
SELECT *
FROM inventory inv
JOIN (
  SELECT
i_item_id,
i_item_sk
  FROM item
  WHERE i_current_price  0.98
  AND i_current_price  1.5
) items
ON inv.inv_item_sk = items.i_item_sk
JOIN warehouse w ON inv.inv_warehouse_sk = w.w_warehouse_sk
JOIN date_dim d ON inv.inv_date_sk = d.d_date_sk
WHERE datediff(d_date, '2001-05-08') = -30
AND datediff(d_date, '2001-05-08') = 30;

 [Spark SQL] show poor performance when multiple table do join operation
 ---

 Key: SPARK-5791
 URL: https://issues.apache.org/jira/browse/SPARK-5791
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yi Zhou

 Spark SQL shows poor performance when multiple tables do join operation 
 compared with  Hive on MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5764) Delete the cache and lock file after executor fetching the jar

2015-02-12 Thread meiyoula (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula closed SPARK-5764.
---
Resolution: Not a Problem

 Delete the cache and lock file after executor fetching the jar
 --

 Key: SPARK-5764
 URL: https://issues.apache.org/jira/browse/SPARK-5764
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: meiyoula

 Every time while executor fetching a jar from httpserver, a lock file and a 
 cache file will be created on the local. After fetching, this two files will 
 be useless.
 And when the jar package is big, the cache file also be big. it wates the 
 disk space.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5765) word split problem in run-example and compute-classpath

2015-02-12 Thread Venkata Ramana G (JIRA)

Venkata Ramana G created SPARK-5765:
---

 Summary: word split problem in run-example and compute-classpath
 Key: SPARK-5765
 URL: https://issues.apache.org/jira/browse/SPARK-5765
 Project: Spark
  Issue Type: Bug
  Components: Examples
Affects Versions: 1.2.1, 1.3.0, 1.1.2
Reporter: Venkata Ramana G


Work split problem in spark directory path in scripts
run-example and compute-classpath.sh

This was introduced in defect fix SPARK-4504



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle

2015-02-12 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-5762:
-

 Summary: Shuffle write time is incorrect for sort-based shuffle
 Key: SPARK-5762
 URL: https://issues.apache.org/jira/browse/SPARK-5762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout


For the sort-based shuffle, when bypassing merge sort, one file is written for 
each partition, and then a final file is written that concatenates all of the 
existing files together. The time to write this final file is not included in 
the shuffle write time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows


[ 
https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317850#comment-14317850
 ] 

Sean Owen commented on SPARK-5754:
--

We just resolved http://issues.apache.org/jira/browse/SPARK-4267 which is kind 
of the flip side to this problem. Now, all arguments are quoted since they may 
contain spaces, and spaces would break the command.

Some quoting seems to be needed at some level to handle this case. I wonder why 
single quote is problematic in Windows? does it have to be double quote?

Maybe the logic can be improved to only quote args with a space in them but 
that still leaves the question of how to correctly quote args with spaces in 
Windows.

 Spark AM not launching on Windows
 -

 Key: SPARK-5754
 URL: https://issues.apache.org/jira/browse/SPARK-5754
 Project: Spark
  Issue Type: Bug
  Components: Windows, YARN
Affects Versions: 1.1.1, 1.2.0
 Environment: Windows Server 2012, Hadoop 2.4.1.
Reporter: Inigo

 I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM 
 container fails to start. The problem seems to be in the generation of the 
 YARN command which adds single quotes (') surrounding some of the java 
 options. In particular, the part of the code that is adding those is the 
 escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not 
 like the quotes for these options. Here is an example of the command that the 
 container tries to execute:
 @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
 '-Dspark.yarn.secondary.jars=' 
 '-Dspark.app.name=org.apache.spark.examples.SparkPi' 
 '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster 
 --class 'org.apache.spark.examples.SparkPi' --jar  
 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar'
   --executor-memory 1024 --executor-cores 1 --num-executors 2
 Once I transform it into:
 @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
 -Dspark.yarn.secondary.jars= 
 -Dspark.app.name=org.apache.spark.examples.SparkPi 
 -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster 
 --class 'org.apache.spark.examples.SparkPi' --jar  
 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar'
   --executor-memory 1024 --executor-cores 1 --num-executors 2
 Everything seems to start.
 How should I deal with this? Creating a separate function like escapeForShell 
 for Windows and call it whenever I detect this is for Windows? Or should I 
 add some sanity check on YARN?
 I checked a little and there seems to be people that is able to run Spark on 
 YARN on Windows, so it might be something else. I didn't find anything 
 related on Jira either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5739) Size exceeds Integer.MAX_VALUE in File Map


[ 
https://issues.apache.org/jira/browse/SPARK-5739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317782#comment-14317782
 ] 

Sean Owen commented on SPARK-5739:
--

What would that really do though except change one error into another? I think 
the current exception is quite clear.

 Size exceeds Integer.MAX_VALUE in File Map
 --

 Key: SPARK-5739
 URL: https://issues.apache.org/jira/browse/SPARK-5739
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.1
 Environment: Spark1.1.1 on a cluster with 12 node. Every node with 
 128GB RAM, 24 Core. the data is just 40GB, and there is 48 parallel task on a 
 node.
Reporter: DjvuLee

 I just run the kmeans algorithm using a random generate data,but occurred 
 this problem after some iteration. I try several time, and this problem is 
 reproduced. 
 Because the data is random generate, so I guess is there a bug ? Or if random 
 data can lead to such a scenario that the size is bigger than 
 Integer.MAX_VALUE, can we check the size before using the file map?
 015-02-11 00:39:36,057 [sparkDriver-akka.actor.default-dispatcher-15] WARN  
 org.apache.spark.util.SizeEstimator - Failed to check whether 
 UseCompressedOops is set; assuming yes
 [error] (run-main-0) java.lang.IllegalArgumentException: Size exceeds 
 Integer.MAX_VALUE
 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
   at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:850)
   at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:105)
   at org.apache.spark.storage.DiskStore.putIterator(DiskStore.scala:86)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:140)
   at 
 org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:105)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:747)
   at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:598)
   at 
 org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:869)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:79)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
   at 
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
   at 
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
   at 
 org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:270)
   at org.apache.spark.mllib.clustering.KMeans.runBreeze(KMeans.scala:143)
   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:126)
   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:338)
   at org.apache.spark.mllib.clustering.KMeans$.train(KMeans.scala:348)
   at KMeansDataGenerator$.main(kmeans.scala:105)
   at KMeansDataGenerator.main(kmeans.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:94)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
   at java.lang.reflect.Method.invoke(Method.java:619)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5762) Shuffle write time is incorrect for sort-based shuffle


[ 
https://issues.apache.org/jira/browse/SPARK-5762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14317799#comment-14317799
 ] 

Apache Spark commented on SPARK-5762:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/4559

 Shuffle write time is incorrect for sort-based shuffle
 --

 Key: SPARK-5762
 URL: https://issues.apache.org/jira/browse/SPARK-5762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout

 For the sort-based shuffle, when bypassing merge sort, one file is written 
 for each partition, and then a final file is written that concatenates all of 
 the existing files together. The time to write this final file is not 
 included in the shuffle write time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5765) word split problem in run-example and compute-classpath