[jira] [Resolved] (SPARK-7310) SparkSubmit does not escape for java options and ^ won't work

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7310.
--
Resolution: Not A Problem

OK sounds like the particular escaping issue is no longer a problem as far as 
we can tell.

 SparkSubmit does not escape  for java options and ^ won't work
 

 Key: SPARK-7310
 URL: https://issues.apache.org/jira/browse/SPARK-7310
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.1, 1.4.0
Reporter: Yitong Zhou
Priority: Minor

 I can create the error when doing something like:
 {code}
 LIBJARS= /jars.../
 bin/spark-submit \
  --driver-java-options -Djob.url=http://www.foo.bar?query=ab; \
  --class com.example.Class \
  --master yarn-cluster \
  --num-executors 3 \
  --executor-cores 1 \
  --queue default \
  --driver-memory 1g \
  --executor-memory 1g \
  --jars $LIBJARS\
  ../a.jar \
  -inputPath /user/yizhou/CED-scoring/input \
  -outputPath /user/yizhou
 {code}
 Notice that if I remove the  in --driver-java-options value, then the 
 submit will succeed. A typical error message looks like this:
 {code}
 org.apache.hadoop.util.Shell$ExitCodeException: Usage: java [-options] class 
 [args...]
(to execute a class)
or  java [-options] -jar jarfile [args...]
(to execute a jar file)
 where options include:
 -d32use a 32-bit data model if available
 -d64use a 64-bit data model if available
 -server to select the server VM
   The default VM is server,
   because you are running on a server-class machine.
 -cp class search path of directories and zip/jar files
 -classpath class search path of directories and zip/jar files
   A : separated list of directories, JAR archives,
   and ZIP archives to search for class files.
 -Dname=value
   set a system property
 -verbose:[class|gc|jni]
   enable verbose output
 -version  print product version and exit
 -version:value
   require the specified version to run
 -showversion  print product version and continue
 -jre-restrict-search | -no-jre-restrict-search
   include/exclude user private JREs in the version search
 -? -help  print this help message
 -Xprint help on non-standard options
 -ea[:packagename...|:classname]
 -enableassertions[:packagename...|:classname]
   enable assertions with specified granularity
 -da[:packagename...|:classname]
 -disableassertions[:packagename...|:classname]
   disable assertions with specified granularity
 -esa | -enablesystemassertions
   enable system assertions
 -dsa | -disablesystemassertions
   disable system assertions
 -agentlib:libname[=options]
   load native agent library libname, e.g. -agentlib:hprof
   see also, -agentlib:jdwp=help and -agentlib:hprof=help
 -agentpath:pathname[=options]
   load native agent library by full pathname
 -javaagent:jarpath[=options]
   load Java programming language agent, see 
 java.lang.instrument
 -splash:imagepath
   show splash screen with specified image
 See http://www.oracle.com/technetwork/java/javase/documentation/index.html 
 for more details.
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   at 
 org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279)
   at 
 org.apache.hadoop.yarn.server.nodemanager.PepperdataContainerExecutor.launchContainer(PepperdataContainerExecutor.java:130)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79)
   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7151) Correlation methods for DataFrame

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-7151.
--
  Resolution: Duplicate
   Fix Version/s: 1.4.0
Assignee: Burak Yavuz
Target Version/s: 1.4.0

 Correlation methods for DataFrame
 -

 Key: SPARK-7151
 URL: https://issues.apache.org/jira/browse/SPARK-7151
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SQL
Reporter: Joseph K. Bradley
Assignee: Burak Yavuz
Priority: Minor
  Labels: dataframe
 Fix For: 1.4.0


 We should support computing correlations between columns in DataFrames with a 
 simple API.
 This could be a DataFrame feature:
 {code}
 myDataFrame.corr(col1, col2)
 // or
 myDataFrame.corr(col1, col2, pearson) // specify correlation type
 {code}
 Or it could be an MLlib feature:
 {code}
 Statistics.corr(myDataFrame(col1), myDataFrame(col2))
 // or
 Statistics.corr(myDataFrame, col1, col2)
 {code}
 (The first Statistics.corr option is more flexible, but it could cause 
 trouble if a user tries to pass in 2 unzippable DataFrame columns.)
 Note: R follow the latter setup.  I'm OK with either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1762) Add functionality to pin RDDs in cache

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1762:
-
Target Version/s:   (was: 1.2.0)

 Add functionality to pin RDDs in cache
 --

 Key: SPARK-1762
 URL: https://issues.apache.org/jira/browse/SPARK-1762
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or

 Right now, all RDDs are created equal, and there is no mechanism to identify 
 a certain RDD to be more important than the rest. This is a problem if the 
 RDD fraction is small, because just caching a few RDDs can evict more 
 important ones.
 A side effect of this feature is that we can now more safely allocate a 
 smaller spark.storage.memoryFraction if we know how large our important RDDs 
 are, without having to worry about them being evicted. This allows us to use 
 more memory for shuffles, for instance, and avoid disk spills.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Chen Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530065#comment-14530065
 ] 

Chen Song commented on SPARK-7394:
--

Ok, I'll work on this after finishing my daily job.

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7395) some suggestion about SimpleApp in quick-start.html

2015-05-06 Thread zhengbing li (JIRA)
zhengbing li created SPARK-7395:
---

 Summary: some suggestion about SimpleApp in quick-start.html
 Key: SPARK-7395
 URL: https://issues.apache.org/jira/browse/SPARK-7395
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
 Environment: none
Reporter: zhengbing li
 Fix For: 1.4.0


Base on the code guide of SimpleApp in 
https://spark.apache.org/docs/latest/quick-start.html, I could not run the 
SimpleApp code until I modify  val conf = new SparkConf().setAppName(Simple 
Application) to val conf = new SparkConf().setAppName(Simple 
Application).setMaster(local).
So the document might be modified for the beginners.
The error of scala example is as follows:
15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0
Exception in thread main org.apache.spark.SparkException: A master URL must 
be set in your configuration
at org.apache.spark.SparkContext.init(SparkContext.scala:206)
at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12)
at com.huawei.openspark.TestSpark.main(TestSpark.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6940) PySpark CrossValidator

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6940.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5926
[https://github.com/apache/spark/pull/5926]

 PySpark CrossValidator
 --

 Key: SPARK-6940
 URL: https://issues.apache.org/jira/browse/SPARK-6940
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 1.3.0
Reporter: Omede Firouz
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.4.0


 PySpark doesn't currently have wrappers for any of the ML.Tuning classes: 
 CrossValidator, CrossValidatorModel, ParamGridBuilder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7397:
---

Assignee: Apache Spark

 Add missing input information report back to ReceiverInputDStream due to 
 SPARK-7139
 ---

 Key: SPARK-7397
 URL: https://issues.apache.org/jira/browse/SPARK-7397
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao
Assignee: Apache Spark

 Input information report is missing due to refactor work of 
 ReceiverInputDStream in SPARK-7139.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139

2015-05-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530218#comment-14530218
 ] 

Apache Spark commented on SPARK-7397:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/5937

 Add missing input information report back to ReceiverInputDStream due to 
 SPARK-7139
 ---

 Key: SPARK-7397
 URL: https://issues.apache.org/jira/browse/SPARK-7397
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao

 Input information report is missing due to refactor work of 
 ReceiverInputDStream in SPARK-7139.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7397:
---

Assignee: (was: Apache Spark)

 Add missing input information report back to ReceiverInputDStream due to 
 SPARK-7139
 ---

 Key: SPARK-7397
 URL: https://issues.apache.org/jira/browse/SPARK-7397
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao

 Input information report is missing due to refactor work of 
 ReceiverInputDStream in SPARK-7139.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6812) filter() on DataFrame does not work as expected

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6812:
---

Assignee: Apache Spark  (was: Sun Rui)

 filter() on DataFrame does not work as expected
 ---

 Key: SPARK-6812
 URL: https://issues.apache.org/jira/browse/SPARK-6812
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 {code}
  filter(df, df$age  21)
 Error in filter(df, df$age  21) :
   no method for coercing this S4 class to a vector
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected

2015-05-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530268#comment-14530268
 ] 

Apache Spark commented on SPARK-6812:
-

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/5938

 filter() on DataFrame does not work as expected
 ---

 Key: SPARK-6812
 URL: https://issues.apache.org/jira/browse/SPARK-6812
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Davies Liu
Assignee: Sun Rui
Priority: Blocker

 {code}
  filter(df, df$age  21)
 Error in filter(df, df$age  21) :
   no method for coercing this S4 class to a vector
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139

2015-05-06 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-7397:
--

 Summary: Add missing input information report back to 
ReceiverInputDStream due to SPARK-7139
 Key: SPARK-7397
 URL: https://issues.apache.org/jira/browse/SPARK-7397
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao


Input information report is missing due to refactor work of 
ReceiverInputDStream in SPARK-7139.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError

2015-05-06 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530224#comment-14530224
 ] 

Iulian Dragos commented on SPARK-5281:
--

Here's my workaround from [this stack overflow 
quesiton|https://stackoverflow.com/questions/29796928/whats-the-most-efficient-way-to-filter-a-dataframe]

- find your launch configuration and go to Classpath
- remove Scala Library and Scala Compiler from the Bootstrap entries
- add (as external jars) scala-reflect, scala-library and scala-compiler to 
user entries

Make sure to add the right version (2.10.4 at this point).



 Registering table on RDD is giving MissingRequirementError
 --

 Key: SPARK-5281
 URL: https://issues.apache.org/jira/browse/SPARK-5281
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.1
Reporter: sarsol
Priority: Critical

 Application crashes on this line  {{rdd.registerTempTable(temp)}}  in 1.2 
 version when using sbt or Eclipse SCALA IDE
 Stacktrace:
 {code}
 Exception in thread main scala.reflect.internal.MissingRequirementError: 
 class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with 
 primordial classloader with boot classpath 
 [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program
  Files\Java\jre7\lib\resources.jar;C:\Program 
 Files\Java\jre7\lib\rt.jar;C:\Program 
 Files\Java\jre7\lib\sunrsasign.jar;C:\Program 
 Files\Java\jre7\lib\jsse.jar;C:\Program 
 Files\Java\jre7\lib\jce.jar;C:\Program 
 Files\Java\jre7\lib\charsets.jar;C:\Program 
 Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found.
   at 
 scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
   at 
 scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
   at 
 scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119)
   at 
 scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115)
   at 
 scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
   at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335)
   at scala.reflect.api.Universe.typeOf(Universe.scala:59)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94)
   at 
 org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33)
   at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111)
   at 
 com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43)
   at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
   at 
 scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.App$$anonfun$main$1.apply(App.scala:71)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
   at scala.App$class.main(App.scala:71)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7386) Spark application level metrics application.$AppName.$number.cores doesn't reset on Standalone Master deployment

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7386:
-
Component/s: Spark Core
   Priority: Minor  (was: Major)

Please set component.
I'm not familiar with this bit, but I recall a similar conversation about a 
cores metric where some metrics were intended to reflect the amount requested 
while the job was running. Is that the intent of this one?

 Spark application level metrics application.$AppName.$number.cores doesn't 
 reset on Standalone Master deployment
 

 Key: SPARK-7386
 URL: https://issues.apache.org/jira/browse/SPARK-7386
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Bharat Venkat
Priority: Minor

 Spark publishes a metric called application.$AppName.$number.cores that gets 
 published which monitors number of cores assigned to an application.  However 
 there is a bug as of 1.3 standalone deployment, where this metric doesn't go 
 down to 0 after the application ends.
 It looks like standalone master holds onto the old state and continues to 
 publish a stale metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7374) Error message when launching: find: 'version' : No such file or directory

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7374.
--
Resolution: Duplicate

Please search JIRA first and review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Error message when launching: find: 'version' : No such file or directory
 ---

 Key: SPARK-7374
 URL: https://issues.apache.org/jira/browse/SPARK-7374
 Project: Spark
  Issue Type: Bug
  Components: Deploy, PySpark, Spark Shell
Affects Versions: 1.3.1
Reporter: Stijn Geuens

 When launch spark-shell (or pyspark), I get the following message:
 find: 'version' : No such file or directory
 else was unexpected at this time.
 How is it possible that this error keeps occurring (with different versions 
 of Spark)? How can I resolve this issue?
 Thanks in advance,
 Stijn



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7369) Spark Python 1.3.1 Mllib dataframe random forest problem

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7369.
--
Resolution: Invalid

Have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

This sounds, on its face, like a py4j issue. These kinds of things can be 
reopened if there is more specific evidence it's Spark-related.

 Spark Python 1.3.1 Mllib dataframe random forest problem
 

 Key: SPARK-7369
 URL: https://issues.apache.org/jira/browse/SPARK-7369
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.3.1
Reporter: Lisbeth Ron
  Labels: hadoop

 I'm working with Dataframes to train a random forest with mllib
 and I have this error
   File 
 /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql.
 somebody can help me...???



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

{code}
between: added in 1.4
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT
{code}

*math*

{code}
round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
-toDeg  - toDegrees-
-toRad - toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)
{code}

*collection functions*

{code}
sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean
{code}

*date functions*

{code}
from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String
{code}


*conditional functions*

{code}
if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T
{code}


*string functions*

{code}
ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, arraystring): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): mapstring, string
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string
{code}


*Misc*

{code}
hash(a1[, a2…]): int
{code}


*text*

{code}
context_ngrams(arrayarraystring, arraystring, int K, int pf): 
arraystructstring,double
ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double
sentences(string str, string lang, string locale): arrayarraystring
{code}


*UDAF*

{code}
var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: arraydouble
percentile_approx: arraydouble
histogram_numeric: arraystruct {'x','y'}
collect_set  — we have hashset
collect_list 
ntile
{code}







  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

{code}
-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT
{code}

*math*

{code}
round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
-toDeg  - toDegrees-
-toRad - toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)
{code}

*collection functions*

{code}
sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean
{code}

*date functions*

{code}
from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-05-06 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530029#comment-14530029
 ] 

Guoqiang Li commented on SPARK-5556:


[FastLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553]:
 Gibbs sampling,The computational complexity is O(n_dk), n_dk is the number of 
topic (unique) in document d.  I recommend to be used for short text
[LightLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L763]
 Metropolis Hasting sampling The computational complexity is O(1)(It depends on 
the partition strategy and takes up more memory).


 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

{code}
-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT
{code}

*math*

{code}
round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
-toDeg  - toDegrees-
-toRad - toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)
{code}

*collection functions*

{code}
sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean
{code}

*date functions*

{code}
from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String
{code}


*conditional functions*

{code}
if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T
{code}


*string functions*

{code}
ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, arraystring): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): mapstring, string
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string
{code}


*Misc*

{code}
hash(a1[, a2…]): int
{code}


*text*

{code}
context_ngrams(arrayarraystring, arraystring, int K, int pf): 
arraystructstring,double
ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double
sentences(string str, string lang, string locale): arrayarraystring
{code}


*UDAF*

{code}
var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: arraydouble
percentile_approx: arraydouble
histogram_numeric: arraystruct {'x','y'}
collect_set  — we have hashset
collect_list 
ntile
{code}







  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
-toDeg  - toDegrees-
-toRad - toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string 

[jira] [Updated] (SPARK-4488) Add control over map-side aggregation

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4488:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

 Add control over map-side aggregation
 -

 Key: SPARK-4488
 URL: https://issues.apache.org/jira/browse/SPARK-4488
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.1.0
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3095) [PySpark] Speed up RDD.count()

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3095:
-
Target Version/s:   (was: 1.2.0)

 [PySpark] Speed up RDD.count()
 --

 Key: SPARK-3095
 URL: https://issues.apache.org/jira/browse/SPARK-3095
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Minor

 RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not 
 PipelineRDD.
 If the JavaRDD is serialized in batch mode, it's possible to skip the 
 deserialization of chunks (except the last one), because they all have the 
 same number of elements in them. There are some special cases that the chunks 
 are re-ordered, so this will not work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3492) Clean up Yarn integration code

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3492:
-
Target Version/s:   (was: 1.2.0)

 Clean up Yarn integration code
 --

 Key: SPARK-3492
 URL: https://issues.apache.org/jira/browse/SPARK-3492
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 This is the parent umbrella for cleaning up the Yarn integration code in 
 general. This is a broad effort and each individual cleanup should opened as 
 a sub-issue against this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3913) Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3913:
-
Target Version/s:   (was: 1.2.0)

 Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn 
 Application Listener and killApplication() API
 --

 Key: SPARK-3913
 URL: https://issues.apache.org/jira/browse/SPARK-3913
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Chester

 When working with Spark with Yarn deployment mode, we have two issues:
 1) We don't know how much yarn max capacity ( memory and cores) before we 
 specify the number of executor and memories for spark drivers and executors. 
 We we set a big number, the job can potentially exceeds the limit and got 
 killed. 
It would be better we let the application know that the yarn resource 
 capacity a head of time and the spark config can adjusted dynamically. 
   
 2) Once job started, we would like to have some feedbacks from yarn 
 application. Currently, the spark client basically block the call and returns 
 when the job is finished or failed or killed. 
 If the job runs for few hours, we have no idea how far it has gone, the 
 progress and resource usage, tracking URL etc. 
 3) Once the job is started, you basically can't stop it. The Yarn Client API 
 stop doesn't to work in most cases from our experience.  But Yarn API does 
 work is killApplication(appId). 
So we need to expose this killApplication() API to Spark Yarn Client as 
 well. 

 I will create one Pull Request and try to address these problems.  
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2838) performance tests for feature transformations

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2838:
-
Target Version/s:   (was: 1.2.0)

 performance tests for feature transformations
 -

 Key: SPARK-2838
 URL: https://issues.apache.org/jira/browse/SPARK-2838
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor

 1. TF-IDF
 2. StandardScaler
 3. Normalizer
 4. Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2653:
-
Target Version/s:   (was: 1.2.0)

 Heap size should be the sum of driver.memory and executor.memory in local mode
 --

 Key: SPARK-2653
 URL: https://issues.apache.org/jira/browse/SPARK-2653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Davies Liu
Priority: Minor
   Original Estimate: 1h
  Remaining Estimate: 1h

 In local mode, the driver and executor run in the same JVM, so the heap size 
 of JVM should be the sum of spark.driver.memory and spark.executor.memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3685) Spark's local dir should accept only local paths

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3685:
-
Target Version/s:   (was: 1.2.0)

 Spark's local dir should accept only local paths
 

 Key: SPARK-3685
 URL: https://issues.apache.org/jira/browse/SPARK-3685
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it 
 will try to do is create a folder called hdfs: and put tmp inside it. 
 This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead 
 of Hadoop's file system to parse this path. We also need to resolve the path 
 appropriately.
 This may not have an urgent use case, but it fails silently and does what is 
 least expected.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1239:
-
Target Version/s:   (was: 1.2.0)

 Don't fetch all map output statuses at each reducer during shuffles
 ---

 Key: SPARK-1239
 URL: https://issues.apache.org/jira/browse/SPARK-1239
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Patrick Wendell

 Instead we should modify the way we fetch map output statuses to take both a 
 mapper and a reducer - or we should just piggyback the statuses on each task. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3075) Expose a way for users to parse event logs

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3075:
-
Target Version/s:   (was: 1.2.0)

 Expose a way for users to parse event logs
 --

 Key: SPARK-3075
 URL: https://issues.apache.org/jira/browse/SPARK-3075
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Or

 Both ReplayListenerBus and util.JsonProtocol are private[spark], so the user 
 wants to parse the event logs themselves for analytics they will have to 
 write their own JSON deserializers (or do some crazy reflection to access 
 these methods). We should expose an easy way for them to do this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2774) Set preferred locations for reduce tasks

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2774:
-
Target Version/s:   (was: 1.2.0)

 Set preferred locations for reduce tasks
 

 Key: SPARK-2774
 URL: https://issues.apache.org/jira/browse/SPARK-2774
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman

 Currently we do not set preferred locations for reduce tasks in Spark. This 
 patch proposes setting preferred locations based on the map output sizes and 
 locations tracked by the MapOutputTracker. This is useful in two conditions
 1. When you have a small job in a large cluster it can be useful to co-locate 
 map and reduce tasks to avoid going over the network
 2. If there is a lot of data skew in the map stage outputs, then it is 
 beneficial to place the reducer close to the largest output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4609) Job can not finish if there is one bad slave in clusters

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4609:
-
Target Version/s:   (was: 1.3.0)

 Job can not finish if there is one bad slave in clusters
 

 Key: SPARK-4609
 URL: https://issues.apache.org/jira/browse/SPARK-4609
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu

 If there is one bad machine in the cluster, the executor will keep die (such 
 as out of space in the disk), some task may be scheduled to this machines 
 multiple times, then the job will failed after several failures of one task.
 {code}
 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 
 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 
 lost)
 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 
 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 
 lost)
 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 
 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 
 lost)
 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 
 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes)
 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, 
 spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 
 lost)
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in 
 stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 
 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure 
 (executor 63 lost)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413)
   at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368)
   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
   at akka.actor.ActorCell.invoke(ActorCell.scala:487)
   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
   at akka.dispatch.Mailbox.run(Mailbox.scala:220)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 {code}
 The task should not be scheduled to a machines for more than one times. Also, 
 if one machine failed with executor lost, it should be put in black list for 
 some time, then try again.
 cc [~kayousterhout] [~matei]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2992) The transforms formerly known as non-lazy

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2992:
-
Target Version/s:   (was: 1.2.0)

 The transforms formerly known as non-lazy
 -

 Key: SPARK-2992
 URL: https://issues.apache.org/jira/browse/SPARK-2992
 Project: Spark
  Issue Type: Umbrella
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Erik Erlandson

 An umbrella for a grab-bag of tickets involving lazy implementations of 
 transforms formerly thought to be non-lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3376:
-
Target Version/s:   (was: 1.3.0)

 Memory-based shuffle strategy to reduce overhead of disk I/O
 

 Key: SPARK-3376
 URL: https://issues.apache.org/jira/browse/SPARK-3376
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: uncleGen
  Labels: performance

 I think a memory-based shuffle can reduce some overhead of disk I/O. I just 
 want to know is there any plan to do something about it. Or any suggestion 
 about it. Base on the work (SPARK-2044), it is feasible to have several 
 implementations of  shuffle.
 
 Currently, there are two implementions of shuffle manager, i.e. SORT and 
 HASH. Both of them will use disk in some stages. For examples, in the map 
 side, all the intermediate data will be written into temporary files. In the 
 reduce side, Spark will use external sort sometimes. In any case, disk I/O 
 will bring some performance loss. Maybe,we can provide a pure-memory shuffle 
 manager. In this shuffle manager, intermediate data will only go through 
 memory. In some of scenes, it can improve performance. Experimentally, I 
 implemented a in-memory shuffle manager upon SPARK-2044. 
 1. Following is my testing result (some heary shuffle operations):
 | data size (Byte)   |  partitions  |  resources |
 | 5131859218  |2000   |   50 executors/ 4 cores/ 4GB |
 | settings   |  operation1   | 
 operation2 |
 | shuffle spill  lz4 |  repartition+flatMap+groupByKey | repartition + 
 groupByKey | 
 |memory   |   38s   |  16s |
 |sort |   45s   |  28s |
 |hash |   46s   |  28s |
 |no shuffle spill  lz4 | | |
 | memory |   16s | 16s |
 | | | |
 |shuffle spill  lzf | | |
 |memory|  28s   | 27s |
 |sort  |  29s   | 29s |
 |hash  |  41s   | 30s |
 |no shuffle spill  lzf | | |
 | memory |  15s | 16s |
 In my implementation, I simply reused the BlockManager in the map-side and 
 set the spark.shuffle.spill false in the reduce-side. All the intermediate 
 data is cached in memory store. Just as Reynold Xin has pointed out, our 
 disk-based shuffle manager has achieved a good performance. With  parameter 
 tuning, the disk-based shuffle manager will  obtain similar performance as 
 memory-based shuffle manager. However, I will continue my work and improve 
 it. And as an alternative tuning option, InMemory shuffle is a good choice. 
 Future work includes, but is not limited to:
 - memory usage management in InMemory Shuffle mode
 - data management when intermediate data can not fit in memory
 Test code:
 {code: borderStyle=solid}
 val conf = new SparkConf().setAppName(InMemoryShuffleTest)
 val sc = new SparkContext(conf)
 val dataPath = args(0)
 val partitions = args(1).toInt
 val rdd1 = sc.textFile(dataPath).cache()
 rdd1.count()
 val startTime = System.currentTimeMillis()
 val rdd2 = rdd1.repartition(partitions)
   .flatMap(_.split(,)).map(s = (s, s))
   .groupBy(e = e._1)
 rdd2.count()
 val endTime = System.currentTimeMillis()
 println(time:  + (endTime - startTime) / 1000 )
 {code}
 2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning 
 for disk shuffle. 
 2.1. Test the influence of memory size per core
 precondition: 100GB(SORT benchmark), 100 executor /15cores  1491partitions 
 (input file blocks) . 
 | memory size per executor| inmemory shuffle(no shuffle spill)  |  sort 
 shuffle  |  hash shuffle |   improvement(vs.sort)  |   improvement(vs.hash) |
 |9GB   |  79.652849s |  60.102337s | failed|   
 -32.7%|  -|
 |12GB  |  54.821924s |  51.654897s |109.167068s |   
 -3.17%|+47.8% | 
 |15GB  |  33.537199s |  40.140621s |48.088158s  |   
 +16.47%   |+30.26%|
 |18GB  |  30.930927s |  43.392401s |49.830276s  |   
 +28.7%|+37.93%| 
 2.2. Test the influence of partition number
 18GB/15cores per executor
 | partitions | inmemory shuffle(no shuffle spill)  |  sort shuffle  |  hash 
 shuffle |   improvement(vs.sort)  |  improvement(vs.hash) |
 |1000  |  92.675436s |  85.193158s |71.106323s  |   
 -8.78%|   

[jira] [Updated] (SPARK-3348) Support user-defined SparkListeners properly

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3348:
-
Target Version/s:   (was: 1.2.0)

 Support user-defined SparkListeners properly
 

 Key: SPARK-3348
 URL: https://issues.apache.org/jira/browse/SPARK-3348
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or

 Because of the current initialization ordering, user-defined SparkListeners 
 do not receive certain events that are posted before application code is run. 
 We need to expose a constructor that allows the given SparkListeners to 
 receive all events.
 There have been interest in this for a while, but I have searched through the 
 JIRA history and have not found a related issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4784) Model.fittingParamMap should store all Params

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4784:
-
Target Version/s:   (was: 1.3.0)

 Model.fittingParamMap should store all Params
 -

 Key: SPARK-4784
 URL: https://issues.apache.org/jira/browse/SPARK-4784
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 spark.ml's Model class should store all parameters in the fittingParamMap, 
 not just the ones which were explicitly set.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3916) recognize appended data in textFileStream()

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3916:
-
Target Version/s:   (was: 1.2.0)

 recognize appended data in textFileStream()
 ---

 Key: SPARK-3916
 URL: https://issues.apache.org/jira/browse/SPARK-3916
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Davies Liu
Assignee: Davies Liu

 Right now, we only find new data from new files, the data written to old 
 files (processed in last batch) will not be processed.
 In order to support this, we need partialRDD(), which is an RDD for part of 
 file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3051) Support looking-up named accumulators in a registry

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3051:
-
Target Version/s:   (was: 1.2.0)

 Support looking-up named accumulators in a registry
 ---

 Key: SPARK-3051
 URL: https://issues.apache.org/jira/browse/SPARK-3051
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Neil Ferguson

 This is a proposed enhancement to Spark based on the following mailing list 
 discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html.
 This proposal builds on SPARK-2380 (Support displaying accumulator values in 
 the web UI) to allow named accumulables to be looked-up in a registry, as 
 opposed to having to be passed to every method that need to access them.
 The use case was described well by [~shivaram], as follows:
 Lets say you have two functions you use 
 in a map call and want to measure how much time each of them takes. For 
 example, if you have a code block like the one below and you want to 
 measure how much time f1 takes as a fraction of the task. 
 {noformat}
 a.map { l = 
val f = f1(l) 
... some work here ... 
 } 
 {noformat}
 It would be really cool if we could do something like 
 {noformat}
 a.map { l = 
val start = System.nanoTime 
val f = f1(l) 
TaskMetrics.get(f1-time).add(System.nanoTime - start) 
 } 
 {noformat}
 SPARK-2380 provides a partial solution to this problem -- however the 
 accumulables would still need to be passed to every function that needs them, 
 which I think would be cumbersome in any application of reasonable complexity.
 The proposal, as suggested by [~pwendell], is to have a registry of 
 accumulables, that can be looked-up by name. 
 Regarding the implementation details, I'd propose that we broadcast a 
 serialized version of all named accumulables in the DAGScheduler (similar to 
 what SPARK-2521 does for Tasks). These can then be deserialized in the 
 Executor. 
 Accumulables are already stored in thread-local variables in the Accumulators 
 object, so exposing these in the registry should be simply a matter of 
 wrapping this object, and keying the accumulables by name (they are currently 
 keyed by ID).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7394:
---
Fix Version/s: (was: 1.4.0)

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7372) Multiclass SVM - One vs All wrapper

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7372.
--
Resolution: Won't Fix

This should be a question on user@ I think. It would better to build this once 
than specialize it several times. 

If there were some different, special way to handle mutliclass in SVM that took 
advantage of how SVMs worked, then it might make sense to support some 
SVM-specific implementation. (For example, you certainly don't need one-vs-all 
to do multiclass in decision trees.) But I don't believe there is for SVM.

 Multiclass SVM - One vs All wrapper
 ---

 Key: SPARK-7372
 URL: https://issues.apache.org/jira/browse/SPARK-7372
 Project: Spark
  Issue Type: Question
  Components: MLlib
Reporter: Renat Bekbolatov
Priority: Trivial

 I was wondering if we want to have a some support for multiclass SVM in 
 MLlib, for example, through a simple wrapper over binary SVM classifiers with 
 OVA.
 There is already WIP for ML pipeline generalization: SparkSPARK-7015, 
 Multiclass to Binary Reduction
 However, if users prefer to just have basic OVA version that runs against 
 SVMWithSGD, they might be able to use it.
 Here is a code sketch: 
 https://github.com/Bekbolatov/spark/commit/463d73323d5f08669d5ae85dc9791b036637c966
 Maybe this could live in a 3rd party utility library (outside Spark MLlib).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Chen Song (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530066#comment-14530066
 ] 

Chen Song commented on SPARK-7394:
--

Ok, I'll work on this after finishing my daily job.

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7396) Update Kafka example to use new API of Kafka 0.8.2

2015-05-06 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-7396:
--

 Summary: Update Kafka example to use new API of Kafka 0.8.2
 Key: SPARK-7396
 URL: https://issues.apache.org/jira/browse/SPARK-7396
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Reporter: Saisai Shao


Due to upgrade of Kafka, current KafkaWordCountProducer will throw below 
exception, we need to update the code accordingly.

{code}
Exception in thread main kafka.common.FailedToSendMessageException: Failed to 
send messages after 3 tries.
at 
kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
at kafka.producer.Producer.send(Producer.scala:77)
at 
org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96)
at 
org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7395) some suggestion about SimpleApp in quick-start.html

2015-05-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-7395:
---
Affects Version/s: (was: 1.3.1)
   1.4.0

 some suggestion about SimpleApp in quick-start.html
 ---

 Key: SPARK-7395
 URL: https://issues.apache.org/jira/browse/SPARK-7395
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
 Environment: none
Reporter: zhengbing li
   Original Estimate: 12h
  Remaining Estimate: 12h

 Base on the code guide of SimpleApp in 
 https://spark.apache.org/docs/latest/quick-start.html, I could not run the 
 SimpleApp code until I modify  val conf = new SparkConf().setAppName(Simple 
 Application) to val conf = new SparkConf().setAppName(Simple 
 Application).setMaster(local).
 So the document might be modified for the beginners.
 The error of scala example is as follows:
 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0
 Exception in thread main org.apache.spark.SparkException: A master URL must 
 be set in your configuration
   at org.apache.spark.SparkContext.init(SparkContext.scala:206)
   at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12)
   at com.huawei.openspark.TestSpark.main(TestSpark.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7395) some suggestion about SimpleApp in quick-start.html

2015-05-06 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-7395:
---
Affects Version/s: (was: 1.4.0)
   1.3.1

 some suggestion about SimpleApp in quick-start.html
 ---

 Key: SPARK-7395
 URL: https://issues.apache.org/jira/browse/SPARK-7395
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
 Environment: none
Reporter: zhengbing li
   Original Estimate: 12h
  Remaining Estimate: 12h

 Base on the code guide of SimpleApp in 
 https://spark.apache.org/docs/latest/quick-start.html, I could not run the 
 SimpleApp code until I modify  val conf = new SparkConf().setAppName(Simple 
 Application) to val conf = new SparkConf().setAppName(Simple 
 Application).setMaster(local).
 So the document might be modified for the beginners.
 The error of scala example is as follows:
 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0
 Exception in thread main org.apache.spark.SparkException: A master URL must 
 be set in your configuration
   at org.apache.spark.SparkContext.init(SparkContext.scala:206)
   at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12)
   at com.huawei.openspark.TestSpark.main(TestSpark.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6824) Fill the docs for DataFrame API in SparkR

2015-05-06 Thread Qian Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529997#comment-14529997
 ] 

Qian Huang commented on SPARK-6824:
---

start working on this issue

 Fill the docs for DataFrame API in SparkR
 -

 Key: SPARK-6824
 URL: https://issues.apache.org/jira/browse/SPARK-6824
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Shivaram Venkataraman
Priority: Blocker

 Some of the DataFrame functions in SparkR do not have complete roxygen docs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7150) SQLContext.range()

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7150:
---
Summary: SQLContext.range()  (was: Facilitate random column generation for 
DataFrames)

 SQLContext.range()
 --

 Key: SPARK-7150
 URL: https://issues.apache.org/jira/browse/SPARK-7150
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SQL
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be handy to have easy ways to construct random columns for 
 DataFrames.  Proposed API:
 {code}
 class SQLContext {
   // Return a DataFrame with a single column named id that has consecutive 
 value from 0 to n.
   def range(n: Long): DataFrame
   def range(n: Long, numPartitions: Int): DataFrame
 }
 {code}
 Usage:
 {code}
 // uniform distribution
 ctx.range(1000).select(rand())
 // normal distribution
 ctx.range(1000).select(randn())
 {code}
 We should add an RangeIterator that supports long start/stop position, and 
 then use it to create an RDD as the basis for this DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
-toDeg  - toDegrees-
-toRad - toRadians-
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*conditional functions*

if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T


*string functions*

ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, arraystring): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): mapstring, string
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string


*Misc*

hash(a1[, a2…]): int


*text*

context_ngrams(arrayarraystring, arraystring, int K, int pf): 
arraystructstring,double
ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double
sentences(string str, string lang, string locale): arrayarraystring


*UDAF*

var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: arraydouble
percentile_approx: arraydouble
histogram_numeric: arraystruct {'x','y'}
collect_set  — we have hashset
collect_list 
ntile








  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
toDeg  - toDegrees
toRad - toRadians
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String



[jira] [Updated] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3454:
-
Target Version/s:   (was: 1.2.0)

 Expose JSON representation of data shown in WebUI
 -

 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Imran Rashid
 Fix For: 1.4.0

 Attachments: sparkmonitoringjsondesign.pdf


 If WebUI support to JSON format extracting, it's helpful for user who want to 
 analyse stage / task / executor information.
 Fortunately, WebUI has renderJson method so we can implement the method in 
 each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3166) Custom serialisers can't be shipped in application jars

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3166:
-
Target Version/s:   (was: 1.2.0)

 Custom serialisers can't be shipped in application jars
 ---

 Key: SPARK-3166
 URL: https://issues.apache.org/jira/browse/SPARK-3166
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Graham Dennis

 Spark cannot currently use a custom serialiser that is shipped with the 
 application jar. Trying to do this causes a java.lang.ClassNotFoundException 
 when trying to instantiate the custom serialiser in the Executor processes. 
 This occurs because Spark attempts to instantiate the custom serialiser 
 before the application jar has been shipped to the Executor process. A 
 reproduction of the problem is available here: 
 https://github.com/GrahamDennis/spark-custom-serialiser
 I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches 
 as of August 21, 2014.  This issue is related to SPARK-2878, and my fix for 
 that issue (https://github.com/apache/spark/pull/1890) also solves this.  My 
 pull request was not merged because it adds the user jar to the Executor 
 processes' class path at launch time.  Such a significant change was thought 
 by [~rxin] to require more QA, and should be considered for inclusion in 1.2 
 at the earliest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3134:
-
Target Version/s:   (was: 1.2.0)

 Update block locations asynchronously in TorrentBroadcast
 -

 Key: SPARK-3134
 URL: https://issues.apache.org/jira/browse/SPARK-3134
 Project: Spark
  Issue Type: Sub-task
  Components: Block Manager
Reporter: Reynold Xin

 Once the TorrentBroadcast gets the data blocks, it needs to tell the master 
 the new location. We should make the location update non-blocking to reduce 
 roundtrips we need to launch tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3684) Can't configure local dirs in Yarn mode

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3684:
-
Target Version/s:   (was: 1.2.0)

 Can't configure local dirs in Yarn mode
 ---

 Key: SPARK-3684
 URL: https://issues.apache.org/jira/browse/SPARK-3684
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

 We can't set SPARK_LOCAL_DIRS or spark.local.dirs because they're not picked 
 up in Yarn mode. However, we can't set YARN_LOCAL_DIRS or LOCAL_DIRS either 
 because these are overridden by Yarn.
 I'm trying to set these through SPARK_YARN_USER_ENV. I'm aware that the 
 default behavior is for Spark to use Yarn's local dirs, but right now there's 
 no way to change it even if the user wants to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3631:
-
Target Version/s:   (was: 1.2.0)

 Add docs for checkpoint usage
 -

 Key: SPARK-3631
 URL: https://issues.apache.org/jira/browse/SPARK-3631
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Andrew Ash
Assignee: Andrew Ash

 We should include general documentation on using checkpoints.  Right now the 
 docs only cover checkpoints in the Spark Streaming use case which is slightly 
 different from Core.
 Some content to consider for inclusion from [~brkyvz]:
 {quote}
 If you set the checkpointing directory however, the intermediate state of the 
 RDDs will be saved in HDFS, and the lineage will pick off from there.
 You won't need to keep the shuffle data before the checkpointed state, 
 therefore those can be safely removed (will be removed automatically).
 However, checkpoint must be called explicitly as in 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
  ,just setting the directory will not be enough.
 {quote}
 {quote}
 Yes, writing to HDFS is more expensive, but I feel it is still a small price 
 to pay when compared to having a Disk Space Full error three hours in
 and having to start from scratch.
 The main goal of checkpointing is to truncate the lineage. Clearing up 
 shuffle writes come as a bonus to checkpointing, it is not the main goal. The
 subtlety here is that .checkpoint() is just like .cache(). Until you call an 
 action, nothing happens. Therefore, if you're going to do 1000 maps in a
 row and you don't want to checkpoint in the meantime until a shuffle happens, 
 you will still get a StackOverflowError, because the lineage is too long.
 I went through some of the code for checkpointing. As far as I can tell, it 
 materializes the data in HDFS, and resets all its dependencies, so you start
 a fresh lineage. My understanding would be that checkpointing still should be 
 done every N operations to reset the lineage. However, an action must be
 performed before the lineage grows too long.
 {quote}
 A good place to put this information would be at 
 https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3514) Provide a utility function for returning the hosts (and number) of live executors

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3514:
-
Target Version/s:   (was: 1.2.0)

 Provide a utility function for returning the hosts (and number) of live 
 executors
 -

 Key: SPARK-3514
 URL: https://issues.apache.org/jira/browse/SPARK-3514
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Minor

 It would be nice to tell user applications how many executors they have 
 currently running in their application. Also, we could give them the host 
 names on which the executors are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1832) Executor UI improvement suggestions

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1832:
-
Target Version/s:   (was: 1.2.0)

 Executor UI improvement suggestions
 ---

 Key: SPARK-1832
 URL: https://issues.apache.org/jira/browse/SPARK-1832
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Thomas Graves

 I received some suggestions from a user for the /executors UI page to make it 
 more helpful. This gets more important when you have a really large number of 
 executors.
  Fill some of the cells with color in order to make it easier to absorb 
 the info, e.g.
 RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
 the red)
 GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
 number)
 Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
 on the log(# completed)
 - if dark blue then write the value in white (same for the RED and GREEN above
 Maybe mark the MASTER task somehow
  
 Report the TOTALS in each column (do this at the TOP so no need to scroll 
 to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3630:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3385) Improve shuffle performance

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3385:
-
Target Version/s:   (was: 1.3.0)

 Improve shuffle performance
 ---

 Key: SPARK-3385
 URL: https://issues.apache.org/jira/browse/SPARK-3385
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Reynold Xin

 Just a ticket to track various efforts related to improving shuffle in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4106) Shuffle write and spill to disk metrics are incorrect

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4106:
-
Target Version/s:   (was: 1.2.0)

 Shuffle write and spill to disk metrics are incorrect
 -

 Key: SPARK-4106
 URL: https://issues.apache.org/jira/browse/SPARK-4106
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Critical

 I have an encountered a job which has some disk spilled (memory) but the disk 
 spilled (disk) is 0, as well as the shuffle write. If I switch to hash based 
 shuffle, where there happens to be no disk spilling, then the shuffle write 
 is correct.
 I can get more info on a workload to repro this situation, but perhaps that 
 state of events is sufficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4356) Test Scala 2.11 on Jenkins

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-4356:
-
Target Version/s:   (was: 1.2.0)

 Test Scala 2.11 on Jenkins
 --

 Key: SPARK-4356
 URL: https://issues.apache.org/jira/browse/SPARK-4356
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical

 We need to make some modifications to the test harness so that we can test 
 Scala 2.11 in Maven regularly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3115) Improve task broadcast latency for small tasks

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3115:
-
Target Version/s:   (was: 1.2.0)

 Improve task broadcast latency for small tasks
 --

 Key: SPARK-3115
 URL: https://issues.apache.org/jira/browse/SPARK-3115
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Shivaram Venkataraman
Assignee: Reynold Xin

 Broadcasting the task information helps reduce the amount of data transferred 
 for large tasks. However we've seen that this adds more latency for small 
 tasks. It'll be great to profile and fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1823:
-
Target Version/s:   (was: 1.2.0)

 ExternalAppendOnlyMap can still OOM if one key is very large
 

 Key: SPARK-1823
 URL: https://issues.apache.org/jira/browse/SPARK-1823
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Andrew Or

 If the values for one key do not collectively fit into memory, then the map 
 will still OOM when you merge the spilled contents back in.
 This is a problem especially for PySpark, since we hash the keys (Python 
 objects) before a shuffle, and there are only so many integers out there in 
 the world, so there could potentially be many collisions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3374) Spark on Yarn config cleanup

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3374:
-
Target Version/s:   (was: 1.2.0)

 Spark on Yarn config cleanup
 

 Key: SPARK-3374
 URL: https://issues.apache.org/jira/browse/SPARK-3374
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 1.1.0
Reporter: Thomas Graves

 The configs in yarn have gotten scattered and inconsistent between cluster 
 and client modes and supporting backwards compatibility.  We should try to 
 clean this up, move things to common places and support configs across both 
 cluster and client modes where we want to make them public.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-2365:
-
Target Version/s:   (was: 1.2.0)

 Add IndexedRDD, an efficient updatable key-value store
 --

 Key: SPARK-2365
 URL: https://issues.apache.org/jira/browse/SPARK-2365
 Project: Spark
  Issue Type: New Feature
  Components: GraphX, Spark Core
Reporter: Ankur Dave
Assignee: Ankur Dave
 Attachments: 2014-07-07-IndexedRDD-design-review.pdf


 RDDs currently provide a bulk-updatable, iterator-based interface. This 
 imposes minimal requirements on the storage layer, which only needs to 
 support sequential access, enabling on-disk and serialized storage.
 However, many applications would benefit from a richer interface. Efficient 
 support for point lookups would enable serving data out of RDDs, but it 
 currently requires iterating over an entire partition to find the desired 
 element. Point updates similarly require copying an entire iterator. Joins 
 are also expensive, requiring a shuffle and local hash joins.
 To address these problems, we propose IndexedRDD, an efficient key-value 
 store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key 
 uniqueness and pre-indexing the entries for efficient joins and point 
 lookups, updates, and deletions.
 It would be implemented by (1) hash-partitioning the entries by key, (2) 
 maintaining a hash index within each partition, and (3) using purely 
 functional (immutable and efficiently updatable) data structures to enable 
 efficient modifications and deletions.
 GraphX would be the first user of IndexedRDD, since it currently implements a 
 limited form of this functionality in VertexRDD. We envision a variety of 
 other uses for IndexedRDD, including streaming updates to RDDs, direct 
 serving from RDDs, and as an execution strategy for Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3441:
-
Target Version/s:   (was: 1.2.0)

 Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style 
 shuffle
 ---

 Key: SPARK-3441
 URL: https://issues.apache.org/jira/browse/SPARK-3441
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Patrick Wendell
Assignee: Sandy Ryza

 I think it would be good to say something like this in the doc for 
 repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy:
 {code}
 This can be used to enact a Hadoop Style shuffle along with a call to 
 mapPartitions, e.g.:
rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...)
 {code}
 It might also be nice to add a version that doesn't take a partitioner and/or 
 to mention this in the groupBy javadoc. I guess it depends a bit whether we 
 consider this to be an API we want people to use more widely or whether we 
 just consider it a narrow stable API mostly for Hive-on-Spark. If we want 
 people to consider this API when porting workloads from Hadoop, then it might 
 be worth documenting better.
 What do you think [~rxin] and [~matei]?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3629) Improvements to YARN doc

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3629:
-
Target Version/s:   (was: 1.1.1, 1.2.0)

 Improvements to YARN doc
 

 Key: SPARK-3629
 URL: https://issues.apache.org/jira/browse/SPARK-3629
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, YARN
Reporter: Matei Zaharia
  Labels: starter

 Right now this doc starts off with a big list of config options, and only 
 then tells you how to submit an app. It would be better to put that part and 
 the packaging part first, and the config options only at the end.
 In addition, the doc mentions yarn-cluster vs yarn-client as separate 
 masters, which is inconsistent with the help output from spark-submit (which 
 says to always use yarn).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3982) receiverStream in Python API

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3982:
-
Target Version/s:   (was: 1.2.0)

 receiverStream in Python API
 

 Key: SPARK-3982
 URL: https://issues.apache.org/jira/browse/SPARK-3982
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, Streaming
Reporter: Davies Liu
Assignee: Davies Liu

 receiverStream() is used to extend the input sources of streaming, it will be 
 very useful to have it in Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7285) Audit missing Hive functions

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7285:
---
Description: 
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

-between-
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
toDeg  - toDegrees
toRad - toRadians
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String


*conditional functions*

if(boolean testCondition, T valueTrue, T valueFalseOrNull): T
nvl(T value, T default_value): T
greatest(T v1, T v2, …): T
least(T v1, T v2, …): T


*string functions*

ascii(string str): int
base64(binary): string
concat(string|binary A, string|binary B…): string | binary
concat_ws(string SEP, string A, string B…): string
concat_ws(string SEP, arraystring): string
decode(binary bin, string charset): string
encode(string src, string charset): binary
find_in_set(string str, string strList): int
format_number(number x, int d): string
length(string): int
instr(string str, string substr): int
locate(string substr, string str[, int pos]): int
lower(string), lcase(string)
lpad(string str, int len, string pad): string
ltrim(string): string

parse_url(string urlString, string partToExtract [, string keyToExtract]): 
string
printf(String format, Obj... args): string
regexp_extract(string subject, string pattern, int index): string
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): 
string
repeat(string str, int n): string
reverse(string A): string
rpad(string str, int len, string pad): string
space(int n): string
split(string str, string pat): array
str_to_map(text[, delimiter1, delimiter2]): mapstring, string
trim(string A): string
unbase64(string str): binary
upper(string A) ucase(string A): string
levenshtein(string A, string B: int
soundex(string A): string


*Misc*

hash(a1[, a2…]): int


*text*

context_ngrams(arrayarraystring, arraystring, int K, int pf): 
arraystructstring,double
ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double
sentences(string str, string lang, string locale): arrayarraystring


*UDAF*

var_samp
stddev_pop
stddev_samp
covar_pop
covar_samp
corr
percentile: arraydouble
percentile_approx: arraydouble
histogram_numeric: arraystruct {'x','y'}
collect_set  — we have hashset
collect_list 
ntile








  was:
Create a list of functions that is on this page but not in SQL/DataFrame.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

Here's the list of missing stuff:

*basic*

between
bitwise operation
bitwiseAND
bitwiseOR
bitwiseXOR
bitwiseNOT

*math*

round(DOUBLE a)
round(DOUBLE a, INT d) Returns a rounded to d decimal places.
log2
sqrt(string column name)
bin
hex(long), hex(string), hex(binary)
unhex(string) - binary
conv
pmod
factorial
toDeg  - toDegrees
toRad - toRadians
e()
pi()
shiftleft(int or long)
shiftright(int or long)
shiftrightunsigned(int or long)

*collection functions*

sort_array(array)
size(map, array)
map_values(mapk,v): arrayv
map_keys(mapk,v):arrayk
array_contains(arrayt, value): boolean

*date functions*

from_unixtime(long, string): string
unix_timestamp(): long
unix_timestamp(date): long
year(date): int
month(date): int
day(date): int
dayofmonth(date); int
hour(timestamp): int
minute(timestamp): int
second(timestamp): int
weekofyear(date): int
date_add(date, int)
date_sub(date, int)
from_utc_timestamp(timestamp, string timezone): timestamp
current_date(): date
current_timestamp(): timestamp
add_months(string start_date, int num_months): string
last_day(string date): string
next_day(string start_date, string day_of_week): string
trunc(string date[, string format]): string
months_between(date1, date2): double
date_format(date/timestamp/string ts, string fmt): String



[jira] [Resolved] (SPARK-4784) Model.fittingParamMap should store all Params

2015-05-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-4784.
--
Resolution: Not A Problem

[SPARK-5956] removed fittingParamMap, so this is no longer an issue

 Model.fittingParamMap should store all Params
 -

 Key: SPARK-4784
 URL: https://issues.apache.org/jira/browse/SPARK-4784
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Minor

 spark.ml's Model class should store all parameters in the fittingParamMap, 
 not just the ones which were explicitly set.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7394:
---
Description: 
Basically alias astype == cast in Column for Python (and Python only).


  was:Basically alias astype == cast in Column for Python.


 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column for Python (and Python only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530063#comment-14530063
 ] 

Reynold Xin commented on SPARK-7394:


cc [~smacat] would you like to do this one as well?

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7395) some suggestion about SimpleApp in quick-start.html

2015-05-06 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7395.
--
  Resolution: Not A Problem
   Fix Version/s: (was: 1.4.0)
Target Version/s:   (was: 1.3.1)

Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first

Don't set target or fix version.

You set a master when you run the example with spark-submit, not in the code.

 some suggestion about SimpleApp in quick-start.html
 ---

 Key: SPARK-7395
 URL: https://issues.apache.org/jira/browse/SPARK-7395
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.1
 Environment: none
Reporter: zhengbing li
   Original Estimate: 12h
  Remaining Estimate: 12h

 Base on the code guide of SimpleApp in 
 https://spark.apache.org/docs/latest/quick-start.html, I could not run the 
 SimpleApp code until I modify  val conf = new SparkConf().setAppName(Simple 
 Application) to val conf = new SparkConf().setAppName(Simple 
 Application).setMaster(local).
 So the document might be modified for the beginners.
 The error of scala example is as follows:
 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0
 Exception in thread main org.apache.spark.SparkException: A master URL must 
 be set in your configuration
   at org.apache.spark.SparkContext.init(SparkContext.scala:206)
   at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12)
   at com.huawei.openspark.TestSpark.main(TestSpark.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7381) Python API for Transformers

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7381:
-
Assignee: Burak Yavuz

 Python API for Transformers
 ---

 Key: SPARK-7381
 URL: https://issues.apache.org/jira/browse/SPARK-7381
 Project: Spark
  Issue Type: Umbrella
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7383) Python API for ml.feature

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7383:
-
Assignee: Burak Yavuz

 Python API for ml.feature
 -

 Key: SPARK-7383
 URL: https://issues.apache.org/jira/browse/SPARK-7383
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7382) Python API for ml.classification

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7382:
-
Assignee: Burak Yavuz

 Python API for ml.classification
 

 Key: SPARK-7382
 URL: https://issues.apache.org/jira/browse/SPARK-7382
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz
Assignee: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7383) Python API for ml.feature

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7383:
-
Priority: Major  (was: Blocker)

 Python API for ml.feature
 -

 Key: SPARK-7383
 URL: https://issues.apache.org/jira/browse/SPARK-7383
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7388) Python Api for Param[Array[T]]

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7388:
-
Assignee: Burak Yavuz

 Python Api for Param[Array[T]]
 --

 Key: SPARK-7388
 URL: https://issues.apache.org/jira/browse/SPARK-7388
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Burak Yavuz
Assignee: Burak Yavuz

 Python can't set Array[T] type params, because py4j casts a list to an 
 ArrayList. Instead of Param[Array[T]], we sill have a ArrayParam[T] which can 
 take a Seq[T].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7381) Python API for Transformers

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7381:
-
Priority: Major  (was: Blocker)

 Python API for Transformers
 ---

 Key: SPARK-7381
 URL: https://issues.apache.org/jira/browse/SPARK-7381
 Project: Spark
  Issue Type: Umbrella
  Components: ML, PySpark
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7382) Python API for ml.classification

2015-05-06 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7382:
-
Priority: Major  (was: Blocker)

 Python API for ml.classification
 

 Key: SPARK-7382
 URL: https://issues.apache.org/jira/browse/SPARK-7382
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Burak Yavuz





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected

2015-05-06 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530106#comment-14530106
 ] 

Sun Rui commented on SPARK-6812:


According to the R manual: 
https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html,
 if a function .First is found on the search path, it is executed as .First(). 
Finally, function .First.sys() in the base package is run. This calls require 
to attach the default packages specified by options(defaultPackages).

In .First() in profile/shell.R, we load SparkR package. This means SparkR 
package is loaded before default packages. If there are same names in default 
packages, they will overwrite those in SparkR. This is why filter() in SparkR 
is masked by filter() in stats, which is usually in the default package list.

We need to make sure SparkR is loaded after default packages. The solution is 
to append SparkR to default packages, instead of loading SparkR in .First().


 filter() on DataFrame does not work as expected
 ---

 Key: SPARK-6812
 URL: https://issues.apache.org/jira/browse/SPARK-6812
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Davies Liu
Assignee: Sun Rui
Priority: Blocker

 {code}
  filter(df, df$age  21)
 Error in filter(df, df$age  21) :
   no method for coercing this S4 class to a vector
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7396) Update Kafka example to use new API of Kafka 0.8.2

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7396:
---

Assignee: Apache Spark

 Update Kafka example to use new API of Kafka 0.8.2
 --

 Key: SPARK-7396
 URL: https://issues.apache.org/jira/browse/SPARK-7396
 Project: Spark
  Issue Type: Bug
  Components: Examples, Streaming
Affects Versions: 1.4.0
Reporter: Saisai Shao
Assignee: Apache Spark

 Due to upgrade of Kafka, current KafkaWordCountProducer will throw below 
 exception, we need to update the code accordingly.
 {code}
 Exception in thread main kafka.common.FailedToSendMessageException: Failed 
 to send messages after 3 tries.
   at 
 kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90)
   at kafka.producer.Producer.send(Producer.scala:77)
   at 
 org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96)
   at 
 org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623)
   at 
 org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169)
   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7393) How to improve Spark SQL performance?

2015-05-06 Thread Liang Lee (JIRA)
Liang Lee created SPARK-7393:


 Summary: How to improve Spark SQL performance?
 Key: SPARK-7393
 URL: https://issues.apache.org/jira/browse/SPARK-7393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang Lee






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7394:
---
Labels: starter  (was: )

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)

2015-05-06 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7394:
---
Description: Basically alias astype == cast in Column for Python.  (was: 
Basically alias astype == cast in Column.)

 Add Pandas style cast (astype)
 --

 Key: SPARK-7394
 URL: https://issues.apache.org/jira/browse/SPARK-7394
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter
 Fix For: 1.4.0


 Basically alias astype == cast in Column for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework

2015-05-06 Thread Bharath Ravi Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530117#comment-14530117
 ] 

Bharath Ravi Kumar commented on SPARK-6284:
---

[~tnachen] This issue makes Spark on Mesos a non-starter in a multi-tenant 
mesos production setup. Any possibility that this will be back-ported into a 
minor release (1.3.x or 1.2.x) ?
On a related note, is any further assistance required for testing this change? 
I could help with that. Thanks.

 Support framework authentication and role in Mesos framework
 

 Key: SPARK-6284
 URL: https://issues.apache.org/jira/browse/SPARK-6284
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Support framework authentication and role in both Coarse grain and fine grain 
 mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-06 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530438#comment-14530438
 ] 

Hrishikesh edited comment on SPARK-6258 at 5/6/15 12:34 PM:


[~yanboliang], I got stuck at one stage. You can work on it.


was (Author: hrishikesh91):
[~yanboliang], you can start working on it.

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-05-06 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reopened SPARK-3454:
-

need to fix some issues w/ test files in pr ...

 Expose JSON representation of data shown in WebUI
 -

 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Imran Rashid
 Fix For: 1.4.0

 Attachments: sparkmonitoringjsondesign.pdf


 If WebUI support to JSON format extracting, it's helpful for user who want to 
 analyse stage / task / executor information.
 Fortunately, WebUI has renderJson method so we can implement the method in 
 each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-7116) Intermediate RDD cached but never unpersisted

2015-05-06 Thread Dennis Proppe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530390#comment-14530390
 ] 

Dennis Proppe edited comment on SPARK-7116 at 5/6/15 11:47 AM:
---

*I second the importance of fixing this.*

For our production workloads, this is really a large issue and blocks the use 
of large data:

1) It creates non-removeable RDDs cached in memory
2) They can't even be removed by deleting the RDD in question in the code
3) The RDDs are up to 10 times bigger that they would be by calling df.cache()

The behaviour can be reproduced easily in the PySpark Shell:

---
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import udf

slen = udf(lambda s: len(s), IntegerType())
rows = [[a, hello], [b, goodbye],[c, whatever]] * 1000
rd = sc.parallelize(rows)
head = ['order','word']

df = rd.toDF(head)
df.cache().count()

bigfile = df.withColumn(word_length, slen(df.word))
bigfile.count()
---

If you now compare the size of the cached df versus the automagically appeared 
cached bigfile, you see that bigfile uses about 8X the storage of df, although 
it only has 1 column more.


was (Author: dproppe):
*I second the importance of fixing this.*

For our production workloads, this is really a large issue and blocks the use 
of large data:

1) It creates non-removeable RDDs cached in memory
2) They can't even be removed by deleting the RDD in question in the code
3) The RDDs are up to 10 times bigger that they would be by calling df.cache()

The behaviour can be reproduced easily in the PySpark Shell:


 from pyspark import SQLContext
 from pyspark.sql.types import IntegerType
 from pyspark.sql.functions import udf



sqlcon = SQLContext(sc)
slen = udf(lambda s: len(s), IntegerType())

rows = [[a, hello], [b, goodbye],[c, whatever]] * 1000
rd = sc.parallelize(rows)
head = ['order','word']
df = rd.toDF(head)
df.cache().count()

`
bigfile = df.withColumn(word_length, slen(df.word))
bigfile.count()
`

If you now compare the size of the cached df versus the automagically appeared 
cached bigfile, you see that bigfile uses about 8X the storage of df, although 
it only has 1 column more.

 Intermediate RDD cached but never unpersisted
 -

 Key: SPARK-7116
 URL: https://issues.apache.org/jira/browse/SPARK-7116
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.3.1
Reporter: Kalle Jepsen

 In 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala#L233
  an intermediate RDD is cached, but never unpersisted. It shows up in the 
 'Storage' section of the Web UI, but cannot be removed. There's already a 
 comment in the source, suggesting to 'clean up'. If that cleanup is more 
 involved than simply calling `unpersist`, it probably exceeds my current 
 Scala skills.
 Why that is a problem:
 I'm adding a constant column to a DataFrame of about 20M records resulting 
 from an inner join with {{df.withColumn(colname, ud_func())}} , where 
 {{ud_func}} is simply a wrapped {{lambda: 1}}. Before and after applying the 
 UDF the DataFrame takes up ~430MB in the cache. The cached intermediate RDD 
 however takes up ~10GB(!) of storage, and I know of no way to uncache it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5913) Python API for ChiSqSelector

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5913:
---

Assignee: Apache Spark  (was: Yanbo Liang)

 Python API for ChiSqSelector
 

 Key: SPARK-5913
 URL: https://issues.apache.org/jira/browse/SPARK-5913
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Minor

 Add a Python API for mllib.feature.ChiSqSelector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5913) Python API for ChiSqSelector

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5913:
---

Assignee: Yanbo Liang  (was: Apache Spark)

 Python API for ChiSqSelector
 

 Key: SPARK-5913
 URL: https://issues.apache.org/jira/browse/SPARK-5913
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Yanbo Liang
Priority: Minor

 Add a Python API for mllib.feature.ChiSqSelector



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-06 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530438#comment-14530438
 ] 

Hrishikesh commented on SPARK-6258:
---

[~yanboliang], you can start working on it.

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530434#comment-14530434
 ] 

Yanbo Liang commented on SPARK-6258:


[~hrishikesh] Are you still work on this issue? If you are not working on it, I 
can take it. [~josephkb]

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6258) Python MLlib API missing items: Clustering

2015-05-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530434#comment-14530434
 ] 

Yanbo Liang edited comment on SPARK-6258 at 5/6/15 12:26 PM:
-

[~hrishikesh] Are you still working on this issue? If you are not working on 
it, I can take it. [~josephkb]


was (Author: yanboliang):
[~hrishikesh] Are you still work on this issue? If you are not working on it, I 
can take it. [~josephkb]

 Python MLlib API missing items: Clustering
 --

 Key: SPARK-6258
 URL: https://issues.apache.org/jira/browse/SPARK-6258
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 This JIRA lists items missing in the Python API for this sub-package of MLlib.
 This list may be incomplete, so please check again when sending a PR to add 
 these features to the Python API.
 Also, please check for major disparities between documentation; some parts of 
 the Python API are less well-documented than their Scala counterparts.  Some 
 items may be listed in the umbrella JIRA linked to this task.
 KMeans
 * setEpsilon
 * setInitializationSteps
 KMeansModel
 * computeCost
 * k
 GaussianMixture
 * setInitialModel
 GaussianMixtureModel
 * k
 Completely missing items which should be fixed in separate JIRAs (which have 
 been created and linked to the umbrella JIRA)
 * LDA
 * PowerIterationClustering
 * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7369) Spark Python 1.3.1 Mllib dataframe random forest problem

2015-05-06 Thread Lisbeth Ron (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lisbeth Ron updated SPARK-7369:
---
Attachment: random_forest_dataframe_spark_30042015.py

Hi Sean,

I still have problems with python spark here are the errors and also the
code that I'm using.

Thanks

Lisbeth



15/05/06 13:14:24 INFO ContextCleaner: Cleaned broadcast 1
15/05/06 13:14:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory
on node001.ca-innovation.fr:47882 (size: 11.0 KB, free: 8.3 GB)
15/05/06 13:14:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory
on node006.ca-innovation.fr:50830 (size: 11.0 KB, free: 8.3 GB)
15/05/06 13:14:25 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 5,
node001.ca-innovation.fr): java.lang.NullPointerException
at
org.apache.spark.api.python.SerDeUtil$$anonfun$toJavaArray$1.apply(SerDeUtil.scala:106)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:123)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:114)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114)
at
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:421)
at
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243)
at
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
at
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205)

15/05/06 13:14:25 INFO TaskSetManager: Starting task 0.1 in stage 3.0 (TID
7, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes)
15/05/06 13:14:25 INFO TaskSetManager: Lost task 1.0 in stage 3.0 (TID 6)
on executor node006.ca-innovation.fr: java.lang.NullPointerException (null)
[duplicate 1]
15/05/06 13:14:25 INFO TaskSetManager: Starting task 1.1 in stage 3.0 (TID
8, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes)
15/05/06 13:14:26 INFO TaskSetManager: Lost task 0.1 in stage 3.0 (TID 7)
on executor node001.ca-innovation.fr: java.lang.NullPointerException (null)
[duplicate 2]
15/05/06 13:14:26 INFO TaskSetManager: Starting task 0.2 in stage 3.0 (TID
9, node006.ca-innovation.fr, NODE_LOCAL, 1409 bytes)
15/05/06 13:14:26 INFO TaskSetManager: Lost task 1.1 in stage 3.0 (TID 8)
on executor node001.ca-innovation.fr: java.lang.NullPointerException (null)
[duplicate 3]
15/05/06 13:14:26 INFO TaskSetManager: Starting task 1.2 in stage 3.0 (TID
10, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes)
15/05/06 13:14:27 INFO TaskSetManager: Lost task 0.2 in stage 3.0 (TID 9)
on executor node006.ca-innovation.fr: java.lang.NullPointerException (null)
[duplicate 4]
15/05/06 13:14:27 INFO TaskSetManager: Starting task 0.3 in stage 3.0 (TID
11, node006.ca-innovation.fr, NODE_LOCAL, 1409 bytes)
15/05/06 13:14:27 INFO TaskSetManager: Lost task 1.2 in stage 3.0 (TID 10)
on executor node001.ca-innovation.fr: java.lang.NullPointerException (null)
[duplicate 5]
15/05/06 13:14:27 INFO TaskSetManager: Starting task 1.3 in stage 3.0 (TID
12, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes)
15/05/06 13:14:28 INFO TaskSetManager: Lost task 0.3 in stage 3.0 (TID 11)
on executor node006.ca-innovation.fr: java.lang.NullPointerException (null)
[duplicate 6]
15/05/06 13:14:28 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times;
aborting job
15/05/06 13:14:28 INFO TaskSchedulerImpl: Cancelling stage 3
15/05/06 13:14:28 INFO TaskSchedulerImpl: Stage 3 was cancelled
15/05/06 13:14:28 INFO DAGScheduler: Stage 3 (count at
/mapr/MapR-Cluster/casarisk/data/POCGRO/Codes/Spark_python/RF_Python_Spark_30042015/random_forest_dataframe_spark_30042015.py:79)
failed in 4.025 s
15/05/06 13:14:28 INFO DAGScheduler: Job 3 failed: count at
/mapr/MapR-Cluster/casarisk/data/POCGRO/Codes/Spark_python/RF_Python_Spark_30042015/random_forest_dataframe_spark_30042015.py:79,
took 4.052326 s
Traceback (most recent call last):
  File
/mapr/MapR-Cluster/casarisk/data/POCGRO/Codes/Spark_python/RF_Python_Spark_30042015/random_forest_dataframe_spark_30042015.py,
line 79, in module
print trainingData.count()
  File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/pyspark/rdd.py, line
932, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/pyspark/rdd.py, line
923, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/pyspark/rdd.py, 

[jira] [Updated] (SPARK-7322) Add DataFrame DSL for window function support

2015-05-06 Thread Olivier Girardot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Girardot updated SPARK-7322:

Labels: DataFrame  (was: )

 Add DataFrame DSL for window function support
 -

 Key: SPARK-7322
 URL: https://issues.apache.org/jira/browse/SPARK-7322
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: DataFrame

 Here's a proposal for supporting window functions in the DataFrame DSL:
 1. Add an over function to Column:
 {code}
 class Column {
   ...
   def over(): WindowFunctionSpec
   ...
 }
 {code}
 2. WindowFunctionSpec:
 {code}
 // By default frame = full partition
 class WindowFunctionSpec {
   def partitionBy(cols: Column*): WindowFunctionSpec
   def orderBy(cols: Column*): WindowFunctionSpec
   // restrict frame beginning from current row - n position
   def rowsPreceding(n: Int): WindowFunctionSpec
   // restrict frame ending from current row - n position
   def rowsFollowing(n: Int): WindowFunctionSpec
   def rangePreceding(n: Int): WindowFunctionSpec
   def rowsFollowing(n: Int): WindowFunctionSpec
 }
 {code}
 Here's an example to use it:
 {code}
 df.select(
   df.store,
   df.date,
   df.sales,
   avg(df.sales).over.partitionBy(df.store)
 .orderBy(df.store) 
 .rowsFollowing(0)// this means from unbounded 
 preceding to current row
 )
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-05-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530562#comment-14530562
 ] 

Apache Spark commented on SPARK-3454:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/5940

 Expose JSON representation of data shown in WebUI
 -

 Key: SPARK-3454
 URL: https://issues.apache.org/jira/browse/SPARK-3454
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Assignee: Imran Rashid
 Fix For: 1.4.0

 Attachments: sparkmonitoringjsondesign.pdf


 If WebUI support to JSON format extracting, it's helpful for user who want to 
 analyse stage / task / executor information.
 Fortunately, WebUI has renderJson method so we can implement the method in 
 each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7398) Add back-pressure to Spark Streaming

2015-05-06 Thread JIRA
François Garillot created SPARK-7398:


 Summary: Add back-pressure to Spark Streaming
 Key: SPARK-7398
 URL: https://issues.apache.org/jira/browse/SPARK-7398
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.3.1
Reporter: François Garillot


Spark Streaming has trouble dealing with situations where 
 batch processing time  batch interval
Meaning a high throughput of input data w.r.t. Spark's ability to remove data 
from the queue.

If this throughput is sustained for long enough, it leads to an unstable 
situation where the memory of the Receiver's Executor is overflowed. This aims 
at transmitting a back-pressure signal back to data ingestion to help with 
dealing with that high throughput, in a backwards-compatible way.

The design doc can be found here:
https://docs.google.com/document/d/15m8N717n4hwFya_pJHUbiv5s50Mq1SZZzL9ftOhms4E/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6093:
---

Assignee: Apache Spark

 Add RegressionMetrics in PySpark/MLlib
 --

 Key: SPARK-6093
 URL: https://issues.apache.org/jira/browse/SPARK-6093
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib

2015-05-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530581#comment-14530581
 ] 

Apache Spark commented on SPARK-6093:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/5941

 Add RegressionMetrics in PySpark/MLlib
 --

 Key: SPARK-6093
 URL: https://issues.apache.org/jira/browse/SPARK-6093
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7116) Intermediate RDD cached but never unpersisted

2015-05-06 Thread Kalle Jepsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530580#comment-14530580
 ] 

Kalle Jepsen commented on SPARK-7116:
-

[~marmbrus] Do you remember why that {{cache()}} was necessary? I've boldly 
commented it out and the only thing that seems to have changed is that 
everything runs a lot faster...

 Intermediate RDD cached but never unpersisted
 -

 Key: SPARK-7116
 URL: https://issues.apache.org/jira/browse/SPARK-7116
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.3.1
Reporter: Kalle Jepsen

 In 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala#L233
  an intermediate RDD is cached, but never unpersisted. It shows up in the 
 'Storage' section of the Web UI, but cannot be removed. There's already a 
 comment in the source, suggesting to 'clean up'. If that cleanup is more 
 involved than simply calling `unpersist`, it probably exceeds my current 
 Scala skills.
 Why that is a problem:
 I'm adding a constant column to a DataFrame of about 20M records resulting 
 from an inner join with {{df.withColumn(colname, ud_func())}} , where 
 {{ud_func}} is simply a wrapped {{lambda: 1}}. Before and after applying the 
 UDF the DataFrame takes up ~430MB in the cache. The cached intermediate RDD 
 however takes up ~10GB(!) of storage, and I know of no way to uncache it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6093:
---

Assignee: (was: Apache Spark)

 Add RegressionMetrics in PySpark/MLlib
 --

 Key: SPARK-6093
 URL: https://issues.apache.org/jira/browse/SPARK-6093
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib

2015-05-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530586#comment-14530586
 ] 

Yanbo Liang commented on SPARK-6093:


[~mengxr] Could you assign this to me?

 Add RegressionMetrics in PySpark/MLlib
 --

 Key: SPARK-6093
 URL: https://issues.apache.org/jira/browse/SPARK-6093
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame

2015-05-06 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7035:
---

Assignee: (was: Apache Spark)

 Drop __getattr__ on pyspark.sql.DataFrame
 -

 Key: SPARK-7035
 URL: https://issues.apache.org/jira/browse/SPARK-7035
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Kalle Jepsen

 I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.
 There is no point in having the possibility to address the DataFrames columns 
 as {{df.column}}, other than the questionable goal to please R developers. 
 And it seems R people can use Spark from their native API in the future.
 I see the following problems with {{\_\_getattr\_\_}} for column selection:
 * It's un-pythonic: There should only be one obvious way to solve a problem, 
 and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} 
 method, which in my opinion is by far superior and a lot more intuitive.
 * It leads to confusing Exceptions. When we mistype a method-name the 
 {{AttributeError}} will say 'No such column ... '.
 * And most importantly: we cannot load DataFrames that have columns with the 
 same name as any attribute on the DataFrame-object. Imagine having a 
 DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} 
 will be ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7393) How to improve Spark SQL performance?

2015-05-06 Thread Dennis Proppe (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530636#comment-14530636
 ] 

Dennis Proppe commented on SPARK-7393:
--

Hi, Liang Lee,

without more information (HDD  SSD?), it is quite hard to reproduce this. Do 
you cache the DF before querying it? Otherwise, you'd be pulling the file from 
the parquet files on disk everytime you query it. In that case, 3s for a select 
on a table of 61 million rows sounds impressing.

In our org, we found that working along the lines of:

df = sqlContext.load(file.parquet)
df.cache.count()
selection = df.where(foo = bar)

works really fast and may deliver good response times in your case. 

 How to improve Spark SQL performance?
 -

 Key: SPARK-7393
 URL: https://issues.apache.org/jira/browse/SPARK-7393
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang Lee

 We want to use Spark SQL in our project ,but we found that the Spark SQL 
 performance is not very well as we expected. The detail is as follows:
  1. We save data as parquet file on HDFS.
  2.We just select one or several rows from the parquet file using spark SQL.
  3. When the total record number is 61 million, it needs about 3 seconds to 
 get the result, which is unacceptable long for our scenario. 
 4.When the total record number is 2 million, it needs about 93 ms to get the 
 result, whcih is still a little long for us.
  5. The query statement is like : SELECT * FROM DBA WHERE COLA=? AND COLB=? 
 And the table is not complex, which has less 10 columns and the content for 
 each column is less than 100 bytes.
  6. Does any one know how to improve the performance or give some other ideas?
  7. Can Spark SQL support micro-second-level response? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >