[jira] [Resolved] (SPARK-7310) SparkSubmit does not escape for java options and ^ won't work
[ https://issues.apache.org/jira/browse/SPARK-7310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7310. -- Resolution: Not A Problem OK sounds like the particular escaping issue is no longer a problem as far as we can tell. SparkSubmit does not escape for java options and ^ won't work Key: SPARK-7310 URL: https://issues.apache.org/jira/browse/SPARK-7310 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 1.3.1, 1.4.0 Reporter: Yitong Zhou Priority: Minor I can create the error when doing something like: {code} LIBJARS= /jars.../ bin/spark-submit \ --driver-java-options -Djob.url=http://www.foo.bar?query=ab; \ --class com.example.Class \ --master yarn-cluster \ --num-executors 3 \ --executor-cores 1 \ --queue default \ --driver-memory 1g \ --executor-memory 1g \ --jars $LIBJARS\ ../a.jar \ -inputPath /user/yizhou/CED-scoring/input \ -outputPath /user/yizhou {code} Notice that if I remove the in --driver-java-options value, then the submit will succeed. A typical error message looks like this: {code} org.apache.hadoop.util.Shell$ExitCodeException: Usage: java [-options] class [args...] (to execute a class) or java [-options] -jar jarfile [args...] (to execute a jar file) where options include: -d32use a 32-bit data model if available -d64use a 64-bit data model if available -server to select the server VM The default VM is server, because you are running on a server-class machine. -cp class search path of directories and zip/jar files -classpath class search path of directories and zip/jar files A : separated list of directories, JAR archives, and ZIP archives to search for class files. -Dname=value set a system property -verbose:[class|gc|jni] enable verbose output -version print product version and exit -version:value require the specified version to run -showversion print product version and continue -jre-restrict-search | -no-jre-restrict-search include/exclude user private JREs in the version search -? -help print this help message -Xprint help on non-standard options -ea[:packagename...|:classname] -enableassertions[:packagename...|:classname] enable assertions with specified granularity -da[:packagename...|:classname] -disableassertions[:packagename...|:classname] disable assertions with specified granularity -esa | -enablesystemassertions enable system assertions -dsa | -disablesystemassertions disable system assertions -agentlib:libname[=options] load native agent library libname, e.g. -agentlib:hprof see also, -agentlib:jdwp=help and -agentlib:hprof=help -agentpath:pathname[=options] load native agent library by full pathname -javaagent:jarpath[=options] load Java programming language agent, see java.lang.instrument -splash:imagepath show splash screen with specified image See http://www.oracle.com/technetwork/java/javase/documentation/index.html for more details. at org.apache.hadoop.util.Shell.runCommand(Shell.java:505) at org.apache.hadoop.util.Shell.run(Shell.java:418) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650) at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:279) at org.apache.hadoop.yarn.server.nodemanager.PepperdataContainerExecutor.launchContainer(PepperdataContainerExecutor.java:130) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:283) at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:79) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7151) Correlation methods for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7151?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-7151. -- Resolution: Duplicate Fix Version/s: 1.4.0 Assignee: Burak Yavuz Target Version/s: 1.4.0 Correlation methods for DataFrame - Key: SPARK-7151 URL: https://issues.apache.org/jira/browse/SPARK-7151 Project: Spark Issue Type: Sub-task Components: ML, SQL Reporter: Joseph K. Bradley Assignee: Burak Yavuz Priority: Minor Labels: dataframe Fix For: 1.4.0 We should support computing correlations between columns in DataFrames with a simple API. This could be a DataFrame feature: {code} myDataFrame.corr(col1, col2) // or myDataFrame.corr(col1, col2, pearson) // specify correlation type {code} Or it could be an MLlib feature: {code} Statistics.corr(myDataFrame(col1), myDataFrame(col2)) // or Statistics.corr(myDataFrame, col1, col2) {code} (The first Statistics.corr option is more flexible, but it could cause trouble if a user tries to pass in 2 unzippable DataFrame columns.) Note: R follow the latter setup. I'm OK with either. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1762) Add functionality to pin RDDs in cache
[ https://issues.apache.org/jira/browse/SPARK-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1762: - Target Version/s: (was: 1.2.0) Add functionality to pin RDDs in cache -- Key: SPARK-1762 URL: https://issues.apache.org/jira/browse/SPARK-1762 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Right now, all RDDs are created equal, and there is no mechanism to identify a certain RDD to be more important than the rest. This is a problem if the RDD fraction is small, because just caching a few RDDs can evict more important ones. A side effect of this feature is that we can now more safely allocate a smaller spark.storage.memoryFraction if we know how large our important RDDs are, without having to worry about them being evicted. This allows us to use more memory for shuffles, for instance, and avoid disk spills. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530065#comment-14530065 ] Chen Song commented on SPARK-7394: -- Ok, I'll work on this after finishing my daily job. Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7395) some suggestion about SimpleApp in quick-start.html
zhengbing li created SPARK-7395: --- Summary: some suggestion about SimpleApp in quick-start.html Key: SPARK-7395 URL: https://issues.apache.org/jira/browse/SPARK-7395 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.1 Environment: none Reporter: zhengbing li Fix For: 1.4.0 Base on the code guide of SimpleApp in https://spark.apache.org/docs/latest/quick-start.html, I could not run the SimpleApp code until I modify val conf = new SparkConf().setAppName(Simple Application) to val conf = new SparkConf().setAppName(Simple Application).setMaster(local). So the document might be modified for the beginners. The error of scala example is as follows: 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0 Exception in thread main org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.init(SparkContext.scala:206) at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12) at com.huawei.openspark.TestSpark.main(TestSpark.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6940) PySpark CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-6940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6940. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5926 [https://github.com/apache/spark/pull/5926] PySpark CrossValidator -- Key: SPARK-6940 URL: https://issues.apache.org/jira/browse/SPARK-6940 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 1.3.0 Reporter: Omede Firouz Assignee: Xiangrui Meng Priority: Critical Fix For: 1.4.0 PySpark doesn't currently have wrappers for any of the ML.Tuning classes: CrossValidator, CrossValidatorModel, ParamGridBuilder -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139
[ https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7397: --- Assignee: Apache Spark Add missing input information report back to ReceiverInputDStream due to SPARK-7139 --- Key: SPARK-7397 URL: https://issues.apache.org/jira/browse/SPARK-7397 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Assignee: Apache Spark Input information report is missing due to refactor work of ReceiverInputDStream in SPARK-7139. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139
[ https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530218#comment-14530218 ] Apache Spark commented on SPARK-7397: - User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/5937 Add missing input information report back to ReceiverInputDStream due to SPARK-7139 --- Key: SPARK-7397 URL: https://issues.apache.org/jira/browse/SPARK-7397 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Input information report is missing due to refactor work of ReceiverInputDStream in SPARK-7139. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139
[ https://issues.apache.org/jira/browse/SPARK-7397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7397: --- Assignee: (was: Apache Spark) Add missing input information report back to ReceiverInputDStream due to SPARK-7139 --- Key: SPARK-7397 URL: https://issues.apache.org/jira/browse/SPARK-7397 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Input information report is missing due to refactor work of ReceiverInputDStream in SPARK-7139. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6812: --- Assignee: Apache Spark (was: Sun Rui) filter() on DataFrame does not work as expected --- Key: SPARK-6812 URL: https://issues.apache.org/jira/browse/SPARK-6812 Project: Spark Issue Type: Bug Components: SparkR Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker {code} filter(df, df$age 21) Error in filter(df, df$age 21) : no method for coercing this S4 class to a vector {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530268#comment-14530268 ] Apache Spark commented on SPARK-6812: - User 'sun-rui' has created a pull request for this issue: https://github.com/apache/spark/pull/5938 filter() on DataFrame does not work as expected --- Key: SPARK-6812 URL: https://issues.apache.org/jira/browse/SPARK-6812 Project: Spark Issue Type: Bug Components: SparkR Reporter: Davies Liu Assignee: Sun Rui Priority: Blocker {code} filter(df, df$age 21) Error in filter(df, df$age 21) : no method for coercing this S4 class to a vector {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7397) Add missing input information report back to ReceiverInputDStream due to SPARK-7139
Saisai Shao created SPARK-7397: -- Summary: Add missing input information report back to ReceiverInputDStream due to SPARK-7139 Key: SPARK-7397 URL: https://issues.apache.org/jira/browse/SPARK-7397 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Input information report is missing due to refactor work of ReceiverInputDStream in SPARK-7139. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5281) Registering table on RDD is giving MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-5281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530224#comment-14530224 ] Iulian Dragos commented on SPARK-5281: -- Here's my workaround from [this stack overflow quesiton|https://stackoverflow.com/questions/29796928/whats-the-most-efficient-way-to-filter-a-dataframe] - find your launch configuration and go to Classpath - remove Scala Library and Scala Compiler from the Bootstrap entries - add (as external jars) scala-reflect, scala-library and scala-compiler to user entries Make sure to add the right version (2.10.4 at this point). Registering table on RDD is giving MissingRequirementError -- Key: SPARK-5281 URL: https://issues.apache.org/jira/browse/SPARK-5281 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.1 Reporter: sarsol Priority: Critical Application crashes on this line {{rdd.registerTempTable(temp)}} in 1.2 version when using sbt or Eclipse SCALA IDE Stacktrace: {code} Exception in thread main scala.reflect.internal.MissingRequirementError: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial classloader with boot classpath [C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-library.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-reflect.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-actor.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-swing.jar;C:\sar\scala\scala-ide\eclipse\plugins\org.scala-ide.scala210.jars_4.0.0.201407240952\target\jars\scala-compiler.jar;C:\Program Files\Java\jre7\lib\resources.jar;C:\Program Files\Java\jre7\lib\rt.jar;C:\Program Files\Java\jre7\lib\sunrsasign.jar;C:\Program Files\Java\jre7\lib\jsse.jar;C:\Program Files\Java\jre7\lib\jce.jar;C:\Program Files\Java\jre7\lib\charsets.jar;C:\Program Files\Java\jre7\lib\jfr.jar;C:\Program Files\Java\jre7\classes] not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48) at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) at org.apache.spark.sql.catalyst.ScalaReflection$$typecreator1$1.apply(ScalaReflection.scala:115) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at scala.reflect.api.TypeTags$class.typeOf(TypeTags.scala:335) at scala.reflect.api.Universe.typeOf(Universe.scala:59) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:115) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:100) at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:33) at org.apache.spark.sql.catalyst.ScalaReflection$class.attributesFor(ScalaReflection.scala:94) at org.apache.spark.sql.catalyst.ScalaReflection$.attributesFor(ScalaReflection.scala:33) at org.apache.spark.sql.SQLContext.createSchemaRDD(SQLContext.scala:111) at com.sar.spark.dq.poc.SparkPOC$delayedInit$body.apply(SparkPOC.scala:43) at scala.Function0$class.apply$mcV$sp(Function0.scala:40) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.App$$anonfun$main$1.apply(App.scala:71) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.App$class.main(App.scala:71) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7386) Spark application level metrics application.$AppName.$number.cores doesn't reset on Standalone Master deployment
[ https://issues.apache.org/jira/browse/SPARK-7386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7386: - Component/s: Spark Core Priority: Minor (was: Major) Please set component. I'm not familiar with this bit, but I recall a similar conversation about a cores metric where some metrics were intended to reflect the amount requested while the job was running. Is that the intent of this one? Spark application level metrics application.$AppName.$number.cores doesn't reset on Standalone Master deployment Key: SPARK-7386 URL: https://issues.apache.org/jira/browse/SPARK-7386 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Bharat Venkat Priority: Minor Spark publishes a metric called application.$AppName.$number.cores that gets published which monitors number of cores assigned to an application. However there is a bug as of 1.3 standalone deployment, where this metric doesn't go down to 0 after the application ends. It looks like standalone master holds onto the old state and continues to publish a stale metric. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7374) Error message when launching: find: 'version' : No such file or directory
[ https://issues.apache.org/jira/browse/SPARK-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7374. -- Resolution: Duplicate Please search JIRA first and review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Error message when launching: find: 'version' : No such file or directory --- Key: SPARK-7374 URL: https://issues.apache.org/jira/browse/SPARK-7374 Project: Spark Issue Type: Bug Components: Deploy, PySpark, Spark Shell Affects Versions: 1.3.1 Reporter: Stijn Geuens When launch spark-shell (or pyspark), I get the following message: find: 'version' : No such file or directory else was unexpected at this time. How is it possible that this error keeps occurring (with different versions of Spark)? How can I resolve this issue? Thanks in advance, Stijn -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7369) Spark Python 1.3.1 Mllib dataframe random forest problem
[ https://issues.apache.org/jira/browse/SPARK-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7369. -- Resolution: Invalid Have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark This sounds, on its face, like a py4j issue. These kinds of things can be reopened if there is more specific evidence it's Spark-related. Spark Python 1.3.1 Mllib dataframe random forest problem Key: SPARK-7369 URL: https://issues.apache.org/jira/browse/SPARK-7369 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.3.1 Reporter: Lisbeth Ron Labels: hadoop I'm working with Dataframes to train a random forest with mllib and I have this error File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o58.sql. somebody can help me...??? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7285) Audit missing Hive functions
[ https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7285: --- Description: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* {code} between: added in 1.4 bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT {code} *math* {code} round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial -toDeg - toDegrees- -toRad - toRadians- e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) {code} *collection functions* {code} sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean {code} *date functions* {code} from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string date[, string format]): string months_between(date1, date2): double date_format(date/timestamp/string ts, string fmt): String {code} *conditional functions* {code} if(boolean testCondition, T valueTrue, T valueFalseOrNull): T nvl(T value, T default_value): T greatest(T v1, T v2, …): T least(T v1, T v2, …): T {code} *string functions* {code} ascii(string str): int base64(binary): string concat(string|binary A, string|binary B…): string | binary concat_ws(string SEP, string A, string B…): string concat_ws(string SEP, arraystring): string decode(binary bin, string charset): string encode(string src, string charset): binary find_in_set(string str, string strList): int format_number(number x, int d): string length(string): int instr(string str, string substr): int locate(string substr, string str[, int pos]): int lower(string), lcase(string) lpad(string str, int len, string pad): string ltrim(string): string parse_url(string urlString, string partToExtract [, string keyToExtract]): string printf(String format, Obj... args): string regexp_extract(string subject, string pattern, int index): string regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): string repeat(string str, int n): string reverse(string A): string rpad(string str, int len, string pad): string space(int n): string split(string str, string pat): array str_to_map(text[, delimiter1, delimiter2]): mapstring, string trim(string A): string unbase64(string str): binary upper(string A) ucase(string A): string levenshtein(string A, string B: int soundex(string A): string {code} *Misc* {code} hash(a1[, a2…]): int {code} *text* {code} context_ngrams(arrayarraystring, arraystring, int K, int pf): arraystructstring,double ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double sentences(string str, string lang, string locale): arrayarraystring {code} *UDAF* {code} var_samp stddev_pop stddev_samp covar_pop covar_samp corr percentile: arraydouble percentile_approx: arraydouble histogram_numeric: arraystruct {'x','y'} collect_set — we have hashset collect_list ntile {code} was: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* {code} -between- bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT {code} *math* {code} round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial -toDeg - toDegrees- -toRad - toRadians- e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) {code} *collection functions* {code} sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean {code} *date functions* {code} from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530029#comment-14530029 ] Guoqiang Li commented on SPARK-5556: [FastLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553]: Gibbs sampling,The computational complexity is O(n_dk), n_dk is the number of topic (unique) in document d. I recommend to be used for short text [LightLDA|https://github.com/witgo/zen/blob/1c0f6c63a0b67569aeefba3f767acf1ac93c7a7c/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L763] Metropolis Hasting sampling The computational complexity is O(1)(It depends on the partition strategy and takes up more memory). Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7285) Audit missing Hive functions
[ https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7285: --- Description: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* {code} -between- bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT {code} *math* {code} round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial -toDeg - toDegrees- -toRad - toRadians- e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) {code} *collection functions* {code} sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean {code} *date functions* {code} from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string date[, string format]): string months_between(date1, date2): double date_format(date/timestamp/string ts, string fmt): String {code} *conditional functions* {code} if(boolean testCondition, T valueTrue, T valueFalseOrNull): T nvl(T value, T default_value): T greatest(T v1, T v2, …): T least(T v1, T v2, …): T {code} *string functions* {code} ascii(string str): int base64(binary): string concat(string|binary A, string|binary B…): string | binary concat_ws(string SEP, string A, string B…): string concat_ws(string SEP, arraystring): string decode(binary bin, string charset): string encode(string src, string charset): binary find_in_set(string str, string strList): int format_number(number x, int d): string length(string): int instr(string str, string substr): int locate(string substr, string str[, int pos]): int lower(string), lcase(string) lpad(string str, int len, string pad): string ltrim(string): string parse_url(string urlString, string partToExtract [, string keyToExtract]): string printf(String format, Obj... args): string regexp_extract(string subject, string pattern, int index): string regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): string repeat(string str, int n): string reverse(string A): string rpad(string str, int len, string pad): string space(int n): string split(string str, string pat): array str_to_map(text[, delimiter1, delimiter2]): mapstring, string trim(string A): string unbase64(string str): binary upper(string A) ucase(string A): string levenshtein(string A, string B: int soundex(string A): string {code} *Misc* {code} hash(a1[, a2…]): int {code} *text* {code} context_ngrams(arrayarraystring, arraystring, int K, int pf): arraystructstring,double ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double sentences(string str, string lang, string locale): arrayarraystring {code} *UDAF* {code} var_samp stddev_pop stddev_samp covar_pop covar_samp corr percentile: arraydouble percentile_approx: arraydouble histogram_numeric: arraystruct {'x','y'} collect_set — we have hashset collect_list ntile {code} was: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* -between- bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT *math* round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial -toDeg - toDegrees- -toRad - toRadians- e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) *collection functions* sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean *date functions* from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string
[jira] [Updated] (SPARK-4488) Add control over map-side aggregation
[ https://issues.apache.org/jira/browse/SPARK-4488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4488: - Target Version/s: (was: 1.1.1, 1.2.0) Add control over map-side aggregation - Key: SPARK-4488 URL: https://issues.apache.org/jira/browse/SPARK-4488 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.1.0 Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3095) [PySpark] Speed up RDD.count()
[ https://issues.apache.org/jira/browse/SPARK-3095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3095: - Target Version/s: (was: 1.2.0) [PySpark] Speed up RDD.count() -- Key: SPARK-3095 URL: https://issues.apache.org/jira/browse/SPARK-3095 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu Priority: Minor RDD.count() can fall back to RDD._jrdd.count(), when the RDD is not PipelineRDD. If the JavaRDD is serialized in batch mode, it's possible to skip the deserialization of chunks (except the last one), because they all have the same number of elements in them. There are some special cases that the chunks are re-ordered, so this will not work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3492) Clean up Yarn integration code
[ https://issues.apache.org/jira/browse/SPARK-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3492: - Target Version/s: (was: 1.2.0) Clean up Yarn integration code -- Key: SPARK-3492 URL: https://issues.apache.org/jira/browse/SPARK-3492 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical This is the parent umbrella for cleaning up the Yarn integration code in general. This is a broad effort and each individual cleanup should opened as a sub-issue against this one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3913) Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API
[ https://issues.apache.org/jira/browse/SPARK-3913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3913: - Target Version/s: (was: 1.2.0) Spark Yarn Client API change to expose Yarn Resource Capacity, Yarn Application Listener and killApplication() API -- Key: SPARK-3913 URL: https://issues.apache.org/jira/browse/SPARK-3913 Project: Spark Issue Type: Improvement Components: YARN Reporter: Chester When working with Spark with Yarn deployment mode, we have two issues: 1) We don't know how much yarn max capacity ( memory and cores) before we specify the number of executor and memories for spark drivers and executors. We we set a big number, the job can potentially exceeds the limit and got killed. It would be better we let the application know that the yarn resource capacity a head of time and the spark config can adjusted dynamically. 2) Once job started, we would like to have some feedbacks from yarn application. Currently, the spark client basically block the call and returns when the job is finished or failed or killed. If the job runs for few hours, we have no idea how far it has gone, the progress and resource usage, tracking URL etc. 3) Once the job is started, you basically can't stop it. The Yarn Client API stop doesn't to work in most cases from our experience. But Yarn API does work is killApplication(appId). So we need to expose this killApplication() API to Spark Yarn Client as well. I will create one Pull Request and try to address these problems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2838) performance tests for feature transformations
[ https://issues.apache.org/jira/browse/SPARK-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2838: - Target Version/s: (was: 1.2.0) performance tests for feature transformations - Key: SPARK-2838 URL: https://issues.apache.org/jira/browse/SPARK-2838 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Priority: Minor 1. TF-IDF 2. StandardScaler 3. Normalizer 4. Word2Vec -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2653) Heap size should be the sum of driver.memory and executor.memory in local mode
[ https://issues.apache.org/jira/browse/SPARK-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2653: - Target Version/s: (was: 1.2.0) Heap size should be the sum of driver.memory and executor.memory in local mode -- Key: SPARK-2653 URL: https://issues.apache.org/jira/browse/SPARK-2653 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Davies Liu Priority: Minor Original Estimate: 1h Remaining Estimate: 1h In local mode, the driver and executor run in the same JVM, so the heap size of JVM should be the sum of spark.driver.memory and spark.executor.memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3685: - Target Version/s: (was: 1.2.0) Spark's local dir should accept only local paths Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1239) Don't fetch all map output statuses at each reducer during shuffles
[ https://issues.apache.org/jira/browse/SPARK-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1239: - Target Version/s: (was: 1.2.0) Don't fetch all map output statuses at each reducer during shuffles --- Key: SPARK-1239 URL: https://issues.apache.org/jira/browse/SPARK-1239 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Patrick Wendell Instead we should modify the way we fetch map output statuses to take both a mapper and a reducer - or we should just piggyback the statuses on each task. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3075) Expose a way for users to parse event logs
[ https://issues.apache.org/jira/browse/SPARK-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3075: - Target Version/s: (was: 1.2.0) Expose a way for users to parse event logs -- Key: SPARK-3075 URL: https://issues.apache.org/jira/browse/SPARK-3075 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.2 Reporter: Andrew Or Both ReplayListenerBus and util.JsonProtocol are private[spark], so the user wants to parse the event logs themselves for analytics they will have to write their own JSON deserializers (or do some crazy reflection to access these methods). We should expose an easy way for them to do this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2774) Set preferred locations for reduce tasks
[ https://issues.apache.org/jira/browse/SPARK-2774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2774: - Target Version/s: (was: 1.2.0) Set preferred locations for reduce tasks Key: SPARK-2774 URL: https://issues.apache.org/jira/browse/SPARK-2774 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Currently we do not set preferred locations for reduce tasks in Spark. This patch proposes setting preferred locations based on the map output sizes and locations tracked by the MapOutputTracker. This is useful in two conditions 1. When you have a small job in a large cluster it can be useful to co-locate map and reduce tasks to avoid going over the network 2. If there is a lot of data skew in the map stage outputs, then it is beneficial to place the reducer close to the largest output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4609) Job can not finish if there is one bad slave in clusters
[ https://issues.apache.org/jira/browse/SPARK-4609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4609: - Target Version/s: (was: 1.3.0) Job can not finish if there is one bad slave in clusters Key: SPARK-4609 URL: https://issues.apache.org/jira/browse/SPARK-4609 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu If there is one bad machine in the cluster, the executor will keep die (such as out of space in the disk), some task may be scheduled to this machines multiple times, then the job will failed after several failures of one task. {code} 14/11/26 00:34:57 INFO TaskSetManager: Starting task 39.0 in stage 3.0 (TID 1255, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:34:57 WARN TaskSetManager: Lost task 39.0 in stage 3.0 (TID 1255, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 60 lost) 14/11/26 00:35:02 INFO TaskSetManager: Starting task 39.1 in stage 3.0 (TID 1256, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:35:03 WARN TaskSetManager: Lost task 39.1 in stage 3.0 (TID 1256, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 61 lost) 14/11/26 00:35:08 INFO TaskSetManager: Starting task 39.2 in stage 3.0 (TID 1257, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:35:08 WARN TaskSetManager: Lost task 39.2 in stage 3.0 (TID 1257, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 62 lost) 14/11/26 00:35:13 INFO TaskSetManager: Starting task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal, PROCESS_LOCAL, 5119 bytes) 14/11/26 00:35:14 WARN TaskSetManager: Lost task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 lost) org.apache.spark.SparkException: Job aborted due to stage failure: Task 39 in stage 3.0 failed 4 times, most recent failure: Lost task 39.3 in stage 3.0 (TID 1258, spark-worker-028.c.lofty-inn-754.internal): ExecutorLostFailure (executor 63 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1207) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1196) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1195) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1413) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1368) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} The task should not be scheduled to a machines for more than one times. Also, if one machine failed with executor lost, it should be put in black list for some time, then try again. cc [~kayousterhout] [~matei] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2992) The transforms formerly known as non-lazy
[ https://issues.apache.org/jira/browse/SPARK-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2992: - Target Version/s: (was: 1.2.0) The transforms formerly known as non-lazy - Key: SPARK-2992 URL: https://issues.apache.org/jira/browse/SPARK-2992 Project: Spark Issue Type: Umbrella Components: Spark Core Affects Versions: 1.1.0 Reporter: Erik Erlandson An umbrella for a grab-bag of tickets involving lazy implementations of transforms formerly thought to be non-lazy. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3376) Memory-based shuffle strategy to reduce overhead of disk I/O
[ https://issues.apache.org/jira/browse/SPARK-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3376: - Target Version/s: (was: 1.3.0) Memory-based shuffle strategy to reduce overhead of disk I/O Key: SPARK-3376 URL: https://issues.apache.org/jira/browse/SPARK-3376 Project: Spark Issue Type: New Feature Components: Shuffle Affects Versions: 1.1.0 Reporter: uncleGen Labels: performance I think a memory-based shuffle can reduce some overhead of disk I/O. I just want to know is there any plan to do something about it. Or any suggestion about it. Base on the work (SPARK-2044), it is feasible to have several implementations of shuffle. Currently, there are two implementions of shuffle manager, i.e. SORT and HASH. Both of them will use disk in some stages. For examples, in the map side, all the intermediate data will be written into temporary files. In the reduce side, Spark will use external sort sometimes. In any case, disk I/O will bring some performance loss. Maybe,we can provide a pure-memory shuffle manager. In this shuffle manager, intermediate data will only go through memory. In some of scenes, it can improve performance. Experimentally, I implemented a in-memory shuffle manager upon SPARK-2044. 1. Following is my testing result (some heary shuffle operations): | data size (Byte) | partitions | resources | | 5131859218 |2000 | 50 executors/ 4 cores/ 4GB | | settings | operation1 | operation2 | | shuffle spill lz4 | repartition+flatMap+groupByKey | repartition + groupByKey | |memory | 38s | 16s | |sort | 45s | 28s | |hash | 46s | 28s | |no shuffle spill lz4 | | | | memory | 16s | 16s | | | | | |shuffle spill lzf | | | |memory| 28s | 27s | |sort | 29s | 29s | |hash | 41s | 30s | |no shuffle spill lzf | | | | memory | 15s | 16s | In my implementation, I simply reused the BlockManager in the map-side and set the spark.shuffle.spill false in the reduce-side. All the intermediate data is cached in memory store. Just as Reynold Xin has pointed out, our disk-based shuffle manager has achieved a good performance. With parameter tuning, the disk-based shuffle manager will obtain similar performance as memory-based shuffle manager. However, I will continue my work and improve it. And as an alternative tuning option, InMemory shuffle is a good choice. Future work includes, but is not limited to: - memory usage management in InMemory Shuffle mode - data management when intermediate data can not fit in memory Test code: {code: borderStyle=solid} val conf = new SparkConf().setAppName(InMemoryShuffleTest) val sc = new SparkContext(conf) val dataPath = args(0) val partitions = args(1).toInt val rdd1 = sc.textFile(dataPath).cache() rdd1.count() val startTime = System.currentTimeMillis() val rdd2 = rdd1.repartition(partitions) .flatMap(_.split(,)).map(s = (s, s)) .groupBy(e = e._1) rdd2.count() val endTime = System.currentTimeMillis() println(time: + (endTime - startTime) / 1000 ) {code} 2. Following is a Spark Sort Benchmark (in spark 1.1.1). There is no tuning for disk shuffle. 2.1. Test the influence of memory size per core precondition: 100GB(SORT benchmark), 100 executor /15cores 1491partitions (input file blocks) . | memory size per executor| inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |9GB | 79.652849s | 60.102337s | failed| -32.7%| -| |12GB | 54.821924s | 51.654897s |109.167068s | -3.17%|+47.8% | |15GB | 33.537199s | 40.140621s |48.088158s | +16.47% |+30.26%| |18GB | 30.930927s | 43.392401s |49.830276s | +28.7%|+37.93%| 2.2. Test the influence of partition number 18GB/15cores per executor | partitions | inmemory shuffle(no shuffle spill) | sort shuffle | hash shuffle | improvement(vs.sort) | improvement(vs.hash) | |1000 | 92.675436s | 85.193158s |71.106323s | -8.78%|
[jira] [Updated] (SPARK-3348) Support user-defined SparkListeners properly
[ https://issues.apache.org/jira/browse/SPARK-3348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3348: - Target Version/s: (was: 1.2.0) Support user-defined SparkListeners properly Key: SPARK-3348 URL: https://issues.apache.org/jira/browse/SPARK-3348 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Because of the current initialization ordering, user-defined SparkListeners do not receive certain events that are posted before application code is run. We need to expose a constructor that allows the given SparkListeners to receive all events. There have been interest in this for a while, but I have searched through the JIRA history and have not found a related issue. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4784) Model.fittingParamMap should store all Params
[ https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4784: - Target Version/s: (was: 1.3.0) Model.fittingParamMap should store all Params - Key: SPARK-4784 URL: https://issues.apache.org/jira/browse/SPARK-4784 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3916) recognize appended data in textFileStream()
[ https://issues.apache.org/jira/browse/SPARK-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3916: - Target Version/s: (was: 1.2.0) recognize appended data in textFileStream() --- Key: SPARK-3916 URL: https://issues.apache.org/jira/browse/SPARK-3916 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Davies Liu Assignee: Davies Liu Right now, we only find new data from new files, the data written to old files (processed in last batch) will not be processed. In order to support this, we need partialRDD(), which is an RDD for part of file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3051) Support looking-up named accumulators in a registry
[ https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3051: - Target Version/s: (was: 1.2.0) Support looking-up named accumulators in a registry --- Key: SPARK-3051 URL: https://issues.apache.org/jira/browse/SPARK-3051 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Neil Ferguson This is a proposed enhancement to Spark based on the following mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html. This proposal builds on SPARK-2380 (Support displaying accumulator values in the web UI) to allow named accumulables to be looked-up in a registry, as opposed to having to be passed to every method that need to access them. The use case was described well by [~shivaram], as follows: Lets say you have two functions you use in a map call and want to measure how much time each of them takes. For example, if you have a code block like the one below and you want to measure how much time f1 takes as a fraction of the task. {noformat} a.map { l = val f = f1(l) ... some work here ... } {noformat} It would be really cool if we could do something like {noformat} a.map { l = val start = System.nanoTime val f = f1(l) TaskMetrics.get(f1-time).add(System.nanoTime - start) } {noformat} SPARK-2380 provides a partial solution to this problem -- however the accumulables would still need to be passed to every function that needs them, which I think would be cumbersome in any application of reasonable complexity. The proposal, as suggested by [~pwendell], is to have a registry of accumulables, that can be looked-up by name. Regarding the implementation details, I'd propose that we broadcast a serialized version of all named accumulables in the DAGScheduler (similar to what SPARK-2521 does for Tasks). These can then be deserialized in the Executor. Accumulables are already stored in thread-local variables in the Accumulators object, so exposing these in the registry should be simply a matter of wrapping this object, and keying the accumulables by name (they are currently keyed by ID). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7394: --- Fix Version/s: (was: 1.4.0) Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7372) Multiclass SVM - One vs All wrapper
[ https://issues.apache.org/jira/browse/SPARK-7372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7372. -- Resolution: Won't Fix This should be a question on user@ I think. It would better to build this once than specialize it several times. If there were some different, special way to handle mutliclass in SVM that took advantage of how SVMs worked, then it might make sense to support some SVM-specific implementation. (For example, you certainly don't need one-vs-all to do multiclass in decision trees.) But I don't believe there is for SVM. Multiclass SVM - One vs All wrapper --- Key: SPARK-7372 URL: https://issues.apache.org/jira/browse/SPARK-7372 Project: Spark Issue Type: Question Components: MLlib Reporter: Renat Bekbolatov Priority: Trivial I was wondering if we want to have a some support for multiclass SVM in MLlib, for example, through a simple wrapper over binary SVM classifiers with OVA. There is already WIP for ML pipeline generalization: SparkSPARK-7015, Multiclass to Binary Reduction However, if users prefer to just have basic OVA version that runs against SVMWithSGD, they might be able to use it. Here is a code sketch: https://github.com/Bekbolatov/spark/commit/463d73323d5f08669d5ae85dc9791b036637c966 Maybe this could live in a 3rd party utility library (outside Spark MLlib). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530066#comment-14530066 ] Chen Song commented on SPARK-7394: -- Ok, I'll work on this after finishing my daily job. Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7396) Update Kafka example to use new API of Kafka 0.8.2
Saisai Shao created SPARK-7396: -- Summary: Update Kafka example to use new API of Kafka 0.8.2 Key: SPARK-7396 URL: https://issues.apache.org/jira/browse/SPARK-7396 Project: Spark Issue Type: Bug Components: Examples, Streaming Reporter: Saisai Shao Due to upgrade of Kafka, current KafkaWordCountProducer will throw below exception, we need to update the code accordingly. {code} Exception in thread main kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries. at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) at kafka.producer.Producer.send(Producer.scala:77) at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96) at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7395) some suggestion about SimpleApp in quick-start.html
[ https://issues.apache.org/jira/browse/SPARK-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7395: --- Affects Version/s: (was: 1.3.1) 1.4.0 some suggestion about SimpleApp in quick-start.html --- Key: SPARK-7395 URL: https://issues.apache.org/jira/browse/SPARK-7395 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.1 Environment: none Reporter: zhengbing li Original Estimate: 12h Remaining Estimate: 12h Base on the code guide of SimpleApp in https://spark.apache.org/docs/latest/quick-start.html, I could not run the SimpleApp code until I modify val conf = new SparkConf().setAppName(Simple Application) to val conf = new SparkConf().setAppName(Simple Application).setMaster(local). So the document might be modified for the beginners. The error of scala example is as follows: 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0 Exception in thread main org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.init(SparkContext.scala:206) at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12) at com.huawei.openspark.TestSpark.main(TestSpark.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7395) some suggestion about SimpleApp in quick-start.html
[ https://issues.apache.org/jira/browse/SPARK-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7395: --- Affects Version/s: (was: 1.4.0) 1.3.1 some suggestion about SimpleApp in quick-start.html --- Key: SPARK-7395 URL: https://issues.apache.org/jira/browse/SPARK-7395 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.1 Environment: none Reporter: zhengbing li Original Estimate: 12h Remaining Estimate: 12h Base on the code guide of SimpleApp in https://spark.apache.org/docs/latest/quick-start.html, I could not run the SimpleApp code until I modify val conf = new SparkConf().setAppName(Simple Application) to val conf = new SparkConf().setAppName(Simple Application).setMaster(local). So the document might be modified for the beginners. The error of scala example is as follows: 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0 Exception in thread main org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.init(SparkContext.scala:206) at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12) at com.huawei.openspark.TestSpark.main(TestSpark.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6824) Fill the docs for DataFrame API in SparkR
[ https://issues.apache.org/jira/browse/SPARK-6824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14529997#comment-14529997 ] Qian Huang commented on SPARK-6824: --- start working on this issue Fill the docs for DataFrame API in SparkR - Key: SPARK-6824 URL: https://issues.apache.org/jira/browse/SPARK-6824 Project: Spark Issue Type: Sub-task Components: SparkR Reporter: Shivaram Venkataraman Priority: Blocker Some of the DataFrame functions in SparkR do not have complete roxygen docs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7150) SQLContext.range()
[ https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7150: --- Summary: SQLContext.range() (was: Facilitate random column generation for DataFrames) SQLContext.range() -- Key: SPARK-7150 URL: https://issues.apache.org/jira/browse/SPARK-7150 Project: Spark Issue Type: Sub-task Components: ML, SQL Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be handy to have easy ways to construct random columns for DataFrames. Proposed API: {code} class SQLContext { // Return a DataFrame with a single column named id that has consecutive value from 0 to n. def range(n: Long): DataFrame def range(n: Long, numPartitions: Int): DataFrame } {code} Usage: {code} // uniform distribution ctx.range(1000).select(rand()) // normal distribution ctx.range(1000).select(randn()) {code} We should add an RangeIterator that supports long start/stop position, and then use it to create an RDD as the basis for this DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7285) Audit missing Hive functions
[ https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7285: --- Description: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* -between- bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT *math* round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial -toDeg - toDegrees- -toRad - toRadians- e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) *collection functions* sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean *date functions* from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string date[, string format]): string months_between(date1, date2): double date_format(date/timestamp/string ts, string fmt): String *conditional functions* if(boolean testCondition, T valueTrue, T valueFalseOrNull): T nvl(T value, T default_value): T greatest(T v1, T v2, …): T least(T v1, T v2, …): T *string functions* ascii(string str): int base64(binary): string concat(string|binary A, string|binary B…): string | binary concat_ws(string SEP, string A, string B…): string concat_ws(string SEP, arraystring): string decode(binary bin, string charset): string encode(string src, string charset): binary find_in_set(string str, string strList): int format_number(number x, int d): string length(string): int instr(string str, string substr): int locate(string substr, string str[, int pos]): int lower(string), lcase(string) lpad(string str, int len, string pad): string ltrim(string): string parse_url(string urlString, string partToExtract [, string keyToExtract]): string printf(String format, Obj... args): string regexp_extract(string subject, string pattern, int index): string regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): string repeat(string str, int n): string reverse(string A): string rpad(string str, int len, string pad): string space(int n): string split(string str, string pat): array str_to_map(text[, delimiter1, delimiter2]): mapstring, string trim(string A): string unbase64(string str): binary upper(string A) ucase(string A): string levenshtein(string A, string B: int soundex(string A): string *Misc* hash(a1[, a2…]): int *text* context_ngrams(arrayarraystring, arraystring, int K, int pf): arraystructstring,double ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double sentences(string str, string lang, string locale): arrayarraystring *UDAF* var_samp stddev_pop stddev_samp covar_pop covar_samp corr percentile: arraydouble percentile_approx: arraydouble histogram_numeric: arraystruct {'x','y'} collect_set — we have hashset collect_list ntile was: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* -between- bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT *math* round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial toDeg - toDegrees toRad - toRadians e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) *collection functions* sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean *date functions* from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string date[, string format]): string months_between(date1, date2): double date_format(date/timestamp/string ts, string fmt): String
[jira] [Updated] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3454: - Target Version/s: (was: 1.2.0) Expose JSON representation of data shown in WebUI - Key: SPARK-3454 URL: https://issues.apache.org/jira/browse/SPARK-3454 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Imran Rashid Fix For: 1.4.0 Attachments: sparkmonitoringjsondesign.pdf If WebUI support to JSON format extracting, it's helpful for user who want to analyse stage / task / executor information. Fortunately, WebUI has renderJson method so we can implement the method in each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3166) Custom serialisers can't be shipped in application jars
[ https://issues.apache.org/jira/browse/SPARK-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3166: - Target Version/s: (was: 1.2.0) Custom serialisers can't be shipped in application jars --- Key: SPARK-3166 URL: https://issues.apache.org/jira/browse/SPARK-3166 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Reporter: Graham Dennis Spark cannot currently use a custom serialiser that is shipped with the application jar. Trying to do this causes a java.lang.ClassNotFoundException when trying to instantiate the custom serialiser in the Executor processes. This occurs because Spark attempts to instantiate the custom serialiser before the application jar has been shipped to the Executor process. A reproduction of the problem is available here: https://github.com/GrahamDennis/spark-custom-serialiser I've verified this problem in Spark 1.0.2, and Spark master and 1.1 branches as of August 21, 2014. This issue is related to SPARK-2878, and my fix for that issue (https://github.com/apache/spark/pull/1890) also solves this. My pull request was not merged because it adds the user jar to the Executor processes' class path at launch time. Such a significant change was thought by [~rxin] to require more QA, and should be considered for inclusion in 1.2 at the earliest. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3134) Update block locations asynchronously in TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-3134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3134: - Target Version/s: (was: 1.2.0) Update block locations asynchronously in TorrentBroadcast - Key: SPARK-3134 URL: https://issues.apache.org/jira/browse/SPARK-3134 Project: Spark Issue Type: Sub-task Components: Block Manager Reporter: Reynold Xin Once the TorrentBroadcast gets the data blocks, it needs to tell the master the new location. We should make the location update non-blocking to reduce roundtrips we need to launch tasks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3684) Can't configure local dirs in Yarn mode
[ https://issues.apache.org/jira/browse/SPARK-3684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3684: - Target Version/s: (was: 1.2.0) Can't configure local dirs in Yarn mode --- Key: SPARK-3684 URL: https://issues.apache.org/jira/browse/SPARK-3684 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Andrew Or We can't set SPARK_LOCAL_DIRS or spark.local.dirs because they're not picked up in Yarn mode. However, we can't set YARN_LOCAL_DIRS or LOCAL_DIRS either because these are overridden by Yarn. I'm trying to set these through SPARK_YARN_USER_ENV. I'm aware that the default behavior is for Spark to use Yarn's local dirs, but right now there's no way to change it even if the user wants to. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3631: - Target Version/s: (was: 1.2.0) Add docs for checkpoint usage - Key: SPARK-3631 URL: https://issues.apache.org/jira/browse/SPARK-3631 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.1.0 Reporter: Andrew Ash Assignee: Andrew Ash We should include general documentation on using checkpoints. Right now the docs only cover checkpoints in the Spark Streaming use case which is slightly different from Core. Some content to consider for inclusion from [~brkyvz]: {quote} If you set the checkpointing directory however, the intermediate state of the RDDs will be saved in HDFS, and the lineage will pick off from there. You won't need to keep the shuffle data before the checkpointed state, therefore those can be safely removed (will be removed automatically). However, checkpoint must be called explicitly as in https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 ,just setting the directory will not be enough. {quote} {quote} Yes, writing to HDFS is more expensive, but I feel it is still a small price to pay when compared to having a Disk Space Full error three hours in and having to start from scratch. The main goal of checkpointing is to truncate the lineage. Clearing up shuffle writes come as a bonus to checkpointing, it is not the main goal. The subtlety here is that .checkpoint() is just like .cache(). Until you call an action, nothing happens. Therefore, if you're going to do 1000 maps in a row and you don't want to checkpoint in the meantime until a shuffle happens, you will still get a StackOverflowError, because the lineage is too long. I went through some of the code for checkpointing. As far as I can tell, it materializes the data in HDFS, and resets all its dependencies, so you start a fresh lineage. My understanding would be that checkpointing still should be done every N operations to reset the lineage. However, an action must be performed before the lineage grows too long. {quote} A good place to put this information would be at https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3514) Provide a utility function for returning the hosts (and number) of live executors
[ https://issues.apache.org/jira/browse/SPARK-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3514: - Target Version/s: (was: 1.2.0) Provide a utility function for returning the hosts (and number) of live executors - Key: SPARK-3514 URL: https://issues.apache.org/jira/browse/SPARK-3514 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Priority: Minor It would be nice to tell user applications how many executors they have currently running in their application. Also, we could give them the host names on which the executors are running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1832) Executor UI improvement suggestions
[ https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1832: - Target Version/s: (was: 1.2.0) Executor UI improvement suggestions --- Key: SPARK-1832 URL: https://issues.apache.org/jira/browse/SPARK-1832 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.0.0 Reporter: Thomas Graves I received some suggestions from a user for the /executors UI page to make it more helpful. This gets more important when you have a really large number of executors. Fill some of the cells with color in order to make it easier to absorb the info, e.g. RED if Failed Tasks greater than 0 (maybe the more failed, the more intense the red) GREEN if Active Tasks greater than 0 (maybe more intense the larger the number) Possibly color code COMPLETE TASKS using various shades of blue (e.g., based on the log(# completed) - if dark blue then write the value in white (same for the RED and GREEN above Maybe mark the MASTER task somehow Report the TOTALS in each column (do this at the TOP so no need to scroll to the bottom, or print both at top and bottom). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3630: - Target Version/s: (was: 1.1.1, 1.2.0) Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Andrew Ash Assignee: Josh Rosen A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3385) Improve shuffle performance
[ https://issues.apache.org/jira/browse/SPARK-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3385: - Target Version/s: (was: 1.3.0) Improve shuffle performance --- Key: SPARK-3385 URL: https://issues.apache.org/jira/browse/SPARK-3385 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Reynold Xin Just a ticket to track various efforts related to improving shuffle in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4106) Shuffle write and spill to disk metrics are incorrect
[ https://issues.apache.org/jira/browse/SPARK-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4106: - Target Version/s: (was: 1.2.0) Shuffle write and spill to disk metrics are incorrect - Key: SPARK-4106 URL: https://issues.apache.org/jira/browse/SPARK-4106 Project: Spark Issue Type: Bug Components: Shuffle, Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Critical I have an encountered a job which has some disk spilled (memory) but the disk spilled (disk) is 0, as well as the shuffle write. If I switch to hash based shuffle, where there happens to be no disk spilling, then the shuffle write is correct. I can get more info on a workload to repro this situation, but perhaps that state of events is sufficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4356) Test Scala 2.11 on Jenkins
[ https://issues.apache.org/jira/browse/SPARK-4356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-4356: - Target Version/s: (was: 1.2.0) Test Scala 2.11 on Jenkins -- Key: SPARK-4356 URL: https://issues.apache.org/jira/browse/SPARK-4356 Project: Spark Issue Type: Improvement Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical We need to make some modifications to the test harness so that we can test Scala 2.11 in Maven regularly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3115) Improve task broadcast latency for small tasks
[ https://issues.apache.org/jira/browse/SPARK-3115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3115: - Target Version/s: (was: 1.2.0) Improve task broadcast latency for small tasks -- Key: SPARK-3115 URL: https://issues.apache.org/jira/browse/SPARK-3115 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Shivaram Venkataraman Assignee: Reynold Xin Broadcasting the task information helps reduce the amount of data transferred for large tasks. However we've seen that this adds more latency for small tasks. It'll be great to profile and fix this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1823) ExternalAppendOnlyMap can still OOM if one key is very large
[ https://issues.apache.org/jira/browse/SPARK-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1823: - Target Version/s: (was: 1.2.0) ExternalAppendOnlyMap can still OOM if one key is very large Key: SPARK-1823 URL: https://issues.apache.org/jira/browse/SPARK-1823 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Andrew Or If the values for one key do not collectively fit into memory, then the map will still OOM when you merge the spilled contents back in. This is a problem especially for PySpark, since we hash the keys (Python objects) before a shuffle, and there are only so many integers out there in the world, so there could potentially be many collisions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3374) Spark on Yarn config cleanup
[ https://issues.apache.org/jira/browse/SPARK-3374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3374: - Target Version/s: (was: 1.2.0) Spark on Yarn config cleanup Key: SPARK-3374 URL: https://issues.apache.org/jira/browse/SPARK-3374 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.1.0 Reporter: Thomas Graves The configs in yarn have gotten scattered and inconsistent between cluster and client modes and supporting backwards compatibility. We should try to clean this up, move things to common places and support configs across both cluster and client modes where we want to make them public. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2365) Add IndexedRDD, an efficient updatable key-value store
[ https://issues.apache.org/jira/browse/SPARK-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-2365: - Target Version/s: (was: 1.2.0) Add IndexedRDD, an efficient updatable key-value store -- Key: SPARK-2365 URL: https://issues.apache.org/jira/browse/SPARK-2365 Project: Spark Issue Type: New Feature Components: GraphX, Spark Core Reporter: Ankur Dave Assignee: Ankur Dave Attachments: 2014-07-07-IndexedRDD-design-review.pdf RDDs currently provide a bulk-updatable, iterator-based interface. This imposes minimal requirements on the storage layer, which only needs to support sequential access, enabling on-disk and serialized storage. However, many applications would benefit from a richer interface. Efficient support for point lookups would enable serving data out of RDDs, but it currently requires iterating over an entire partition to find the desired element. Point updates similarly require copying an entire iterator. Joins are also expensive, requiring a shuffle and local hash joins. To address these problems, we propose IndexedRDD, an efficient key-value store built on RDDs. IndexedRDD would extend RDD[(Long, V)] by enforcing key uniqueness and pre-indexing the entries for efficient joins and point lookups, updates, and deletions. It would be implemented by (1) hash-partitioning the entries by key, (2) maintaining a hash index within each partition, and (3) using purely functional (immutable and efficiently updatable) data structures to enable efficient modifications and deletions. GraphX would be the first user of IndexedRDD, since it currently implements a limited form of this functionality in VertexRDD. We envision a variety of other uses for IndexedRDD, including streaming updates to RDDs, direct serving from RDDs, and as an execution strategy for Spark SQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3441) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle
[ https://issues.apache.org/jira/browse/SPARK-3441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3441: - Target Version/s: (was: 1.2.0) Explain in docs that repartitionAndSortWithinPartitions enacts Hadoop style shuffle --- Key: SPARK-3441 URL: https://issues.apache.org/jira/browse/SPARK-3441 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Patrick Wendell Assignee: Sandy Ryza I think it would be good to say something like this in the doc for repartitionAndSortWithinPartitions and add also maybe in the doc for groupBy: {code} This can be used to enact a Hadoop Style shuffle along with a call to mapPartitions, e.g.: rdd.repartitionAndSortWithinPartitions(part).mapPartitions(...) {code} It might also be nice to add a version that doesn't take a partitioner and/or to mention this in the groupBy javadoc. I guess it depends a bit whether we consider this to be an API we want people to use more widely or whether we just consider it a narrow stable API mostly for Hive-on-Spark. If we want people to consider this API when porting workloads from Hadoop, then it might be worth documenting better. What do you think [~rxin] and [~matei]? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3629) Improvements to YARN doc
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3629: - Target Version/s: (was: 1.1.1, 1.2.0) Improvements to YARN doc Key: SPARK-3629 URL: https://issues.apache.org/jira/browse/SPARK-3629 Project: Spark Issue Type: Documentation Components: Documentation, YARN Reporter: Matei Zaharia Labels: starter Right now this doc starts off with a big list of config options, and only then tells you how to submit an app. It would be better to put that part and the packaging part first, and the config options only at the end. In addition, the doc mentions yarn-cluster vs yarn-client as separate masters, which is inconsistent with the help output from spark-submit (which says to always use yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3982) receiverStream in Python API
[ https://issues.apache.org/jira/browse/SPARK-3982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3982: - Target Version/s: (was: 1.2.0) receiverStream in Python API Key: SPARK-3982 URL: https://issues.apache.org/jira/browse/SPARK-3982 Project: Spark Issue Type: New Feature Components: PySpark, Streaming Reporter: Davies Liu Assignee: Davies Liu receiverStream() is used to extend the input sources of streaming, it will be very useful to have it in Python API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7285) Audit missing Hive functions
[ https://issues.apache.org/jira/browse/SPARK-7285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7285: --- Description: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* -between- bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT *math* round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial toDeg - toDegrees toRad - toRadians e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) *collection functions* sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean *date functions* from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string date[, string format]): string months_between(date1, date2): double date_format(date/timestamp/string ts, string fmt): String *conditional functions* if(boolean testCondition, T valueTrue, T valueFalseOrNull): T nvl(T value, T default_value): T greatest(T v1, T v2, …): T least(T v1, T v2, …): T *string functions* ascii(string str): int base64(binary): string concat(string|binary A, string|binary B…): string | binary concat_ws(string SEP, string A, string B…): string concat_ws(string SEP, arraystring): string decode(binary bin, string charset): string encode(string src, string charset): binary find_in_set(string str, string strList): int format_number(number x, int d): string length(string): int instr(string str, string substr): int locate(string substr, string str[, int pos]): int lower(string), lcase(string) lpad(string str, int len, string pad): string ltrim(string): string parse_url(string urlString, string partToExtract [, string keyToExtract]): string printf(String format, Obj... args): string regexp_extract(string subject, string pattern, int index): string regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT): string repeat(string str, int n): string reverse(string A): string rpad(string str, int len, string pad): string space(int n): string split(string str, string pat): array str_to_map(text[, delimiter1, delimiter2]): mapstring, string trim(string A): string unbase64(string str): binary upper(string A) ucase(string A): string levenshtein(string A, string B: int soundex(string A): string *Misc* hash(a1[, a2…]): int *text* context_ngrams(arrayarraystring, arraystring, int K, int pf): arraystructstring,double ngrams(arrayarraystring, int N, int K, int pf): arraystructstring,double sentences(string str, string lang, string locale): arrayarraystring *UDAF* var_samp stddev_pop stddev_samp covar_pop covar_samp corr percentile: arraydouble percentile_approx: arraydouble histogram_numeric: arraystruct {'x','y'} collect_set — we have hashset collect_list ntile was: Create a list of functions that is on this page but not in SQL/DataFrame. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF Here's the list of missing stuff: *basic* between bitwise operation bitwiseAND bitwiseOR bitwiseXOR bitwiseNOT *math* round(DOUBLE a) round(DOUBLE a, INT d) Returns a rounded to d decimal places. log2 sqrt(string column name) bin hex(long), hex(string), hex(binary) unhex(string) - binary conv pmod factorial toDeg - toDegrees toRad - toRadians e() pi() shiftleft(int or long) shiftright(int or long) shiftrightunsigned(int or long) *collection functions* sort_array(array) size(map, array) map_values(mapk,v): arrayv map_keys(mapk,v):arrayk array_contains(arrayt, value): boolean *date functions* from_unixtime(long, string): string unix_timestamp(): long unix_timestamp(date): long year(date): int month(date): int day(date): int dayofmonth(date); int hour(timestamp): int minute(timestamp): int second(timestamp): int weekofyear(date): int date_add(date, int) date_sub(date, int) from_utc_timestamp(timestamp, string timezone): timestamp current_date(): date current_timestamp(): timestamp add_months(string start_date, int num_months): string last_day(string date): string next_day(string start_date, string day_of_week): string trunc(string date[, string format]): string months_between(date1, date2): double date_format(date/timestamp/string ts, string fmt): String
[jira] [Resolved] (SPARK-4784) Model.fittingParamMap should store all Params
[ https://issues.apache.org/jira/browse/SPARK-4784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-4784. -- Resolution: Not A Problem [SPARK-5956] removed fittingParamMap, so this is no longer an issue Model.fittingParamMap should store all Params - Key: SPARK-4784 URL: https://issues.apache.org/jira/browse/SPARK-4784 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Minor spark.ml's Model class should store all parameters in the fittingParamMap, not just the ones which were explicitly set. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7394: --- Description: Basically alias astype == cast in Column for Python (and Python only). was:Basically alias astype == cast in Column for Python. Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column for Python (and Python only). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530063#comment-14530063 ] Reynold Xin commented on SPARK-7394: cc [~smacat] would you like to do this one as well? Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7395) some suggestion about SimpleApp in quick-start.html
[ https://issues.apache.org/jira/browse/SPARK-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7395. -- Resolution: Not A Problem Fix Version/s: (was: 1.4.0) Target Version/s: (was: 1.3.1) Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first Don't set target or fix version. You set a master when you run the example with spark-submit, not in the code. some suggestion about SimpleApp in quick-start.html --- Key: SPARK-7395 URL: https://issues.apache.org/jira/browse/SPARK-7395 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.3.1 Environment: none Reporter: zhengbing li Original Estimate: 12h Remaining Estimate: 12h Base on the code guide of SimpleApp in https://spark.apache.org/docs/latest/quick-start.html, I could not run the SimpleApp code until I modify val conf = new SparkConf().setAppName(Simple Application) to val conf = new SparkConf().setAppName(Simple Application).setMaster(local). So the document might be modified for the beginners. The error of scala example is as follows: 15/05/06 15:05:48 INFO SparkContext: Running Spark version 1.3.0 Exception in thread main org.apache.spark.SparkException: A master URL must be set in your configuration at org.apache.spark.SparkContext.init(SparkContext.scala:206) at com.huawei.openspark.TestSpark$.main(TestSpark.scala:12) at com.huawei.openspark.TestSpark.main(TestSpark.scala) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7381) Python API for Transformers
[ https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7381: - Assignee: Burak Yavuz Python API for Transformers --- Key: SPARK-7381 URL: https://issues.apache.org/jira/browse/SPARK-7381 Project: Spark Issue Type: Umbrella Components: ML, PySpark Reporter: Burak Yavuz Assignee: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7383) Python API for ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7383: - Assignee: Burak Yavuz Python API for ml.feature - Key: SPARK-7383 URL: https://issues.apache.org/jira/browse/SPARK-7383 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz Assignee: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7382) Python API for ml.classification
[ https://issues.apache.org/jira/browse/SPARK-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7382: - Assignee: Burak Yavuz Python API for ml.classification Key: SPARK-7382 URL: https://issues.apache.org/jira/browse/SPARK-7382 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz Assignee: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7383) Python API for ml.feature
[ https://issues.apache.org/jira/browse/SPARK-7383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7383: - Priority: Major (was: Blocker) Python API for ml.feature - Key: SPARK-7383 URL: https://issues.apache.org/jira/browse/SPARK-7383 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7388) Python Api for Param[Array[T]]
[ https://issues.apache.org/jira/browse/SPARK-7388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7388: - Assignee: Burak Yavuz Python Api for Param[Array[T]] -- Key: SPARK-7388 URL: https://issues.apache.org/jira/browse/SPARK-7388 Project: Spark Issue Type: Sub-task Components: ML Reporter: Burak Yavuz Assignee: Burak Yavuz Python can't set Array[T] type params, because py4j casts a list to an ArrayList. Instead of Param[Array[T]], we sill have a ArrayParam[T] which can take a Seq[T]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7381) Python API for Transformers
[ https://issues.apache.org/jira/browse/SPARK-7381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7381: - Priority: Major (was: Blocker) Python API for Transformers --- Key: SPARK-7381 URL: https://issues.apache.org/jira/browse/SPARK-7381 Project: Spark Issue Type: Umbrella Components: ML, PySpark Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7382) Python API for ml.classification
[ https://issues.apache.org/jira/browse/SPARK-7382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7382: - Priority: Major (was: Blocker) Python API for ml.classification Key: SPARK-7382 URL: https://issues.apache.org/jira/browse/SPARK-7382 Project: Spark Issue Type: Sub-task Components: ML, PySpark Reporter: Burak Yavuz -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6812) filter() on DataFrame does not work as expected
[ https://issues.apache.org/jira/browse/SPARK-6812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530106#comment-14530106 ] Sun Rui commented on SPARK-6812: According to the R manual: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Startup.html, if a function .First is found on the search path, it is executed as .First(). Finally, function .First.sys() in the base package is run. This calls require to attach the default packages specified by options(defaultPackages). In .First() in profile/shell.R, we load SparkR package. This means SparkR package is loaded before default packages. If there are same names in default packages, they will overwrite those in SparkR. This is why filter() in SparkR is masked by filter() in stats, which is usually in the default package list. We need to make sure SparkR is loaded after default packages. The solution is to append SparkR to default packages, instead of loading SparkR in .First(). filter() on DataFrame does not work as expected --- Key: SPARK-6812 URL: https://issues.apache.org/jira/browse/SPARK-6812 Project: Spark Issue Type: Bug Components: SparkR Reporter: Davies Liu Assignee: Sun Rui Priority: Blocker {code} filter(df, df$age 21) Error in filter(df, df$age 21) : no method for coercing this S4 class to a vector {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7396) Update Kafka example to use new API of Kafka 0.8.2
[ https://issues.apache.org/jira/browse/SPARK-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7396: --- Assignee: Apache Spark Update Kafka example to use new API of Kafka 0.8.2 -- Key: SPARK-7396 URL: https://issues.apache.org/jira/browse/SPARK-7396 Project: Spark Issue Type: Bug Components: Examples, Streaming Affects Versions: 1.4.0 Reporter: Saisai Shao Assignee: Apache Spark Due to upgrade of Kafka, current KafkaWordCountProducer will throw below exception, we need to update the code accordingly. {code} Exception in thread main kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries. at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:90) at kafka.producer.Producer.send(Producer.scala:77) at org.apache.spark.examples.streaming.KafkaWordCountProducer$.main(KafkaWordCount.scala:96) at org.apache.spark.examples.streaming.KafkaWordCountProducer.main(KafkaWordCount.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:623) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7393) How to improve Spark SQL performance?
Liang Lee created SPARK-7393: Summary: How to improve Spark SQL performance? Key: SPARK-7393 URL: https://issues.apache.org/jira/browse/SPARK-7393 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang Lee -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7394: --- Labels: starter (was: ) Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7394) Add Pandas style cast (astype)
[ https://issues.apache.org/jira/browse/SPARK-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7394: --- Description: Basically alias astype == cast in Column for Python. (was: Basically alias astype == cast in Column.) Add Pandas style cast (astype) -- Key: SPARK-7394 URL: https://issues.apache.org/jira/browse/SPARK-7394 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Fix For: 1.4.0 Basically alias astype == cast in Column for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6284) Support framework authentication and role in Mesos framework
[ https://issues.apache.org/jira/browse/SPARK-6284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530117#comment-14530117 ] Bharath Ravi Kumar commented on SPARK-6284: --- [~tnachen] This issue makes Spark on Mesos a non-starter in a multi-tenant mesos production setup. Any possibility that this will be back-ported into a minor release (1.3.x or 1.2.x) ? On a related note, is any further assistance required for testing this change? I could help with that. Thanks. Support framework authentication and role in Mesos framework Key: SPARK-6284 URL: https://issues.apache.org/jira/browse/SPARK-6284 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Support framework authentication and role in both Coarse grain and fine grain mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530438#comment-14530438 ] Hrishikesh edited comment on SPARK-6258 at 5/6/15 12:34 PM: [~yanboliang], I got stuck at one stage. You can work on it. was (Author: hrishikesh91): [~yanboliang], you can start working on it. Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid reopened SPARK-3454: - need to fix some issues w/ test files in pr ... Expose JSON representation of data shown in WebUI - Key: SPARK-3454 URL: https://issues.apache.org/jira/browse/SPARK-3454 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Imran Rashid Fix For: 1.4.0 Attachments: sparkmonitoringjsondesign.pdf If WebUI support to JSON format extracting, it's helpful for user who want to analyse stage / task / executor information. Fortunately, WebUI has renderJson method so we can implement the method in each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7116) Intermediate RDD cached but never unpersisted
[ https://issues.apache.org/jira/browse/SPARK-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530390#comment-14530390 ] Dennis Proppe edited comment on SPARK-7116 at 5/6/15 11:47 AM: --- *I second the importance of fixing this.* For our production workloads, this is really a large issue and blocks the use of large data: 1) It creates non-removeable RDDs cached in memory 2) They can't even be removed by deleting the RDD in question in the code 3) The RDDs are up to 10 times bigger that they would be by calling df.cache() The behaviour can be reproduced easily in the PySpark Shell: --- from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf slen = udf(lambda s: len(s), IntegerType()) rows = [[a, hello], [b, goodbye],[c, whatever]] * 1000 rd = sc.parallelize(rows) head = ['order','word'] df = rd.toDF(head) df.cache().count() bigfile = df.withColumn(word_length, slen(df.word)) bigfile.count() --- If you now compare the size of the cached df versus the automagically appeared cached bigfile, you see that bigfile uses about 8X the storage of df, although it only has 1 column more. was (Author: dproppe): *I second the importance of fixing this.* For our production workloads, this is really a large issue and blocks the use of large data: 1) It creates non-removeable RDDs cached in memory 2) They can't even be removed by deleting the RDD in question in the code 3) The RDDs are up to 10 times bigger that they would be by calling df.cache() The behaviour can be reproduced easily in the PySpark Shell: from pyspark import SQLContext from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf sqlcon = SQLContext(sc) slen = udf(lambda s: len(s), IntegerType()) rows = [[a, hello], [b, goodbye],[c, whatever]] * 1000 rd = sc.parallelize(rows) head = ['order','word'] df = rd.toDF(head) df.cache().count() ` bigfile = df.withColumn(word_length, slen(df.word)) bigfile.count() ` If you now compare the size of the cached df versus the automagically appeared cached bigfile, you see that bigfile uses about 8X the storage of df, although it only has 1 column more. Intermediate RDD cached but never unpersisted - Key: SPARK-7116 URL: https://issues.apache.org/jira/browse/SPARK-7116 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.3.1 Reporter: Kalle Jepsen In https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala#L233 an intermediate RDD is cached, but never unpersisted. It shows up in the 'Storage' section of the Web UI, but cannot be removed. There's already a comment in the source, suggesting to 'clean up'. If that cleanup is more involved than simply calling `unpersist`, it probably exceeds my current Scala skills. Why that is a problem: I'm adding a constant column to a DataFrame of about 20M records resulting from an inner join with {{df.withColumn(colname, ud_func())}} , where {{ud_func}} is simply a wrapped {{lambda: 1}}. Before and after applying the UDF the DataFrame takes up ~430MB in the cache. The cached intermediate RDD however takes up ~10GB(!) of storage, and I know of no way to uncache it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5913) Python API for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5913: --- Assignee: Apache Spark (was: Yanbo Liang) Python API for ChiSqSelector Key: SPARK-5913 URL: https://issues.apache.org/jira/browse/SPARK-5913 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Minor Add a Python API for mllib.feature.ChiSqSelector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5913) Python API for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-5913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5913: --- Assignee: Yanbo Liang (was: Apache Spark) Python API for ChiSqSelector Key: SPARK-5913 URL: https://issues.apache.org/jira/browse/SPARK-5913 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Yanbo Liang Priority: Minor Add a Python API for mllib.feature.ChiSqSelector -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530438#comment-14530438 ] Hrishikesh commented on SPARK-6258: --- [~yanboliang], you can start working on it. Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530434#comment-14530434 ] Yanbo Liang commented on SPARK-6258: [~hrishikesh] Are you still work on this issue? If you are not working on it, I can take it. [~josephkb] Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6258) Python MLlib API missing items: Clustering
[ https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530434#comment-14530434 ] Yanbo Liang edited comment on SPARK-6258 at 5/6/15 12:26 PM: - [~hrishikesh] Are you still working on this issue? If you are not working on it, I can take it. [~josephkb] was (Author: yanboliang): [~hrishikesh] Are you still work on this issue? If you are not working on it, I can take it. [~josephkb] Python MLlib API missing items: Clustering -- Key: SPARK-6258 URL: https://issues.apache.org/jira/browse/SPARK-6258 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.3.0 Reporter: Joseph K. Bradley This JIRA lists items missing in the Python API for this sub-package of MLlib. This list may be incomplete, so please check again when sending a PR to add these features to the Python API. Also, please check for major disparities between documentation; some parts of the Python API are less well-documented than their Scala counterparts. Some items may be listed in the umbrella JIRA linked to this task. KMeans * setEpsilon * setInitializationSteps KMeansModel * computeCost * k GaussianMixture * setInitialModel GaussianMixtureModel * k Completely missing items which should be fixed in separate JIRAs (which have been created and linked to the umbrella JIRA) * LDA * PowerIterationClustering * StreamingKMeans -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7369) Spark Python 1.3.1 Mllib dataframe random forest problem
[ https://issues.apache.org/jira/browse/SPARK-7369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisbeth Ron updated SPARK-7369: --- Attachment: random_forest_dataframe_spark_30042015.py Hi Sean, I still have problems with python spark here are the errors and also the code that I'm using. Thanks Lisbeth 15/05/06 13:14:24 INFO ContextCleaner: Cleaned broadcast 1 15/05/06 13:14:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on node001.ca-innovation.fr:47882 (size: 11.0 KB, free: 8.3 GB) 15/05/06 13:14:24 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on node006.ca-innovation.fr:50830 (size: 11.0 KB, free: 8.3 GB) 15/05/06 13:14:25 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 5, node001.ca-innovation.fr): java.lang.NullPointerException at org.apache.spark.api.python.SerDeUtil$$anonfun$toJavaArray$1.apply(SerDeUtil.scala:106) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:123) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.next(SerDeUtil.scala:114) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:114) at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:421) at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:243) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618) at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:205) 15/05/06 13:14:25 INFO TaskSetManager: Starting task 0.1 in stage 3.0 (TID 7, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes) 15/05/06 13:14:25 INFO TaskSetManager: Lost task 1.0 in stage 3.0 (TID 6) on executor node006.ca-innovation.fr: java.lang.NullPointerException (null) [duplicate 1] 15/05/06 13:14:25 INFO TaskSetManager: Starting task 1.1 in stage 3.0 (TID 8, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes) 15/05/06 13:14:26 INFO TaskSetManager: Lost task 0.1 in stage 3.0 (TID 7) on executor node001.ca-innovation.fr: java.lang.NullPointerException (null) [duplicate 2] 15/05/06 13:14:26 INFO TaskSetManager: Starting task 0.2 in stage 3.0 (TID 9, node006.ca-innovation.fr, NODE_LOCAL, 1409 bytes) 15/05/06 13:14:26 INFO TaskSetManager: Lost task 1.1 in stage 3.0 (TID 8) on executor node001.ca-innovation.fr: java.lang.NullPointerException (null) [duplicate 3] 15/05/06 13:14:26 INFO TaskSetManager: Starting task 1.2 in stage 3.0 (TID 10, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes) 15/05/06 13:14:27 INFO TaskSetManager: Lost task 0.2 in stage 3.0 (TID 9) on executor node006.ca-innovation.fr: java.lang.NullPointerException (null) [duplicate 4] 15/05/06 13:14:27 INFO TaskSetManager: Starting task 0.3 in stage 3.0 (TID 11, node006.ca-innovation.fr, NODE_LOCAL, 1409 bytes) 15/05/06 13:14:27 INFO TaskSetManager: Lost task 1.2 in stage 3.0 (TID 10) on executor node001.ca-innovation.fr: java.lang.NullPointerException (null) [duplicate 5] 15/05/06 13:14:27 INFO TaskSetManager: Starting task 1.3 in stage 3.0 (TID 12, node001.ca-innovation.fr, NODE_LOCAL, 1409 bytes) 15/05/06 13:14:28 INFO TaskSetManager: Lost task 0.3 in stage 3.0 (TID 11) on executor node006.ca-innovation.fr: java.lang.NullPointerException (null) [duplicate 6] 15/05/06 13:14:28 ERROR TaskSetManager: Task 0 in stage 3.0 failed 4 times; aborting job 15/05/06 13:14:28 INFO TaskSchedulerImpl: Cancelling stage 3 15/05/06 13:14:28 INFO TaskSchedulerImpl: Stage 3 was cancelled 15/05/06 13:14:28 INFO DAGScheduler: Stage 3 (count at /mapr/MapR-Cluster/casarisk/data/POCGRO/Codes/Spark_python/RF_Python_Spark_30042015/random_forest_dataframe_spark_30042015.py:79) failed in 4.025 s 15/05/06 13:14:28 INFO DAGScheduler: Job 3 failed: count at /mapr/MapR-Cluster/casarisk/data/POCGRO/Codes/Spark_python/RF_Python_Spark_30042015/random_forest_dataframe_spark_30042015.py:79, took 4.052326 s Traceback (most recent call last): File /mapr/MapR-Cluster/casarisk/data/POCGRO/Codes/Spark_python/RF_Python_Spark_30042015/random_forest_dataframe_spark_30042015.py, line 79, in module print trainingData.count() File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/pyspark/rdd.py, line 932, in count return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/pyspark/rdd.py, line 923, in sum return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add) File /opt/mapr/spark/spark-1.3.1-bin-mapr4/python/pyspark/rdd.py,
[jira] [Updated] (SPARK-7322) Add DataFrame DSL for window function support
[ https://issues.apache.org/jira/browse/SPARK-7322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Olivier Girardot updated SPARK-7322: Labels: DataFrame (was: ) Add DataFrame DSL for window function support - Key: SPARK-7322 URL: https://issues.apache.org/jira/browse/SPARK-7322 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: DataFrame Here's a proposal for supporting window functions in the DataFrame DSL: 1. Add an over function to Column: {code} class Column { ... def over(): WindowFunctionSpec ... } {code} 2. WindowFunctionSpec: {code} // By default frame = full partition class WindowFunctionSpec { def partitionBy(cols: Column*): WindowFunctionSpec def orderBy(cols: Column*): WindowFunctionSpec // restrict frame beginning from current row - n position def rowsPreceding(n: Int): WindowFunctionSpec // restrict frame ending from current row - n position def rowsFollowing(n: Int): WindowFunctionSpec def rangePreceding(n: Int): WindowFunctionSpec def rowsFollowing(n: Int): WindowFunctionSpec } {code} Here's an example to use it: {code} df.select( df.store, df.date, df.sales, avg(df.sales).over.partitionBy(df.store) .orderBy(df.store) .rowsFollowing(0)// this means from unbounded preceding to current row ) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3454) Expose JSON representation of data shown in WebUI
[ https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530562#comment-14530562 ] Apache Spark commented on SPARK-3454: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/5940 Expose JSON representation of data shown in WebUI - Key: SPARK-3454 URL: https://issues.apache.org/jira/browse/SPARK-3454 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.1.0 Reporter: Kousuke Saruta Assignee: Imran Rashid Fix For: 1.4.0 Attachments: sparkmonitoringjsondesign.pdf If WebUI support to JSON format extracting, it's helpful for user who want to analyse stage / task / executor information. Fortunately, WebUI has renderJson method so we can implement the method in each subclass. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7398) Add back-pressure to Spark Streaming
François Garillot created SPARK-7398: Summary: Add back-pressure to Spark Streaming Key: SPARK-7398 URL: https://issues.apache.org/jira/browse/SPARK-7398 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.3.1 Reporter: François Garillot Spark Streaming has trouble dealing with situations where batch processing time batch interval Meaning a high throughput of input data w.r.t. Spark's ability to remove data from the queue. If this throughput is sustained for long enough, it leads to an unstable situation where the memory of the Receiver's Executor is overflowed. This aims at transmitting a back-pressure signal back to data ingestion to help with dealing with that high throughput, in a backwards-compatible way. The design doc can be found here: https://docs.google.com/document/d/15m8N717n4hwFya_pJHUbiv5s50Mq1SZZzL9ftOhms4E/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6093: --- Assignee: Apache Spark Add RegressionMetrics in PySpark/MLlib -- Key: SPARK-6093 URL: https://issues.apache.org/jira/browse/SPARK-6093 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530581#comment-14530581 ] Apache Spark commented on SPARK-6093: - User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/5941 Add RegressionMetrics in PySpark/MLlib -- Key: SPARK-6093 URL: https://issues.apache.org/jira/browse/SPARK-6093 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7116) Intermediate RDD cached but never unpersisted
[ https://issues.apache.org/jira/browse/SPARK-7116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530580#comment-14530580 ] Kalle Jepsen commented on SPARK-7116: - [~marmbrus] Do you remember why that {{cache()}} was necessary? I've boldly commented it out and the only thing that seems to have changed is that everything runs a lot faster... Intermediate RDD cached but never unpersisted - Key: SPARK-7116 URL: https://issues.apache.org/jira/browse/SPARK-7116 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.3.1 Reporter: Kalle Jepsen In https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/pythonUdfs.scala#L233 an intermediate RDD is cached, but never unpersisted. It shows up in the 'Storage' section of the Web UI, but cannot be removed. There's already a comment in the source, suggesting to 'clean up'. If that cleanup is more involved than simply calling `unpersist`, it probably exceeds my current Scala skills. Why that is a problem: I'm adding a constant column to a DataFrame of about 20M records resulting from an inner join with {{df.withColumn(colname, ud_func())}} , where {{ud_func}} is simply a wrapped {{lambda: 1}}. Before and after applying the UDF the DataFrame takes up ~430MB in the cache. The cached intermediate RDD however takes up ~10GB(!) of storage, and I know of no way to uncache it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6093: --- Assignee: (was: Apache Spark) Add RegressionMetrics in PySpark/MLlib -- Key: SPARK-6093 URL: https://issues.apache.org/jira/browse/SPARK-6093 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6093) Add RegressionMetrics in PySpark/MLlib
[ https://issues.apache.org/jira/browse/SPARK-6093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530586#comment-14530586 ] Yanbo Liang commented on SPARK-6093: [~mengxr] Could you assign this to me? Add RegressionMetrics in PySpark/MLlib -- Key: SPARK-6093 URL: https://issues.apache.org/jira/browse/SPARK-6093 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7035: --- Assignee: (was: Apache Spark) Drop __getattr__ on pyspark.sql.DataFrame - Key: SPARK-7035 URL: https://issues.apache.org/jira/browse/SPARK-7035 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.4.0 Reporter: Kalle Jepsen I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed. There is no point in having the possibility to address the DataFrames columns as {{df.column}}, other than the questionable goal to please R developers. And it seems R people can use Spark from their native API in the future. I see the following problems with {{\_\_getattr\_\_}} for column selection: * It's un-pythonic: There should only be one obvious way to solve a problem, and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} method, which in my opinion is by far superior and a lot more intuitive. * It leads to confusing Exceptions. When we mistype a method-name the {{AttributeError}} will say 'No such column ... '. * And most importantly: we cannot load DataFrames that have columns with the same name as any attribute on the DataFrame-object. Imagine having a DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} will be ambiguous and lead to broken code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7393) How to improve Spark SQL performance?
[ https://issues.apache.org/jira/browse/SPARK-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14530636#comment-14530636 ] Dennis Proppe commented on SPARK-7393: -- Hi, Liang Lee, without more information (HDD SSD?), it is quite hard to reproduce this. Do you cache the DF before querying it? Otherwise, you'd be pulling the file from the parquet files on disk everytime you query it. In that case, 3s for a select on a table of 61 million rows sounds impressing. In our org, we found that working along the lines of: df = sqlContext.load(file.parquet) df.cache.count() selection = df.where(foo = bar) works really fast and may deliver good response times in your case. How to improve Spark SQL performance? - Key: SPARK-7393 URL: https://issues.apache.org/jira/browse/SPARK-7393 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang Lee We want to use Spark SQL in our project ,but we found that the Spark SQL performance is not very well as we expected. The detail is as follows: 1. We save data as parquet file on HDFS. 2.We just select one or several rows from the parquet file using spark SQL. 3. When the total record number is 61 million, it needs about 3 seconds to get the result, which is unacceptable long for our scenario. 4.When the total record number is 2 million, it needs about 93 ms to get the result, whcih is still a little long for us. 5. The query statement is like : SELECT * FROM DBA WHERE COLA=? AND COLB=? And the table is not complex, which has less 10 columns and the content for each column is less than 100 bytes. 6. Does any one know how to improve the performance or give some other ideas? 7. Can Spark SQL support micro-second-level response? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org