[jira] [Resolved] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6113. -- Resolution: Fixed Issue resolved by pull request 5626 [https://github.com/apache/spark/pull/5626] Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical Fix For: 1.4.0 *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7108) spark.local.dir is no longer honored in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512649#comment-14512649 ] Marcelo Vanzin commented on SPARK-7108: --- The way I read the documentation, {{spark.local.dir}} should only ever work on the driver, never on executors, since as the documentation says, that is managed by the cluster manager (regardless of whether you set SPARK_LOCAL_DIRS for your app or not - as the doc says, the cluster manager sets that!). If {{spark.local.dir}} is not working for the driver, then there's a potential bug here. spark.local.dir is no longer honored in Standalone mode --- Key: SPARK-7108 URL: https://issues.apache.org/jira/browse/SPARK-7108 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.3.0 Reporter: Josh Rosen Priority: Critical Prior to SPARK-4834, configuring spark.local.dir in the driver would affect the local directories created on the executor. After this patch, executors will always ignore this setting in favor of directories read from {{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the worker's own configuration and not the application configuration. This change impacts users who configured {{spark.local.dir}} only in their driver and not via their cluster's {{spark-defaults.conf}} or {{spark-env.sh}} files. This is an atypical use-case, since the available local directories / disks are a property of the cluster and not the application, which probably explains why this issue has not been reported previously. The correct fix might be comment + documentation improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7111) Add a tracker to track the direct (receiver-less) streams
[ https://issues.apache.org/jira/browse/SPARK-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7111: - Component/s: Streaming Add a tracker to track the direct (receiver-less) streams - Key: SPARK-7111 URL: https://issues.apache.org/jira/browse/SPARK-7111 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Saisai Shao Currently for receiver-based input streams, Spark Streaming offers ReceiverTracker and ReceivedBlockTracker to track the status of receivers as well as block information. Also this status and block information can be retrieved from StreamingListener to expose to the users. But for direct-based (receiver-less) input streams, Current Spark Streaming lacks such mechanism to track the registered direct streams, also lacks the way to track the processed number of data for direct-based input streams. Here propose a mechanism to track the register direct stream, also expose the processing statistics to the BatchInfo and StreamingListener. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7112) Add a tracker to track the direct streams
[ https://issues.apache.org/jira/browse/SPARK-7112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7112: - Component/s: Streaming Add a tracker to track the direct streams - Key: SPARK-7112 URL: https://issues.apache.org/jira/browse/SPARK-7112 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Saisai Shao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI
[ https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7113: - Component/s: Streaming Add the direct stream related information to the streaming listener and web UI -- Key: SPARK-7113 URL: https://issues.apache.org/jira/browse/SPARK-7113 Project: Spark Issue Type: Sub-task Components: Streaming Reporter: Saisai Shao -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7107) Add parameter for zookeeper.znode.parent to hbase_inputformat.py
[ https://issues.apache.org/jira/browse/SPARK-7107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7107: - Component/s: PySpark Add parameter for zookeeper.znode.parent to hbase_inputformat.py Key: SPARK-7107 URL: https://issues.apache.org/jira/browse/SPARK-7107 Project: Spark Issue Type: Bug Components: PySpark Reporter: Ted Yu Assignee: Ted Yu Priority: Minor [~yeshavora] first reported encountering the following exception running hbase_inputformat.py : {code} py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:313) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:288) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) {code} It turned out that the hbase cluster has custom znode parent: {code} property namezookeeper.znode.parent/name value/hbase-unsecure/value /property {code} hbase_inputformat.py should support specification of custom znode parent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5722: - Assignee: Don Drake Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake Assignee: Don Drake Fix For: 1.2.2 The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5651) Support 'create db.table' in HiveContext
[ https://issues.apache.org/jira/browse/SPARK-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5651: - Assignee: Yadong Qi Support 'create db.table' in HiveContext Key: SPARK-5651 URL: https://issues.apache.org/jira/browse/SPARK-5651 Project: Spark Issue Type: Bug Components: SQL Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.4.0 Now spark version is only support ```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext. This patch is used to support ```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5712) Semicolon at end of a comment line
[ https://issues.apache.org/jira/browse/SPARK-5712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5712: - Assignee: Adrian Wang Semicolon at end of a comment line -- Key: SPARK-5712 URL: https://issues.apache.org/jira/browse/SPARK-5712 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.4.0 HIVE-3348 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5756) Analyzer should not throw scala.NotImplementedError for illegitimate sql
[ https://issues.apache.org/jira/browse/SPARK-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5756: - Assignee: Fei Wang Analyzer should not throw scala.NotImplementedError for illegitimate sql Key: SPARK-5756 URL: https://issues.apache.org/jira/browse/SPARK-5756 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: Fei Wang Assignee: Fei Wang ```SELECT CAST(x AS STRING) FROM src``` comes a NotImplementedError: CliDriver: scala.NotImplementedError: an implementation is missing at scala.Predef$.$qmark$qmark$qmark(Predef.scala:252) at org.apache.spark.sql.catalyst.expressions.PrettyAttribute.dataType(namedExpressions.scala:221) at org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:30) at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:30) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68) at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68) at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80) at scala.collection.immutable.List.exists(List.scala:84) at org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:68) at org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:56) at org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:56) at org.apache.spark.sql.catalyst.expressions.NamedExpression.typeSuffix(namedExpressions.scala:62) at org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:124) at org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:78) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83) at scala.collection.immutable.Stream.map(Stream.scala:376) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:81) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:204) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:79) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5758) Use LongType as the default type for integers in JSON schema inference.
[ https://issues.apache.org/jira/browse/SPARK-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5758: - Assignee: Yin Huai Use LongType as the default type for integers in JSON schema inference. --- Key: SPARK-5758 URL: https://issues.apache.org/jira/browse/SPARK-5758 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.0 Per discussion in https://github.com/apache/spark/pull/4521, we will use LongType as the default data type for integer values in JSON schema inference. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5789) Throw a better error message if JsonRDD.parseJson encounters unrecoverable parsing errors.
[ https://issues.apache.org/jira/browse/SPARK-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5789: - Assignee: Yin Huai Throw a better error message if JsonRDD.parseJson encounters unrecoverable parsing errors. -- Key: SPARK-5789 URL: https://issues.apache.org/jira/browse/SPARK-5789 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.0 For example {code} sqlContext.jsonRDD(sc.parallelize(a:1}::Nil)) {code} will throw {code} scala.MatchError: a (of class java.lang.String) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/12 15:08:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 4.0 (TID 26) in 10 ms on localhost (7/8) 15/02/12 15:08:55 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 4.0 (TID 33, localhost): scala.MatchError: a (of class java.lang.String) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302) at org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879) at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5498) [SPARK-SQL]when the partition schema does not match table schema,it throws java.lang.ClassCastException and so on
[ https://issues.apache.org/jira/browse/SPARK-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5498: - Assignee: jeanlyn [SPARK-SQL]when the partition schema does not match table schema,it throws java.lang.ClassCastException and so on - Key: SPARK-5498 URL: https://issues.apache.org/jira/browse/SPARK-5498 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn Assignee: jeanlyn Fix For: 1.4.0 when the partition schema does not match table schema,it will thows exception when the task is running.For example,we modify the type of column from int to bigint by the sql *ALTER TABLE table_with_partition CHANGE COLUMN key key BIGINT* ,then we query the patition data which was stored before the changing,we would get the exception: {noformat} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 (TID 30, BJHC-HADOOP-HERA-16950.jeanlyn.local): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:322) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at
[jira] [Updated] (SPARK-5404) Statistic of Logical Plan is too aggresive
[ https://issues.apache.org/jira/browse/SPARK-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5404: - Assignee: Cheng Hao Statistic of Logical Plan is too aggresive -- Key: SPARK-5404 URL: https://issues.apache.org/jira/browse/SPARK-5404 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.4.0 The statistic number of a logical plan is quite helpful while do optimization like join reordering, however, the default algorithm is too aggressive, which probably lead to a totally wrong join order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators
[ https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5277: - Assignee: Max Seiden SparkSqlSerializer does not register user specified KryoRegistrators - Key: SPARK-5277 URL: https://issues.apache.org/jira/browse/SPARK-5277 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1, 1.3.0 Reporter: Max Seiden Assignee: Max Seiden Fix For: 1.4.0 Although the SparkSqlSerializer class extends the KryoSerializer in core, it's overridden newKryo() does not call super.newKryo(). This results in inconsistent serializer behaviors depending on whether a KryoSerializer instance or a SparkSqlSerializer instance is used. This may also be related to the TODO in KryoResourcePool, which uses KryoSerializer instead of SparkSqlSerializer due to yet-to-be-investigated test failures. An example of the divergence in behavior: The Exchange operator creates a new SparkSqlSerializer instance (with an empty conf; another issue) when it is constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the resource pool (see above). The result is that the serialized in-memory columns are created using the user provided serializers / registrators, while serialization during exchange does not. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's
[ https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-911: Assignee: Aaron Support map pruning on sorted (K, V) RDD's -- Key: SPARK-911 URL: https://issues.apache.org/jira/browse/SPARK-911 Project: Spark Issue Type: New Feature Components: Spark Core Reporter: Patrick Wendell Assignee: Aaron Fix For: 1.4.0 If someone has sorted a (K, V) rdd, we should offer them a way to filter a range of the partitions that employs map pruning. This would be simple using a small range index within the rdd itself. A good example is I sort my dataset by time and then I want to serve queries that are restricted to a certain time range. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6779) Move shared params to param.shared and use code gen
[ https://issues.apache.org/jira/browse/SPARK-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6779: - Assignee: Xiangrui Meng Move shared params to param.shared and use code gen --- Key: SPARK-6779 URL: https://issues.apache.org/jira/browse/SPARK-6779 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng The boilerplate code should be automatically generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6779) Move shared params to param.shared and use code gen
[ https://issues.apache.org/jira/browse/SPARK-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6779. -- Resolution: Fixed Fix Version/s: 1.4.0 Move shared params to param.shared and use code gen --- Key: SPARK-6779 URL: https://issues.apache.org/jira/browse/SPARK-6779 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 The boilerplate code should be automatically generated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5935) Accept MapType in the schema provided to a JSON dataset.
[ https://issues.apache.org/jira/browse/SPARK-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5935: - Assignee: Yin Huai Accept MapType in the schema provided to a JSON dataset. Key: SPARK-5935 URL: https://issues.apache.org/jira/browse/SPARK-5935 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5908) Hive udtf with single alias should be resolved correctly
[ https://issues.apache.org/jira/browse/SPARK-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5908: - Assignee: Liang-Chi Hsieh Hive udtf with single alias should be resolved correctly Key: SPARK-5908 URL: https://issues.apache.org/jira/browse/SPARK-5908 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Fix For: 1.4.0 ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with multiple alias. When only single alias is used with HiveGenericUdtf, the alias is not working. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type
[ https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5911: - Assignee: Yin Huai Make Column.cast(to: String) support fixed precision and scale decimal type --- Key: SPARK-5911 URL: https://issues.apache.org/jira/browse/SPARK-5911 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.3.1, 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7107) Add parameter for zookeeper.znode.parent to hbase_inputformat.py
[ https://issues.apache.org/jira/browse/SPARK-7107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-7107: --- Component/s: Examples Add parameter for zookeeper.znode.parent to hbase_inputformat.py Key: SPARK-7107 URL: https://issues.apache.org/jira/browse/SPARK-7107 Project: Spark Issue Type: Bug Components: Examples, PySpark Reporter: Ted Yu Assignee: Ted Yu Priority: Minor [~yeshavora] first reported encountering the following exception running hbase_inputformat.py : {code} py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD. : java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208) at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:313) at org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:288) at org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160) {code} It turned out that the hbase cluster has custom znode parent: {code} property namezookeeper.znode.parent/name value/hbase-unsecure/value /property {code} hbase_inputformat.py should support specification of custom znode parent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command
[ https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5926: - Assignee: Yanbo Liang [SQL] DataFrame.explain() return false result for DDL command - Key: SPARK-5926 URL: https://issues.apache.org/jira/browse/SPARK-5926 Project: Spark Issue Type: Bug Components: SQL Reporter: Yanbo Liang Assignee: Yanbo Liang Fix For: 1.3.0 This bug is easy to reproduce, the following two queries should print out the same explain result, but it's not. sql(create table tb as select * from src where key 490).explain(true) sql(explain extended create table tb as select * from src where key 490) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5909) Add a clearCache command to Spark SQL's cache manager
[ https://issues.apache.org/jira/browse/SPARK-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5909: - Assignee: Yin Huai Add a clearCache command to Spark SQL's cache manager - Key: SPARK-5909 URL: https://issues.apache.org/jira/browse/SPARK-5909 Project: Spark Issue Type: Task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.4.0 This command will clear all cached data from the in-memory cache, which will be useful when users want to quickly clear the cache or as a workaround of cases like SPARK-5881. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5203) union with different decimal type report error
[ https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5203: - Assignee: guowei union with different decimal type report error -- Key: SPARK-5203 URL: https://issues.apache.org/jira/browse/SPARK-5203 Project: Spark Issue Type: Bug Components: SQL Reporter: guowei Assignee: guowei Fix For: 1.4.0 Test case like this: {code:sql} create table test (a decimal(10,1)); select a from test union all select a*2 from test; {code} Exception thown: {noformat} 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union all select a*2 from test] org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved attributes: *, tree: 'Project [*] 'Subquery _u1 'Union Project [a#1] MetastoreRelation default, test, None Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), DecimalType())), DecimalType(21,1)) AS _c0#0] MetastoreRelation default, test, None at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83) at org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421) at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421) at org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369) at org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211) at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5068: - Assignee: dongxu When the path not found in the hdfs,we can't get the result --- Key: SPARK-5068 URL: https://issues.apache.org/jira/browse/SPARK-5068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn Assignee: dongxu Fix For: 1.4.0 when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: {noformat} hive show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) {noformat} {noformat} hive dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 {noformat} when i run the sql {noformat} select * from partition_test limit 10 {noformat} in *hive*,i got no problem,but when i run in *spark-sql* i get the error as follow: {noformat} Exception in thread main org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (SPARK-7108) spark.local.dir is no longer honored in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-7108: --- Summary: spark.local.dir is no longer honored in Standalone mode (was: Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting) spark.local.dir is no longer honored in Standalone mode --- Key: SPARK-7108 URL: https://issues.apache.org/jira/browse/SPARK-7108 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.3.0 Reporter: Josh Rosen Priority: Critical Prior to SPARK-4834, configuring spark.local.dir in the driver would affect the local directories created on the executor. After this patch, executors will always ignore this setting in favor of directories read from {{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the worker's own configuration and not the application configuration. This change impacts users who configured {{spark.local.dir}} only in their driver and not via their cluster's {{spark-defaults.conf}} or {{spark-env.sh}} files. This is an atypical use-case, since the available local directories / disks are a property of the cluster and not the application, which probably explains why this issue has not been reported previously. The correct fix might be comment + documentation improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
Sean Owen created SPARK-7145: Summary: commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency Key: SPARK-7145 URL: https://issues.apache.org/jira/browse/SPARK-7145 Project: Spark Issue Type: Improvement Components: Spark Core, SQL, Streaming Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Spark depends only on Commons Lang3 (3.x). However there are several accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days ago the version of Lang 2.x that accidentally comes in via Hadoop can change with Hadoop version and so the accidental usage is more than a purely theoretical problem. It's easy to change the usages to 3.x counterparts. Also, there are just a few uses of Commons IO in the code which can be replaced with uses of Guava, removing another used but undeclared dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5852) Fail to convert a newly created empty metastore parquet table to a data source parquet table.
[ https://issues.apache.org/jira/browse/SPARK-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5852: - Assignee: Yin Huai Fail to convert a newly created empty metastore parquet table to a data source parquet table. - Key: SPARK-5852 URL: https://issues.apache.org/jira/browse/SPARK-5852 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Blocker Fix For: 1.3.0 To reproduce the exception, try {code} val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}})) sqlContext.jsonRDD(rdd).registerTempTable(jt) sqlContext.sql(create table test stored as parquet as select * from jt) {code} ParquetConversions tries to convert the write path to the data source API write path. But, the following exception was thrown. {code} java.lang.UnsupportedOperationException: empty.reduceLeft at scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167) at scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:47) at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195) at scala.collection.AbstractTraversable.reduce(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetRelation2$.readSchema(newParquet.scala:633) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:349) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:290) at org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:354) at org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToParquetRelation(HiveMetastoreCatalog.scala:218) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:440) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:439) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:47) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:439) at org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:416) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59) at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) at scala.collection.immutable.List.foldLeft(List.scala:84) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:917) at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:917) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918) at org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:918) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:919) at org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:919) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:924) at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:922) at
[jira] [Updated] (SPARK-5862) Only transformUp the given plan once in HiveMetastoreCatalog
[ https://issues.apache.org/jira/browse/SPARK-5862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5862: - Assignee: Liang-Chi Hsieh Only transformUp the given plan once in HiveMetastoreCatalog Key: SPARK-5862 URL: https://issues.apache.org/jira/browse/SPARK-5862 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Assignee: Liang-Chi Hsieh Priority: Minor Fix For: 1.3.0 Current ParquetConversions in HiveMetastoreCatalog will transformUp the given plan multiple times if there are many Metastore Parquet tables. Since the transformUp operation is recursive, it should be better to only perform it once. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7000) spark.ml Prediction abstractions should reside in ml.prediction subpackage
[ https://issues.apache.org/jira/browse/SPARK-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-7000. -- Resolution: Not A Problem spark.ml Prediction abstractions should reside in ml.prediction subpackage -- Key: SPARK-7000 URL: https://issues.apache.org/jira/browse/SPARK-7000 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.1 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor spark.ml prediction abstractions are currently not gathered; they are in both ml.impl and ml.tree. Instead, they should be gathered into ml.prediction. This will become more important as more abstractions, such as ensembles, are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7000) spark.ml Prediction abstractions should reside in ml.prediction subpackage
[ https://issues.apache.org/jira/browse/SPARK-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512828#comment-14512828 ] Joseph K. Bradley commented on SPARK-7000: -- Decision after conferring with [~mengxr]: The structure will be: * ml.Predictor (Predictor will be made public and moved to the ml package in [SPARK-5995]) * ml.trees.* (Tree abstractions shared by ml.classification and ml.regression) * ml.ensembles.* (Generic ensemble abstractions shared by ml.classification and ml.regression) Since we can't think of more shared abstractions currently, we'll aim for a flatter directory structure. Closing this JIRA spark.ml Prediction abstractions should reside in ml.prediction subpackage -- Key: SPARK-7000 URL: https://issues.apache.org/jira/browse/SPARK-7000 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.3.1 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Minor spark.ml prediction abstractions are currently not gathered; they are in both ml.impl and ml.tree. Instead, they should be gathered into ml.prediction. This will become more important as more abstractions, such as ensembles, are added. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6989) Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various arcane compiler errors
[ https://issues.apache.org/jira/browse/SPARK-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6989. -- Resolution: Cannot Reproduce Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various arcane compiler errors -- Key: SPARK-6989 URL: https://issues.apache.org/jira/browse/SPARK-6989 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.3.0 Environment: Java 1.8.0_40 on Ubuntu 14.04.1 Reporter: Michael Allman Assignee: Prashant Sharma Attachments: spark_repl_2.11_errors.txt When starting the Spark 1.3 spark-shell compiled for Scala 2.11, I get a random assortment of compiler errors. I will attach a transcript. One thing I've noticed is that they seem to be less frequent when I increase the driver heap size to 5 GB or so. By comparison, the Spark 1.1 spark-shell on Scala 2.10 has been rock solid with a 512 MB heap. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7124) Add functions to check for file and directory existence
[ https://issues.apache.org/jira/browse/SPARK-7124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7124. -- Resolution: Not A Problem Add functions to check for file and directory existence --- Key: SPARK-7124 URL: https://issues.apache.org/jira/browse/SPARK-7124 Project: Spark Issue Type: Improvement Components: Input/Output Reporter: Sam Steingold How do I check that a file or directory exists? For file, I was told to do {{sc.textFile().first()}} which seems wrong: # it initiates unnecessary i/o which could be huge (what is the file is binary and has no newlines?) # it fails for 0-length files (e.g., we write 0-length {{_SUCCESS}} files in directories after they have been successfully written) it appears that Spark needs bona file {{isFile}} and {{isDirectory}} methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
[ https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric O. LEBIGOT (EOL) updated SPARK-7141: - Description: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}} This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) was: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) saveAsTextFile() on S3 first creates empty prefix - Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}} This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell reopened SPARK-6752: I had to revert this because it caused test failures with the Hadoop 1.0 build. To reproduce them use: build/sbt -Dhadoop.version=1.0.4 -Pkinesis-asl -Phive -Phive-thriftserver -Phive-0.12.0 streaming/test:compile The errors are: {code} [error] /Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1740: error: cannot find symbol [error] Assert.assertTrue(new context not created, newContextCreated.isTrue()); [error] ^ [error] symbol: method isTrue() [error] location: variable newContextCreated of type MutableBoolean [error] /Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1746: error: cannot find symbol [error] Assert.assertTrue(new context not created, newContextCreated.isTrue()); [error] ^ [error] symbol: method isTrue() [error] location: variable newContextCreated of type MutableBoolean [error] /Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1752: error: cannot find symbol [error] Assert.assertTrue(old context not recovered, newContextCreated.isFalse()); [error] ^ [error] symbol: method isFalse() [error] location: variable newContextCreated of type MutableBoolean [error] /Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1768: error: cannot find symbol [error] Assert.assertTrue(new context not created, newContextCreated.isTrue()); [error] ^ [error] symbol: method isTrue() [error] location: variable newContextCreated of type MutableBoolean [error] /Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1773: error: cannot find symbol [error] Assert.assertTrue(new context not created, newContextCreated.isTrue()); [error] ^ [error] symbol: method isTrue() [error] location: variable newContextCreated of type MutableBoolean [error] /Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1778: error: cannot find symbol [error] Assert.assertTrue(old context not recovered, newContextCreated.isFalse()); [error] ^ [error] symbol: method isFalse() [error] location: variable newContextCreated of type MutableBoolean [error] 6 errors [error] (streaming/test:compile) javac returned nonzero exit code [error] Total time: 94 s, completed Apr 25, 2015 10:30:20 AM pwendell @ admins-mbp : ~/Documents/spark (detached HEAD|REBASE 9/11) {code} Allow StreamingContext to be recreated from checkpoint and existing SparkContext Key: SPARK-6752 URL: https://issues.apache.org/jira/browse/SPARK-6752 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.1, 1.2.1, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.4.0 Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated. Proposed solution: Introduce the following methods on StreamingContext 1. {{new StreamingContext(checkpointDirectory, sparkContext)}} - Recreate StreamingContext from checkpoint using the provided SparkContext 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}} - Recreate StreamingContext from checkpoint using the provided SparkContext and hadoop conf to read the checkpoint 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext = StreamingContext)}} - If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction Also, the corresponding Java and Python API has to be added as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
[ https://issues.apache.org/jira/browse/SPARK-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512719#comment-14512719 ] Apache Spark commented on SPARK-7145: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/5703 commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency -- Key: SPARK-7145 URL: https://issues.apache.org/jira/browse/SPARK-7145 Project: Spark Issue Type: Improvement Components: Spark Core, SQL, Streaming Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Spark depends only on Commons Lang3 (3.x). However there are several accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days ago the version of Lang 2.x that accidentally comes in via Hadoop can change with Hadoop version and so the accidental usage is more than a purely theoretical problem. It's easy to change the usages to 3.x counterparts. Also, there are just a few uses of Commons IO in the code which can be replaced with uses of Guava, removing another used but undeclared dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
[ https://issues.apache.org/jira/browse/SPARK-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7145: --- Assignee: Apache Spark (was: Sean Owen) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency -- Key: SPARK-7145 URL: https://issues.apache.org/jira/browse/SPARK-7145 Project: Spark Issue Type: Improvement Components: Spark Core, SQL, Streaming Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Apache Spark Priority: Minor Spark depends only on Commons Lang3 (3.x). However there are several accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days ago the version of Lang 2.x that accidentally comes in via Hadoop can change with Hadoop version and so the accidental usage is more than a purely theoretical problem. It's easy to change the usages to 3.x counterparts. Also, there are just a few uses of Commons IO in the code which can be replaced with uses of Guava, removing another used but undeclared dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
[ https://issues.apache.org/jira/browse/SPARK-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7145: --- Assignee: Sean Owen (was: Apache Spark) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency -- Key: SPARK-7145 URL: https://issues.apache.org/jira/browse/SPARK-7145 Project: Spark Issue Type: Improvement Components: Spark Core, SQL, Streaming Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Minor Spark depends only on Commons Lang3 (3.x). However there are several accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days ago the version of Lang 2.x that accidentally comes in via Hadoop can change with Hadoop version and so the accidental usage is more than a purely theoretical problem. It's easy to change the usages to 3.x counterparts. Also, there are just a few uses of Commons IO in the code which can be replaced with uses of Guava, removing another used but undeclared dependency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext
[ https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512620#comment-14512620 ] Sean Owen commented on SPARK-6752: -- It's an issue with the version of Commons Lang (2.x). Just use {{AtomicBoolean}} here and avoid the library altogether Allow StreamingContext to be recreated from checkpoint and existing SparkContext Key: SPARK-6752 URL: https://issues.apache.org/jira/browse/SPARK-6752 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.1.1, 1.2.1, 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das Priority: Critical Fix For: 1.4.0 Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated. Proposed solution: Introduce the following methods on StreamingContext 1. {{new StreamingContext(checkpointDirectory, sparkContext)}} - Recreate StreamingContext from checkpoint using the provided SparkContext 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}} - Recreate StreamingContext from checkpoint using the provided SparkContext and hadoop conf to read the checkpoint 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext = StreamingContext)}} - If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction Also, the corresponding Java and Python API has to be added as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7143) Add BM25 Estimator
[ https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512765#comment-14512765 ] Joseph K. Bradley commented on SPARK-7143: -- Do you have some references to recent papers and current use cases in industry, especially ones showing BM25 is much better than TF-IDF? It will be good to figure out whether it is clearly better than TF-IDF, or if it is best in specialized cases (and would then be better as a Spark package). Also, can you please comment on which variant you're implementing? The Wikipedia page makes it sound like some corrections are necessary for the basic BM25 in order to make it more practical. Thanks! Add BM25 Estimator -- Key: SPARK-7143 URL: https://issues.apache.org/jira/browse/SPARK-7143 Project: Spark Issue Type: New Feature Components: ML Reporter: Liang-Chi Hsieh [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used to rank documents. It is commonly used in IR tasks and can be parallel. This issue is proposed to add it into Spark ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5817) UDTF column names didn't set properly
[ https://issues.apache.org/jira/browse/SPARK-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5817: - Assignee: Cheng Hao UDTF column names didn't set properly -- Key: SPARK-5817 URL: https://issues.apache.org/jira/browse/SPARK-5817 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Assignee: Cheng Hao Fix For: 1.4.0 {code} createQueryTest(Specify the udtf output, select d from (select explode(array(1,1)) d from src limit 1) t) {code} It throws exception like: {panel} org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5824) CTAS should set null format in hive-0.13.1
[ https://issues.apache.org/jira/browse/SPARK-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5824: - Assignee: Adrian Wang CTAS should set null format in hive-0.13.1 -- Key: SPARK-5824 URL: https://issues.apache.org/jira/browse/SPARK-5824 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Fix For: 1.3.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5833) Adds REFRESH TABLE command to refresh external data sources tables
[ https://issues.apache.org/jira/browse/SPARK-5833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5833: - Assignee: Cheng Lian Adds REFRESH TABLE command to refresh external data sources tables -- Key: SPARK-5833 URL: https://issues.apache.org/jira/browse/SPARK-5833 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Reporter: Cheng Lian Assignee: Cheng Lian Fix For: 1.3.0 This command can be used to refresh (possibly cached) metadata stored in external data source tables. For example, for JSON tables, it forces schema inference; for Parquet tables, it forces schema merging and partition discovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5840) HiveContext cannot be serialized due to tuple extraction
[ https://issues.apache.org/jira/browse/SPARK-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5840: - Assignee: Reynold Xin HiveContext cannot be serialized due to tuple extraction Key: SPARK-5840 URL: https://issues.apache.org/jira/browse/SPARK-5840 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.2.1 Reporter: Reynold Xin Assignee: Reynold Xin Fix For: 1.3.0 See the following mailing list question: http://apache-spark-developers-list.1001551.n3.nabble.com/ The use of tuple extraction for (hiveconf, sessionState) creates a non-transient tuple field. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5794) add jar should return 0
[ https://issues.apache.org/jira/browse/SPARK-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5794: - Assignee: Adrian Wang add jar should return 0 --- Key: SPARK-5794 URL: https://issues.apache.org/jira/browse/SPARK-5794 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Assignee: Adrian Wang Priority: Minor Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5780) The loggings of Python unittests are noisy and scaring in
[ https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5780: - Assignee: Davies Liu The loggings of Python unittests are noisy and scaring in -- Key: SPARK-5780 URL: https://issues.apache.org/jira/browse/SPARK-5780 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 1.3.0 Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.3.0 There a bunch of logging coming from driver and worker, it's noisy and scaring, and a lots of exception in it, people are confusing about the tests are failing or not. It should mute the logging during tests, only show them if any one failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5626) Spurious test failures due to NullPointerException in EasyMock test code
[ https://issues.apache.org/jira/browse/SPARK-5626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-5626: - Assignee: Josh Rosen Spurious test failures due to NullPointerException in EasyMock test code Key: SPARK-5626 URL: https://issues.apache.org/jira/browse/SPARK-5626 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Labels: flaky-test Attachments: consoleText.txt I've seen a few cases where a test failure will trigger a cascade of spurious failures when instantiating test suites that use EasyMock. Here's a sample symptom: {code} [info] CacheManagerSuite: [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.CacheManagerSuite *** ABORTED *** (137 milliseconds) [info] java.lang.NullPointerException: [info] at org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52) [info] at org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90) [info] at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73) [info] at org.objenesis.ObjenesisHelper.newInstance(ObjenesisHelper.java:43) [info] at org.easymock.internal.ObjenesisClassInstantiator.newInstance(ObjenesisClassInstantiator.java:26) [info] at org.easymock.internal.ClassProxyFactory.createProxy(ClassProxyFactory.java:219) [info] at org.easymock.internal.MocksControl.createMock(MocksControl.java:59) [info] at org.easymock.EasyMock.createMock(EasyMock.java:103) [info] at org.scalatest.mock.EasyMockSugar$class.mock(EasyMockSugar.scala:267) [info] at org.apache.spark.CacheManagerSuite.mock(CacheManagerSuite.scala:28) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply$mcV$sp(CacheManagerSuite.scala:40) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38) [info] at org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38) [info] at org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195) [info] at org.apache.spark.CacheManagerSuite.runTest(CacheManagerSuite.scala:28) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:318) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.scala:1424) [info] at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:545) [info] at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212) [info] at org.apache.spark.CacheManagerSuite.org$scalatest$BeforeAndAfter$$super$run(CacheManagerSuite.scala:28) [info] at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241) [info] at org.apache.spark.CacheManagerSuite.run(CacheManagerSuite.scala:28) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:294) [info] at sbt.ForkMain$Run$2.call(ForkMain.java:284) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:262) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] at java.lang.Thread.run(Thread.java:745) {code} This is from https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26852/consoleFull. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Resolved] (SPARK-7092) Update spark scala version to 2.11.6
[ https://issues.apache.org/jira/browse/SPARK-7092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7092. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5662 [https://github.com/apache/spark/pull/5662] Update spark scala version to 2.11.6 Key: SPARK-7092 URL: https://issues.apache.org/jira/browse/SPARK-7092 Project: Spark Issue Type: Improvement Components: Spark Core, Spark Shell Affects Versions: 1.4.0 Reporter: Prashant Sharma Assignee: Prashant Sharma Priority: Minor Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
[ https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512794#comment-14512794 ] Eric O. LEBIGOT (EOL) commented on SPARK-7141: -- Ah, thanks: it's good to know where the issue comes from. As far as I understand, it does not look like it's going to be fixed any soon (minor issue, and Amazon deprecating s3://…). I will try s3n:// instead and see if this works and changes anything. saveAsTextFile() on S3 first creates empty prefix - Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}} This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib
[ https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843 ] Debasish Das commented on SPARK-5992: - Did someone compared algebird LSH with spark minhash link above ? Unless algebird is slow (which I found for TopK monoid) we should use it the same way HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ? Locality Sensitive Hashing (LSH) for MLlib -- Key: SPARK-5992 URL: https://issues.apache.org/jira/browse/SPARK-5992 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Locality Sensitive Hashing (LSH) would be very useful for ML. It would be great to discuss some possible algorithms here, choose an API, and make a PR for an initial algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7146) Should ML sharedParams be a public API?
Joseph K. Bradley created SPARK-7146: Summary: Should ML sharedParams be a public API? Key: SPARK-7146 URL: https://issues.apache.org/jira/browse/SPARK-7146 Project: Spark Issue Type: Brainstorming Components: ML Reporter: Joseph K. Bradley Discussion: Should the Param traits in sharedParams.scala be private? Pros: * Users have to be careful since parameters can have different meanings for different algorithms. Cons: * Sharing the Param traits helps to encourage standardized Param names and documentation. * If the shared Params are public, then implementations could test for the traits. We probably do not want users to do that. Currently, the shared params are public but marked as DeveloperApi. Proposal: Either (a) make the shared params private to encourage users to write specialized documentation and value checks for parameters, or (b) design a better way to encourage overriding documentation and parameter value checks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7147) Enforce Params.validate by making it abstract
Joseph K. Bradley created SPARK-7147: Summary: Enforce Params.validate by making it abstract Key: SPARK-7147 URL: https://issues.apache.org/jira/browse/SPARK-7147 Project: Spark Issue Type: Sub-task Components: ML Reporter: Joseph K. Bradley We should make Params.validate abstract to force developers to implement it. We have been ignoring it so far. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7148) Configure Parquet block size (row group size) for ML model import/export
Joseph K. Bradley created SPARK-7148: Summary: Configure Parquet block size (row group size) for ML model import/export Key: SPARK-7148 URL: https://issues.apache.org/jira/browse/SPARK-7148 Project: Spark Issue Type: Improvement Components: MLlib, SQL Affects Versions: 1.3.1, 1.3.0, 1.4.0 Reporter: Joseph K. Bradley Priority: Minor It would be nice if we could configure the Parquet buffer size when using Parquet format for ML model import/export. Currently, for some models (trees and ensembles), the schema has 13+ columns. With a default buffer size of 128MB (I think), that puts the allocated buffer way over the default memory made available by run-example. Because of this problem, users have to use spark-submit and explicitly use a larger amount of memory in order to run some ML examples. Is there a simple way to specify {{parquet.block.size}}? I'm not familiar with this part of SparkSQL. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5023) In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages)
[ https://issues.apache.org/jira/browse/SPARK-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512869#comment-14512869 ] Eran Medan commented on SPARK-5023: --- I don't think this is a duplicate, the information shows correctly in live view, the incorrect numbers are for the history / event view. I had no lost partitions and no failures, but still - in live view, something that took seconds or minutes shows as milliseconds or seconds in the history view. I'll debug it and see if I can figure out the root cause. No error messages. In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages) Key: SPARK-5023 URL: https://issues.apache.org/jira/browse/SPARK-5023 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.1.1, 1.2.0 Environment: Amazon EC2 AMI r3.2xlarge, cluster of 20 to 50 nodes, running the ec2 provided scripts to create. Reporter: Eran Medan I'm running a long process using Spark + Graph and things look good on the 4040 job status UI, but when the job is done, when going to the history then the job total duration is much, much smaller than the total of its stages. The way I set logs up is this: val homeDir = sys.props(user.home) val logsPath = new File(homeDir,sparkEventLogs) val conf = new SparkConf().setAppName(...) conf.set(spark.eventLog.enabled, true).set(spark.eventLog.dir, logsPath.getCanonicalPath) for example job ID X - duration 0.2 s, but when I click the job and look at its stages, the sum of their duration is more than 15 minutes! (before the job was over, in the 4040 job status, the job duration was correct, it is only incorrect when its done and going to the logs) I hope I didn't configure something because I was very surprised no one reported it yet (I searched, but perhaps I missed it) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
Eric O. LEBIGOT (EOL) created SPARK-7141: Summary: saveAsTextFile() on S3 first creates empty prefix Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved (and read from, maybe) in the intended location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
[ https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric O. LEBIGOT (EOL) updated SPARK-7141: - Description: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). was: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved (and read from, maybe) in the intended location. saveAsTextFile() on S3 first creates empty prefix - Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
[ https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric O. LEBIGOT (EOL) updated SPARK-7141: - Description: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many `block_` files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) was: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). saveAsTextFile() on S3 first creates empty prefix - Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many `block_` files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
[ https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric O. LEBIGOT (EOL) updated SPARK-7141: - Description: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) was: Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many `block_` files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) saveAsTextFile() on S3 first creates empty prefix - Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7108) Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting
[ https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512280#comment-14512280 ] Patrick Wendell edited comment on SPARK-7108 at 4/25/15 6:01 AM: - Hey I think [~joshrosen] actually worded this a bit confusingly. The issue is that even if SPARK_LOCAL_DIRS is not set at all by the user, the setting of spark.local.dir is not used from the application. This regresses from earlier versions of spark which (as the documentation implies) would respect spark.local.dir if set. was (Author: pwendell): Hey I think [~joshrosen] actually miswrote this. The issue is that even if SPARK_LOCAL_DIRS is not set at all, the setting of spark.local.dir is not used from the driver. Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting - Key: SPARK-7108 URL: https://issues.apache.org/jira/browse/SPARK-7108 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.3.0 Reporter: Josh Rosen Priority: Critical Prior to SPARK-4834, configuring spark.local.dir in the driver would affect the local directories created on the executor. After this patch, executors will always ignore this setting in favor of directories read from {{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the worker's own configuration and not the application configuration. This change impacts users who configured {{spark.local.dir}} only in their driver and not via their cluster's {{spark-defaults.conf}} or {{spark-env.sh}} files. This is an atypical use-case, since the available local directories / disks are a property of the cluster and not the application, which probably explains why this issue has not been reported previously. The correct fix might be comment + documentation improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7108) Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting
[ https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512280#comment-14512280 ] Patrick Wendell edited comment on SPARK-7108 at 4/25/15 6:02 AM: - Hey I think [~joshrosen] actually worded this a bit confusingly. The issue is that even if SPARK_LOCAL_DIRS is not set at all by the user, the setting of spark.local.dir is not used from the application. This regresses from earlier versions of spark which (as the documentation implies) would respect spark.local.dir if set. This isn't great because Spark will silently start using a different local directory when upgraded. In our case it caused us to run out of disk space because /tmp was used instead of a directory we'd explicitly set. was (Author: pwendell): Hey I think [~joshrosen] actually worded this a bit confusingly. The issue is that even if SPARK_LOCAL_DIRS is not set at all by the user, the setting of spark.local.dir is not used from the application. This regresses from earlier versions of spark which (as the documentation implies) would respect spark.local.dir if set. Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting - Key: SPARK-7108 URL: https://issues.apache.org/jira/browse/SPARK-7108 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.1, 1.3.0 Reporter: Josh Rosen Priority: Critical Prior to SPARK-4834, configuring spark.local.dir in the driver would affect the local directories created on the executor. After this patch, executors will always ignore this setting in favor of directories read from {{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the worker's own configuration and not the application configuration. This change impacts users who configured {{spark.local.dir}} only in their driver and not via their cluster's {{spark-defaults.conf}} or {{spark-env.sh}} files. This is an atypical use-case, since the available local directories / disks are a property of the cluster and not the application, which probably explains why this issue has not been reported previously. The correct fix might be comment + documentation improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7118) Add coalesce Spark SQL function to PySpark API
[ https://issues.apache.org/jira/browse/SPARK-7118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512320#comment-14512320 ] Apache Spark commented on SPARK-7118: - User 'ogirardot' has created a pull request for this issue: https://github.com/apache/spark/pull/5698 Add coalesce Spark SQL function to PySpark API -- Key: SPARK-7118 URL: https://issues.apache.org/jira/browse/SPARK-7118 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.3.1 Reporter: Olivier Girardot Priority: Minor The *org.apache.sql.functions.coalesce* function is not available from PySpark SQL API. Let's add it. Olivier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5891) Add Binarizer
[ https://issues.apache.org/jira/browse/SPARK-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5891: --- Assignee: Apache Spark Add Binarizer - Key: SPARK-5891 URL: https://issues.apache.org/jira/browse/SPARK-5891 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Apache Spark `Binarizer` takes a column of continuous features and output a column with binary features, where nonzeros (or values below a threshold) become 1 in the output. {code} val binarizer = new Binarizer() .setInputCol(numVisits) .setOutputCol(visited) {code} The output column should be marked as binary. We need to discuss whether we should process multiple columns or a vector column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6333) saveAsObjectFile support for compression codec
[ https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6333: --- Assignee: (was: Apache Spark) saveAsObjectFile support for compression codec -- Key: SPARK-6333 URL: https://issues.apache.org/jira/browse/SPARK-6333 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Deenar Toraskar Priority: Minor saveAsObjectFile current does not support a compression codec. This story is about adding saveAsObjectFile (path, codec) support into spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5891) Add Binarizer
[ https://issues.apache.org/jira/browse/SPARK-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512361#comment-14512361 ] Apache Spark commented on SPARK-5891: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5699 Add Binarizer - Key: SPARK-5891 URL: https://issues.apache.org/jira/browse/SPARK-5891 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `Binarizer` takes a column of continuous features and output a column with binary features, where nonzeros (or values below a threshold) become 1 in the output. {code} val binarizer = new Binarizer() .setInputCol(numVisits) .setOutputCol(visited) {code} The output column should be marked as binary. We need to discuss whether we should process multiple columns or a vector column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5891) Add Binarizer
[ https://issues.apache.org/jira/browse/SPARK-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5891: --- Assignee: (was: Apache Spark) Add Binarizer - Key: SPARK-5891 URL: https://issues.apache.org/jira/browse/SPARK-5891 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng `Binarizer` takes a column of continuous features and output a column with binary features, where nonzeros (or values below a threshold) become 1 in the output. {code} val binarizer = new Binarizer() .setInputCol(numVisits) .setOutputCol(visited) {code} The output column should be marked as binary. We need to discuss whether we should process multiple columns or a vector column. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2750) Add Https support for Web UI
[ https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Wang updated SPARK-2750: Attachment: exception on yarn when https enabled.txt Add Https support for Web UI Key: SPARK-2750 URL: https://issues.apache.org/jira/browse/SPARK-2750 Project: Spark Issue Type: New Feature Components: Web UI Reporter: Tao Wang Labels: https, ssl, webui Fix For: 1.0.3 Attachments: exception on yarn when https enabled.txt Original Estimate: 96h Remaining Estimate: 96h Now I try to add https support for web ui using Jetty ssl integration.Below is the plan: 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can switch between https and http by configure spark.http.policy in JVM property for each process, while choose http by default. 2.Web port of Master and worker would be decided in order of launch arguments, JVM property, System Env and default port. 3.Below is some other configuration items: spark.ssl.server.keystore.location The file or URL of the SSL Key store spark.ssl.server.keystore.password The password for the key store spark.ssl.server.keystore.keypassword The password (if any) for the specific key within the key store spark.ssl.server.keystore.type The type of the key store (default JKS) spark.client.https.need-auth True if SSL needs client authentication spark.ssl.server.truststore.location The file name or URL of the trust store location spark.ssl.server.truststore.password The password for the trust store spark.ssl.server.truststore.type The type of the trust store (default JKS) Any feedback is welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6333) saveAsObjectFile support for compression codec
[ https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6333: --- Assignee: Apache Spark saveAsObjectFile support for compression codec -- Key: SPARK-6333 URL: https://issues.apache.org/jira/browse/SPARK-6333 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.1 Reporter: Deenar Toraskar Assignee: Apache Spark Priority: Minor saveAsObjectFile current does not support a compression codec. This story is about adding saveAsObjectFile (path, codec) support into spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule
Yash Datta created SPARK-7142: - Summary: Minor enhancement to BooleanSimplification Optimizer rule Key: SPARK-7142 URL: https://issues.apache.org/jira/browse/SPARK-7142 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yash Datta Priority: Minor Add simplification using these rules : A and (not(A) or B) = A and B not(A and B) = not(A) or not(B) not(A or B) = not(A) and not(B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule
[ https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7142: --- Assignee: Apache Spark Minor enhancement to BooleanSimplification Optimizer rule - Key: SPARK-7142 URL: https://issues.apache.org/jira/browse/SPARK-7142 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yash Datta Assignee: Apache Spark Priority: Minor Add simplification using these rules : A and (not(A) or B) = A and B not(A and B) = not(A) or not(B) not(A or B) = not(A) and not(B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule
[ https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7142: --- Assignee: (was: Apache Spark) Minor enhancement to BooleanSimplification Optimizer rule - Key: SPARK-7142 URL: https://issues.apache.org/jira/browse/SPARK-7142 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yash Datta Priority: Minor Add simplification using these rules : A and (not(A) or B) = A and B not(A and B) = not(A) or not(B) not(A or B) = not(A) and not(B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule
[ https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512487#comment-14512487 ] Apache Spark commented on SPARK-7142: - User 'saucam' has created a pull request for this issue: https://github.com/apache/spark/pull/5700 Minor enhancement to BooleanSimplification Optimizer rule - Key: SPARK-7142 URL: https://issues.apache.org/jira/browse/SPARK-7142 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yash Datta Priority: Minor Add simplification using these rules : A and (not(A) or B) = A and B not(A and B) = not(A) or not(B) not(A or B) = not(A) and not(B) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix
[ https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512573#comment-14512573 ] Liang-Chi Hsieh commented on SPARK-7141: The double slash issue is caused by the Jets3tFileSystemStore implementation in Hadoop. You can refer to [HADOOP-11444|https://issues.apache.org/jira/browse/HADOOP-11444] and [the discussion on spark-user|https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAE50=drwWG=eMDM=lsuf-puzopxfnj-+7k3vx_m5mmjfal2...@mail.gmail.com%3E]. saveAsTextFile() on S3 first creates empty prefix - Key: SPARK-7141 URL: https://issues.apache.org/jira/browse/SPARK-7141 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.1 Environment: OS X 10.10 Reporter: Eric O. LEBIGOT (EOL) Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, i.e. it writes to {{s3://bucket//prefix}} (note the double slash). Example code (in a {{pyspark}} shell): {{rdd = sc.parallelize(abcd)}} {{rdd.saveAsTextFile(s3://bucket/prefix)}) This is quite annoying, as the files cannot be saved in the intended location (they can be read, though, with the original path: {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in the right place). Also, many {{block_*}} files are created directly in the bucket: shouldn't they be deleted? (This may be a separate issue, but maybe it is a path issue as well.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them
[ https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512575#comment-14512575 ] Harsh Gupta commented on SPARK-6980: [~imranr] [~bryanc] Hi. I tried a simple example of actors producer and consumer by setting setTimeOut very low and was able to see the exception.I am not clear on how util methods in SparkConf would get NamedDuration.Although the wrapper approach sounds fine.Will do some more tweaks and post here(although won't be very active this week since need to get my primary laptop fixed) Akka timeout exceptions indicate which conf controls them - Key: SPARK-6980 URL: https://issues.apache.org/jira/browse/SPARK-6980 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Imran Rashid Assignee: Harsh Gupta Priority: Minor Labels: starter Attachments: Spark-6980-Test.scala If you hit one of the akka timeouts, you just get an exception like {code} java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] {code} The exception doesn't indicate how to change the timeout, though there is usually (always?) a corresponding setting in {{SparkConf}} . It would be nice if the exception including the relevant setting. I think this should be pretty easy to do -- we just need to create something like a {{NamedTimeout}}. It would have its own {{await}} method, catches the akka timeout and throws its own exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7143) Add BM25 Estimator
Liang-Chi Hsieh created SPARK-7143: -- Summary: Add BM25 Estimator Key: SPARK-7143 URL: https://issues.apache.org/jira/browse/SPARK-7143 Project: Spark Issue Type: New Feature Components: ML Reporter: Liang-Chi Hsieh [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used to rank documents. It is commonly used in IR tasks and can be parallel. This issue is proposed to add it into Spark ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7143) Add BM25 Estimator
[ https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7143: --- Assignee: Apache Spark Add BM25 Estimator -- Key: SPARK-7143 URL: https://issues.apache.org/jira/browse/SPARK-7143 Project: Spark Issue Type: New Feature Components: ML Reporter: Liang-Chi Hsieh Assignee: Apache Spark [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used to rank documents. It is commonly used in IR tasks and can be parallel. This issue is proposed to add it into Spark ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7143) Add BM25 Estimator
[ https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7143: --- Assignee: (was: Apache Spark) Add BM25 Estimator -- Key: SPARK-7143 URL: https://issues.apache.org/jira/browse/SPARK-7143 Project: Spark Issue Type: New Feature Components: ML Reporter: Liang-Chi Hsieh [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used to rank documents. It is commonly used in IR tasks and can be parallel. This issue is proposed to add it into Spark ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7143) Add BM25 Estimator
[ https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512578#comment-14512578 ] Apache Spark commented on SPARK-7143: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5701 Add BM25 Estimator -- Key: SPARK-7143 URL: https://issues.apache.org/jira/browse/SPARK-7143 Project: Spark Issue Type: New Feature Components: ML Reporter: Liang-Chi Hsieh [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used to rank documents. It is commonly used in IR tasks and can be parallel. This issue is proposed to add it into Spark ML. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7144) SPARK-6784
Yin Huai created SPARK-7144: --- Summary: SPARK-6784 Key: SPARK-7144 URL: https://issues.apache.org/jira/browse/SPARK-7144 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7144) SPARK-6784
[ https://issues.apache.org/jira/browse/SPARK-7144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-7144. - Resolution: Invalid oops... SPARK-6784 -- Key: SPARK-7144 URL: https://issues.apache.org/jira/browse/SPARK-7144 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5456) Decimal Type comparison issue
[ https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-5456: Priority: Blocker (was: Major) Target Version/s: 1.4.0 Decimal Type comparison issue - Key: SPARK-5456 URL: https://issues.apache.org/jira/browse/SPARK-5456 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Reporter: Kuldeep Priority: Blocker Not quite able to figure this out but here is a junit test to reproduce this, in JavaAPISuite.java {code:title=DecimalBug.java} @Test public void decimalQueryTest() { ListRow decimalTable = new ArrayListRow(); decimalTable.add(RowFactory.create(new BigDecimal(1), new BigDecimal(2))); decimalTable.add(RowFactory.create(new BigDecimal(3), new BigDecimal(4))); JavaRDDRow rows = sc.parallelize(decimalTable); ListStructField fields = new ArrayListStructField(7); fields.add(DataTypes.createStructField(a, DataTypes.createDecimalType(), true)); fields.add(DataTypes.createStructField(b, DataTypes.createDecimalType(), true)); sqlContext.applySchema(rows.rdd(), DataTypes.createStructType(fields)).registerTempTable(foo); Assert.assertEquals(sqlContext.sql(select * from foo where a 0).collectAsList(), decimalTable); } {code} Fails with java.lang.ClassCastException: java.math.BigDecimal cannot be cast to org.apache.spark.sql.types.Decimal -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org