date:20150425


 [ 
https://issues.apache.org/jira/browse/SPARK-7111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7111:
-
Component/s: Streaming

 Add a tracker to track the direct (receiver-less) streams
 -

 Key: SPARK-7111
 URL: https://issues.apache.org/jira/browse/SPARK-7111
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Saisai Shao

 Currently for receiver-based input streams, Spark Streaming offers 
 ReceiverTracker and ReceivedBlockTracker to track the status of receivers as 
 well as block information. Also this status and block information can be 
 retrieved from StreamingListener to expose to the users.
 But for direct-based (receiver-less) input streams, Current Spark Streaming 
 lacks such mechanism to track the registered direct streams, also lacks the 
 way to track the processed number of data for direct-based input streams.
 Here propose a mechanism to track the register direct stream, also expose the 
 processing statistics to the BatchInfo and StreamingListener.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7112) Add a tracker to track the direct streams


 [ 
https://issues.apache.org/jira/browse/SPARK-7112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7112:
-
Component/s: Streaming

 Add a tracker to track the direct streams
 -

 Key: SPARK-7112
 URL: https://issues.apache.org/jira/browse/SPARK-7112
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Saisai Shao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7113) Add the direct stream related information to the streaming listener and web UI


 [ 
https://issues.apache.org/jira/browse/SPARK-7113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7113:
-
Component/s: Streaming

 Add the direct stream related information to the streaming listener and web UI
 --

 Key: SPARK-7113
 URL: https://issues.apache.org/jira/browse/SPARK-7113
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Reporter: Saisai Shao





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7107) Add parameter for zookeeper.znode.parent to hbase_inputformat.py


 [ 
https://issues.apache.org/jira/browse/SPARK-7107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7107:
-
Component/s: PySpark

 Add parameter for zookeeper.znode.parent to hbase_inputformat.py
 

 Key: SPARK-7107
 URL: https://issues.apache.org/jira/browse/SPARK-7107
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor

 [~yeshavora] first reported encountering the following exception running 
 hbase_inputformat.py :
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
 : java.lang.RuntimeException: java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
 at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:313)
 at 
 org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:288)
 at 
 org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
 {code}
 It turned out that the hbase cluster has custom znode parent:
 {code}
 property
   namezookeeper.znode.parent/name
   value/hbase-unsecure/value
 /property
 {code}
 hbase_inputformat.py should support specification of custom znode parent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark


 [ 
https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5722:
-
Assignee: Don Drake

 Infer_schema_type incorrect for Integers in pyspark
 ---

 Key: SPARK-5722
 URL: https://issues.apache.org/jira/browse/SPARK-5722
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0
Reporter: Don Drake
Assignee: Don Drake
 Fix For: 1.2.2


 The Integers datatype in Python does not match what a Scala/Java integer is 
 defined as.   This causes inference of data types and schemas to fail when 
 data is larger than 2^32 and it is inferred incorrectly as an Integer.
 Since the range of valid Python integers is wider than Java Integers, this 
 causes problems when inferring Integer vs. Long datatypes.  This will cause 
 problems when attempting to save SchemaRDD as Parquet or JSON.
 Here's an example:
 {code}
  sqlCtx = SQLContext(sc)
  from pyspark.sql import Row
  rdd = sc.parallelize([Row(f1='a', f2=100)])
  srdd = sqlCtx.inferSchema(rdd)
  srdd.schema()
 StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true)))
 {code}
 That number is a LongType in Java, but an Integer in python.  We need to 
 check the value to see if it should really by a LongType when a IntegerType 
 is initially inferred.
 More tests:
 {code}
  from pyspark.sql import _infer_type
 # OK
  print _infer_type(1)
 IntegerType
 # OK
  print _infer_type(2**31-1)
 IntegerType
 #WRONG
  print _infer_type(2**31)
 #WRONG
 IntegerType
  print _infer_type(2**61 )
 #OK
 IntegerType
  print _infer_type(2**71 )
 LongType
 {code}
 Java Primitive Types defined:
 http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html
 Python Built-in Types:
 https://docs.python.org/2/library/stdtypes.html#typesnumeric



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5651) Support 'create db.table' in HiveContext


 [ 
https://issues.apache.org/jira/browse/SPARK-5651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5651:
-
Assignee: Yadong Qi

 Support 'create db.table' in HiveContext
 

 Key: SPARK-5651
 URL: https://issues.apache.org/jira/browse/SPARK-5651
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yadong Qi
Assignee: Yadong Qi
 Fix For: 1.4.0


 Now spark version is only support ```create table 
 table_in_database_creation.test1 as select * from src limit 1;``` in 
 HiveContext.
 This patch is used to support ```create table 
 `table_in_database_creation.test2` as select * from src limit 1;``` in 
 HiveContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5712) Semicolon at end of a comment line


 [ 
https://issues.apache.org/jira/browse/SPARK-5712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5712:
-
Assignee: Adrian Wang

 Semicolon at end of a comment line
 --

 Key: SPARK-5712
 URL: https://issues.apache.org/jira/browse/SPARK-5712
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.4.0


 HIVE-3348



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5756) Analyzer should not throw scala.NotImplementedError for illegitimate sql


 [ 
https://issues.apache.org/jira/browse/SPARK-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5756:
-
Assignee: Fei Wang

 Analyzer should not throw scala.NotImplementedError for illegitimate sql
 

 Key: SPARK-5756
 URL: https://issues.apache.org/jira/browse/SPARK-5756
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Fei Wang
Assignee: Fei Wang

 ```SELECT CAST(x AS STRING) FROM src```  comes a NotImplementedError:
   CliDriver: scala.NotImplementedError: an implementation is missing
 at scala.Predef$.$qmark$qmark$qmark(Predef.scala:252)
 at 
 org.apache.spark.sql.catalyst.expressions.PrettyAttribute.dataType(namedExpressions.scala:221)
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:30)
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:30)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:68)
 at 
 scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
 at scala.collection.immutable.List.exists(List.scala:84)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:68)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:56)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:56)
 at 
 org.apache.spark.sql.catalyst.expressions.NamedExpression.typeSuffix(namedExpressions.scala:62)
 at 
 org.apache.spark.sql.catalyst.expressions.Alias.toString(namedExpressions.scala:124)
 at 
 org.apache.spark.sql.catalyst.expressions.Expression.prettyString(Expression.scala:78)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1$$anonfun$7.apply(Analyzer.scala:83)
 at scala.collection.immutable.Stream.map(Stream.scala:376)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:81)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:204)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:79)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5758) Use LongType as the default type for integers in JSON schema inference.


 [ 
https://issues.apache.org/jira/browse/SPARK-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5758:
-
Assignee: Yin Huai

 Use LongType as the default type for integers in JSON schema inference.
 ---

 Key: SPARK-5758
 URL: https://issues.apache.org/jira/browse/SPARK-5758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.0


 Per discussion in https://github.com/apache/spark/pull/4521, we will use 
 LongType as the default data type for integer values in JSON schema inference.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5789) Throw a better error message if JsonRDD.parseJson encounters unrecoverable parsing errors.


 [ 
https://issues.apache.org/jira/browse/SPARK-5789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5789:
-
Assignee: Yin Huai

 Throw a better error message if JsonRDD.parseJson encounters unrecoverable 
 parsing errors.
 --

 Key: SPARK-5789
 URL: https://issues.apache.org/jira/browse/SPARK-5789
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.0


 For example
 {code}
 sqlContext.jsonRDD(sc.parallelize(a:1}::Nil))
 {code}
 will throw
 {code}
 scala.MatchError: a (of class java.lang.String)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/02/12 15:08:55 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 
 4.0 (TID 26) in 10 ms on localhost (7/8)
 15/02/12 15:08:55 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 4.0 
 (TID 33, localhost): scala.MatchError: a (of class java.lang.String)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:302)
   at 
 org.apache.spark.sql.json.JsonRDD$$anonfun$parseJson$1$$anonfun$apply$2.apply(JsonRDD.scala:300)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:879)
   at org.apache.spark.rdd.RDD$$anonfun$19.apply(RDD.scala:878)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at 
 org.apache.spark.SparkContext$$anonfun$30.apply(SparkContext.scala:1516)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5498) [SPARK-SQL]when the partition schema does not match table schema,it throws java.lang.ClassCastException and so on


 [ 
https://issues.apache.org/jira/browse/SPARK-5498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5498:
-
Assignee: jeanlyn

 [SPARK-SQL]when the partition schema does not match table schema,it throws 
 java.lang.ClassCastException and so on
 -

 Key: SPARK-5498
 URL: https://issues.apache.org/jira/browse/SPARK-5498
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Assignee: jeanlyn
 Fix For: 1.4.0


 when the partition schema does not match table schema,it will thows exception 
 when the task is running.For example,we modify the type of column from int to 
 bigint by the sql *ALTER TABLE table_with_partition CHANGE COLUMN key key 
 BIGINT* ,then we query the patition data which was stored before the 
 changing,we would get the exception:
 {noformat}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 27.0 failed 4 times, most recent failure: Lost task 0.3 in stage 27.0 
 (TID 30, BJHC-HADOOP-HERA-16950.jeanlyn.local): java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableInt
 at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setInt(SpecificMutableRow.scala:241)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$13$$anonfun$apply$4.apply(TableReader.scala:286)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:322)
 at 
 org.apache.spark.sql.hive.HadoopTableReader$$anonfun$fillObject$1.apply(TableReader.scala:314)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
 at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
 at scala.collection.AbstractIterator.to(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
 at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
 at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
 at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
 at 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
 at 
 org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:141)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
 at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:662)
 Driver stacktrace:
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at

[jira] [Updated] (SPARK-5404) Statistic of Logical Plan is too aggresive


 [ 
https://issues.apache.org/jira/browse/SPARK-5404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5404:
-
Assignee: Cheng Hao

 Statistic of Logical Plan is too aggresive
 --

 Key: SPARK-5404
 URL: https://issues.apache.org/jira/browse/SPARK-5404
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.4.0


 The statistic number of a logical plan is quite helpful while do optimization 
 like join reordering, however, the default algorithm is too aggressive, which 
 probably lead to a totally wrong join order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5277) SparkSqlSerializer does not register user specified KryoRegistrators


 [ 
https://issues.apache.org/jira/browse/SPARK-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5277:
-
Assignee: Max Seiden

 SparkSqlSerializer does not register user specified KryoRegistrators 
 -

 Key: SPARK-5277
 URL: https://issues.apache.org/jira/browse/SPARK-5277
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1, 1.3.0
Reporter: Max Seiden
Assignee: Max Seiden
 Fix For: 1.4.0


 Although the SparkSqlSerializer class extends the KryoSerializer in core, 
 it's overridden newKryo() does not call super.newKryo(). This results in 
 inconsistent serializer behaviors depending on whether a KryoSerializer 
 instance or a SparkSqlSerializer instance is used. This may also be related 
 to the TODO in KryoResourcePool, which uses KryoSerializer instead of 
 SparkSqlSerializer due to yet-to-be-investigated test failures.
 An example of the divergence in behavior: The Exchange operator creates a new 
 SparkSqlSerializer instance (with an empty conf; another issue) when it is 
 constructed, whereas the GENERIC ColumnType pulls a KryoSerializer out of the 
 resource pool (see above). The result is that the serialized in-memory 
 columns are created using the user provided serializers / registrators, while 
 serialization during exchange does not.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-911) Support map pruning on sorted (K, V) RDD's


 [ 
https://issues.apache.org/jira/browse/SPARK-911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-911:

Assignee: Aaron

 Support map pruning on sorted (K, V) RDD's
 --

 Key: SPARK-911
 URL: https://issues.apache.org/jira/browse/SPARK-911
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Aaron
 Fix For: 1.4.0


 If someone has sorted a (K, V) rdd, we should offer them a way to filter a 
 range of the partitions that employs map pruning. This would be simple using 
 a small range index within the rdd itself. A good example is I sort my 
 dataset by time and then I want to serve queries that are restricted to a 
 certain time range.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6779) Move shared params to param.shared and use code gen


 [ 
https://issues.apache.org/jira/browse/SPARK-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6779:
-
Assignee: Xiangrui Meng

 Move shared params to param.shared and use code gen
 ---

 Key: SPARK-6779
 URL: https://issues.apache.org/jira/browse/SPARK-6779
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The boilerplate code should be automatically generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6779) Move shared params to param.shared and use code gen


 [ 
https://issues.apache.org/jira/browse/SPARK-6779?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6779.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Move shared params to param.shared and use code gen
 ---

 Key: SPARK-6779
 URL: https://issues.apache.org/jira/browse/SPARK-6779
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 The boilerplate code should be automatically generated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5935) Accept MapType in the schema provided to a JSON dataset.


 [ 
https://issues.apache.org/jira/browse/SPARK-5935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5935:
-
Assignee: Yin Huai

 Accept MapType in the schema provided to a JSON dataset.
 

 Key: SPARK-5935
 URL: https://issues.apache.org/jira/browse/SPARK-5935
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5908) Hive udtf with single alias should be resolved correctly


 [ 
https://issues.apache.org/jira/browse/SPARK-5908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5908:
-
Assignee: Liang-Chi Hsieh

 Hive udtf with single alias should be resolved correctly
 

 Key: SPARK-5908
 URL: https://issues.apache.org/jira/browse/SPARK-5908
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
 Fix For: 1.4.0


 ResolveUdtfsAlias in hiveUdfs only considers the HiveGenericUdtf with 
 multiple alias. When only single alias is used with HiveGenericUdtf, the 
 alias is not working.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5911) Make Column.cast(to: String) support fixed precision and scale decimal type


 [ 
https://issues.apache.org/jira/browse/SPARK-5911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5911:
-
Assignee: Yin Huai

 Make Column.cast(to: String) support fixed precision and scale decimal type
 ---

 Key: SPARK-5911
 URL: https://issues.apache.org/jira/browse/SPARK-5911
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.3.1, 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7107) Add parameter for zookeeper.znode.parent to hbase_inputformat.py

2015-04-25 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-7107:
---
Component/s: Examples

 Add parameter for zookeeper.znode.parent to hbase_inputformat.py
 

 Key: SPARK-7107
 URL: https://issues.apache.org/jira/browse/SPARK-7107
 Project: Spark
  Issue Type: Bug
  Components: Examples, PySpark
Reporter: Ted Yu
Assignee: Ted Yu
Priority: Minor

 [~yeshavora] first reported encountering the following exception running 
 hbase_inputformat.py :
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
 : java.lang.RuntimeException: java.lang.NullPointerException
 at 
 org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
 at org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:313)
 at 
 org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:288)
 at 
 org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
 {code}
 It turned out that the hbase cluster has custom znode parent:
 {code}
 property
   namezookeeper.znode.parent/name
   value/hbase-unsecure/value
 /property
 {code}
 hbase_inputformat.py should support specification of custom znode parent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5926) [SQL] DataFrame.explain() return false result for DDL command


 [ 
https://issues.apache.org/jira/browse/SPARK-5926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5926:
-
Assignee: Yanbo Liang

 [SQL] DataFrame.explain() return false result for DDL command
 -

 Key: SPARK-5926
 URL: https://issues.apache.org/jira/browse/SPARK-5926
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yanbo Liang
Assignee: Yanbo Liang
 Fix For: 1.3.0


 This bug is easy to reproduce, the following two queries should print out the 
 same explain result, but it's not.
 sql(create table tb as select * from src where key  490).explain(true)
 sql(explain extended create table tb as select * from src where key  490)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5909) Add a clearCache command to Spark SQL's cache manager


 [ 
https://issues.apache.org/jira/browse/SPARK-5909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5909:
-
Assignee: Yin Huai

 Add a clearCache command to Spark SQL's cache manager
 -

 Key: SPARK-5909
 URL: https://issues.apache.org/jira/browse/SPARK-5909
 Project: Spark
  Issue Type: Task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.4.0


 This command will clear all cached data from the in-memory cache, which will 
 be useful when users want to quickly clear the cache or as a workaround of 
 cases like SPARK-5881.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5203) union with different decimal type report error


 [ 
https://issues.apache.org/jira/browse/SPARK-5203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5203:
-
Assignee: guowei

 union with different decimal type report error
 --

 Key: SPARK-5203
 URL: https://issues.apache.org/jira/browse/SPARK-5203
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: guowei
Assignee: guowei
 Fix For: 1.4.0


 Test case like this:
 {code:sql}
 create table test (a decimal(10,1));
 select a from test union all select a*2 from test;
 {code}
 Exception thown:
 {noformat}
 15/01/12 16:28:54 ERROR SparkSQLDriver: Failed in [select a from test union 
 all select a*2 from test]
 org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved 
 attributes: *, tree:
 'Project [*]
  'Subquery _u1
   'Union 
Project [a#1]
 MetastoreRelation default, test, None
Project [CAST((CAST(a#2, DecimalType()) * CAST(CAST(2, DecimalType(10,0)), 
 DecimalType())), DecimalType(21,1)) AS _c0#0]
 MetastoreRelation default, test, None
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$$anonfun$1.applyOrElse(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:83)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$CheckResolution$.apply(Analyzer.scala:81)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:34)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:410)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:411)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:412)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:417)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:415)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:421)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:421)
   at 
 org.apache.spark.sql.hive.HiveContext$QueryExecution.stringResult(HiveContext.scala:369)
   at 
 org.apache.spark.sql.hive.thriftserver.AbstractSparkSQLDriver.run(AbstractSparkSQLDriver.scala:58)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:275)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:211)
   at 
 org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result


 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5068:
-
Assignee: dongxu

 When the path not found in the hdfs,we can't get the result
 ---

 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn
Assignee: dongxu
 Fix For: 1.4.0


 when the partion path was found in the metastore but not found in the hdfs,it 
 will casue some problems as follow:
 {noformat}
 hive show partitions partition_test;
 OK
 dt=1
 dt=2
 dt=3
 dt=4
 Time taken: 0.168 seconds, Fetched: 4 row(s)
 {noformat}
 {noformat}
 hive dfs -ls /user/jeanlyn/warehouse/partition_test;
 Found 3 items
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=1
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
 /user/jeanlyn/warehouse/partition_test/dt=3
 drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
 /user/jeanlyn/warehouse/partition_test/dt=4
 {noformat}
 when i run the sql 
 {noformat}
 select * from partition_test limit 10
 {noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get 
 the error as follow:
 {noformat}
 Exception in thread main org.apache.hadoop.mapred.InvalidInputException: 
 Input path does not exist: 
 hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
 at 
 org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
 at 
 org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
 at 
 org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
 at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
 at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
 at org.apache.spark.sql.hive.testpartition.main(test.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (SPARK-7108) spark.local.dir is no longer honored in Standalone mode


 [ 
https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-7108:
---
Summary: spark.local.dir is no longer honored in Standalone mode  (was: 
Setting spark.local.dir in driver no longer overrides the standalone worker's 
local directory setting)

 spark.local.dir is no longer honored in Standalone mode
 ---

 Key: SPARK-7108
 URL: https://issues.apache.org/jira/browse/SPARK-7108
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1, 1.3.0
Reporter: Josh Rosen
Priority: Critical

 Prior to SPARK-4834, configuring spark.local.dir in the driver would affect 
 the local directories created on the executor.  After this patch, executors 
 will always ignore this setting in favor of directories read from 
 {{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the 
 worker's own configuration and not the application configuration.
 This change impacts users who configured {{spark.local.dir}} only in their 
 driver and not via their cluster's {{spark-defaults.conf}} or 
 {{spark-env.sh}} files.  This is an atypical use-case, since the available 
 local directories / disks are a property of the cluster and not the 
 application, which probably explains why this issue has not been reported 
 previously.
 The correct fix might be comment + documentation improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency

Sean Owen created SPARK-7145:


 Summary: commons-lang (2.x) classes used instead of commons-lang3 
(3.x); commons-io used without dependency
 Key: SPARK-7145
 URL: https://issues.apache.org/jira/browse/SPARK-7145
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL, Streaming
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor


Spark depends only on Commons Lang3 (3.x). However there are several accidental 
usages of Commons Lang (2.x) in the codebase. As we saw a few days ago the 
version of Lang 2.x that accidentally comes in via Hadoop can change with 
Hadoop version and so the accidental usage is more than a purely theoretical 
problem. It's easy to change the usages to 3.x counterparts.

Also, there are just a few uses of Commons IO in the code which can be replaced 
with uses of Guava, removing another used but undeclared dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5852) Fail to convert a newly created empty metastore parquet table to a data source parquet table.


 [ 
https://issues.apache.org/jira/browse/SPARK-5852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5852:
-
Assignee: Yin Huai

 Fail to convert a newly created empty metastore parquet table to a data 
 source parquet table.
 -

 Key: SPARK-5852
 URL: https://issues.apache.org/jira/browse/SPARK-5852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.3.0


 To reproduce the exception, try
 {code}
 val rdd = sc.parallelize((1 to 10).map(i = s{a:$i, b:str${i}}))
 sqlContext.jsonRDD(rdd).registerTempTable(jt)
 sqlContext.sql(create table test stored as parquet as select * from jt)
 {code}
 ParquetConversions tries to convert the write path to the data source API 
 write path. But, the following exception was thrown.
 {code}
 java.lang.UnsupportedOperationException: empty.reduceLeft
   at 
 scala.collection.TraversableOnce$class.reduceLeft(TraversableOnce.scala:167)
   at 
 scala.collection.mutable.ArrayBuffer.scala$collection$IndexedSeqOptimized$$super$reduceLeft(ArrayBuffer.scala:47)
   at 
 scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
   at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableOnce$class.reduce(TraversableOnce.scala:195)
   at scala.collection.AbstractTraversable.reduce(Traversable.scala:105)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$.readSchema(newParquet.scala:633)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:349)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$8.apply(newParquet.scala:290)
   at scala.Option.getOrElse(Option.scala:120)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:290)
   at 
 org.apache.spark.sql.parquet.ParquetRelation2.init(newParquet.scala:354)
   at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.org$apache$spark$sql$hive$HiveMetastoreCatalog$$convertToParquetRelation(HiveMetastoreCatalog.scala:218)
   at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:440)
   at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$$anonfun$apply$4.apply(HiveMetastoreCatalog.scala:439)
   at 
 scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
   at 
 scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
   at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:439)
   at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog$ParquetConversions$.apply(HiveMetastoreCatalog.scala:416)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
   at 
 scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
   at scala.collection.immutable.List.foldLeft(List.scala:84)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:917)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:917)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData$lzycompute(SQLContext.scala:918)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.withCachedData(SQLContext.scala:918)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan$lzycompute(SQLContext.scala:919)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.optimizedPlan(SQLContext.scala:919)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:924)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:922)
   at

[jira] [Updated] (SPARK-5862) Only transformUp the given plan once in HiveMetastoreCatalog


 [ 
https://issues.apache.org/jira/browse/SPARK-5862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5862:
-
Assignee: Liang-Chi Hsieh

 Only transformUp the given plan once in HiveMetastoreCatalog
 

 Key: SPARK-5862
 URL: https://issues.apache.org/jira/browse/SPARK-5862
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Assignee: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.3.0


 Current ParquetConversions in HiveMetastoreCatalog will transformUp the given 
 plan multiple times if there are many Metastore Parquet tables. Since the 
 transformUp operation is recursive, it should be better to only perform it 
 once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7000) spark.ml Prediction abstractions should reside in ml.prediction subpackage


 [ 
https://issues.apache.org/jira/browse/SPARK-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7000.
--
Resolution: Not A Problem

 spark.ml Prediction abstractions should reside in ml.prediction subpackage
 --

 Key: SPARK-7000
 URL: https://issues.apache.org/jira/browse/SPARK-7000
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.1
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 spark.ml prediction abstractions are currently not gathered; they are in both 
 ml.impl and ml.tree.  Instead, they should be gathered into ml.prediction.  
 This will become more important as more abstractions, such as ensembles, are 
 added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7000) spark.ml Prediction abstractions should reside in ml.prediction subpackage


[ 
https://issues.apache.org/jira/browse/SPARK-7000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512828#comment-14512828
 ] 

Joseph K. Bradley commented on SPARK-7000:
--

Decision after conferring with [~mengxr]:  The structure will be:
* ml.Predictor (Predictor will be made public and moved to the ml package in 
[SPARK-5995])
* ml.trees.* (Tree abstractions shared by ml.classification and ml.regression)
* ml.ensembles.* (Generic ensemble abstractions shared by ml.classification and 
ml.regression)

Since we can't think of more shared abstractions currently, we'll aim for a 
flatter directory structure.  Closing this JIRA

 spark.ml Prediction abstractions should reside in ml.prediction subpackage
 --

 Key: SPARK-7000
 URL: https://issues.apache.org/jira/browse/SPARK-7000
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.3.1
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 spark.ml prediction abstractions are currently not gathered; they are in both 
 ml.impl and ml.tree.  Instead, they should be gathered into ml.prediction.  
 This will become more important as more abstractions, such as ensembles, are 
 added.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6989) Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various arcane compiler errors


 [ 
https://issues.apache.org/jira/browse/SPARK-6989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6989.
--
Resolution: Cannot Reproduce

 Spark 1.3 REPL for Scala 2.11 (2.11.2) fails to start, emitting various 
 arcane compiler errors
 --

 Key: SPARK-6989
 URL: https://issues.apache.org/jira/browse/SPARK-6989
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.3.0
 Environment: Java 1.8.0_40 on Ubuntu 14.04.1
Reporter: Michael Allman
Assignee: Prashant Sharma
 Attachments: spark_repl_2.11_errors.txt


 When starting the Spark 1.3 spark-shell compiled for Scala 2.11, I get a 
 random assortment of compiler errors. I will attach a transcript.
 One thing I've noticed is that they seem to be less frequent when I increase 
 the driver heap size to 5 GB or so. By comparison, the Spark 1.1 spark-shell 
 on Scala 2.10 has been rock solid with a 512 MB heap.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7124) Add functions to check for file and directory existence


 [ 
https://issues.apache.org/jira/browse/SPARK-7124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7124.
--
Resolution: Not A Problem

 Add functions to check for file and directory existence
 ---

 Key: SPARK-7124
 URL: https://issues.apache.org/jira/browse/SPARK-7124
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Reporter: Sam Steingold

 How do I check that a file or directory exists?
 For file, I was told to do {{sc.textFile().first()}} which seems wrong:
 # it initiates unnecessary i/o which could be huge (what is the file is 
 binary and has no newlines?)
 # it fails for 0-length files (e.g., we write 0-length {{_SUCCESS}} files in 
 directories after they have been successfully written)
 it appears that Spark needs bona file {{isFile}} and {{isDirectory}} methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix


 [ 
https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric O. LEBIGOT (EOL) updated SPARK-7141:
-
Description: 
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)}}

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).

Also, many {{block_*}} files are created directly in the bucket: shouldn't they 
be deleted? (This may be a separate issue, but maybe it is a path issue as 
well.)

  was:
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).

Also, many {{block_*}} files are created directly in the bucket: shouldn't they 
be deleted? (This may be a separate issue, but maybe it is a path issue as 
well.)


 saveAsTextFile() on S3 first creates empty prefix
 -

 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)

 Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
 i.e. it writes to {{s3://bucket//prefix}} (note the double slash).
 Example code (in a {{pyspark}} shell):
 {{rdd = sc.parallelize(abcd)}}
 {{rdd.saveAsTextFile(s3://bucket/prefix)}}
 This is quite annoying, as the files cannot be saved in the intended location 
 (they can be read, though, with the original path: 
 {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them 
 in the right place).
 Also, many {{block_*}} files are created directly in the bucket: shouldn't 
 they be deleted? (This may be a separate issue, but maybe it is a path issue 
 as well.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext


 [ 
https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-6752:


I had to revert this because it caused test failures with the Hadoop 1.0 build.

To reproduce them use:
build/sbt -Dhadoop.version=1.0.4 -Pkinesis-asl -Phive -Phive-thriftserver 
-Phive-0.12.0 streaming/test:compile

The errors are:
{code}
[error] 
/Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1740:
 error: cannot find symbol
[error] Assert.assertTrue(new context not created, 
newContextCreated.isTrue());
[error]   ^
[error]   symbol:   method isTrue()
[error]   location: variable newContextCreated of type MutableBoolean
[error] 
/Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1746:
 error: cannot find symbol
[error] Assert.assertTrue(new context not created, 
newContextCreated.isTrue());
[error]   ^
[error]   symbol:   method isTrue()
[error]   location: variable newContextCreated of type MutableBoolean
[error] 
/Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1752:
 error: cannot find symbol
[error] Assert.assertTrue(old context not recovered, 
newContextCreated.isFalse());
[error] ^
[error]   symbol:   method isFalse()
[error]   location: variable newContextCreated of type MutableBoolean
[error] 
/Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1768:
 error: cannot find symbol
[error] Assert.assertTrue(new context not created, 
newContextCreated.isTrue());
[error]   ^
[error]   symbol:   method isTrue()
[error]   location: variable newContextCreated of type MutableBoolean
[error] 
/Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1773:
 error: cannot find symbol
[error] Assert.assertTrue(new context not created, 
newContextCreated.isTrue());
[error]   ^
[error]   symbol:   method isTrue()
[error]   location: variable newContextCreated of type MutableBoolean
[error] 
/Users/pwendell/Documents/spark/streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java:1778:
 error: cannot find symbol
[error] Assert.assertTrue(old context not recovered, 
newContextCreated.isFalse());
[error] ^
[error]   symbol:   method isFalse()
[error]   location: variable newContextCreated of type MutableBoolean
[error] 6 errors
[error] (streaming/test:compile) javac returned nonzero exit code
[error] Total time: 94 s, completed Apr 25, 2015 10:30:20 AM
pwendell @ admins-mbp : ~/Documents/spark (detached HEAD|REBASE 9/11) 
{code}

 Allow StreamingContext to be recreated from checkpoint and existing 
 SparkContext
 

 Key: SPARK-6752
 URL: https://issues.apache.org/jira/browse/SPARK-6752
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.1, 1.2.1, 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.4.0


 Currently if you want to create a StreamingContext from checkpoint 
 information, the system will create a new SparkContext. This prevent 
 StreamingContext to be recreated from checkpoints in managed environments 
 where SparkContext is precreated.
 Proposed solution: Introduce the following methods on StreamingContext
 1. {{new StreamingContext(checkpointDirectory, sparkContext)}}
 - Recreate StreamingContext from checkpoint using the provided SparkContext
 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}}
 - Recreate StreamingContext from checkpoint using the provided SparkContext 
 and hadoop conf to read the checkpoint
 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, 
 createFunction: SparkContext = StreamingContext)}}
 - If checkpoint file exists, then recreate StreamingContext using the 
 provided SparkContext (that is, call 1.), else create StreamingContext using 
 the provided createFunction
 Also, the corresponding Java and Python API has to be added as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency


[ 
https://issues.apache.org/jira/browse/SPARK-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512719#comment-14512719
 ] 

Apache Spark commented on SPARK-7145:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5703

 commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io 
 used without dependency
 --

 Key: SPARK-7145
 URL: https://issues.apache.org/jira/browse/SPARK-7145
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL, Streaming
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Spark depends only on Commons Lang3 (3.x). However there are several 
 accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days 
 ago the version of Lang 2.x that accidentally comes in via Hadoop can change 
 with Hadoop version and so the accidental usage is more than a purely 
 theoretical problem. It's easy to change the usages to 3.x counterparts.
 Also, there are just a few uses of Commons IO in the code which can be 
 replaced with uses of Guava, removing another used but undeclared dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency


 [ 
https://issues.apache.org/jira/browse/SPARK-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7145:
---

Assignee: Apache Spark  (was: Sean Owen)

 commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io 
 used without dependency
 --

 Key: SPARK-7145
 URL: https://issues.apache.org/jira/browse/SPARK-7145
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL, Streaming
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Apache Spark
Priority: Minor

 Spark depends only on Commons Lang3 (3.x). However there are several 
 accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days 
 ago the version of Lang 2.x that accidentally comes in via Hadoop can change 
 with Hadoop version and so the accidental usage is more than a purely 
 theoretical problem. It's easy to change the usages to 3.x counterparts.
 Also, there are just a few uses of Commons IO in the code which can be 
 replaced with uses of Guava, removing another used but undeclared dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7145) commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency


 [ 
https://issues.apache.org/jira/browse/SPARK-7145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7145:
---

Assignee: Sean Owen  (was: Apache Spark)

 commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io 
 used without dependency
 --

 Key: SPARK-7145
 URL: https://issues.apache.org/jira/browse/SPARK-7145
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL, Streaming
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 Spark depends only on Commons Lang3 (3.x). However there are several 
 accidental usages of Commons Lang (2.x) in the codebase. As we saw a few days 
 ago the version of Lang 2.x that accidentally comes in via Hadoop can change 
 with Hadoop version and so the accidental usage is more than a purely 
 theoretical problem. It's easy to change the usages to 3.x counterparts.
 Also, there are just a few uses of Commons IO in the code which can be 
 replaced with uses of Guava, removing another used but undeclared dependency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6752) Allow StreamingContext to be recreated from checkpoint and existing SparkContext


[ 
https://issues.apache.org/jira/browse/SPARK-6752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512620#comment-14512620
 ] 

Sean Owen commented on SPARK-6752:
--

It's an issue with the version of Commons Lang (2.x). Just use 
{{AtomicBoolean}} here and avoid the library altogether

 Allow StreamingContext to be recreated from checkpoint and existing 
 SparkContext
 

 Key: SPARK-6752
 URL: https://issues.apache.org/jira/browse/SPARK-6752
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.1, 1.2.1, 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical
 Fix For: 1.4.0


 Currently if you want to create a StreamingContext from checkpoint 
 information, the system will create a new SparkContext. This prevent 
 StreamingContext to be recreated from checkpoints in managed environments 
 where SparkContext is precreated.
 Proposed solution: Introduce the following methods on StreamingContext
 1. {{new StreamingContext(checkpointDirectory, sparkContext)}}
 - Recreate StreamingContext from checkpoint using the provided SparkContext
 2. {{new StreamingContext(checkpointDirectory, hadoopConf, sparkContext)}}
 - Recreate StreamingContext from checkpoint using the provided SparkContext 
 and hadoop conf to read the checkpoint
 3. {{StreamingContext.getOrCreate(checkpointDirectory, sparkContext, 
 createFunction: SparkContext = StreamingContext)}}
 - If checkpoint file exists, then recreate StreamingContext using the 
 provided SparkContext (that is, call 1.), else create StreamingContext using 
 the provided createFunction
 Also, the corresponding Java and Python API has to be added as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7143) Add BM25 Estimator


[ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512765#comment-14512765
 ] 

Joseph K. Bradley commented on SPARK-7143:
--

Do you have some references to recent papers and current use cases in industry, 
especially ones showing BM25 is much better than TF-IDF?  It will be good to 
figure out whether it is clearly better than TF-IDF, or if it is best in 
specialized cases (and would then be better as a Spark package).

Also, can you please comment on which variant you're implementing?  The 
Wikipedia page makes it sound like some corrections are necessary for the basic 
BM25 in order to make it more practical.

Thanks!

 Add BM25 Estimator
 --

 Key: SPARK-7143
 URL: https://issues.apache.org/jira/browse/SPARK-7143
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Liang-Chi Hsieh

 [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
 to rank documents. It is commonly used in IR tasks and can be parallel. This 
 issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5817) UDTF column names didn't set properly


 [ 
https://issues.apache.org/jira/browse/SPARK-5817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5817:
-
Assignee: Cheng Hao

 UDTF column names didn't set properly 
 --

 Key: SPARK-5817
 URL: https://issues.apache.org/jira/browse/SPARK-5817
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.4.0


 {code}
 createQueryTest(Specify the udtf output, select d from (select 
 explode(array(1,1)) d from src limit 1) t)
 {code}
 It throws exception like:
 {panel}
 org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input 
 columns _c0; line 1 pos 7
   at 
 org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250)
   at 
 org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5824) CTAS should set null format in hive-0.13.1


 [ 
https://issues.apache.org/jira/browse/SPARK-5824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5824:
-
Assignee: Adrian Wang

 CTAS should set null format in hive-0.13.1
 --

 Key: SPARK-5824
 URL: https://issues.apache.org/jira/browse/SPARK-5824
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5833) Adds REFRESH TABLE command to refresh external data sources tables


 [ 
https://issues.apache.org/jira/browse/SPARK-5833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5833:
-
Assignee: Cheng Lian

 Adds REFRESH TABLE command to refresh external data sources tables
 --

 Key: SPARK-5833
 URL: https://issues.apache.org/jira/browse/SPARK-5833
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Cheng Lian
Assignee: Cheng Lian
 Fix For: 1.3.0


 This command can be used to refresh (possibly cached) metadata stored in 
 external data source tables. For example, for JSON tables, it forces schema 
 inference; for Parquet tables, it forces schema merging and partition 
 discovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5840) HiveContext cannot be serialized due to tuple extraction


 [ 
https://issues.apache.org/jira/browse/SPARK-5840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5840:
-
Assignee: Reynold Xin

 HiveContext cannot be serialized due to tuple extraction
 

 Key: SPARK-5840
 URL: https://issues.apache.org/jira/browse/SPARK-5840
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.3.0


 See the following mailing list question: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/
 The use of tuple extraction for (hiveconf, sessionState) creates a 
 non-transient tuple field.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5794) add jar should return 0


 [ 
https://issues.apache.org/jira/browse/SPARK-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5794:
-
Assignee: Adrian Wang

 add jar should return 0
 ---

 Key: SPARK-5794
 URL: https://issues.apache.org/jira/browse/SPARK-5794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5780) The loggings of Python unittests are noisy and scaring in


 [ 
https://issues.apache.org/jira/browse/SPARK-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5780:
-
Assignee: Davies Liu

 The loggings of Python unittests are noisy and scaring in 
 --

 Key: SPARK-5780
 URL: https://issues.apache.org/jira/browse/SPARK-5780
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.3.0


 There a bunch of logging coming from driver and worker, it's noisy and 
 scaring, and a lots of exception in it, people are confusing about the tests 
 are failing or not.
 It should mute the logging during tests, only show them if any one failed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5626) Spurious test failures due to NullPointerException in EasyMock test code


 [ 
https://issues.apache.org/jira/browse/SPARK-5626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-5626:
-
Assignee: Josh Rosen

 Spurious test failures due to NullPointerException in EasyMock test code
 

 Key: SPARK-5626
 URL: https://issues.apache.org/jira/browse/SPARK-5626
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Critical
  Labels: flaky-test
 Attachments: consoleText.txt


 I've seen a few cases where a test failure will trigger a cascade of spurious 
 failures when instantiating test suites that use EasyMock.  Here's a sample 
 symptom:
 {code}
 [info] CacheManagerSuite:
 [info] Exception encountered when attempting to run a suite with class name: 
 org.apache.spark.CacheManagerSuite *** ABORTED *** (137 milliseconds)
 [info]   java.lang.NullPointerException:
 [info]   at 
 org.objenesis.strategy.StdInstantiatorStrategy.newInstantiatorOf(StdInstantiatorStrategy.java:52)
 [info]   at 
 org.objenesis.ObjenesisBase.getInstantiatorOf(ObjenesisBase.java:90)
 [info]   at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
 [info]   at org.objenesis.ObjenesisHelper.newInstance(ObjenesisHelper.java:43)
 [info]   at 
 org.easymock.internal.ObjenesisClassInstantiator.newInstance(ObjenesisClassInstantiator.java:26)
 [info]   at 
 org.easymock.internal.ClassProxyFactory.createProxy(ClassProxyFactory.java:219)
 [info]   at 
 org.easymock.internal.MocksControl.createMock(MocksControl.java:59)
 [info]   at org.easymock.EasyMock.createMock(EasyMock.java:103)
 [info]   at 
 org.scalatest.mock.EasyMockSugar$class.mock(EasyMockSugar.scala:267)
 [info]   at 
 org.apache.spark.CacheManagerSuite.mock(CacheManagerSuite.scala:28)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply$mcV$sp(CacheManagerSuite.scala:40)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38)
 [info]   at 
 org.apache.spark.CacheManagerSuite$$anonfun$1.apply(CacheManagerSuite.scala:38)
 [info]   at 
 org.scalatest.BeforeAndAfter$class.runTest(BeforeAndAfter.scala:195)
 [info]   at 
 org.apache.spark.CacheManagerSuite.runTest(CacheManagerSuite.scala:28)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
 [info]   at 
 org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
 [info]   at scala.collection.immutable.List.foreach(List.scala:318)
 [info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
 [info]   at 
 org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
 [info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
 [info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
 [info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
 [info]   at org.scalatest.Suite$class.run(Suite.scala:1424)
 [info]   at 
 org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1555)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at 
 org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:212)
 [info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
 [info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:212)
 [info]   at 
 org.apache.spark.CacheManagerSuite.org$scalatest$BeforeAndAfter$$super$run(CacheManagerSuite.scala:28)
 [info]   at org.scalatest.BeforeAndAfter$class.run(BeforeAndAfter.scala:241)
 [info]   at org.apache.spark.CacheManagerSuite.run(CacheManagerSuite.scala:28)
 [info]   at 
 org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
 [info]   at 
 org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
 [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
 [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 [info]   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 [info]   at java.lang.Thread.run(Thread.java:745)
 {code}
 This is from 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26852/consoleFull.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Resolved] (SPARK-7092) Update spark scala version to 2.11.6


 [ 
https://issues.apache.org/jira/browse/SPARK-7092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7092.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5662
[https://github.com/apache/spark/pull/5662]

 Update spark scala version to 2.11.6
 

 Key: SPARK-7092
 URL: https://issues.apache.org/jira/browse/SPARK-7092
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Spark Shell
Affects Versions: 1.4.0
Reporter: Prashant Sharma
Assignee: Prashant Sharma
Priority: Minor
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix


[ 
https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512794#comment-14512794
 ] 

Eric O. LEBIGOT (EOL) commented on SPARK-7141:
--

Ah, thanks: it's good to know where the issue comes from.

As far as I understand, it does not look like it's going to be fixed any soon 
(minor issue, and Amazon deprecating s3://…).

I will try s3n:// instead and see if this works and changes anything.

 saveAsTextFile() on S3 first creates empty prefix
 -

 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)

 Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
 i.e. it writes to {{s3://bucket//prefix}} (note the double slash).
 Example code (in a {{pyspark}} shell):
 {{rdd = sc.parallelize(abcd)}}
 {{rdd.saveAsTextFile(s3://bucket/prefix)}}
 This is quite annoying, as the files cannot be saved in the intended location 
 (they can be read, though, with the original path: 
 {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them 
 in the right place).
 Also, many {{block_*}} files are created directly in the bucket: shouldn't 
 they be deleted? (This may be a separate issue, but maybe it is a path issue 
 as well.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-04-25 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512843#comment-14512843
 ] 

Debasish Das commented on SPARK-5992:
-

Did someone compared algebird LSH with spark minhash link above ? Unless 
algebird is slow (which I found for TopK monoid) we should use it the same way 
HLL is being used in Spark streaming ? Is it ok to add algebird to mllib ?

 Locality Sensitive Hashing (LSH) for MLlib
 --

 Key: SPARK-5992
 URL: https://issues.apache.org/jira/browse/SPARK-5992
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley

 Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
 great to discuss some possible algorithms here, choose an API, and make a PR 
 for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7146) Should ML sharedParams be a public API?

Joseph K. Bradley created SPARK-7146:


 Summary: Should ML sharedParams be a public API?
 Key: SPARK-7146
 URL: https://issues.apache.org/jira/browse/SPARK-7146
 Project: Spark
  Issue Type: Brainstorming
  Components: ML
Reporter: Joseph K. Bradley


Discussion: Should the Param traits in sharedParams.scala be private?

Pros:
* Users have to be careful since parameters can have different meanings for 
different algorithms.

Cons:
* Sharing the Param traits helps to encourage standardized Param names and 
documentation.
* If the shared Params are public, then implementations could test for the 
traits.  We probably do not want users to do that.

Currently, the shared params are public but marked as DeveloperApi.

Proposal: Either
(a) make the shared params private to encourage users to write specialized 
documentation and value checks for parameters, or
(b) design a better way to encourage overriding documentation and parameter 
value checks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7147) Enforce Params.validate by making it abstract

Joseph K. Bradley created SPARK-7147:


 Summary: Enforce Params.validate by making it abstract
 Key: SPARK-7147
 URL: https://issues.apache.org/jira/browse/SPARK-7147
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley


We should make Params.validate abstract to force developers to implement it.  
We have been ignoring it so far.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7148) Configure Parquet block size (row group size) for ML model import/export

Joseph K. Bradley created SPARK-7148:


 Summary: Configure Parquet block size (row group size) for ML 
model import/export
 Key: SPARK-7148
 URL: https://issues.apache.org/jira/browse/SPARK-7148
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 1.3.1, 1.3.0, 1.4.0
Reporter: Joseph K. Bradley
Priority: Minor


It would be nice if we could configure the Parquet buffer size when using 
Parquet format for ML model import/export.  Currently, for some models (trees 
and ensembles), the schema has 13+ columns.  With a default buffer size of 
128MB (I think), that puts the allocated buffer way over the default memory 
made available by run-example.  Because of this problem, users have to use 
spark-submit and explicitly use a larger amount of memory in order to run some 
ML examples.

Is there a simple way to specify {{parquet.block.size}}?  I'm not familiar with 
this part of SparkSQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5023) In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages)

2015-04-25 Thread Eran Medan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512869#comment-14512869
 ] 

Eran Medan commented on SPARK-5023:
---

I don't think this is a duplicate, the information shows correctly in live 
view, the incorrect numbers are for the history / event view. 

I had no lost partitions and no failures, but still - in live view, something 
that took seconds or minutes shows as milliseconds or seconds in the history 
view. I'll debug it and see if I can figure out the root cause. 

No error messages. 

 In Web UI job history, the total job duration is incorrect (much smaller than 
 the sum of its stages)
 

 Key: SPARK-5023
 URL: https://issues.apache.org/jira/browse/SPARK-5023
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.1.1, 1.2.0
 Environment: Amazon EC2 AMI r3.2xlarge, cluster of 20 to 50 nodes, 
 running the ec2 provided scripts to create. 
Reporter: Eran Medan

 I'm running a long process using Spark + Graph and things look good on the 
 4040 job status UI, but when the job is done, when going to the history then 
 the job total duration is much, much smaller than the total of its stages.
 The way I set logs up is this:
   val homeDir = sys.props(user.home)
   val logsPath = new File(homeDir,sparkEventLogs)
   val conf = new SparkConf().setAppName(...)
   conf.set(spark.eventLog.enabled, true).set(spark.eventLog.dir, 
 logsPath.getCanonicalPath)
 for example job ID X - duration 0.2 s, but when I click the job and look at 
 its stages, the sum of their duration is more than 15 minutes!
 (before the job was over, in the 4040 job status, the job duration was 
 correct, it is only incorrect when its done and going to the logs) 
 I hope I didn't configure something because I was very surprised no one 
 reported it yet (I searched, but perhaps I missed it) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix

Eric O. LEBIGOT (EOL) created SPARK-7141:


 Summary: saveAsTextFile() on S3 first creates empty prefix
 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)


Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved (and read from, maybe) in 
the intended location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix


 [ 
https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric O. LEBIGOT (EOL) updated SPARK-7141:
-
Description: 
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).

  was:
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved (and read from, maybe) in 
the intended location.


 saveAsTextFile() on S3 first creates empty prefix
 -

 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)

 Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
 i.e. it writes to {{s3://bucket//prefix}} (note the double slash).
 Example code (in a {{pyspark}} shell):
 {{rdd = sc.parallelize(abcd)}}
 {{rdd.saveAsTextFile(s3://bucket/prefix)})
 This is quite annoying, as the files cannot be saved in the intended location 
 (they can be read, though, with the original path: 
 {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them 
 in the right place).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix


 [ 
https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric O. LEBIGOT (EOL) updated SPARK-7141:
-
Description: 
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).

Also, many `block_` files are created directly in the bucket: shouldn't they be 
deleted? (This may be a separate issue, but maybe it is a path issue as well.)

  was:
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).


 saveAsTextFile() on S3 first creates empty prefix
 -

 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)

 Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
 i.e. it writes to {{s3://bucket//prefix}} (note the double slash).
 Example code (in a {{pyspark}} shell):
 {{rdd = sc.parallelize(abcd)}}
 {{rdd.saveAsTextFile(s3://bucket/prefix)})
 This is quite annoying, as the files cannot be saved in the intended location 
 (they can be read, though, with the original path: 
 {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them 
 in the right place).
 Also, many `block_` files are created directly in the bucket: shouldn't they 
 be deleted? (This may be a separate issue, but maybe it is a path issue as 
 well.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix


 [ 
https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric O. LEBIGOT (EOL) updated SPARK-7141:
-
Description: 
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).

Also, many {{block_*}} files are created directly in the bucket: shouldn't they 
be deleted? (This may be a separate issue, but maybe it is a path issue as 
well.)

  was:
Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
i.e. it writes to {{s3://bucket//prefix}} (note the double slash).

Example code (in a {{pyspark}} shell):

{{rdd = sc.parallelize(abcd)}}
{{rdd.saveAsTextFile(s3://bucket/prefix)})

This is quite annoying, as the files cannot be saved in the intended location 
(they can be read, though, with the original path: 
{{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them in 
the right place).

Also, many `block_` files are created directly in the bucket: shouldn't they be 
deleted? (This may be a separate issue, but maybe it is a path issue as well.)


 saveAsTextFile() on S3 first creates empty prefix
 -

 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)

 Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
 i.e. it writes to {{s3://bucket//prefix}} (note the double slash).
 Example code (in a {{pyspark}} shell):
 {{rdd = sc.parallelize(abcd)}}
 {{rdd.saveAsTextFile(s3://bucket/prefix)})
 This is quite annoying, as the files cannot be saved in the intended location 
 (they can be read, though, with the original path: 
 {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them 
 in the right place).
 Also, many {{block_*}} files are created directly in the bucket: shouldn't 
 they be deleted? (This may be a separate issue, but maybe it is a path issue 
 as well.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7108) Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting


[ 
https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512280#comment-14512280
 ] 

Patrick Wendell edited comment on SPARK-7108 at 4/25/15 6:01 AM:
-

Hey I think [~joshrosen] actually worded this a bit confusingly. The issue is 
that even if SPARK_LOCAL_DIRS is not set at all by the user, the setting of 
spark.local.dir is not used from the application. This regresses from earlier 
versions of spark which (as the documentation implies) would respect 
spark.local.dir if set.


was (Author: pwendell):
Hey I think [~joshrosen] actually miswrote this. The issue is that even if 
SPARK_LOCAL_DIRS is not set at all, the setting of spark.local.dir is not 
used from the driver.

 Setting spark.local.dir in driver no longer overrides the standalone worker's 
 local directory setting
 -

 Key: SPARK-7108
 URL: https://issues.apache.org/jira/browse/SPARK-7108
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1, 1.3.0
Reporter: Josh Rosen
Priority: Critical

 Prior to SPARK-4834, configuring spark.local.dir in the driver would affect 
 the local directories created on the executor.  After this patch, executors 
 will always ignore this setting in favor of directories read from 
 {{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the 
 worker's own configuration and not the application configuration.
 This change impacts users who configured {{spark.local.dir}} only in their 
 driver and not via their cluster's {{spark-defaults.conf}} or 
 {{spark-env.sh}} files.  This is an atypical use-case, since the available 
 local directories / disks are a property of the cluster and not the 
 application, which probably explains why this issue has not been reported 
 previously.
 The correct fix might be comment + documentation improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7108) Setting spark.local.dir in driver no longer overrides the standalone worker's local directory setting

[
https://issues.apache.org/jira/browse/SPARK-7108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512280#comment-14512280
]

Patrick Wendell edited comment on SPARK-7108 at 4/25/15 6:02 AM:
-

Hey I think [~joshrosen] actually worded this a bit confusingly. The issue is
that even if SPARK_LOCAL_DIRS is not set at all by the user, the setting of
spark.local.dir is not used from the application. This regresses from earlier
versions of spark which (as the documentation implies) would respect
spark.local.dir if set.

This isn't great because Spark will silently start using a different local
directory when upgraded. In our case it caused us to run out of disk space
because /tmp was used instead of a directory we'd explicitly set.

was (Author: pwendell):
Hey I think [~joshrosen] actually worded this a bit confusingly. The issue is
that even if SPARK_LOCAL_DIRS is not set at all by the user, the setting of
spark.local.dir is not used from the application. This regresses from earlier
versions of spark which (as the documentation implies) would respect
spark.local.dir if set.

Setting spark.local.dir in driver no longer overrides the standalone worker's
local directory setting
-

Key: SPARK-7108
URL: https://issues.apache.org/jira/browse/SPARK-7108
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.2.1, 1.3.0
Reporter: Josh Rosen
Priority: Critical

Prior to SPARK-4834, configuring spark.local.dir in the driver would affect
the local directories created on the executor. After this patch, executors
will always ignore this setting in favor of directories read from
{{SPARK_LOCAL_DIRS}}, which is set by the standalone worker based on the
worker's own configuration and not the application configuration.
This change impacts users who configured {{spark.local.dir}} only in their
driver and not via their cluster's {{spark-defaults.conf}} or
{{spark-env.sh}} files. This is an atypical use-case, since the available
local directories / disks are a property of the cluster and not the
application, which probably explains why this issue has not been reported
previously.
The correct fix might be comment + documentation improvements.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7118) Add coalesce Spark SQL function to PySpark API


[ 
https://issues.apache.org/jira/browse/SPARK-7118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512320#comment-14512320
 ] 

Apache Spark commented on SPARK-7118:
-

User 'ogirardot' has created a pull request for this issue:
https://github.com/apache/spark/pull/5698

 Add coalesce Spark SQL function to PySpark API
 --

 Key: SPARK-7118
 URL: https://issues.apache.org/jira/browse/SPARK-7118
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.3.1
Reporter: Olivier Girardot
Priority: Minor

 The *org.apache.sql.functions.coalesce* function is not available from 
 PySpark SQL API.
 Let's add it.
 Olivier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5891) Add Binarizer


 [ 
https://issues.apache.org/jira/browse/SPARK-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5891:
---

Assignee: Apache Spark

 Add Binarizer
 -

 Key: SPARK-5891
 URL: https://issues.apache.org/jira/browse/SPARK-5891
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Apache Spark

 `Binarizer` takes a column of continuous features and output a column with 
 binary features, where nonzeros (or values below a threshold) become 1 in the 
 output.
 {code}
 val binarizer = new Binarizer()
   .setInputCol(numVisits)
   .setOutputCol(visited)
 {code}
 The output column should be marked as binary. We need to discuss whether we 
 should process multiple columns or a vector column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6333) saveAsObjectFile support for compression codec


 [ 
https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6333:
---

Assignee: (was: Apache Spark)

 saveAsObjectFile support for compression codec
 --

 Key: SPARK-6333
 URL: https://issues.apache.org/jira/browse/SPARK-6333
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Deenar Toraskar
Priority: Minor

 saveAsObjectFile current does not support a compression codec.  This story is 
 about adding saveAsObjectFile (path, codec) support into spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5891) Add Binarizer


[ 
https://issues.apache.org/jira/browse/SPARK-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512361#comment-14512361
 ] 

Apache Spark commented on SPARK-5891:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5699

 Add Binarizer
 -

 Key: SPARK-5891
 URL: https://issues.apache.org/jira/browse/SPARK-5891
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `Binarizer` takes a column of continuous features and output a column with 
 binary features, where nonzeros (or values below a threshold) become 1 in the 
 output.
 {code}
 val binarizer = new Binarizer()
   .setInputCol(numVisits)
   .setOutputCol(visited)
 {code}
 The output column should be marked as binary. We need to discuss whether we 
 should process multiple columns or a vector column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5891) Add Binarizer


 [ 
https://issues.apache.org/jira/browse/SPARK-5891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5891:
---

Assignee: (was: Apache Spark)

 Add Binarizer
 -

 Key: SPARK-5891
 URL: https://issues.apache.org/jira/browse/SPARK-5891
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng

 `Binarizer` takes a column of continuous features and output a column with 
 binary features, where nonzeros (or values below a threshold) become 1 in the 
 output.
 {code}
 val binarizer = new Binarizer()
   .setInputCol(numVisits)
   .setOutputCol(visited)
 {code}
 The output column should be marked as binary. We need to discuss whether we 
 should process multiple columns or a vector column.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2750) Add Https support for Web UI

2015-04-25 Thread Tao Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Wang updated SPARK-2750:

Attachment: exception on yarn when https enabled.txt

 Add Https support for Web UI
 

 Key: SPARK-2750
 URL: https://issues.apache.org/jira/browse/SPARK-2750
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Tao Wang
  Labels: https, ssl, webui
 Fix For: 1.0.3

 Attachments: exception on yarn when https enabled.txt

   Original Estimate: 96h
  Remaining Estimate: 96h

 Now I try to add https support for web ui using Jetty ssl integration.Below 
 is the plan:
 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User 
 can switch between https and http by configure spark.http.policy in JVM 
 property for each process, while choose http by default.
 2.Web port of Master and worker would be decided in order of launch 
 arguments, JVM property, System Env and default port.
 3.Below is some other configuration items:
 spark.ssl.server.keystore.location The file or URL of the SSL Key store
 spark.ssl.server.keystore.password  The password for the key store
 spark.ssl.server.keystore.keypassword The password (if any) for the specific 
 key within the key store
 spark.ssl.server.keystore.type The type of the key store (default JKS)
 spark.client.https.need-auth True if SSL needs client authentication
 spark.ssl.server.truststore.location The file name or URL of the trust store 
 location
 spark.ssl.server.truststore.password The password for the trust store
 spark.ssl.server.truststore.type The type of the trust store (default JKS)
 Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6333) saveAsObjectFile support for compression codec


 [ 
https://issues.apache.org/jira/browse/SPARK-6333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6333:
---

Assignee: Apache Spark

 saveAsObjectFile support for compression codec
 --

 Key: SPARK-6333
 URL: https://issues.apache.org/jira/browse/SPARK-6333
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Deenar Toraskar
Assignee: Apache Spark
Priority: Minor

 saveAsObjectFile current does not support a compression codec.  This story is 
 about adding saveAsObjectFile (path, codec) support into spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule

2015-04-25 Thread Yash Datta (JIRA)

Yash Datta created SPARK-7142:
-

 Summary: Minor enhancement to BooleanSimplification Optimizer rule
 Key: SPARK-7142
 URL: https://issues.apache.org/jira/browse/SPARK-7142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yash Datta
Priority: Minor


Add simplification using these rules :

A and (not(A) or B) = A and B

not(A and B) = not(A) or not(B)

not(A or B) = not(A) and not(B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule


 [ 
https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7142:
---

Assignee: Apache Spark

 Minor enhancement to BooleanSimplification Optimizer rule
 -

 Key: SPARK-7142
 URL: https://issues.apache.org/jira/browse/SPARK-7142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yash Datta
Assignee: Apache Spark
Priority: Minor

 Add simplification using these rules :
 A and (not(A) or B) = A and B
 not(A and B) = not(A) or not(B)
 not(A or B) = not(A) and not(B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule


 [ 
https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7142:
---

Assignee: (was: Apache Spark)

 Minor enhancement to BooleanSimplification Optimizer rule
 -

 Key: SPARK-7142
 URL: https://issues.apache.org/jira/browse/SPARK-7142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yash Datta
Priority: Minor

 Add simplification using these rules :
 A and (not(A) or B) = A and B
 not(A and B) = not(A) or not(B)
 not(A or B) = not(A) and not(B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7142) Minor enhancement to BooleanSimplification Optimizer rule


[ 
https://issues.apache.org/jira/browse/SPARK-7142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512487#comment-14512487
 ] 

Apache Spark commented on SPARK-7142:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/5700

 Minor enhancement to BooleanSimplification Optimizer rule
 -

 Key: SPARK-7142
 URL: https://issues.apache.org/jira/browse/SPARK-7142
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yash Datta
Priority: Minor

 Add simplification using these rules :
 A and (not(A) or B) = A and B
 not(A and B) = not(A) or not(B)
 not(A or B) = not(A) and not(B)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7141) saveAsTextFile() on S3 first creates empty prefix

2015-04-25 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512573#comment-14512573
 ] 

Liang-Chi Hsieh commented on SPARK-7141:


The double slash issue is caused by the Jets3tFileSystemStore implementation in 
Hadoop.
You can refer to 
[HADOOP-11444|https://issues.apache.org/jira/browse/HADOOP-11444] and [the 
discussion on 
spark-user|https://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAE50=drwWG=eMDM=lsuf-puzopxfnj-+7k3vx_m5mmjfal2...@mail.gmail.com%3E].

 saveAsTextFile() on S3 first creates empty prefix
 -

 Key: SPARK-7141
 URL: https://issues.apache.org/jira/browse/SPARK-7141
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
 Environment: OS X 10.10
Reporter: Eric O. LEBIGOT (EOL)

 Using {{saveAsTextFile(s3://bucket/prefix}} actually adds an empty prefix, 
 i.e. it writes to {{s3://bucket//prefix}} (note the double slash).
 Example code (in a {{pyspark}} shell):
 {{rdd = sc.parallelize(abcd)}}
 {{rdd.saveAsTextFile(s3://bucket/prefix)})
 This is quite annoying, as the files cannot be saved in the intended location 
 (they can be read, though, with the original path: 
 {{sc.textFile(s3://bucket/prefix}}, but the AWS console does not show them 
 in the right place).
 Also, many {{block_*}} files are created directly in the bucket: shouldn't 
 they be deleted? (This may be a separate issue, but maybe it is a path issue 
 as well.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-04-25 Thread Harsh Gupta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14512575#comment-14512575
 ] 

Harsh Gupta commented on SPARK-6980:


[~imranr]  [~bryanc]  Hi. I tried a simple example of actors producer and 
consumer by setting setTimeOut very low and was able to see the exception.I am 
not clear on how util methods in SparkConf would get NamedDuration.Although the 
wrapper approach sounds fine.Will do some more tweaks and post here(although 
won't be very active this week since need to get my primary laptop fixed)

 Akka timeout exceptions indicate which conf controls them
 -

 Key: SPARK-6980
 URL: https://issues.apache.org/jira/browse/SPARK-6980
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Harsh Gupta
Priority: Minor
  Labels: starter
 Attachments: Spark-6980-Test.scala


 If you hit one of the akka timeouts, you just get an exception like
 {code}
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
 {code}
 The exception doesn't indicate how to change the timeout, though there is 
 usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
 nice if the exception including the relevant setting.
 I think this should be pretty easy to do -- we just need to create something 
 like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
 akka timeout and throws its own exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7143) Add BM25 Estimator

2015-04-25 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-7143:
--

 Summary: Add BM25 Estimator
 Key: SPARK-7143
 URL: https://issues.apache.org/jira/browse/SPARK-7143
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Liang-Chi Hsieh


[BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used to 
rank documents. It is commonly used in IR tasks and can be parallel. This issue 
is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7143) Add BM25 Estimator


 [ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7143:
---

Assignee: Apache Spark

 Add BM25 Estimator
 --

 Key: SPARK-7143
 URL: https://issues.apache.org/jira/browse/SPARK-7143
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Liang-Chi Hsieh
Assignee: Apache Spark

 [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
 to rank documents. It is commonly used in IR tasks and can be parallel. This 
 issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7143) Add BM25 Estimator


 [ 
https://issues.apache.org/jira/browse/SPARK-7143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7143:
---

Assignee: (was: Apache Spark)

 Add BM25 Estimator
 --

 Key: SPARK-7143
 URL: https://issues.apache.org/jira/browse/SPARK-7143
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Liang-Chi Hsieh

 [BM25|http://en.wikipedia.org/wiki/Okapi_BM25] is a retrieval function used 
 to rank documents. It is commonly used in IR tasks and can be parallel. This 
 issue is proposed to add it into Spark ML.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7143) Add BM25 Estimator