[jira] [Resolved] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-11500.

   Resolution: Fixed
Fix Version/s: 1.7.0

Issue resolved by pull request 9517
[https://github.com/apache/spark/pull/9517]

> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 1.7.0
>
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6726) Model export/import for spark.ml: LogisticRegression

2015-11-11 Thread Earthson Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000192#comment-15000192
 ] 

Earthson Lu commented on SPARK-6726:


Is the API ready for subtasks? I can do some work:)

> Model export/import for spark.ml: LogisticRegression
> 
>
> Key: SPARK-6726
> URL: https://issues.apache.org/jira/browse/SPARK-6726
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11583) Make MapStatus use less memory uage

2015-11-11 Thread Kent Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000187#comment-15000187
 ] 

Kent Yao edited comment on SPARK-11583 at 11/11/15 10:21 AM:
-

[~imranr]
[~dlemire]
1. test cases
1.1 sparse case: for each task 10 blocks contains data, others dont
 
  sc.makeRDD(1 to 40950, 4095).groupBy(x=>x).top(5)

1.2 dense case: for each task most block contains data, few dont
 
 1.2.1 full
 sc.makeRDD(1 to 16769025, 4095).groupBy(x=>x).top(5)
  
 1.2.2 very dense: about 95 empty blocks
 sc.makeRDD(1 to 1638, 4095).groupBy(x=>x).top(5)
 
   1.3 test tool

 jmap -dump:format=b,file=heap.bin   
   
   1.4 test branches: branch-1.5, master
 

2. memory usage

 2.1 RoaringBitmap--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 | 
>= 34,135,920
-
 my explaination: 4095 * short[4095-10] =4095 * 16 * 4085 / 8  ≈ 34,135,920

2.2.1 RoaringBitmap--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
  >= 360,360 
-
my explaination:RoaringBitmap(0)

   2.2.2 RoaringBitmap--very dense

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 1,441,440
-
 my explaination:4095 * short[95] = 4095 * 16 * 95 / 8 = 778, 050 (+ others 
= 1441440)

 
2.3 BitSet--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 2,391,480
-
my explaination:4095 * 4095 =16,769,025 + (others =  2,391,480)
2.4 BitSet--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 2,391,480
-
my explaination:same as the above

3. conclusion

memory usage:

RoaringBitmap--full  <  RoaringBitmap--very dense <  BitSet--full =  
BitSet--sparse < RoaringBitmap--sparse
  







was (Author: qin yao):
@lemire [~imranr]

1. test cases
1.1 sparse case: for each task 10 blocks contains data, others dont
 
  sc.makeRDD(1 to 40950, 4095).groupBy(x=>x).top(5)

1.2 dense case: for each task most block contains data, few dont
 
 1.2.1 full
 sc.makeRDD(1 to 16769025, 4095).groupBy(x=>x).top(5)
  
 1.2.2 very dense: about 95 empty blocks
 sc.makeRDD(1 to 1638, 4095).groupBy(x=>x).top(5)
 
   1.3 test tool

 jmap -dump:format=b,file=heap.bin   
   
   1.4 test branches: branch-1.5, master
 

2. memory usage

 2.1 RoaringBitmap--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 | 
>= 34,135,920
-
 my explaination: 4095 * short[4095-10] =4095 * 16 * 4085 / 8  ≈ 34,135,920

2.2.1 RoaringBitmap--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
  >= 360,360 
-
my explaination:RoaringBitmap(0)

   2.2.2 RoaringBitmap--very dense

Class Name  

[jira] [Commented] (SPARK-10113) Support for unsigned Parquet logical types

2015-11-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000203#comment-15000203
 ] 

Cheng Lian commented on SPARK-10113:


I think emitting a clear error message is more reasonable since Spark SQL only 
handles signed integral types and can't hold the whole value ranges of those 
unsigned integral types.

> Support for unsigned Parquet logical types
> --
>
> Key: SPARK-10113
> URL: https://issues.apache.org/jira/browse/SPARK-10113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Jordan Thomas
>
> Add support for unsigned Parquet logical types UINT_16, UINT_32 and UINT_64.
> {code}
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 (UINT_64);
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.illegalType$1(CatalystSchemaConverter.scala:130)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.convertPrimitiveField(CatalystSchemaConverter.scala:169)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:115)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:97)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:94)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.org$apache$spark$sql$parquet$CatalystSchemaConverter$$convert(CatalystSchemaConverter.scala:94)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter$$anonfun$convertGroupField$1.apply(CatalystSchemaConverter.scala:200)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter$$anonfun$convertGroupField$1.apply(CatalystSchemaConverter.scala:200)
>   at scala.Option.fold(Option.scala:158)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.convertGroupField(CatalystSchemaConverter.scala:200)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.convertField(CatalystSchemaConverter.scala:116)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:97)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter$$anonfun$2.apply(CatalystSchemaConverter.scala:94)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.org$apache$spark$sql$parquet$CatalystSchemaConverter$$convert(CatalystSchemaConverter.scala:94)
>   at 
> org.apache.spark.sql.parquet.CatalystSchemaConverter.convert(CatalystSchemaConverter.scala:91)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation$$anonfun$readSchemaFromFooter$2.apply(ParquetRelation.scala:734)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation$$anonfun$readSchemaFromFooter$2.apply(ParquetRelation.scala:734)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation$.readSchemaFromFooter(ParquetRelation.scala:734)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation$$anonfun$28$$anonfun$apply$8.apply(ParquetRelation.scala:714)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation$$anonfun$28$$anonfun$apply$8.apply(ParquetRelation.scala:713)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at 

[jira] [Updated] (SPARK-11651) LinearRegressionSummary should support get residuals by type

2015-11-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11651:

Description: 
LinearRegressionSummary should support get residuals by type like R glm:
{code}
residuals(object, type = c("deviance", "pearson", "working",
   "response", "partial"), ...)
{code}
I think we should add residualsByType in LinearRegressionSummary for this 
issue, and expose the API for SparkR in another JIRA.

  was:
LinearRegressionSummary should support get residuals by type like R glm.
{code}
residuals(object, type = c("deviance", "pearson", "working",
   "response", "partial"), ...)
{code}
I think we should add residualsByType in LinearRegressionSummary for this 
issue, and expose the API for SparkR in another JIRA.


> LinearRegressionSummary should support get residuals by type
> 
>
> Key: SPARK-11651
> URL: https://issues.apache.org/jira/browse/SPARK-11651
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> LinearRegressionSummary should support get residuals by type like R glm:
> {code}
> residuals(object, type = c("deviance", "pearson", "working",
>"response", "partial"), ...)
> {code}
> I think we should add residualsByType in LinearRegressionSummary for this 
> issue, and expose the API for SparkR in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11651) LinearRegressionSummary should support get residuals by type

2015-11-11 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11651:
---

 Summary: LinearRegressionSummary should support get residuals by 
type
 Key: SPARK-11651
 URL: https://issues.apache.org/jira/browse/SPARK-11651
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang


LinearRegressionSummary should support get residuals by type like R glm.
{code}
residuals(object, type = c("deviance", "pearson", "working",
   "response", "partial"), ...)
{code}
I think we should add residualsByType in LinearRegressionSummary for this 
issue, and expose the API for SparkR in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11652) Remote code execution with InvokerTransformer

2015-11-11 Thread Daniel Darabos (JIRA)
Daniel Darabos created SPARK-11652:
--

 Summary: Remote code execution with InvokerTransformer
 Key: SPARK-11652
 URL: https://issues.apache.org/jira/browse/SPARK-11652
 Project: Spark
  Issue Type: Bug
Reporter: Daniel Darabos
Priority: Minor


There is a remote code execution vulnerability in the Apache Commons 
collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
that can be exploited simply by causing malicious data to be deserialized using 
Java serialization.

As Spark is used in security-conscious environments I think it's worth taking a 
closer look at how the vulnerability affects Spark. What are the points where 
Spark deserializes external data? Which are affected by using Kryo instead of 
Java serialization? What mitigation strategies are available?

If the issue is serious enough but mitigation is possible, it may be useful to 
post about it on the mailing list or blog.

Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6789) Model export/import for spark.ml: ALS

2015-11-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000179#comment-15000179
 ] 

Yanbo Liang commented on SPARK-6789:


I will work on it. 

> Model export/import for spark.ml: ALS
> -
>
> Key: SPARK-6789
> URL: https://issues.apache.org/jira/browse/SPARK-6789
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11594) Cannot create UDAF in REPL

2015-11-11 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-11594.
---
Resolution: Not A Problem

> Cannot create UDAF in REPL
> --
>
> Key: SPARK-11594
> URL: https://issues.apache.org/jira/browse/SPARK-11594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
> Environment: Latest Spark Master
> JVM 1.8.0_66-b17
>Reporter: Herman van Hovell
>Priority: Minor
>
> If you try to define the a UDAF in the REPL, an internal error is thrown by 
> Java. The following code for example:
> {noformat}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{DataType, LongType, StructType}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> class LongProductSum extends UserDefinedAggregateFunction {
>   def inputSchema: StructType = new StructType()
> .add("a", LongType)
> .add("b", LongType)
>   def bufferSchema: StructType = new StructType()
> .add("product", LongType)
>   def dataType: DataType = LongType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!(input.isNullAt(0) || input.isNullAt(1))) {
>   buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1)
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
>   }
>   def evaluate(buffer: Row): Any =
> buffer.getLong(0)
> }
> sqlContext.udf.register("longProductSum", new LongProductSum)
> val data2 = Seq[(Integer, Integer, Integer)](
>   (1, 10, -10),
>   (null, -60, 60),
>   (1, 30, -30),
>   (1, 30, 30),
>   (2, 1, 1),
>   (3, null, null)).toDF("key", "value1", "value2")
> data2.registerTempTable("agg2")
> val q = sqlContext.sql("""
> |SELECT
> |  key,
> |  count(distinct value1, value2),
> |  longProductSum(distinct value1, value2)
> |FROM agg2
> |GROUP BY key
> """.stripMargin)
> q.show
> {noformat}
> Will throw the following error:
> {noformat}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleName(Class.java:1330)
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2092)
>   at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1419)
>   at 

[jira] [Commented] (SPARK-11089) Add a option for thrift-server to share a single session across all connections

2015-11-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000195#comment-15000195
 ] 

Cheng Lian commented on SPARK-11089:


OK, I'm taking this.

> Add a option for thrift-server to share a single session across all 
> connections
> ---
>
> Key: SPARK-11089
> URL: https://issues.apache.org/jira/browse/SPARK-11089
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Cheng Lian
>
> In 1.6, we improve the session support in JDBC server by separating temporary 
> tables and UDFs. In some cases, user may still want to share the temporary 
> tables or UDFs across different applications.
> We should have an option or config to support that (use the original 
> SQLContext instead of calling newSession if it's set to true).
> cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store

2015-11-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000215#comment-15000215
 ] 

Cheng Lian commented on SPARK-9686:
---

[~navis] [~bugg_tb] [~pin_zhang] May I ask were you all using embedded or local 
metastore? Namely, {{hive.metastore.uris}} is configured to be empty?

> Spark hive jdbc client cannot get table from metadata store
> ---
>
> Key: SPARK-9686
> URL: https://issues.apache.org/jira/browse/SPARK-9686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0, 1.5.1
>Reporter: pin_zhang
>Assignee: Cheng Lian
> Attachments: SPARK-9686.1.patch.txt
>
>
> 1. Start  start-thriftserver.sh
> 2. connect with beeline
> 3. create table
> 4.show tables, the new created table returned
> 5.
>   Class.forName("org.apache.hive.jdbc.HiveDriver");
>   String URL = "jdbc:hive2://localhost:1/default";
>Properties info = new Properties();
> Connection conn = DriverManager.getConnection(URL, info);
>   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
>null, null, null);
> Problem:
>No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000112#comment-15000112
 ] 

Hyukjin Kwon commented on SPARK-10978:
--

I am sorry to add some comments more here but.. are you sure that it sill 
pushes down filters?

I have tested many times (including end-to-end, function unittest and debuged 
with IDE for checking the filter objects)

 but it looks it does not although maybe I am missing something.

I separatly tested the test codes written at Spark and, all the test codes 
related with this do not throw exceptions even If the filters are not pushed 
down or pushdown is disabled. 

Would you like to try this code under {{ParquetFilterSuite}}?

{code:scala}
  test("actual filter push down") {
import testImplicits._
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/part=1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
val df = sqlContext.read.parquet(path).filter("a = 2")

// This is the source RDD without Spark-side filtering.
val childRDD = 
df.queryExecution.executedPlan.asInstanceOf[Filter].child.execute()

// The result should be single row.
assert(childRDD.count == 1)
  }
}
  }
{code}

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000112#comment-15000112
 ] 

Hyukjin Kwon edited comment on SPARK-10978 at 11/11/15 8:29 AM:


I am sorry to add some comments more here but.. are you sure that it sill 
pushes down filters?

I have tested many times (including end-to-end, function unittest and debuged 
with IDE for checking the filter objects)

 but it looks it does not although maybe I am missing something.

I separatly tested the test codes written at Spark and, all the test codes 
related with this do not throw exceptions even If the filters are not pushed 
down or pushdown is disabled. 

Would you like to try this code under {{ParquetFilterSuite}}?

{code}
  test("actual filter push down") {
import testImplicits._
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/part=1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
val df = sqlContext.read.parquet(path).filter("a = 2")

// This is the source RDD without Spark-side filtering.
val childRDD = 
df.queryExecution.executedPlan.asInstanceOf[Filter].child.execute()

// The result should be single row.
assert(childRDD.count == 1)
  }
}
  }
{code}


was (Author: hyukjin.kwon):
I am sorry to add some comments more here but.. are you sure that it sill 
pushes down filters?

I have tested many times (including end-to-end, function unittest and debuged 
with IDE for checking the filter objects)

 but it looks it does not although maybe I am missing something.

I separatly tested the test codes written at Spark and, all the test codes 
related with this do not throw exceptions even If the filters are not pushed 
down or pushdown is disabled. 

Would you like to try this code under {{ParquetFilterSuite}}?

{code:scala}
  test("actual filter push down") {
import testImplicits._
withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
  withTempPath { dir =>
val path = s"${dir.getCanonicalPath}/part=1"
(1 to 3).map(i => (i, i.toString)).toDF("a", "b").write.parquet(path)
val df = sqlContext.read.parquet(path).filter("a = 2")

// This is the source RDD without Spark-side filtering.
val childRDD = 
df.queryExecution.executedPlan.asInstanceOf[Filter].child.execute()

// The result should be single row.
assert(childRDD.count == 1)
  }
}
  }
{code}

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11651) LinearRegressionSummary should support get residuals by type

2015-11-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11651:

Description: 
LinearRegressionSummary should support get residuals by type like R glm:
{code}
residuals(object, type = c("deviance", "pearson", "working",
   "response", "partial"), ...)
{code}
We only add residualsByType in ml.LinearRegressionSummary for this issue, and 
expose the API for SparkR in another JIRA.

  was:
LinearRegressionSummary should support get residuals by type like R glm:
{code}
residuals(object, type = c("deviance", "pearson", "working",
   "response", "partial"), ...)
{code}
I think we should add residualsByType in LinearRegressionSummary for this 
issue, and expose the API for SparkR in another JIRA.


> LinearRegressionSummary should support get residuals by type
> 
>
> Key: SPARK-11651
> URL: https://issues.apache.org/jira/browse/SPARK-11651
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> LinearRegressionSummary should support get residuals by type like R glm:
> {code}
> residuals(object, type = c("deviance", "pearson", "working",
>"response", "partial"), ...)
> {code}
> We only add residualsByType in ml.LinearRegressionSummary for this issue, and 
> expose the API for SparkR in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11651) LinearRegressionSummary should support get residuals by type

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000163#comment-15000163
 ] 

Apache Spark commented on SPARK-11651:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9626

> LinearRegressionSummary should support get residuals by type
> 
>
> Key: SPARK-11651
> URL: https://issues.apache.org/jira/browse/SPARK-11651
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> LinearRegressionSummary should support get residuals by type like R glm:
> {code}
> residuals(object, type = c("deviance", "pearson", "working",
>"response", "partial"), ...)
> {code}
> We only add residualsByType in ml.LinearRegressionSummary for this issue, and 
> expose the API for SparkR in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11651) LinearRegressionSummary should support get residuals by type

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11651:


Assignee: Apache Spark

> LinearRegressionSummary should support get residuals by type
> 
>
> Key: SPARK-11651
> URL: https://issues.apache.org/jira/browse/SPARK-11651
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> LinearRegressionSummary should support get residuals by type like R glm:
> {code}
> residuals(object, type = c("deviance", "pearson", "working",
>"response", "partial"), ...)
> {code}
> We only add residualsByType in ml.LinearRegressionSummary for this issue, and 
> expose the API for SparkR in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11651) LinearRegressionSummary should support get residuals by type

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11651:


Assignee: (was: Apache Spark)

> LinearRegressionSummary should support get residuals by type
> 
>
> Key: SPARK-11651
> URL: https://issues.apache.org/jira/browse/SPARK-11651
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>
> LinearRegressionSummary should support get residuals by type like R glm:
> {code}
> residuals(object, type = c("deviance", "pearson", "working",
>"response", "partial"), ...)
> {code}
> We only add residualsByType in ml.LinearRegressionSummary for this issue, and 
> expose the API for SparkR in another JIRA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9866) VersionsSuite is unnecessarily slow in Jenkins

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9866:
---

Assignee: (was: Apache Spark)

> VersionsSuite is unnecessarily slow in Jenkins
> --
>
> Key: SPARK-9866
> URL: https://issues.apache.org/jira/browse/SPARK-9866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Josh Rosen
>
> The VersionsSuite Hive test is unreasonably slow in Jenkins; downloading the 
> Hive JARs and their transitive dependencies from Maven adds at least 8 
> minutes to the total build time.
> In order to cut down on build time, I think that we should make the cache 
> directory configurable via an environment variable and should configure the 
> Jenkins scripts to set this variable to point to a location outside of the 
> Jenkins workspace which is re-used across builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9866) VersionsSuite is unnecessarily slow in Jenkins

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9866:
---

Assignee: Apache Spark

> VersionsSuite is unnecessarily slow in Jenkins
> --
>
> Key: SPARK-9866
> URL: https://issues.apache.org/jira/browse/SPARK-9866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The VersionsSuite Hive test is unreasonably slow in Jenkins; downloading the 
> Hive JARs and their transitive dependencies from Maven adds at least 8 
> minutes to the total build time.
> In order to cut down on build time, I think that we should make the cache 
> directory configurable via an environment variable and should configure the 
> Jenkins scripts to set this variable to point to a location outside of the 
> Jenkins workspace which is re-used across builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9866) VersionsSuite is unnecessarily slow in Jenkins

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000119#comment-15000119
 ] 

Apache Spark commented on SPARK-9866:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9624

> VersionsSuite is unnecessarily slow in Jenkins
> --
>
> Key: SPARK-9866
> URL: https://issues.apache.org/jira/browse/SPARK-9866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Josh Rosen
>
> The VersionsSuite Hive test is unreasonably slow in Jenkins; downloading the 
> Hive JARs and their transitive dependencies from Maven adds at least 8 
> minutes to the total build time.
> In order to cut down on build time, I think that we should make the cache 
> directory configurable via an environment variable and should configure the 
> Jenkins scripts to set this variable to point to a location outside of the 
> Jenkins workspace which is re-used across builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11500) Not deterministic order of columns when using merging schemas.

2015-11-11 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11500:
---
Fix Version/s: 1.6.0

> Not deterministic order of columns when using merging schemas.
> --
>
> Key: SPARK-11500
> URL: https://issues.apache.org/jira/browse/SPARK-11500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
> Fix For: 1.6.0, 1.7.0
>
>
> When executing 
> {{sqlContext.read.option("mergeSchema", "true").parquet(pathOne, 
> pathTwo).printSchema()}}
> The order of columns is not deterministic, showing up in a different order 
> sometimes.
> This is because of {{FileStatusCache}} in {{HadoopFsRelation}} (which 
> {{ParquetRelation}} extends as you know). When 
> {{FileStatusCache.listLeafFiles()}} is called, this returns 
> {{Set[FileStatus]}} which messes up the order of {{Array[FileStatus]}}.
> So, after retrieving the list of leaf files including {{_metadata}} and 
> {{_common_metadata}},  this starts to merge (separately and if necessary) the 
> {{Set}} s of {{_metadata}}, {{_common_metadata}} and part-files in 
> {{ParquetRelation.mergeSchemasInParallel()}}, which ends up in the different 
> column order having the leading columns (of the first file) which the other 
> files do not have.
> I think this can be resolved by using {{LinkedHashSet}}.
> in a simple view,
> If A file has 1,2,3 fields, and B file column 3,4,5, we can not ensure which 
> column shows first since It is not deterministic.
> 1. Read file list (A and B)
> 2. Not deterministic order of (A and B or B and A) as I said.
> 3. It merges by {{reduceOption}} with retrieved schemas of (A and B or B and 
> A), (which maybe also should be {{reduceOptionRight}} or 
> {{reduceOptionLeft}}).
> 4. The output columns would be 1,2,3,4,5 when A and B, or 3.4.5.1.2 when B 
> and A.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11583) Make MapStatus use less memory uage

2015-11-11 Thread Kent Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000187#comment-15000187
 ] 

Kent Yao edited comment on SPARK-11583 at 11/11/15 10:20 AM:
-

@lemire [~imranr]

1. test cases
1.1 sparse case: for each task 10 blocks contains data, others dont
 
  sc.makeRDD(1 to 40950, 4095).groupBy(x=>x).top(5)

1.2 dense case: for each task most block contains data, few dont
 
 1.2.1 full
 sc.makeRDD(1 to 16769025, 4095).groupBy(x=>x).top(5)
  
 1.2.2 very dense: about 95 empty blocks
 sc.makeRDD(1 to 1638, 4095).groupBy(x=>x).top(5)
 
   1.3 test tool

 jmap -dump:format=b,file=heap.bin   
   
   1.4 test branches: branch-1.5, master
 

2. memory usage

 2.1 RoaringBitmap--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 | 
>= 34,135,920
-
 my explaination: 4095 * short[4095-10] =4095 * 16 * 4085 / 8  ≈ 34,135,920

2.2.1 RoaringBitmap--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
  >= 360,360 
-
my explaination:RoaringBitmap(0)

   2.2.2 RoaringBitmap--very dense

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 1,441,440
-
 my explaination:4095 * short[95] = 4095 * 16 * 95 / 8 = 778, 050 (+ others 
= 1441440)

 
2.3 BitSet--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 2,391,480
-
my explaination:4095 * 4095 =16,769,025 + (others =  2,391,480)
2.4 BitSet--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 2,391,480
-
my explaination:same as the above

3. conclusion

memory usage:

RoaringBitmap--full  <  RoaringBitmap--very dense <  BitSet--full =  
BitSet--sparse < RoaringBitmap--sparse
  







was (Author: qin yao):
@lemire [~imranr]

1. test cases
1.1 sparse case: for each task 10 blocks contains data, others dont
 
  sc.makeRDD(1 to 40950, 4095).groupBy(x=>x).top(5)

1.2 dense case: for each task most block contains data, few dont
 
 1.2.1 full
 sc.makeRDD(1 to 16769025, 4095).groupBy(x=>x).top(5)
  
 1.2.2 very dense: about 95 empty blocks
 sc.makeRDD(1 to 1638, 4095).groupBy(x=>x).top(5)
 
   1.3 test tool

 jmap -dump:format=b,file=heap.bin   
   
   1.4 test branches: branch-1.5, master
 

2. memory usage

 2.1 RoaringBitmap--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 | 
>= 34,135,920
-
 my explaination: 4095 * short[4095-10] =4095 * 16 * 4085 / 8  ≈ 34,135,920

2.2.1 RoaringBitmap--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
  >= 360,360 
-
my explaination:RoaringBitmap(0)

   2.2.2 RoaringBitmap--very dense

Class Name

[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-11 Thread Kent Yao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000187#comment-15000187
 ] 

Kent Yao commented on SPARK-11583:
--

@lemire [~imranr]

1. test cases
1.1 sparse case: for each task 10 blocks contains data, others dont
 
  sc.makeRDD(1 to 40950, 4095).groupBy(x=>x).top(5)

1.2 dense case: for each task most block contains data, few dont
 
 1.2.1 full
 sc.makeRDD(1 to 16769025, 4095).groupBy(x=>x).top(5)
  
 1.2.2 very dense: about 95 empty blocks
 sc.makeRDD(1 to 1638, 4095).groupBy(x=>x).top(5)
 
   1.3 test tool

 jmap -dump:format=b,file=heap.bin   
   
   1.4 test branches: branch-1.5, master
 

2. memory usage

 2.1 RoaringBitmap--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 | 
>= 34,135,920
-
 my explaination: 4095 * short[4095-10] =4095 * 16 * 4085 / 8  ≈ 34,135,920

2.2.1 RoaringBitmap--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
  >= 360,360 
-
my explaination:RoaringBitmap(0)

   2.2.2 RoaringBitmap--very dense

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 1,441,440
-
 my explaination:4095 * short[95] = 4095 * 16 * 95 / 8 = 778, 050 (+ others 
= 1441440)

 
2.3 BitSet--sparse

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 2,391,480
-
my explaination:4095 * 4095 =16,769,025 + (others =  2,391,480)
2.4 BitSet--full

Class Name  | Objects | Shallow Heap | 
Retained Heap
-
org.apache.spark.scheduler.HighlyCompressedMapStatus|   4,095 |  131,040 |  
>= 2,391,480
-
my explaination:same as the above

3. conclusion

memory usage:

RoaringBitmap--full < RoaringBitmap--very dense < BitSet--full = BitSet--sparse 
< RoaringBitmap--sparse
  






> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5968) Parquet warning in spark-shell

2015-11-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000200#comment-15000200
 ] 

Cheng Lian commented on SPARK-5968:
---

It had once been fixed via a quite hacky trick. Unfortunately it came back 
again after upgrading to parquet-mr 1.7.0, and there doesn't seem to be a 
reliable way to override the log settings because of PARQUET-369, which 
prevents users or other libraries to redirect Parquet JUL logger via SLF4J. 
It's fixed in the most recent parquet-format master, but I'm afraid we have to 
wait for another parquet-format and parquet-mr release to fix this issue 
completely.

> Parquet warning in spark-shell
> --
>
> Key: SPARK-5968
> URL: https://issues.apache.org/jira/browse/SPARK-5968
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.3.0
>
>
> This may happen in the case of schema evolving, namely appending new Parquet 
> data with different but compatible schema to existing Parquet files:
> {code}
> 15/02/23 23:29:24 WARN ParquetOutputCommitter: could not write summary file 
> for rankings
> parquet.io.ParquetEncodingException: 
> file:/Users/matei/workspace/apache-spark/rankings/part-r-1.parquet 
> invalid: all the files must be contained in the root rankings
> at parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:422)
> at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:398)
> at 
> parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:51)
> {code}
> The reason is that the Spark SQL schemas stored in Parquet key-value metadata 
> differ. Parquet doesn't know how to "merge" these opaque user-defined 
> metadata, and just throw an exception and give up writing summary files. 
> Since the Parquet data source in Spark 1.3.0 supports schema merging, it's 
> harmless.  But this is kind of scary for the user.  We should try to suppress 
> this through the logger. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11650) "AkkaUtilsSuite.remote fetch ssl on - untrusted server" test is very slow

2015-11-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11650:
--

 Summary: "AkkaUtilsSuite.remote fetch ssl on - untrusted server" 
test is very slow
 Key: SPARK-11650
 URL: https://issues.apache.org/jira/browse/SPARK-11650
 Project: Spark
  Issue Type: Sub-task
  Components: Tests
Reporter: Josh Rosen


"AkkaUtilsSuite.remote fetch ssl on - untrusted server" test is very slow; it 
always takes 2 minutes to run, probably because a timeout is set to 2 minutes:

https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/3976/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=spark-test/testReport/org.apache.spark.util/AkkaUtilsSuite/remote_fetch_ssl_on___untrusted_server/history/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11594) Cannot create UDAF in REPL

2015-11-11 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000151#comment-15000151
 ] 

Herman van Hovell commented on SPARK-11594:
---

Move to scala 2.10.5 fixed this.

> Cannot create UDAF in REPL
> --
>
> Key: SPARK-11594
> URL: https://issues.apache.org/jira/browse/SPARK-11594
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
> Environment: Latest Spark Master
> JVM 1.8.0_66-b17
>Reporter: Herman van Hovell
>Priority: Minor
>
> If you try to define the a UDAF in the REPL, an internal error is thrown by 
> Java. The following code for example:
> {noformat}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{DataType, LongType, StructType}
> import org.apache.spark.sql.expressions.{MutableAggregationBuffer, 
> UserDefinedAggregateFunction}
> class LongProductSum extends UserDefinedAggregateFunction {
>   def inputSchema: StructType = new StructType()
> .add("a", LongType)
> .add("b", LongType)
>   def bufferSchema: StructType = new StructType()
> .add("product", LongType)
>   def dataType: DataType = LongType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
>   }
>   def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
> if (!(input.isNullAt(0) || input.isNullAt(1))) {
>   buffer(0) = buffer.getLong(0) + input.getLong(0) * input.getLong(1)
> }
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)
>   }
>   def evaluate(buffer: Row): Any =
> buffer.getLong(0)
> }
> sqlContext.udf.register("longProductSum", new LongProductSum)
> val data2 = Seq[(Integer, Integer, Integer)](
>   (1, 10, -10),
>   (null, -60, 60),
>   (1, 30, -30),
>   (1, 30, 30),
>   (2, 1, 1),
>   (3, null, null)).toDF("key", "value1", "value2")
> data2.registerTempTable("agg2")
> val q = sqlContext.sql("""
> |SELECT
> |  key,
> |  count(distinct value1, value2),
> |  longProductSum(distinct value1, value2)
> |FROM agg2
> |GROUP BY key
> """.stripMargin)
> q.show
> {noformat}
> Will throw the following error:
> {noformat}
> java.lang.InternalError: Malformed class name
>   at java.lang.Class.getSimpleName(Class.java:1330)
>   at 
> org.apache.spark.sql.execution.aggregate.ScalaUDAF.toString(udaf.scala:455)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$$anonfun$9.apply(SparkStrategies.scala:209)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$Aggregation$.apply(SparkStrategies.scala:209)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
>   at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:445)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:51)
>   at 
> org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:49)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:56)
>   at 
> org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:56)
>   at org.apache.spark.sql.DataFrame.withCallback(DataFrame.scala:2092)
>   at 

[jira] [Commented] (SPARK-11633) HiveContext throws TreeNode Exception : Failed to Copy Node

2015-11-11 Thread Saurabh Santhosh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000328#comment-15000328
 ] 

Saurabh Santhosh commented on SPARK-11633:
--

{code:title=Stacktrace|borderStyle=solid}

org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
LogicalRDD [F1#2,F2#3], MapPartitionsRDD[2] at createDataFrame at 
SparkClientTest.java:79

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:346)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:96)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:64)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$3.applyOrElse(Analyzer.scala:333)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$3.applyOrElse(Analyzer.scala:332)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:329)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenUp(TreeNode.scala:329)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:283)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)

[jira] [Assigned] (SPARK-11654) add reduce to GroupedDataset

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11654:


Assignee: (was: Apache Spark)

> add reduce to GroupedDataset
> 
>
> Key: SPARK-11654
> URL: https://issues.apache.org/jira/browse/SPARK-11654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11654) add reduce to GroupedDataset

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000382#comment-15000382
 ] 

Apache Spark commented on SPARK-11654:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9628

> add reduce to GroupedDataset
> 
>
> Key: SPARK-11654
> URL: https://issues.apache.org/jira/browse/SPARK-11654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11654) add reduce to GroupedDataset

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11654:


Assignee: Apache Spark

> add reduce to GroupedDataset
> 
>
> Key: SPARK-11654
> URL: https://issues.apache.org/jira/browse/SPARK-11654
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11655:
---
Description: 
We've been combatting an orphaned process issue on AMPLab Jenkins since October 
and I finally was able to dig in and figure out what's going on.

After some sleuthing and working around OS limits and JDK bugs, I was able to 
get the full launch commands for the hanging orphaned processes. It looks like 
they're all running spark-submit:

{code}
org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
 -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
{code}

Based on the output of some Ganglia graphs, I was able to figure out that these 
leaks started around October 9:

 !screenshot-1.png|thumbnail! 

This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
merged, which added LauncherBackendSuite. The launch arguments used in this 
suite seem to line up with the arguments that I observe in the hanging 
processes' {{jps}} output: 
https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46

Interestingly, Jenkins doesn't show test timing or output for this suite! I 
think that what might be happening is that we have a mixed Scala/Java package, 
so maybe the two test runner XML files aren't being merged properly: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/

Whenever I try running this suite locally, it looks like it ends up creating a 
zombie SparkSubmit process! I think that what's happening is that the 
launcher's {{handle.kill()}} call ends up destroying the bash {{spark-submit}} 
subprocess such that its child process (a JVM) leaks.

I think that we'll have to do something similar to what we do in PySpark when 
launching a child JVM from a Python / Bash process: connect it to a socket or 
stream such that it can detect its parent's death and clean up after itself 
appropriately.

/cc [~shaneknapp] and [~vanzin].

  was:
We've been combatting an orphaned process issue on AMPLab Jenkins since October 
and I finally was able to dig in and figure out what's going on.

After some sleuthing and working around OS limits and JDK bugs, I was able to 
get the full launch commands for the hanging orphaned processes. It looks like 
they're all running spark-submit:

{code}
org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
 -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
{code}

Based on the output of some Ganglia graphs, I was able to figure out that these 
leaks started around October 9.

This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
merged, which added LauncherBackendSuite. The launch arguments used in this 
suite seem to line up with the arguments that I observe in the hanging 
processes' {{jps}} output: 

[jira] [Issue Comment Deleted] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10978:
-
Comment: was deleted

(was: Thanks the for the test! I think there is a bug.)

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11659) Codegen sporadically fails with same input character

2015-11-11 Thread Catalin Alexandru Zamfir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Alexandru Zamfir updated SPARK-11659:
-
Description: 
We pretty much have a default installation of Spark 1.5.1. Some of our jobs 
sporadically fail with the below exception for the same "input character" (we 
don't have @ in our inputs as we check the types that we filter from the data, 
but jobs still fail) and when we re-run the same job with the same input, the 
same job passes without any failures. I believe it's a bug in code-gen but I 
can't debug this on a production cluster. One thing to note is that this has a 
higher chance of occurring when multiple jobs are run in parallel to one 
another (eg. 4 jobs at a time started on the same second using a scheduler and 
sharing the same context). However, I have no reproduce rule. For example, from 
32 jobs scheduled in batches of 4 jobs per batch, 1 of the jobs in one of the 
batches may fail with the below error and with a different job, randomly. I 
don't know an idea on how to approach this situation to produce better 
information so maybe you can advise us.

{noformat}
Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most 
recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
Invalid character input "@" (character code 64)

public SpecificOrdering 
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificOrdering(expr);
}

class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
  
  private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
  
  
  
  public 
SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
expressions = expr;

  }
  
  @Override
  public int compare(InternalRow a, InternalRow b) {
InternalRow i = null;  // Holds current row being evaluated.

i = a;
boolean isNullA2;
long primitiveA3;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullA2 = isNull0;
  primitiveA3 = primitive1;
}
i = b;
boolean isNullB4;
long primitiveB5;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullB4 = isNull0;
  primitiveB5 = primitive1;
}
if (isNullA2 && isNullB4) {
  // Nothing
} else if (isNullA2) {
  return -1;
} else if (isNullB4) {
  return 1;
} else {
  int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA8;
long primitiveA9;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullA8 = isNull6;
  primitiveA9 = primitive7;
}
i = b;
boolean isNullB10;
long primitiveB11;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullB10 = isNull6;
  primitiveB11 = primitive7;
}
if (isNullA8 && isNullB10) {
  // Nothing
} else if (isNullA8) {
  return -1;
} else if (isNullB10) {
  return 1;
} else {
  int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA14;
long primitiveA15;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullA14 = isNull12;
  primitiveA15 = primitive13;
}
i = b;
boolean isNullB16;
long primitiveB17;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullB16 = isNull12;
  primitiveB17 = primitive13;
}
if (isNullA14 && isNullB16) {
  // Nothing
} else if (isNullA14) {
  return -1;
} else if (isNullB16) {
  return 1;
} else {
  int comp = (primitiveA15 > primitiveB17 ? 1 : primitiveA15 < primitiveB17 
? -1 : 0);
  if (comp != 0) {
return comp;
  }
}

return 0;
  }
}

at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 

[jira] [Comment Edited] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-11 Thread Bartlomiej Alberski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000616#comment-15000616
 ] 

Bartlomiej Alberski edited comment on SPARK-11553 at 11/11/15 5:10 PM:
---

Ok. I think that I know what is the problem. It can be reproduced with scala 
2.11.6 and DataFrame API.

If you are using DataFrame API from scala and you are trying to get 
Int|Long|Boolean etc - value that extends AnyVal, you will receive "zero value" 
specific for given type (0 for Long and Int, false for Boolean etc), while API 
suggest that NPE will be raised.

Example modified in order to ilustrate problem (from 
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
{code}
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
val res = df.map(x => x.getLong(x.fieldIndex("name"))).collect()
println(res.mkString(",")
{code}

Problem comes from implementation of getInt|Float|Boolean|... methods: 
{code}
getInt(i: Int): Int = getAs[Int](i)
getAs[T](i: Int): T = get(i).asInstanceOf[T]
{code}

null.asInstanceOf[Long] returns 0 (because Long cannot be null - it extends 
AnyVal)

Examplary invocations from scala REPL
{code}
scala> null.asInstanceOf[Int]
res0: Int = 0

scala> null.asInstanceOf[Long]
res1: Long = 0

scala> null.asInstanceOf[Short]
res2: Short = 0

scala> null.asInstanceOf[Boolean]
res3: Boolean = false

scala> null.asInstanceOf[Double]
res4: Double = 0.0

scala> null.asInstanceOf[Float]
res5: Float = 0.0
{code}

I will be more than happy to prepare PR solving this issue.


was (Author: alberskib):
Ok. I think that I know what is the problem. It can be reproduced with scala 
2.11.6 and DataFrame API.

If you are using DataFrame API from scala and you are trying to get 
Int|Long|Boolean etc - value that extends AnyVal, you will receive "zero value" 
specific for given type (0 for Long and Int, false for Boolean etc), while API 
suggest that NPE will be raised.

Example modified in order to ilustrate problem (from 
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
{code}
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
val res = df.map(x => x.getLong(x.fieldIndex("name"))).collect()
println(res.mkString(",")
{code}

Problem comes from implementation of getInt|Float|Boolean|... methods: 
{code}
getInt(i: Int): Int = getAs[Int](i)
getAs[T](i: Int): T = get(i).asInstanceOf[T]
{code}

null.asInstanceOf[Long] returns 0 (because Long cannot be null because it 
extends AnyVal)

Examplary invocations from scala REPL
{code}
scala> null.asInstanceOf[Int]
res0: Int = 0

scala> null.asInstanceOf[Long]
res1: Long = 0

scala> null.asInstanceOf[Short]
res2: Short = 0

scala> null.asInstanceOf[Boolean]
res3: Boolean = false

scala> null.asInstanceOf[Double]
res4: Double = 0.0

scala> null.asInstanceOf[Float]
res5: Float = 0.0
{code}

I will be more than happy to prepare PR solving this issue.

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11626) ml.feature.Word2Vec.transform() should not recompute word-vector map each time

2015-11-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11626.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9592
[https://github.com/apache/spark/pull/9592]

> ml.feature.Word2Vec.transform() should not recompute word-vector map each time
> --
>
> Key: SPARK-11626
> URL: https://issues.apache.org/jira/browse/SPARK-11626
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: q79969786
>Assignee: q79969786
> Fix For: 1.6.0
>
>
> org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not 
> read broadcast every sentence.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000763#comment-15000763
 ] 

kevin yu commented on SPARK-11657:
--

Hello Virgil: Can you try to toDF().show()? then do toDF().take(2)?

Thanks
Kevin

> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11659) Codegen sporadically fails with same input character

2015-11-11 Thread Catalin Alexandru Zamfir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Alexandru Zamfir updated SPARK-11659:
-
Description: 
We pretty much have a default instalation of Spark 1.5.1. Some of our jobs 
sporadically fail with the below exception for the same "input character" (we 
don't have @ in our inputs, but jobs still fail) and when we re-run the same 
job with the same input, all jobs pass without any failures. I believe it's a 
bug in code-gen but I can't debug this on a production cluster (and it's almost 
close to impossible to reproduce it).

{noformat}
Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most 
recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
Invalid character input "@" (character code 64)

public SpecificOrdering 
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificOrdering(expr);
}

class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
  
  private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
  
  
  
  public 
SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
expressions = expr;

  }
  
  @Override
  public int compare(InternalRow a, InternalRow b) {
InternalRow i = null;  // Holds current row being evaluated.

i = a;
boolean isNullA2;
long primitiveA3;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullA2 = isNull0;
  primitiveA3 = primitive1;
}
i = b;
boolean isNullB4;
long primitiveB5;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullB4 = isNull0;
  primitiveB5 = primitive1;
}
if (isNullA2 && isNullB4) {
  // Nothing
} else if (isNullA2) {
  return -1;
} else if (isNullB4) {
  return 1;
} else {
  int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA8;
long primitiveA9;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullA8 = isNull6;
  primitiveA9 = primitive7;
}
i = b;
boolean isNullB10;
long primitiveB11;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullB10 = isNull6;
  primitiveB11 = primitive7;
}
if (isNullA8 && isNullB10) {
  // Nothing
} else if (isNullA8) {
  return -1;
} else if (isNullB10) {
  return 1;
} else {
  int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA14;
long primitiveA15;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullA14 = isNull12;
  primitiveA15 = primitive13;
}
i = b;
boolean isNullB16;
long primitiveB17;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullB16 = isNull12;
  primitiveB17 = primitive13;
}
if (isNullA14 && isNullB16) {
  // Nothing
} else if (isNullA14) {
  return -1;
} else if (isNullB16) {
  return 1;
} else {
  int comp = (primitiveA15 > primitiveB17 ? 1 : primitiveA15 < primitiveB17 
? -1 : 0);
  if (comp != 0) {
return comp;
  }
}

return 0;
  }
}

at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at 

[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11601:
--
Description: 
Generate a list of binary incompatible changes using MiMa and create new JIRAs 
for issues found. Filter out false positives as needed.

If you want to take this task, ping [~mengxr] for advice since he did it for 
1.5.

  was:
Generate a list of binary incompatible changes using MiMa.  Filter out false 
positives as needed.

If you want to take this task, ping [~mengxr] for advice since he did it for 
1.5.


> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Tim Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11646) WholeTextFileRDD should return Text rather than String

2015-11-11 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11646.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> WholeTextFileRDD should return Text rather than String
> --
>
> Key: SPARK-11646
> URL: https://issues.apache.org/jira/browse/SPARK-11646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> If it returns Text, we can reuse this in Spark SQL to provide a WholeTextFile 
> data source and directly convert the Text into UTF8String without extra 
> string decoding and encoding.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11566) Refactoring GaussianMixtureModel.gaussians in Python

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11566:
--
Assignee: Yu Ishikawa

> Refactoring GaussianMixtureModel.gaussians in Python
> 
>
> Key: SPARK-11566
> URL: https://issues.apache.org/jira/browse/SPARK-11566
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.5.1
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>Priority: Trivial
> Fix For: 1.6.0
>
>
> We could also implement {{GaussianMixtureModelWrapper.gaussians}} in Scala 
> with {{SerDe.dumps}}, instead of returning Java {{Object}}. So, it would be a 
> little simpler and more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11656) support typed aggregate in project list

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11656:
--
Assignee: Wenchen Fan

> support typed aggregate in project list
> ---
>
> Key: SPARK-11656
> URL: https://issues.apache.org/jira/browse/SPARK-11656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000854#comment-15000854
 ] 

Yin Huai commented on SPARK-10978:
--

I opened https://issues.apache.org/jira/browse/SPARK-11661.

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11643:
--
Target Version/s:   (was: 1.5.0, 1.5.1)

[~csa...@progress.com] this can't target 1.5.0 / 1.5.1 -- they're released. 
Generally you don't set target.

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>Reporter: Chip Sands
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11659) Codegen sporadically fails with same input character

2015-11-11 Thread Catalin Alexandru Zamfir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Alexandru Zamfir updated SPARK-11659:
-
Description: 
We pretty much have a default instalation of Spark 1.5.1. Some of our jobs 
sporadically fail with the below exception for the same "input character" (we 
don't have @ in our inputs, but jobs still fail) and when we re-run the same 
job with the same input, all jobs pass without any failures. I believe it's a 
bug in code-gen but I can't debug this on a production cluster (and it's almost 
close to impossible to reproduce it).

{{
Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most 
recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
Invalid character input "@" (character code 64)

public SpecificOrdering 
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificOrdering(expr);
}

class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
  
  private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
  
  
  
  public 
SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
expressions = expr;

  }
  
  @Override
  public int compare(InternalRow a, InternalRow b) {
InternalRow i = null;  // Holds current row being evaluated.

i = a;
boolean isNullA2;
long primitiveA3;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullA2 = isNull0;
  primitiveA3 = primitive1;
}
i = b;
boolean isNullB4;
long primitiveB5;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullB4 = isNull0;
  primitiveB5 = primitive1;
}
if (isNullA2 && isNullB4) {
  // Nothing
} else if (isNullA2) {
  return -1;
} else if (isNullB4) {
  return 1;
} else {
  int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA8;
long primitiveA9;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullA8 = isNull6;
  primitiveA9 = primitive7;
}
i = b;
boolean isNullB10;
long primitiveB11;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullB10 = isNull6;
  primitiveB11 = primitive7;
}
if (isNullA8 && isNullB10) {
  // Nothing
} else if (isNullA8) {
  return -1;
} else if (isNullB10) {
  return 1;
} else {
  int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA14;
long primitiveA15;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullA14 = isNull12;
  primitiveA15 = primitive13;
}
i = b;
boolean isNullB16;
long primitiveB17;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullB16 = isNull12;
  primitiveB17 = primitive13;
}
if (isNullA14 && isNullB16) {
  // Nothing
} else if (isNullA14) {
  return -1;
} else if (isNullB16) {
  return 1;
} else {
  int comp = (primitiveA15 > primitiveB17 ? 1 : primitiveA15 < primitiveB17 
? -1 : 0);
  if (comp != 0) {
return comp;
  }
}

return 0;
  }
}

at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
org.spark-project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at 
org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
at 

[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread Virgil Palanciuc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000840#comment-15000840
 ] 

Virgil Palanciuc commented on SPARK-11657:
--

On the simple example:
{code}
> val df = sc.makeRDD(Seq("20206E25479B5C67992015A9")).toDF()
df: org.apache.spark.sql.DataFrame = [_1: string]
> df.show()
++
|  _1|
++
|479B5C67992015A9...|
++

> df.take(2)
res1: Array[org.apache.spark.sql.Row] = Array([479B5C67992015A9]) 
{code}

also:
{code}
== Parsed Logical Plan ==
Limit 21
 LogicalRDD [_1#1], MapPartitionsRDD[5] at stringRddToDataFrameHolder at 
:21

== Analyzed Logical Plan ==
_1: string
Limit 21
 LogicalRDD [_1#1], MapPartitionsRDD[5] at stringRddToDataFrameHolder at 
:21

== Optimized Logical Plan ==
Limit 21
 LogicalRDD [_1#1], MapPartitionsRDD[5] at stringRddToDataFrameHolder at 
:21

== Physical Plan ==
Limit 21
 Scan PhysicalRDD[_1#1]

Code Generation: true
{code}


> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11660) Spark Thrift GetResultSetMetadata describes a VARCHAR as a STRING

2015-11-11 Thread Chip Sands (JIRA)
Chip Sands created SPARK-11660:
--

 Summary: Spark Thrift GetResultSetMetadata describes a VARCHAR as 
a STRING
 Key: SPARK-11660
 URL: https://issues.apache.org/jira/browse/SPARK-11660
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.0
Reporter: Chip Sands


In the Spark SQL  thrift interface the GetResultSetMetadata reply packet that 
describes the result set metadata, reports a column that is defined as a 
VARCHAR in the database, as Native type of STRING. Data still returns correctly 
in the thrift string type but ODBC/JDBC is not able to correctly describe the 
data type being return or its defined maximum length.

FYI Hive returns it correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11567) Add Python API for corr aggregate function

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11567:
--
Assignee: Felix Cheung

> Add Python API for corr aggregate function
> --
>
> Key: SPARK-11567
> URL: https://issues.apache.org/jira/browse/SPARK-11567
> Project: Spark
>  Issue Type: Task
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11202) Unsupported dataType

2015-11-11 Thread F Jimenez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000665#comment-15000665
 ] 

F Jimenez commented on SPARK-11202:
---

The problem seems to stem from calling a deprecated method. In Spark 1.5.1 I 
see the following code:

`  def convertFromString(string: String): Seq[Attribute] = {

Try(DataType.fromJson(string)).getOrElse(DataType.fromCaseClassString(string)) 
match {
  case s: StructType => s.toAttributes
  case other => sys.error(s"Can convert $string to row")
}
  }`

The method in `DataType.fromCaseClassString` is marked as deprecated. Switching 
to `DataType.fromJson` instead solve my instance of the same problem.


> Unsupported dataType
> 
>
> Key: SPARK-11202
> URL: https://issues.apache.org/jira/browse/SPARK-11202
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: whc
>
> I read data from oracle and save as parquet ,then get the following error:
> java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]}
> ^
> at 
> org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(DataType.scala:245)
> at 
> org.apache.spark.sql.types.DataType$.fromCaseClassString(DataType.scala:102)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromString(ParquetTypesConverter.scala:62)
> at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:51)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> I checked the type but do not have Timestamp or Date type in oracle
> my oracle table like this:
> create table DW_DOMAIN
> (
> domain_id   NUMBER,
>   cityid  NUMBER,
>   domain_type NUMBER,
>   domain_name VARCHAR2(80)
> )
> and my code like this:
> Map options = new HashMap();
>   options.put("url", url);
>   options.put("driver", driver);
>   options.put("user", user);
>   options.put("password", password);
>   options.put("dbtable", "(select DOMAIN_NAME,DOMAIN_ID from 
> dw_domain ) t");
> DataFrame df = this.sqlContext.read().format("jdbc").options(options )
>   .load();
> df.write().mode(SaveMode.Append)
>   
> .parquet("hdfs://cluster1:8020/database/count_domain/");
> if add "to_char(DOMAIN_ID)",that can get correct result. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000739#comment-15000739
 ] 

Marcelo Vanzin commented on SPARK-11655:


I'll take a look at the code.

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000821#comment-15000821
 ] 

Yin Huai commented on SPARK-10978:
--

Thanks the for the test! I think there is a bug.

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-11-11 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000823#comment-15000823
 ] 

Yin Huai commented on SPARK-10978:
--

Thanks the for the test! I think there is a bug.

> Allow PrunedFilterScan to eliminate predicates from further evaluation
> --
>
> Key: SPARK-10978
> URL: https://issues.apache.org/jira/browse/SPARK-10978
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Russell Alexander Spitzer
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> Currently PrunedFilterScan allows implementors to push down predicates to an 
> underlying datasource. This is done solely as an optimization as the 
> predicate will be reapplied on the Spark side as well. This allows for 
> bloom-filter like operations but ends up doing a redundant scan for those 
> sources which can do accurate pushdowns.
> In addition it makes it difficult for underlying sources to accept queries 
> which reference non-existent to provide ancillary function. In our case we 
> allow a solr query to be passed in via a non-existent solr_query column. 
> Since this column is not returned when Spark does a filter on "solr_query" 
> nothing passes. 
> Suggestion on the ML from [~marmbrus] 
> {quote}
> We have to try and maintain binary compatibility here, so probably the 
> easiest thing to do here would be to add a method to the class.  Perhaps 
> something like:
> def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters
> By default, this could return all filters so behavior would remain the same, 
> but specific implementations could override it.  There is still a chance that 
> this would conflict with existing methods, but hopefully that would not be a 
> problem in practice.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11659) Codegen sporadically fails with same input character

2015-11-11 Thread Catalin Alexandru Zamfir (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Catalin Alexandru Zamfir updated SPARK-11659:
-
Description: 
We pretty much have a default installation of Spark 1.5.1. Some of our jobs 
sporadically fail with the below exception for the same "input character" (we 
don't have @ in our inputs as we check the types that we filter from the data, 
but jobs still fail) and when we re-run the same job with the same input, the 
same job passes without any failures. I believe it's a bug in code-gen but I 
can't debug this on a production cluster. One thing to note is that this has a 
higher chance of occurring when multiple jobs are run in parallel to one 
another (eg. 4 jobs at a time started on the same second using a scheduler and 
sharing the same context). However, I have no reproduce rule. For example, from 
32 jobs scheduled in batches of 4 jobs per batch, 1 of the jobs in one of the 
batches may fail with the below error and with a different job, randomly. I 
don't know an idea on how to approach this situation to produce better 
information so maybe you can advise us.

{noformat}
Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most 
recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
Invalid character input "@" (character code 64)

public SpecificOrdering 
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificOrdering(expr);
}

class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
  
  private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
  
  
  
  public 
SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
expressions = expr;

  }
  
  @Override
  public int compare(InternalRow a, InternalRow b) {
InternalRow i = null;  // Holds current row being evaluated.

i = a;
boolean isNullA2;
long primitiveA3;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullA2 = isNull0;
  primitiveA3 = primitive1;
}
i = b;
boolean isNullB4;
long primitiveB5;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullB4 = isNull0;
  primitiveB5 = primitive1;
}
if (isNullA2 && isNullB4) {
  // Nothing
} else if (isNullA2) {
  return -1;
} else if (isNullB4) {
  return 1;
} else {
  int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA8;
long primitiveA9;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullA8 = isNull6;
  primitiveA9 = primitive7;
}
i = b;
boolean isNullB10;
long primitiveB11;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullB10 = isNull6;
  primitiveB11 = primitive7;
}
if (isNullA8 && isNullB10) {
  // Nothing
} else if (isNullA8) {
  return -1;
} else if (isNullB10) {
  return 1;
} else {
  int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA14;
long primitiveA15;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullA14 = isNull12;
  primitiveA15 = primitive13;
}
i = b;
boolean isNullB16;
long primitiveB17;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullB16 = isNull12;
  primitiveB17 = primitive13;
}
if (isNullA14 && isNullB16) {
  // Nothing
} else if (isNullA14) {
  return -1;
} else if (isNullB16) {
  return 1;
} else {
  int comp = (primitiveA15 > primitiveB17 ? 1 : primitiveA15 < primitiveB17 
? -1 : 0);
  if (comp != 0) {
return comp;
  }
}

return 0;
  }
}

at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 

[jira] [Resolved] (SPARK-11481) orderBy with multiple columns in WindowSpec does not work properly

2015-11-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-11481.

   Resolution: Fixed
 Assignee: Davies Liu
Fix Version/s: 1.6.0
   1.5.2

Re-open this if it's different.

> orderBy with multiple columns in WindowSpec does not work properly
> --
>
> Key: SPARK-11481
> URL: https://issues.apache.org/jira/browse/SPARK-11481
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Jose Antonio
>Assignee: Davies Liu
>  Labels: DataFrame, sparkSQL
> Fix For: 1.5.2, 1.6.0
>
>
> When using multiple columns in the orderBy of a WindowSpec the order by seems 
> to work only for the first column.
> A possible workaround is to sort previosly the DataFrame and then apply the 
> window spec over the sorted DataFrame
> e.g. 
> THIS NOT WORKS:
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version', 
> func.sum(df.group_counter).over(window_sum))
> THIS WORKS WELL:
> df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version', 
> func.sum(df.group_counter).over(window_sum))
> Also, can anybody confirm that this is a true workaround?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11481) orderBy with multiple columns in WindowSpec does not work properly

2015-11-11 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000826#comment-15000826
 ] 

Davies Liu commented on SPARK-11481:


I think this is related to https://issues.apache.org/jira/browse/SPARK-11009, 
fixed 1.5.2 and 1.6. 

> orderBy with multiple columns in WindowSpec does not work properly
> --
>
> Key: SPARK-11481
> URL: https://issues.apache.org/jira/browse/SPARK-11481
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
> Environment: All
>Reporter: Jose Antonio
>  Labels: DataFrame, sparkSQL
>
> When using multiple columns in the orderBy of a WindowSpec the order by seems 
> to work only for the first column.
> A possible workaround is to sort previosly the DataFrame and then apply the 
> window spec over the sorted DataFrame
> e.g. 
> THIS NOT WORKS:
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version', 
> func.sum(df.group_counter).over(window_sum))
> THIS WORKS WELL:
> df = df.sort('user_unique_id', 'creation_date', 'mib_id', 'day')
> window_sum = Window.partitionBy('user_unique_id').orderBy('creation_date', 
> 'mib_id', 'day').rowsBetween(-sys.maxsize, 0)
> df = df.withColumn('user_version', 
> func.sum(df.group_counter).over(window_sum))
> Also, can anybody confirm that this is a true workaround?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11656) support typed aggregate in project list

2015-11-11 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11656.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9630
[https://github.com/apache/spark/pull/9630]

> support typed aggregate in project list
> ---
>
> Key: SPARK-11656
> URL: https://issues.apache.org/jira/browse/SPARK-11656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread Xin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Wu updated SPARK-11657:
---
Comment: was deleted

(was: I tried this sample data as local file mode. and it seems working to me. 
Have you tried it this way?

{code}
scala> val data = sqlContext.read.parquet("/root/sample")
[Stage 0:>  (0 + 8) / 
8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, 
clusterData: array, dpid: int]

scala> data.take(1)
15/11/11 08:26:29 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
res0: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])

scala> data.collect()
15/11/11 08:26:53 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
res1: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])

scala> data.show(false)
15/11/11 08:26:57 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+---++++
|clusterSize|clusterName |clusterData 
|dpid|
+---++++
|1  
|6A01CACD56169A947F000101|[77512098164594606101815510825479776971]|813 |
+---++++

{code})

> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11553) row.getInt(i) if row[i]=null returns 0

2015-11-11 Thread Bartlomiej Alberski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000616#comment-15000616
 ] 

Bartlomiej Alberski commented on SPARK-11553:
-

Ok. I think that I know what is the problem. It can be reproduced with scala 
2.11.6 and DataFrame API.

If you are using DataFrame API from scala and you are trying to get 
Int|Long|Boolean etc - value that extends AnyVal, you will receive "zero value" 
specific for given type (0 for Long and Int, false for Boolean etc), while API 
suggest that NPE will be raised.

Example modified in order to ilustrate problem (from 
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes)
{code}
val sc: SparkContext // An existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Displays the content of the DataFrame to stdout
df.show()
val res = df.map(x => x.getLong(x.fieldIndex("name"))).collect()
println(res.mkString(",")
{code}

Problem comes from implementation of getInt|Float|Boolean|... methods: 
{code}
getInt(i: Int): Int = getAs[Int](i)
getAs[T](i: Int): T = get(i).asInstanceOf[T]
{code}

null.asInstanceOf[Long] returns 0 (because Long cannot be null because it 
extends AnyVal)

Examplary invocations from scala REPL
{code}
scala> null.asInstanceOf[Int]
res0: Int = 0

scala> null.asInstanceOf[Long]
res1: Long = 0

scala> null.asInstanceOf[Short]
res2: Short = 0

scala> null.asInstanceOf[Boolean]
res3: Boolean = false

scala> null.asInstanceOf[Double]
res4: Double = 0.0

scala> null.asInstanceOf[Float]
res5: Float = 0.0
{code}

I will be more than happy to prepare PR solving this issue.

> row.getInt(i) if row[i]=null returns 0
> --
>
> Key: SPARK-11553
> URL: https://issues.apache.org/jira/browse/SPARK-11553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Tofigh
>Priority: Minor
>
> row.getInt|Float|Double in SPARK RDD return 0 if row[index] is null. (Even 
> according to the document they should throw nullException error)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11659) Codegen sporadically fails with same input character

2015-11-11 Thread Catalin Alexandru Zamfir (JIRA)
Catalin Alexandru Zamfir created SPARK-11659:


 Summary: Codegen sporadically fails with same input character
 Key: SPARK-11659
 URL: https://issues.apache.org/jira/browse/SPARK-11659
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.1
 Environment: Default, Linux (Jessie)
Reporter: Catalin Alexandru Zamfir


We pretty much have a default instalation of Spark 1.5.1. Some of our jobs 
sporadically fail with the below exception for the same "input character" (we 
don't have @ in our inputs, but jobs still fail) and when we re-run the same 
job with the same input, all jobs pass without any failures. I believe it's a 
bug in code-gen but I can't debug this on a production cluster (and it's almost 
close to impossible to reproduce it).

{{Job aborted due to stage failure: Task 50 in stage 4.0 failed 4 times, most 
recent failure: Lost task 50.3 in stage 4.0 (TID 894, 10.136.64.112): 
java.util.concurrent.ExecutionException: java.lang.Exception: failed to 
compile: org.codehaus.commons.compiler.CompileException: Line 15, Column 9: 
Invalid character input "@" (character code 64)

public SpecificOrdering 
generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
  return new SpecificOrdering(expr);
}

class SpecificOrdering extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
  
  private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
  
  
  
  public 
SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
expressions = expr;

  }
  
  @Override
  public int compare(InternalRow a, InternalRow b) {
InternalRow i = null;  // Holds current row being evaluated.

i = a;
boolean isNullA2;
long primitiveA3;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullA2 = isNull0;
  primitiveA3 = primitive1;
}
i = b;
boolean isNullB4;
long primitiveB5;
{
  /* input[0, LongType] */
  
  boolean isNull0 = i.isNullAt(0);
  long primitive1 = isNull0 ? -1L : (i.getLong(0));
  
  isNullB4 = isNull0;
  primitiveB5 = primitive1;
}
if (isNullA2 && isNullB4) {
  // Nothing
} else if (isNullA2) {
  return -1;
} else if (isNullB4) {
  return 1;
} else {
  int comp = (primitiveA3 > primitiveB5 ? 1 : primitiveA3 < primitiveB5 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA8;
long primitiveA9;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullA8 = isNull6;
  primitiveA9 = primitive7;
}
i = b;
boolean isNullB10;
long primitiveB11;
{
  /* input[1, LongType] */
  
  boolean isNull6 = i.isNullAt(1);
  long primitive7 = isNull6 ? -1L : (i.getLong(1));
  
  isNullB10 = isNull6;
  primitiveB11 = primitive7;
}
if (isNullA8 && isNullB10) {
  // Nothing
} else if (isNullA8) {
  return -1;
} else if (isNullB10) {
  return 1;
} else {
  int comp = (primitiveA9 > primitiveB11 ? 1 : primitiveA9 < primitiveB11 ? 
-1 : 0);
  if (comp != 0) {
return comp;
  }
}


i = a;
boolean isNullA14;
long primitiveA15;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullA14 = isNull12;
  primitiveA15 = primitive13;
}
i = b;
boolean isNullB16;
long primitiveB17;
{
  /* input[2, LongType] */
  
  boolean isNull12 = i.isNullAt(2);
  long primitive13 = isNull12 ? -1L : (i.getLong(2));
  
  isNullB16 = isNull12;
  primitiveB17 = primitive13;
}
if (isNullA14 && isNullB16) {
  // Nothing
} else if (isNullA14) {
  return -1;
} else if (isNullB16) {
  return 1;
} else {
  int comp = (primitiveA15 > primitiveB17 ? 1 : primitiveA15 < primitiveB17 
? -1 : 0);
  if (comp != 0) {
return comp;
  }
}

return 0;
  }
}

at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
org.spark-project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
org.spark-project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
org.spark-project.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
org.spark-project.guava.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 

[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-11 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11601:
--
Assignee: Tim Hunter

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Tim Hunter
>
> Generate a list of binary incompatible changes using MiMa.  Filter out false 
> positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000856#comment-15000856
 ] 

Apache Spark commented on SPARK-11655:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9633

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11637) Alias do not work with udf with * parameter

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11637:
--
Component/s: SQL

> Alias do not work with udf with * parameter
> ---
>
> Key: SPARK-11637
> URL: https://issues.apache.org/jira/browse/SPARK-11637
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.5.2
>Reporter: Pierre Borckmans
>
> In Spark < 1.5.0, this used to work :
> {code:java|title=Spark <1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> res2: org.apache.spark.sql.DataFrame = [x: int]
> {code}
> From Spark 1.5.0+, it fails:
> {code:java|title=Spark>=1.5.0|borderStyle=solid}
> scala> sqlContext.sql("select hash(*) as x from T")
> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
> ['hash(*) AS x#1];
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
> at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
> ...
> {code}
> This is not specific to the `hash` udf. It also applies to user defined 
> functions.
> The `*` seems to be the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11655:


Assignee: Apache Spark

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11652) Remote code execution with InvokerTransformer

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11652:
--
Component/s: Spark Core

> Remote code execution with InvokerTransformer
> -
>
> Key: SPARK-11652
> URL: https://issues.apache.org/jira/browse/SPARK-11652
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Daniel Darabos
>Priority: Minor
>
> There is a remote code execution vulnerability in the Apache Commons 
> collections library (https://issues.apache.org/jira/browse/COLLECTIONS-580) 
> that can be exploited simply by causing malicious data to be deserialized 
> using Java serialization.
> As Spark is used in security-conscious environments I think it's worth taking 
> a closer look at how the vulnerability affects Spark. What are the points 
> where Spark deserializes external data? Which are affected by using Kryo 
> instead of Java serialization? What mitigation strategies are available?
> If the issue is serious enough but mitigation is possible, it may be useful 
> to post about it on the mailing list or blog.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11643) inserting date with leading zero inserts null example '0001-12-10'

2015-11-11 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11643:
--
Component/s: SQL

> inserting date with leading zero inserts null example '0001-12-10'
> --
>
> Key: SPARK-11643
> URL: https://issues.apache.org/jira/browse/SPARK-11643
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Chip Sands
>
> inserting date with leading zero inserts null value, example '0001-12-10'.
> This worked until 1.5/1.5.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9866) VersionsSuite is unnecessarily slow in Jenkins

2015-11-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-9866:
-

Assignee: Josh Rosen

> VersionsSuite is unnecessarily slow in Jenkins
> --
>
> Key: SPARK-9866
> URL: https://issues.apache.org/jira/browse/SPARK-9866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> The VersionsSuite Hive test is unreasonably slow in Jenkins; downloading the 
> Hive JARs and their transitive dependencies from Maven adds at least 8 
> minutes to the total build time.
> In order to cut down on build time, I think that we should make the cache 
> directory configurable via an environment variable and should configure the 
> Jenkins scripts to set this variable to point to a location outside of the 
> Jenkins workspace which is re-used across builds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread chris snow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000565#comment-15000565
 ] 

chris snow commented on SPARK-11658:


Pull request - https://github.com/apache/spark/pull/9631

> simplify documentation for PySpark combineByKey
> ---
>
> Key: SPARK-11658
> URL: https://issues.apache.org/jira/browse/SPARK-11658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> The current documentation for combineByKey looks like this:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def f(x): return x
> >>> def add(a, b): return a + str(b)
> >>> sorted(x.combineByKey(str, add, add).collect())
> [('a', '11'), ('b', '1')]
> """
> {code}
> I think it could be simplified to:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def add(a, b): return a + str(b)
> >>> x.combineByKey(str, add, add).collect()
> [('a', '11'), ('b', '1')]
> """
> {code}
> I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread chris snow (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

chris snow updated SPARK-11658:
---
Comment: was deleted

(was: Pull request - https://github.com/apache/spark/pull/9631)

> simplify documentation for PySpark combineByKey
> ---
>
> Key: SPARK-11658
> URL: https://issues.apache.org/jira/browse/SPARK-11658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> The current documentation for combineByKey looks like this:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def f(x): return x
> >>> def add(a, b): return a + str(b)
> >>> sorted(x.combineByKey(str, add, add).collect())
> [('a', '11'), ('b', '1')]
> """
> {code}
> I think it could be simplified to:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def add(a, b): return a + str(b)
> >>> x.combineByKey(str, add, add).collect()
> [('a', '11'), ('b', '1')]
> """
> {code}
> I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread Virgil Palanciuc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000599#comment-15000599
 ] 

Virgil Palanciuc commented on SPARK-11657:
--

I can't reproduce this on standalone; I can reproduce it on spark -shell over 
yarn (emr) with something as simple as this:

{code}
scala> scala> sc.makeRDD(Seq("20206E25479B5C67992015A9")).toDF().take(1)
res1: Array[org.apache.spark.sql.Row] = Array([479B5C67992015A9]) 
{code}


> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread Xin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000607#comment-15000607
 ] 

Xin Wu commented on SPARK-11657:


I tried this sample data as local file mode. and it seems working to me. Have 
you tried it this way?

{code}
scala> val data = sqlContext.read.parquet("/root/sample")
[Stage 0:>  (0 + 8) / 
8]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, 
clusterData: array, dpid: int]

scala> data.take(1)
15/11/11 08:26:29 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
res0: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])

scala> data.collect()
15/11/11 08:26:53 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
res1: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])

scala> data.show(false)
15/11/11 08:26:57 WARN ParquetRecordReader: Can not initialize counter due to 
context is not a instance of TaskInputOutputContext, but is 
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
+---++++
|clusterSize|clusterName |clusterData 
|dpid|
+---++++
|1  
|6A01CACD56169A947F000101|[77512098164594606101815510825479776971]|813 |
+---++++

{code}

> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11657) Bad data read using dataframes

2015-11-11 Thread Virgil Palanciuc (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Virgil Palanciuc updated SPARK-11657:
-
Attachment: sample.tgz

Sample directory, use to reproduce the problem

> Bad data read using dataframes
> --
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've included the "hdfs:///sample" directory here:
> https://www.dropbox.com/s/su0flfn49rrc7jz/sample.tgz?dl=0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11657) Bad data read using dataframes

2015-11-11 Thread Virgil Palanciuc (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Virgil Palanciuc updated SPARK-11657:
-
Description: 
I get strange behaviour when reading parquet data:

{code}
scala> val data = sqlContext.read.parquet("hdfs:///sample")
data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, 
clusterData: array, dpid: int]
scala> data.take(1)/// this returns garbage
res0: Array[org.apache.spark.sql.Row] = 
Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
 
scala> data.collect()/// this works
res1: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
{code}

I've attached the "hdfs:///sample" directory to this bug report

  was:
I get strange behaviour when reading parquet data:

{code}
scala> val data = sqlContext.read.parquet("hdfs:///sample")
data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, 
clusterData: array, dpid: int]
scala> data.take(1)/// this returns garbage
res0: Array[org.apache.spark.sql.Row] = 
Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
 
scala> data.collect()/// this works
res1: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
{code}

I've included the "hdfs:///sample" directory here:

https://www.dropbox.com/s/su0flfn49rrc7jz/sample.tgz?dl=0


> Bad data read using dataframes
> --
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread chris snow (JIRA)
chris snow created SPARK-11658:
--

 Summary: simplify documentation for PySpark combineByKey
 Key: SPARK-11658
 URL: https://issues.apache.org/jira/browse/SPARK-11658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Affects Versions: 1.5.1
Reporter: chris snow
Priority: Minor


The current documentation for combineByKey looks like this:

{code}
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> def f(x): return x
>>> def add(a, b): return a + str(b)
>>> sorted(x.combineByKey(str, add, add).collect())
[('a', '11'), ('b', '1')]
"""
{code}

I think it could be simplified to:

{code}
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> def add(a, b): return a + str(b)
>>> x.combineByKey(str, add, add).collect()
[('a', '11'), ('b', '1')]
"""
{code}

I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11653) Would be very useful if spark-daemon.sh supported foreground operations

2015-11-11 Thread Adrian Bridgett (JIRA)
Adrian Bridgett created SPARK-11653:
---

 Summary: Would be very useful if spark-daemon.sh supported 
foreground operations
 Key: SPARK-11653
 URL: https://issues.apache.org/jira/browse/SPARK-11653
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.5.1
Reporter: Adrian Bridgett
Priority: Minor


See also SPARK-5964.  I've raised a new ticket for a few reasons:
- alternative more minimal patch (I've not submitted a PR as I thought any 
discussion would be better here): 
https://github.com/abridgett/spark/tree/feature/add_spark_daemon_foreground
- couldn't re-open the original ticket and didn't know if a comment would 
suffice

There are quite a few use cases for this, e.g. when running under Mesos the 
shuffle service needs to be running and trying to do this under Marathon is a 
bit painful without this feature.  

I see that there was a plan for these scripts to be rewritten in Python, 
however until then this would be very useful (and can feed into the rewrite as 
well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11653) Would be very useful if spark-daemon.sh supported foreground operations

2015-11-11 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000390#comment-15000390
 ] 

Adrian Bridgett commented on SPARK-11653:
-

Thanks Sean, I took a slightly different approach - basically just using an 
environment variable, I didn't add any argument handling support as it's a bit 
of an advanced case (so treating it just as SPARK_NICENESS or SPARK_LOG_DIR).

The discussion on SPARK-5964 seemed to move into a cleanup of the code (which 
is fine, but perhaps should be a separate task).  I'm happy to do that if 
desired to accept multiple longargs, print errors to stderr etc.

> Would be very useful if spark-daemon.sh supported foreground operations
> ---
>
> Key: SPARK-11653
> URL: https://issues.apache.org/jira/browse/SPARK-11653
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.1
>Reporter: Adrian Bridgett
>Priority: Minor
>
> See also SPARK-5964.  I've raised a new ticket for a few reasons:
> - alternative more minimal patch (I've not submitted a PR as I thought any 
> discussion would be better here): 
> https://github.com/abridgett/spark/tree/feature/add_spark_daemon_foreground
> - couldn't re-open the original ticket and didn't know if a comment would 
> suffice
> There are quite a few use cases for this, e.g. when running under Mesos the 
> shuffle service needs to be running and trying to do this under Marathon is a 
> bit painful without this feature.  
> I see that there was a plan for these scripts to be rewritten in Python, 
> however until then this would be very useful (and can feed into the rewrite 
> as well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000424#comment-15000424
 ] 

shane knapp commented on SPARK-11655:
-

actually, this started back in mid-may, but the impact definitely ramped up in 
october.   if you look at the attached graph, you'll see the overall process 
count ramp up then.  i was out of the office for most of august, and you can 
see the effect of me not being around to keep an eye on things.  then in 
october, things got REALLY bad (see second attached) and i started killing off 
the hanging processes daily, as evidenced by the sawtooth pattern.

this affects all spark-test builds, btw.

that being said, i like the socket/stream solution...  nice find!

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-11-11 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000426#comment-15000426
 ] 

Thomas Graves commented on SPARK-11154:
---

It seems unnecessary to me to add new configs just to support this.  I see this 
causing confusion to users.

Personally I would rather see either adding the support for k/m/g and leaving 
if none specified defaults to m (although this doesn't match other spark 
things) or there is discussion going on about spark 2.0 and perhaps there we 
just change the existing configs to support k/m/g, etc, 

> make specificaition spark.yarn.executor.memoryOverhead consistent with 
> typical JVM options
> --
>
> Key: SPARK-11154
> URL: https://issues.apache.org/jira/browse/SPARK-11154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
> default, but it would be nice to allow users to specify the size as though it 
> were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended 
> to the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11657) Bad data read using dataframes

2015-11-11 Thread Virgil Palanciuc (JIRA)
Virgil Palanciuc created SPARK-11657:


 Summary: Bad data read using dataframes
 Key: SPARK-11657
 URL: https://issues.apache.org/jira/browse/SPARK-11657
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.5.1, 1.5.2
 Environment: EMR (yarn)
Reporter: Virgil Palanciuc
Priority: Critical


I get strange behaviour when reading parquet data:

{code}
scala> val data = sqlContext.read.parquet("hdfs:///sample")
data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: string, 
clusterData: array, dpid: int]
scala> data.take(1)/// this returns garbage
res0: Array[org.apache.spark.sql.Row] = 
Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
 
scala> data.collect()/// this works
res1: Array[org.apache.spark.sql.Row] = 
Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
{code}

I've included the "hdfs:///sample" directory here:

https://www.dropbox.com/s/su0flfn49rrc7jz/sample.tgz?dl=0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11658:


Assignee: Apache Spark

> simplify documentation for PySpark combineByKey
> ---
>
> Key: SPARK-11658
> URL: https://issues.apache.org/jira/browse/SPARK-11658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Assignee: Apache Spark
>Priority: Minor
>
> The current documentation for combineByKey looks like this:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def f(x): return x
> >>> def add(a, b): return a + str(b)
> >>> sorted(x.combineByKey(str, add, add).collect())
> [('a', '11'), ('b', '1')]
> """
> {code}
> I think it could be simplified to:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def add(a, b): return a + str(b)
> >>> x.combineByKey(str, add, add).collect()
> [('a', '11'), ('b', '1')]
> """
> {code}
> I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11658:


Assignee: (was: Apache Spark)

> simplify documentation for PySpark combineByKey
> ---
>
> Key: SPARK-11658
> URL: https://issues.apache.org/jira/browse/SPARK-11658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> The current documentation for combineByKey looks like this:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def f(x): return x
> >>> def add(a, b): return a + str(b)
> >>> sorted(x.combineByKey(str, add, add).collect())
> [('a', '11'), ('b', '1')]
> """
> {code}
> I think it could be simplified to:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def add(a, b): return a + str(b)
> >>> x.combineByKey(str, add, add).collect()
> [('a', '11'), ('b', '1')]
> """
> {code}
> I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11658) simplify documentation for PySpark combineByKey

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000563#comment-15000563
 ] 

Apache Spark commented on SPARK-11658:
--

User 'snowch' has created a pull request for this issue:
https://github.com/apache/spark/pull/9631

> simplify documentation for PySpark combineByKey
> ---
>
> Key: SPARK-11658
> URL: https://issues.apache.org/jira/browse/SPARK-11658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 1.5.1
>Reporter: chris snow
>Priority: Minor
>
> The current documentation for combineByKey looks like this:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def f(x): return x
> >>> def add(a, b): return a + str(b)
> >>> sorted(x.combineByKey(str, add, add).collect())
> [('a', '11'), ('b', '1')]
> """
> {code}
> I think it could be simplified to:
> {code}
> >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
> >>> def add(a, b): return a + str(b)
> >>> x.combineByKey(str, add, add).collect()
> [('a', '11'), ('b', '1')]
> """
> {code}
> I'll shortly add a patch for this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-11-11 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000526#comment-15000526
 ] 

Ferdinand Xu commented on SPARK-5682:
-

Thank you for your question. The key is generated by key gen which is instanced 
by specified keygen algorithm. The part of work is available in the method 
CryptoConf#initSparkShuffleCredentials. More detailed information is available 
in the PR(https://github.com/apache/spark/pull/8880). And for the IV part, we 
are using Chimera(https://github.com/intel-hadoop/chimera) as an external 
library in the latest 
PR(https://github.com/intel-hadoop/chimera/blob/master/src/main/java/com/intel/chimera/JceAesCtrCryptoCodec.java#L70
 and 
https://github.com/intel-hadoop/chimera/blob/master/src/main/java/com/intel/chimera/OpensslAesCtrCryptoCodec.java#L81).
 You can also deep into the code about how IV is calculated by counter and 
initial 
IV(https://github.com/intel-hadoop/chimera/blob/master/src/main/java/com/intel/chimera/AesCtrCryptoCodec.java#L42).
 The initial IV is generated by security random.

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11657) Bad Dataframe data read from parquet

2015-11-11 Thread Virgil Palanciuc (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Virgil Palanciuc updated SPARK-11657:
-
Summary: Bad Dataframe data read from parquet  (was: Bad data read using 
dataframes)

> Bad Dataframe data read from parquet
> 
>
> Key: SPARK-11657
> URL: https://issues.apache.org/jira/browse/SPARK-11657
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.5.1, 1.5.2
> Environment: EMR (yarn)
>Reporter: Virgil Palanciuc
>Priority: Critical
> Attachments: sample.tgz
>
>
> I get strange behaviour when reading parquet data:
> {code}
> scala> val data = sqlContext.read.parquet("hdfs:///sample")
> data: org.apache.spark.sql.DataFrame = [clusterSize: int, clusterName: 
> string, clusterData: array, dpid: int]
> scala> data.take(1)/// this returns garbage
> res0: Array[org.apache.spark.sql.Row] = 
> Array([1,56169A947F000101,WrappedArray(164594606101815510825479776971),813])
>  
> scala> data.collect()/// this works
> res1: Array[org.apache.spark.sql.Row] = 
> Array([1,6A01CACD56169A947F000101,WrappedArray(77512098164594606101815510825479776971),813])
> {code}
> I've attached the "hdfs:///sample" directory to this bug report



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10954) Parquet version in the "created_by" metadata field of Parquet files written by Spark 1.5 and 1.6 is wrong

2015-11-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000222#comment-15000222
 ] 

Cheng Lian commented on SPARK-10954:


Figured out the reason why {{created_by}} is wrong in Spark while investigating 
another issue.

It turned out that parquet-mr 1.7.0 still looks for 
{{META-INF/maven/com.twitter/parquet-column/pom.properties}} from the jar file 
where class {{o.a.parquet.Version}} is loaded for version information. This is 
a bug which doesn't affect normal Parquet users since the properties file 
doesn't exist (com.twitter is a wrong package name since parquet-mr 1.7.0 has 
moved to org.apache.parquet). However, Spark 1.5 included parquet-hadoop-bundle 
1.6.0 to fix a Hive compatibility issue ([PR 
#7867|https://github.com/apache/spark/pull/7867]), and this dependency happens 
to contain the missing properties file. The fact that Spark bundles all 
dependencies into an uber assembly jar makes Parquet read out the wrong version 
information and write it into generated Parquet files.

Haven't figured out a workaround for this issue though.

> Parquet version in the "created_by" metadata field of Parquet files written 
> by Spark 1.5 and 1.6 is wrong
> -
>
> Key: SPARK-10954
> URL: https://issues.apache.org/jira/browse/SPARK-10954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 1.6.0
>Reporter: Cheng Lian
>Assignee: Gayathri Murali
>Priority: Minor
>
> We've upgraded to parquet-mr 1.7.0 in Spark 1.5, but the {{created_by}} field 
> still says 1.6.0. This issue can be reproduced by generating any Parquet file 
> with Spark 1.5, and then check the metadata with {{parquet-meta}} CLI tool:
> {noformat}
> $ parquet-meta /tmp/parquet/dec
> file:
> file:/tmp/parquet/dec/part-r-0-f210e968-1be5-40bc-bcbc-007f935e6dc7.gz.parquet
> creator: parquet-mr version 1.6.0
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"dec","type":"decimal(20,2)","nullable":true,"metadata":{}}]}
> file schema: spark_schema
> -
> dec: OPTIONAL FIXED_LEN_BYTE_ARRAY O:DECIMAL R:0 D:1
> row group 1: RC:10 TS:140 OFFSET:4
> -
> dec:  FIXED_LEN_BYTE_ARRAY GZIP DO:0 FPO:4 SZ:99/140/1.41 VC:10 
> ENC:PLAIN,BIT_PACKED,RLE
> {noformat}
> Note that this field is written by parquet-mr rather than Spark. However, 
> writing Parquet files using parquet-mr 1.7.0 directly without Spark 1.5 only 
> shows {{parquet-mr}} without any version number. Files written by parquet-mr 
> 1.8.1 without Spark look fine though.
> Currently this isn't a big issue. But parquet-mr 1.8 checks for this field to 
> workaround PARQUET-251.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11656) support typed aggregate in project list

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11656:


Assignee: Apache Spark

> support typed aggregate in project list
> ---
>
> Key: SPARK-11656
> URL: https://issues.apache.org/jira/browse/SPARK-11656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11656) support typed aggregate in project list

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000421#comment-15000421
 ] 

Apache Spark commented on SPARK-11656:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9630

> support typed aggregate in project list
> ---
>
> Key: SPARK-11656
> URL: https://issues.apache.org/jira/browse/SPARK-11656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11656) support typed aggregate in project list

2015-11-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11656:


Assignee: (was: Apache Spark)

> support typed aggregate in project list
> ---
>
> Key: SPARK-11656
> URL: https://issues.apache.org/jira/browse/SPARK-11656
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000441#comment-15000441
 ] 

Sean Owen commented on SPARK-11154:
---

I think we'd have to make new properties to maintain compatibility. However I 
agree it's confusing. I think it's therefore not worth fixing in 1.x. At best, 
target this for 2.x.

> make specificaition spark.yarn.executor.memoryOverhead consistent with 
> typical JVM options
> --
>
> Key: SPARK-11154
> URL: https://issues.apache.org/jira/browse/SPARK-11154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
> default, but it would be nice to allow users to specify the size as though it 
> were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended 
> to the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-11655:
--

 Summary: SparkLauncherBackendSuite leaks child processes
 Key: SPARK-11655
 URL: https://issues.apache.org/jira/browse/SPARK-11655
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 1.6.0
Reporter: Josh Rosen
Priority: Blocker
 Attachments: screenshot-1.png

We've been combatting an orphaned process issue on AMPLab Jenkins since October 
and I finally was able to dig in and figure out what's going on.

After some sleuthing and working around OS limits and JDK bugs, I was able to 
get the full launch commands for the hanging orphaned processes. It looks like 
they're all running spark-submit:

{code}
org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
 -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
{code}

Based on the output of some Ganglia graphs, I was able to figure out that these 
leaks started around October 9.

This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
merged, which added LauncherBackendSuite. The launch arguments used in this 
suite seem to line up with the arguments that I observe in the hanging 
processes' {{jps}} output: 
https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46

Interestingly, Jenkins doesn't show test timing or output for this suite! I 
think that what might be happening is that we have a mixed Scala/Java package, 
so maybe the two test runner XML files aren't being merged properly: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/

Whenever I try running this suite locally, it looks like it ends up creating a 
zombie SparkSubmit process! I think that what's happening is that the 
launcher's {{handle.kill()}} call ends up destroying the bash {{spark-submit}} 
subprocess such that its child process (a JVM) leaks.

I think that we'll have to do something similar to what we do in PySpark when 
launching a child JVM from a Python / Bash process: connect it to a socket or 
stream such that it can detect its parent's death and clean up after itself 
appropriately.

/cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11655:
---
Attachment: screenshot-1.png

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: screenshot-1.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11655:
---
Description: 
We've been combatting an orphaned process issue on AMPLab Jenkins since October 
and I finally was able to dig in and figure out what's going on.

After some sleuthing and working around OS limits and JDK bugs, I was able to 
get the full launch commands for the hanging orphaned processes. It looks like 
they're all running spark-submit:

{code}
org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
 -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
{code}

Based on the output of some Ganglia graphs, I was able to figure out that these 
leaks started around October 9.

 !screenshot-1.png|thumbnail! 

This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
merged, which added LauncherBackendSuite. The launch arguments used in this 
suite seem to line up with the arguments that I observe in the hanging 
processes' {{jps}} output: 
https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46

Interestingly, Jenkins doesn't show test timing or output for this suite! I 
think that what might be happening is that we have a mixed Scala/Java package, 
so maybe the two test runner XML files aren't being merged properly: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/

Whenever I try running this suite locally, it looks like it ends up creating a 
zombie SparkSubmit process! I think that what's happening is that the 
launcher's {{handle.kill()}} call ends up destroying the bash {{spark-submit}} 
subprocess such that its child process (a JVM) leaks.

I think that we'll have to do something similar to what we do in PySpark when 
launching a child JVM from a Python / Bash process: connect it to a socket or 
stream such that it can detect its parent's death and clean up after itself 
appropriately.

/cc [~shaneknapp] and [~vanzin].

  was:
We've been combatting an orphaned process issue on AMPLab Jenkins since October 
and I finally was able to dig in and figure out what's going on.

After some sleuthing and working around OS limits and JDK bugs, I was able to 
get the full launch commands for the hanging orphaned processes. It looks like 
they're all running spark-submit:

{code}
org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
 -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
{code}

Based on the output of some Ganglia graphs, I was able to figure out that these 
leaks started around October 9:

 !screenshot-1.png|thumbnail! 

This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
merged, which added LauncherBackendSuite. The launch arguments used in this 
suite seem to line up with the arguments that I observe in the hanging 

[jira] [Updated] (SPARK-11655) SparkLauncherBackendSuite leaks child processes

2015-11-11 Thread shane knapp (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shane knapp updated SPARK-11655:

Attachment: year_or_doom.png
month_of_doom.png

> SparkLauncherBackendSuite leaks child processes
> ---
>
> Key: SPARK-11655
> URL: https://issues.apache.org/jira/browse/SPARK-11655
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Priority: Blocker
> Attachments: month_of_doom.png, screenshot-1.png, year_or_doom.png
>
>
> We've been combatting an orphaned process issue on AMPLab Jenkins since 
> October and I finally was able to dig in and figure out what's going on.
> After some sleuthing and working around OS limits and JDK bugs, I was able to 
> get the full launch commands for the hanging orphaned processes. It looks 
> like they're all running spark-submit:
> {code}
> org.apache.spark.deploy.SparkSubmit --master local-cluster[1,1,1024] --conf 
> spark.driver.extraClassPath=/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/test-classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/core/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/launcher/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/common/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/network/shuffle/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/unsafe/target/scala-2.10/classes:/home/jenkins/workspace/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE/hadoop2.0/label/spark-test/tags/target/scala-2.10/
>  -Xms1g -Xmx1g -Dtest.appender=console -XX:MaxPermSize=256m
> {code}
> Based on the output of some Ganglia graphs, I was able to figure out that 
> these leaks started around October 9.
>  !screenshot-1.png|thumbnail! 
> This roughly lines up with when https://github.com/apache/spark/pull/7052 was 
> merged, which added LauncherBackendSuite. The launch arguments used in this 
> suite seem to line up with the arguments that I observe in the hanging 
> processes' {{jps}} output: 
> https://github.com/apache/spark/blame/1bc41125ee6306e627be212969854f639969c440/core/src/test/scala/org/apache/spark/launcher/LauncherBackendSuite.scala#L46
> Interestingly, Jenkins doesn't show test timing or output for this suite! I 
> think that what might be happening is that we have a mixed Scala/Java 
> package, so maybe the two test runner XML files aren't being merged properly: 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.5-SBT/746/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/testReport/org.apache.spark.launcher/
> Whenever I try running this suite locally, it looks like it ends up creating 
> a zombie SparkSubmit process! I think that what's happening is that the 
> launcher's {{handle.kill()}} call ends up destroying the bash 
> {{spark-submit}} subprocess such that its child process (a JVM) leaks.
> I think that we'll have to do something similar to what we do in PySpark when 
> launching a child JVM from a Python / Bash process: connect it to a socket or 
> stream such that it can detect its parent's death and clean up after itself 
> appropriately.
> /cc [~shaneknapp] and [~vanzin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11654) add reduce to GroupedDataset

2015-11-11 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11654:
---

 Summary: add reduce to GroupedDataset
 Key: SPARK-11654
 URL: https://issues.apache.org/jira/browse/SPARK-11654
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11154) make specificaition spark.yarn.executor.memoryOverhead consistent with typical JVM options

2015-11-11 Thread Dustin Cote (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000419#comment-15000419
 ] 

Dustin Cote commented on SPARK-11154:
-

[~Kitard] I think the naming convention and strategy makes sense.  Someone with 
more familiarity with the code base should probably comment on the files that 
need to change though.

> make specificaition spark.yarn.executor.memoryOverhead consistent with 
> typical JVM options
> --
>
> Key: SPARK-11154
> URL: https://issues.apache.org/jira/browse/SPARK-11154
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Submit
>Reporter: Dustin Cote
>Priority: Minor
>
> spark.yarn.executor.memoryOverhead is currently specified in megabytes by 
> default, but it would be nice to allow users to specify the size as though it 
> were a typical -Xmx option to a JVM where you can have 'm' and 'g' appended 
> to the end to explicitly specify megabytes or gigabytes.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10865) [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type

2015-11-11 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000404#comment-15000404
 ] 

Dominic Ricard commented on SPARK-10865:


Is it possible for this fix to be included in the 1.5.2 release? This is 
causing major issues with some BI Tool that apply transformation and expect an 
INT from Ceil.

Thanks

> [Spark SQL] [UDF] the ceil/ceiling function got wrong return value type
> ---
>
> Key: SPARK-10865
> URL: https://issues.apache.org/jira/browse/SPARK-10865
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yi Zhou
>Assignee: Cheng Hao
> Fix For: 1.6.0
>
>
> As per ceil/ceiling definition,it should get BIGINT return value
> -ceil(DOUBLE a), ceiling(DOUBLE a)
> -Returns the minimum BIGINT value that is equal to or greater than a.
> But in current Spark implementation, it got wrong value type.
> e.g., 
> select ceil(2642.12) from udf_test_web_sales limit 1;
> 2643.0
> In hive implementation, it got return value type like below:
> hive> select ceil(2642.12) from udf_test_web_sales limit 1;
> OK
> 2643



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11656) support typed aggregate in project list

2015-11-11 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11656:
---

 Summary: support typed aggregate in project list
 Key: SPARK-11656
 URL: https://issues.apache.org/jira/browse/SPARK-11656
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11648) IllegalReferenceCountException in Spark workloads

2015-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000288#comment-15000288
 ] 

Sean Owen commented on SPARK-11648:
---

Duplicate of SPARK-11617 it seems

> IllegalReferenceCountException in Spark workloads
> -
>
> Key: SPARK-11648
> URL: https://issues.apache.org/jira/browse/SPARK-11648
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Nishkam Ravi
>
> This exception is thrown for multiple workloads. Can be reproduced with 
> WordCount/PageRank/TeraSort.
> -
> Stack trace:
> 15/11/10 01:11:31 WARN TaskSetManager: Lost task 6.0 in stage 1.0 (TID 459, 
> 10.20.78.15): io.netty.util.IllegalReferenceCountException: refCnt: 0
>   at 
> io.netty.buffer.AbstractByteBuf.ensureAccessible(AbstractByteBuf.java:1178)
>   at io.netty.buffer.AbstractByteBuf.checkIndex(AbstractByteBuf.java:1129)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:180)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:687)
>   at io.netty.buffer.CompositeByteBuf.getBytes(CompositeByteBuf.java:42)
>   at io.netty.buffer.SlicedByteBuf.getBytes(SlicedByteBuf.java:181)
>   at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:677)
>   at io.netty.buffer.ByteBufInputStream.read(ByteBufInputStream.java:120)
>   at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:360)
>   at com.ning.compress.lzf.ChunkDecoder.readHeader(ChunkDecoder.java:213)
>   at 
> com.ning.compress.lzf.impl.UnsafeChunkDecoder.decodeChunk(UnsafeChunkDecoder.java:49)
>   at 
> com.ning.compress.lzf.LZFInputStream.readyBuffer(LZFInputStream.java:363)
>   at com.ning.compress.lzf.LZFInputStream.read(LZFInputStream.java:193)
>   at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310)
>   at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2323)
>   at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2794)
>   at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:801)
>   at java.io.ObjectInputStream.(ObjectInputStream.java:299)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:123)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:64)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:60)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:152)
>   at 
> org.apache.spark.Aggregator.combineCombinersByKey(Aggregator.scala:58)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:83)
>   at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10371) Optimize sequential projections

2015-11-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000337#comment-15000337
 ] 

Apache Spark commented on SPARK-10371:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9627

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Nong Li
>Priority: Critical
> Fix For: 1.6.0
>
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.
> {code}
> val input = sqlContext.range(10)
> val output = input.withColumn("x", col("id") + 1).withColumn("y", col("x") * 
> 2)
> output.explain(true)
> == Parsed Logical Plan ==
> 'Project [*,('x * 2) AS y#254]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Analyzed Logical Plan ==
> id: bigint, x: bigint, y: bigint
> Project [id#252L,x#253L,(x#253L * cast(2 as bigint)) AS y#254L]
>  Project [id#252L,(id#252L + cast(1 as bigint)) AS x#253L]
>   LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Optimized Logical Plan ==
> Project [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS y#254L]
>  LogicalRDD [id#252L], MapPartitionsRDD[458] at range at :30
> == Physical Plan ==
> TungstenProject [id#252L,(id#252L + 1) AS x#253L,((id#252L + 1) * 2) AS 
> y#254L]
>  Scan PhysicalRDD[id#252L]
> Code Generation: true
> input: org.apache.spark.sql.DataFrame = [id: bigint]
> output: org.apache.spark.sql.DataFrame = [id: bigint, x: bigint, y: bigint]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11583) Make MapStatus use less memory uage

2015-11-11 Thread Daniel Lemire (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000339#comment-15000339
 ] 

Daniel Lemire commented on SPARK-11583:
---

[~Qin Yao] [~aimran50]

If you guys want to setup a documented benchmark with accompanying analysis. 
I'd be glad to help,
if my help is needed. 

Ideally, a benchmark should be setup based on interesting or representative 
workloads so that
it is as relevant as possible to applications.


No data structure, no matter which one, is always best in all possible 
scenarios. There are certainly cases 
where an uncompressed BitSet is the best choice. Other cases where a hash set 
is best. And so forth.

What matters is what works for the applications at hand.
 

> Make MapStatus use less memory uage
> ---
>
> Key: SPARK-11583
> URL: https://issues.apache.org/jira/browse/SPARK-11583
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Reporter: Kent Yao
>
> In the resolved issue https://issues.apache.org/jira/browse/SPARK-11271, as I 
> said, using BitSet can save ≈20% memory usage compared to RoaringBitMap. 
> For a spark job contains quite a lot of tasks, 20% seems a drop in the ocean. 
> Essentially, BitSet uses long[]. For example a BitSet[200k] = long[3125].
> So if we use a HashSet[Int] to store reduceId (when non-empty blocks are 
> dense,use reduceId of empty blocks; when sparse, use non-empty ones). 
> For dense cases: if HashSet[Int](numNonEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingNoEmptyBlocks
> For sparse cases: if HashSet[Int](numEmptyBlocks).size <   
> BitSet[totalBlockNum], I use MapStatusTrackingEmptyBlocks
> sparse case, 299/300 are empty
> sc.makeRDD(1 to 3, 3000).groupBy(x=>x).top(5)
> dense case,  no block is empty
> sc.makeRDD(1 to 900, 3000).groupBy(x=>x).top(5)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6628) ClassCastException occurs when executing sql statement "insert into" on hbase table

2015-11-11 Thread Francesco Palmiotto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000345#comment-15000345
 ] 

Francesco Palmiotto commented on SPARK-6628:


Spark 1.5.1 is affected.

> ClassCastException occurs when executing sql statement "insert into" on hbase 
> table
> ---
>
> Key: SPARK-6628
> URL: https://issues.apache.org/jira/browse/SPARK-6628
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 1 in stage 3.0 failed 4 times, most recent failure: Lost task 1.3 in 
> stage 3.0 (TID 12, vm-17): java.lang.ClassCastException: 
> org.apache.hadoop.hive.hbase.HiveHBaseTableOutputFormat cannot be cast to 
> org.apache.hadoop.hive.ql.io.HiveOutputFormat
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat$lzycompute(hiveWriterContainers.scala:72)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.outputFormat(hiveWriterContainers.scala:71)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.getOutputName(hiveWriterContainers.scala:91)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.initWriters(hiveWriterContainers.scala:115)
> at 
> org.apache.spark.sql.hive.SparkHiveWriterContainer.executorSideSetup(hiveWriterContainers.scala:84)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:112)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11653) Would be very useful if spark-daemon.sh supported foreground operations

2015-11-11 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15000373#comment-15000373
 ] 

Sean Owen commented on SPARK-11653:
---

To avoid duplication, it's better to reopen the original issue than make a new 
one. Before we do that, my question is, does your patch address any comments on 
the last attempt? it went stale.

> Would be very useful if spark-daemon.sh supported foreground operations
> ---
>
> Key: SPARK-11653
> URL: https://issues.apache.org/jira/browse/SPARK-11653
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.5.1
>Reporter: Adrian Bridgett
>Priority: Minor
>
> See also SPARK-5964.  I've raised a new ticket for a few reasons:
> - alternative more minimal patch (I've not submitted a PR as I thought any 
> discussion would be better here): 
> https://github.com/abridgett/spark/tree/feature/add_spark_daemon_foreground
> - couldn't re-open the original ticket and didn't know if a comment would 
> suffice
> There are quite a few use cases for this, e.g. when running under Mesos the 
> shuffle service needs to be running and trying to do this under Marathon is a 
> bit painful without this feature.  
> I see that there was a plan for these scripts to be rewritten in Python, 
> however until then this would be very useful (and can feed into the rewrite 
> as well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >