[GitHub] spark pull request: [SPARK-2288] Hide ShuffleBlockManager behind S...

2014-07-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1241#issuecomment-48569956
  
Actually it's still necessary once the PR gets merged. It basically 
compares all the public interfaces at bytecode level against 1.0, and see if 
any public interfaces changed. This is done in place to ensure API 
compatibility throughout the 1.x release cycle. 

Unfortunately, all the methods that are labeled "private[package_name]" 
loses the private visibility at bytecode level, so the compatibility checker is 
more conservative and we would need to manually mark these interface changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14751335
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/csv/CsvRDD.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.csv
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericMutableRow, 
AttributeReference, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+private[sql] object CsvRDD extends Logging {
+
+  private[sql] def inferSchema(
+  csv: RDD[String],
+  delimiter: String,
+  quote: String,
--- End diff --

is it ever possible for "quote" to be something that's longer than 1 char?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-09 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14751361
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -130,6 +131,47 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, JsonRDD.inferSchema(json, samplingRatio))
 
   /**
+   * Loads a CSV file (according to RFC 4180) and returns the result as a 
[[SchemaRDD]].
+   *
+   * NOTE: If there are new line characters inside quoted fields this 
method may fail to
+   * parse correctly, because the two lines may be in different 
partitions. Use
+   * [[SQLContext#csvRDD]] to parse such files.
+   *
+   * @param path path to input file
+   * @param delimiter Optional delimiter (default is comma)
+   * @param quote Optional quote character or string (default is '"')
+   * @param header Optional flag to indicate first line of each file is 
the header
+   *   (default is false)
+   */
+  def csvFile(path: String,
+  delimiter: String = ",",
+  quote: String = "\"",
--- End diff --

Is it possible to use a combination of single quotes and double quotes in 
the same file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Clean up SparkKMeans example's code

2014-07-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1352#issuecomment-48570213
  
Merging in master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2115: Stage kill link is too close to st...

2014-07-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1350#issuecomment-48570288
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

2014-07-09 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1346#issuecomment-48570378
  
@yhuai I haven't looked at the changes yet, but can you make sure the end 
API is usable in Java?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14752632
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/csv/CsvRDD.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.csv
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericMutableRow, 
AttributeReference, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+private[sql] object CsvRDD extends Logging {
+
+  private[sql] def inferSchema(
+  csv: RDD[String],
+  delimiter: String,
+  quote: String,
--- End diff --

What are the examples of quotes that are more than 1 char long?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14752678
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -130,6 +131,47 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, JsonRDD.inferSchema(json, samplingRatio))
 
   /**
+   * Loads a CSV file (according to RFC 4180) and returns the result as a 
[[SchemaRDD]].
+   *
+   * NOTE: If there are new line characters inside quoted fields this 
method may fail to
+   * parse correctly, because the two lines may be in different 
partitions. Use
+   * [[SQLContext#csvRDD]] to parse such files.
+   *
+   * @param path path to input file
+   * @param delimiter Optional delimiter (default is comma)
+   * @param quote Optional quote character or string (default is '"')
+   * @param header Optional flag to indicate first line of each file is 
the header
+   *   (default is false)
+   */
+  def csvFile(path: String,
+  delimiter: String = ",",
+  quote: String = "\"",
--- End diff --

Ok if the standard dictates that only double quotes are allowed, it is 
actually confusing to allow arbitrary quotes (it is one thing to allow single 
quotes since they might be common, but I wouldn't allow arbitrary length 
quotes). This is adding unnecessary complexity to both the interface and the 
implementation. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14753118
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/csv/CsvRDD.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.csv
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericMutableRow, 
AttributeReference, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+private[sql] object CsvRDD extends Logging {
+
+  private[sql] def inferSchema(
+  csv: RDD[String],
+  delimiter: String,
+  quote: String,
--- End diff --

Do people actually use this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14753240
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -130,6 +131,47 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, JsonRDD.inferSchema(json, samplingRatio))
 
   /**
+   * Loads a CSV file (according to RFC 4180) and returns the result as a 
[[SchemaRDD]].
+   *
+   * NOTE: If there are new line characters inside quoted fields this 
method may fail to
+   * parse correctly, because the two lines may be in different 
partitions. Use
+   * [[SQLContext#csvRDD]] to parse such files.
+   *
+   * @param path path to input file
+   * @param delimiter Optional delimiter (default is comma)
+   * @param quote Optional quote character or string (default is '"')
+   * @param header Optional flag to indicate first line of each file is 
the header
+   *   (default is false)
+   */
+  def csvFile(path: String,
+  delimiter: String = ",",
+  quote: String = "\"",
--- End diff --

Actually both Pandas and R only allow single character quotes. I think it 
make sense to allow changing quotes, but allowing multi character quotes is 
just plain confusing and also inconsistent with the standard / other tools. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14753286
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -130,6 +131,47 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, JsonRDD.inferSchema(json, samplingRatio))
 
   /**
+   * Loads a CSV file (according to RFC 4180) and returns the result as a 
[[SchemaRDD]].
+   *
+   * NOTE: If there are new line characters inside quoted fields this 
method may fail to
+   * parse correctly, because the two lines may be in different 
partitions. Use
+   * [[SQLContext#csvRDD]] to parse such files.
+   *
+   * @param path path to input file
+   * @param delimiter Optional delimiter (default is comma)
+   * @param quote Optional quote character or string (default is '"')
+   * @param header Optional flag to indicate first line of each file is 
the header
+   *   (default is false)
+   */
+  def csvFile(path: String,
+  delimiter: String = ",",
+  quote: String = "\"",
+  header: Boolean = false): SchemaRDD = {
+val csv = sparkContext.textFile(path)
--- End diff --

textFile creates an RDD of lines. This actually doesn't work if a record 
contains new lines (inside quotes). 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14753321
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -130,6 +131,47 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, JsonRDD.inferSchema(json, samplingRatio))
 
   /**
+   * Loads a CSV file (according to RFC 4180) and returns the result as a 
[[SchemaRDD]].
+   *
+   * NOTE: If there are new line characters inside quoted fields this 
method may fail to
+   * parse correctly, because the two lines may be in different 
partitions. Use
+   * [[SQLContext#csvRDD]] to parse such files.
+   *
+   * @param path path to input file
+   * @param delimiter Optional delimiter (default is comma)
+   * @param quote Optional quote character or string (default is '"')
+   * @param header Optional flag to indicate first line of each file is 
the header
+   *   (default is false)
+   */
+  def csvFile(path: String,
+  delimiter: String = ",",
+  quote: String = "\"",
+  header: Boolean = false): SchemaRDD = {
+val csv = sparkContext.textFile(path)
--- End diff --

To do this properly, we would either need a new input format that handles 
CSV line splits, or assemble the lines back from textFile.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14753430
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/csv/CsvRDD.scala ---
@@ -0,0 +1,91 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.csv
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.expressions.{GenericMutableRow, 
AttributeReference, Row}
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+private[sql] object CsvRDD extends Logging {
+
+  private[sql] def inferSchema(
+  csv: RDD[String],
+  delimiter: String,
+  quote: String,
+  useHeader: Boolean): LogicalPlan = {
+
+// Constructing schema
+// TODO: Infer types based on a sample and/or let user specify 
types/schema
+val firstLine = csv.first()
+// Assuming first row is representative and using it to determine 
number of fields
+val firstRow = new CsvTokenizer(Seq(firstLine).iterator, delimiter, 
quote).next()
+val header = if (useHeader) {
+  firstRow
+} else {
+  firstRow.zipWithIndex.map { case (value, index) => s"V$index" }
+}
+
+val schemaFields = header.map { fieldName =>
+  StructField(fieldName.asInstanceOf[String], StringType, nullable = 
true)
+}
+val schema = StructType(schemaFields)
+
+val parsedCSV = csv.mapPartitions { iter =>
+  // When using header, any input line that equals firstLine is 
assumed to be header
+  val csvIter = if (useHeader) {
+iter.filter(_ != firstLine)
+  } else {
+iter
+  }
+  val tokenIter = new CsvTokenizer(csvIter, delimiter, quote)
+  parseCSV(tokenIter, schema)
+}
+
+SparkLogicalPlan(ExistingRdd(asAttributes(schema), parsedCSV))
+  }
+
+  private def castToType(value: Any, dataType: DataType): Any = dataType 
match {
--- End diff --

We should consolidate all the type related functions in @yhuai's PR on 
public/schema type. We have many similar code blocks like this in Spark SQL, 
and many of them are missing stuff (e.g. this one misses Timestamp, Decimal, 
etc)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP] SPARK-2360: CSV import to SchemaRDDs

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1351#discussion_r14753602
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/csv/CsvTokenizer.scala ---
@@ -0,0 +1,118 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.csv
+
+import scala.collection.mutable.ArrayBuffer
+
+/**
+ * Tokenizer based on RFC 4180 for comma separated values.
+ * It implements an iterator that returns each tokenized line as an 
Array[Any].
+ */
+private[sql] class CsvTokenizer(inputIter: Iterator[String],
--- End diff --

Can we use an existing tokenizer such as 
http://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/text/StrTokenizer.html
 ?

We already depend on commons-lang in Spark. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2115: Stage kill link is too close to st...

2014-07-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1350#issuecomment-48576867
  
Thanks - merging this in master.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2393][SQL] Prototype implementation of ...

2014-07-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1238#issuecomment-48577069
  
If this is no longer WIP, can you change the title to make it more 
descriptive? At the very least I wouldn't include the word "Prototype".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r14754430
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -93,47 +92,56 @@ abstract class DataType {
   }
 
   def isPrimitive: Boolean = false
+
+  def simpleString: String
 }
 
-case object NullType extends DataType
+case object NullType extends DataType {
+  def simpleString: String = "null"
+}
 
 trait PrimitiveType extends DataType {
   override def isPrimitive = true
 }
 
 abstract class NativeType extends DataType {
-  type JvmType
-  @transient val tag: TypeTag[JvmType]
-  val ordering: Ordering[JvmType]
+  private[sql] type JvmType
+  @transient private[sql] val tag: TypeTag[JvmType]
+  private[sql] val ordering: Ordering[JvmType]
 
-  @transient val classTag = {
+  @transient private[sql] val classTag = {
 val mirror = runtimeMirror(Utils.getSparkClassLoader)
 ClassTag[JvmType](mirror.runtimeClass(tag.tpe))
   }
 }
 
 case object StringType extends NativeType with PrimitiveType {
-  type JvmType = String
-  @transient lazy val tag = typeTag[JvmType]
-  val ordering = implicitly[Ordering[JvmType]]
+  private[sql] type JvmType = String
+  @transient private[sql] lazy val tag = typeTag[JvmType]
+  private[sql] val ordering = implicitly[Ordering[JvmType]]
+  def simpleString: String = "string"
 }
--- End diff --

while you at it, add a blank line to separate each class


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [WIP][SPARK-2179][SQL] Public API for DataType...

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1346#discussion_r14754724
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala 
---
@@ -123,9 +123,12 @@ private[sql] trait SchemaRDDLike {
   def saveAsTable(tableName: String): Unit =
 sqlContext.executePlan(InsertIntoCreatedTable(None, tableName, 
logicalPlan)).toRdd
 
+  /** Returns the schema. */
+  def schema: StructType = queryExecution.analyzed.schema
+
   /** Returns the output schema in the tree format. */
-  def schemaString: String = queryExecution.analyzed.schemaString
+  def formattedSchemaString: String = schema.formattedSchemaString
--- End diff --

Do we have to change this API?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2427: Fix Scala examples that use the wr...

2014-07-10 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1353#discussion_r14755685
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/SparkPageRank.scala ---
@@ -31,8 +31,12 @@ import org.apache.spark.{SparkConf, SparkContext}
  */
 object SparkPageRank {
   def main(args: Array[String]) {
+if (args.length < 1) {
+  System.err.println("Usage: SparkPageRank  ")
+  System.exit(1)
+}
 val sparkConf = new SparkConf().setAppName("PageRank")
-var iters = args(1).toInt
+val iters = if (args.length > 0) args(1).toInt else 100
--- End diff --

100 is pretty big. Can we set this to a small number, e.g. 10 or 20?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2427: Fix Scala examples that use the wr...

2014-07-10 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1353#issuecomment-48640422
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1744#discussion_r15780210
  
--- Diff: dev/lint-python ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"
+SPARK_ROOT_DIR="$(dirname $SCRIPT_DIR)"
+PEP8_REPORT_PATH="$SPARK_ROOT_DIR/dev/pep8-report.txt"
+
+cd $SPARK_ROOT_DIR
+
+# Get pep8 at runtime so that we don't rely on it being installed on the 
build server.
+#+ See: https://github.com/apache/spark/pull/1744#issuecomment-50982162
+#+ TODOs:
+#+  - Dynamically determine latest release version of pep8 and use that.
+#+  - Download this from a more reliable source. (GitHub raw can be flaky, 
apparently. (?))
+PEP8_SCRIPT_PATH="$SPARK_ROOT_DIR/dev/pep8.py"

+PEP8_SCRIPT_REMOTE_PATH="https://raw.githubusercontent.com/jcrocholl/pep8/1.5.7/pep8.py";
+
+curl --silent -o "$PEP8_SCRIPT_PATH" "$PEP8_SCRIPT_REMOTE_PATH"
+curl_status=$?
+
+if [ $curl_status -ne 0 ]
+then
+echo "Failed to download pep8.py from 
\"$PEP8_SCRIPT_REMOTE_PATH\"."
+exit $curl_status
+fi
+
+
+# There is no need to write this output to a file
+#+ first, but we do so so that the check status can
+#+ be output before the report, like with the
+#+ scalastyle and RAT checks.
+python $PEP8_SCRIPT_PATH ./python --exclude="cloudpickle.py" \
+> "$PEP8_REPORT_PATH"
+pep8_status=${PIPESTATUS[0]} #$?
+
+if [ $pep8_status -ne 0 ]
+then
--- End diff --

here too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1744#discussion_r15780205
  
--- Diff: dev/lint-python ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"
+SPARK_ROOT_DIR="$(dirname $SCRIPT_DIR)"
+PEP8_REPORT_PATH="$SPARK_ROOT_DIR/dev/pep8-report.txt"
+
+cd $SPARK_ROOT_DIR
+
+# Get pep8 at runtime so that we don't rely on it being installed on the 
build server.
+#+ See: https://github.com/apache/spark/pull/1744#issuecomment-50982162
+#+ TODOs:
+#+  - Dynamically determine latest release version of pep8 and use that.
+#+  - Download this from a more reliable source. (GitHub raw can be flaky, 
apparently. (?))
+PEP8_SCRIPT_PATH="$SPARK_ROOT_DIR/dev/pep8.py"

+PEP8_SCRIPT_REMOTE_PATH="https://raw.githubusercontent.com/jcrocholl/pep8/1.5.7/pep8.py";
+
+curl --silent -o "$PEP8_SCRIPT_PATH" "$PEP8_SCRIPT_REMOTE_PATH"
+curl_status=$?
+
+if [ $curl_status -ne 0 ]
--- End diff --

can we follow the rest of spark bash style to put then on the same line as 
the if? we can also reduce the level of indication there


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2323] Exception in accumulator update s...

2014-08-04 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1772

[SPARK-2323] Exception in accumulator update should not crash DAGScheduler 
& SparkContext



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark accumulator-dagscheduler

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1772.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1772


commit 360b46d2cb5cf64137d08f5089588083dfe43771
Author: Reynold Xin 
Date:   2014-08-04T22:54:48Z

[SPARK-2323] Exception in accumulator update should not crash DAGScheduler 
& SparkContext.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2323] Exception in accumulator update s...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1772#issuecomment-51132429
  
This doesn't actually weaken the semantics because previously it would just 
crash. Given the 1.1 timeline, it is best to use the current semantics, which 
is really "no guarantee" (Patrick's nicer way of putting it is "attempted at 
least once").





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51132704
  
Haha, I don't know if it responds to such complex phrases. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51132720
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2323] Exception in accumulator update s...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1772#issuecomment-51134330
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2323] Exception in accumulator update s...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1772#issuecomment-51141863
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2315] Implement drop, dropRight and dro...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1254#issuecomment-51142915
  
I'm not sure what's happening. Maybe Jenkins is lazy today. We can retry 
tomorrow, and if it doesn't work, create a new PR.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2323] Exception in accumulator update s...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1772#issuecomment-51146388
  
Merging this in master. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2323] Exception in accumulator update s...

2014-08-04 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1772#issuecomment-51146442
  
Actually I merged it in master, branch-1.0. and branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2856] Decrease initial buffer size for ...

2014-08-04 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1780

[SPARK-2856] Decrease initial buffer size for Kryo to 64KB.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark kryo-init-size

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1780.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1780


commit 551b935c3db56cb214ebbea6922cc5b6d37d229a
Author: Reynold Xin 
Date:   2014-08-05T05:52:05Z

[SPARK-2856] Decrease initial buffer size for Kryo to 64KB.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-08-04 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1648#discussion_r15796222
  
--- Diff: 
sql/hive/compatibility/src/test/scala/org/apache/spark/sql/hive/execution/HiveCompatibilitySuite.scala
 ---
@@ -38,6 +39,7 @@ class HiveCompatibilitySuite extends HiveQueryFileTest 
with BeforeAndAfter {
 
   override def beforeAll() {
 TestHive.cacheTables = true
+TestHive.set(SQLConf.SHUFFLE_PARTITIONS, "2")
--- End diff --

We should keep it at 2 to speed up tests ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2503] Lower shuffle output buffer (spar...

2014-08-04 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1781

[SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 
32KB.


This can substantially reduce memory usage during shuffle.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark 
SPARK-2503-spark.shuffle.file.buffer.kb

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1781.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1781


commit 1b8f72b387c10d53b54ae3a2e7eb6b9b72f14d36
Author: Reynold Xin 
Date:   2014-08-05T06:06:02Z

[SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 
32KB.

This can substantially reduce memory usage during shuffle.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2856] Decrease initial buffer size for ...

2014-08-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1780#discussion_r15797592
  
--- Diff: 
core/src/main/scala/org/apache/spark/serializer/KryoSerializer.scala ---
@@ -47,7 +47,9 @@ class KryoSerializer(conf: SparkConf)
   with Logging
   with Serializable {
 
-  private val bufferSize = conf.getInt("spark.kryoserializer.buffer.mb", 
2) * 1024 * 1024
+  private val bufferSize =
+(conf.getDouble("spark.kryoserializer.buffer.mb", 0.064) * 1024 * 
1024).toInt
--- End diff --

It's pretty obvious, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51164849
  
Maybe it was started before that fix got merged. Let's run this again,


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51164855
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2856] Decrease initial buffer size for ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1780#issuecomment-51164898
  
Thanks. Merging in master. @andrewor14 if you feel strongly about it, I can 
push a commit to add an one-line comment.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2859] Update url of Kryo project in rel...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1782#issuecomment-51165308
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2585] Remove special handling of Hadoop...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1648#issuecomment-51171274
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Set Spark SQL Hive compatibility test shuffle ...

2014-08-05 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1784

Set Spark SQL Hive compatibility test shuffle partitions to 2.

This should improve test runtime because majority of the test runtime are 
scheduling and task overheads.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark sparksql-test-parallelism

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1784.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1784


commit 6e5ec73a5c069ba65ab093fb718b00b00dac3d36
Author: Reynold Xin 
Date:   2014-08-05T09:16:43Z

Set Spark SQL Hive compatibility test shuffle partitions to 2.

This should improve test runtime because majority of the test runtime are 
scheduling and task overheads.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2869 - Potential leak of Jdbc Connection...

2014-08-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1792#discussion_r15840899
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
@@ -70,20 +70,30 @@ class JdbcRDD[T: ClassTag](
   override def compute(thePart: Partition, context: TaskContext) = new 
NextIterator[T] {
 context.addOnCompleteCallback{ () => closeIfNeeded() }
 val part = thePart.asInstanceOf[JdbcPartition]
-val conn = getConnection()
-val stmt = conn.prepareStatement(sql, ResultSet.TYPE_FORWARD_ONLY, 
ResultSet.CONCUR_READ_ONLY)
+var conn : Connection = _
+var stmt : PreparedStatement = _
+try {
--- End diff --

Actually as I look at this again, I think the try is not necessary. 
Basically the first line in the constructor adds "closeIfNeeded" callback to 
the complete callback list. Then if any exception is thrown, even during the 
constructor, I think closeIfNeeded is called to close connections.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2869 - Fix tiny bug in JdbcRdd for closi...

2014-08-05 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1792#discussion_r15843358
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/JdbcRDD.scala ---
@@ -106,7 +106,7 @@ class JdbcRDD[T: ClassTag](
 case e: Exception => logWarning("Exception closing statement", e)
   }
   try {
-if (null != conn && ! stmt.isClosed()) conn.close()
+if (null != conn && ! conn.isClosed()) conn.close()
--- End diff --

while you are at it, do you mind adding curly braces for the close()? The 
spark coding style requires curly braces even for one-liner ifs. It's an 
oversight that we didn't do this for the ifs in this file.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51263909
  
He's probably en route somewhere because he woke up early this morning to 
catch a flight...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51263913
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51263924
  
Jenkins, add to whitelist.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1758#issuecomment-51270610
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51273157
  
I think Jenkins is sick ... we can keep trying ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51273180
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2503] Lower shuffle output buffer (spar...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1781#issuecomment-51274284
  
Thanks. Merging in master & branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2869 - Fix tiny bug in JdbcRdd for closi...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1792#issuecomment-51277474
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Tighten the visibility of various SQLConf meth...

2014-08-05 Thread rxin
GitHub user rxin opened a pull request:

https://github.com/apache/spark/pull/1794

Tighten the visibility of various SQLConf methods and renamed setter/getters



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rxin/spark sql-conf

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1794.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1794


commit 4b19d6ccd3544e02811361cf7cd06327a15335e5
Author: Reynold Xin 
Date:   2014-08-06T01:15:08Z

Tighten the visibility of various SQLConf methods and renamed 
setter/getters.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2869 - Fix tiny bug in JdbcRdd for closi...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1792#issuecomment-51282959
  
Thanks. Merging this in master & branch-1.0 and branch-1.1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2862] Use shorthand range notation to a...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1787#issuecomment-51283209
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2861] Fix Doc comment of histogram meth...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1786#issuecomment-51285191
  
Jenkins, test this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SQL] Tighten the visibility of various SQLCon...

2014-08-05 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1794#issuecomment-51289764
  
@concretevitamin 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51367875
  
This looks good to me. What do you think, @JoshRosen / @davies ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51373160
  
The line length is already 100 in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2787: Make sort-based shuffle write file...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1799#discussion_r15897877
  
--- Diff: core/src/main/scala/org/apache/spark/SparkEnv.scala ---
@@ -246,8 +250,13 @@ object SparkEnv extends Logging {
   "."
 }
 
-val shuffleManager = instantiateClass[ShuffleManager](
-  "spark.shuffle.manager", 
"org.apache.spark.shuffle.hash.HashShuffleManager")
+// Let the user specify short names for shuffle managers
+val shortShuffleMgrNames = Map(
+  "HASH" -> "org.apache.spark.shuffle.hash.HashShuffleManager",
+  "SORT" -> "org.apache.spark.shuffle.sort.SortShuffleManager")
+val shuffleMgrName = conf.get("spark.shuffle.manager", "HASH")
--- End diff --

can we make this case insensitive?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2787: Make sort-based shuffle write file...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1799#discussion_r15898020
  
--- Diff: 
core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleWriter.scala ---
@@ -54,87 +55,36 @@ private[spark] class SortShuffleWriter[K, V, C](
 
   /** Write a bunch of records to this task's output */
   override def write(records: Iterator[_ <: Product2[K, V]]): Unit = {
-// Get an iterator with the elements for each partition ID
-val partitions: Iterator[(Int, Iterator[Product2[K, _]])] = {
-  if (dep.mapSideCombine) {
-if (!dep.aggregator.isDefined) {
-  throw new IllegalStateException("Aggregator is empty for 
map-side combine")
-}
-sorter = new ExternalSorter[K, V, C](
-  dep.aggregator, Some(dep.partitioner), dep.keyOrdering, 
dep.serializer)
-sorter.write(records)
-sorter.partitionedIterator
-  } else {
-// In this case we pass neither an aggregator nor an ordering to 
the sorter, because we
-// don't care whether the keys get sorted in each partition; that 
will be done on the
-// reduce side if the operation being run is sortByKey.
-sorter = new ExternalSorter[K, V, V](
-  None, Some(dep.partitioner), None, dep.serializer)
-sorter.write(records)
-sorter.partitionedIterator
+if (dep.mapSideCombine) {
+  if (!dep.aggregator.isDefined) {
+throw new IllegalStateException("Aggregator is empty for map-side 
combine")
   }
+  sorter = new ExternalSorter[K, V, C](
+dep.aggregator, Some(dep.partitioner), dep.keyOrdering, 
dep.serializer)
+  sorter.insertAll(records)
+} else {
+  // In this case we pass neither an aggregator nor an ordering to the 
sorter, because we
+  // don't care whether the keys get sorted in each partition; that 
will be done on the
+  // reduce side if the operation being run is sortByKey.
+  sorter = new ExternalSorter[K, V, V](
--- End diff --

I think this fits in one line without wrapping ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2787: Make sort-based shuffle write file...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1799#discussion_r15898462
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -640,9 +713,122 @@ private[spark] class ExternalSorter[K, V, C](
*/
   def iterator: Iterator[Product2[K, C]] = 
partitionedIterator.flatMap(pair => pair._2)
 
+  /**
+   * Write all the data added into this ExternalSorter into a file in the 
disk store, creating
+   * an .index file for it as well with the offsets of each partition. 
This is called by the
+   * SortShuffleWriter and can go through an efficient path of just 
concatenating binary files
+   * if we decided to avoid merge-sorting.
+   *
+   * @param blockId block ID to write to. The index file will be 
blockId.name + ".index".
+   * @param context a TaskContext for a running Spark task, for us to 
update shuffle metrics.
+   * @return array of lengths, in bytes, of each partition of the file 
(used by map output tracker)
+   */
+  def writePartitionedFile(blockId: BlockId, context: TaskContext): 
Array[Long] = {
+val outputFile = blockManager.diskBlockManager.getFile(blockId)
+
+// Track location of each range in the output file
+val offsets = new Array[Long](numPartitions + 1)
+val lengths = new Array[Long](numPartitions)
+
+// Statistics
+var totalBytes = 0L
+var totalTime = 0L
+
+if (bypassMergeSort && partitionWriters != null) {
--- End diff --

In the comment above partitionWriters, we say: "Array of file writers for 
each partition, used if bypassMergeSort is true" .

This implies that when bypassMergeSort is true, partitionWriters wouldn't 
be null. Can you document the case when partitionWriters would be null?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2787: Make sort-based shuffle write file...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1799#discussion_r15898826
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -640,9 +713,122 @@ private[spark] class ExternalSorter[K, V, C](
*/
   def iterator: Iterator[Product2[K, C]] = 
partitionedIterator.flatMap(pair => pair._2)
 
+  /**
+   * Write all the data added into this ExternalSorter into a file in the 
disk store, creating
+   * an .index file for it as well with the offsets of each partition. 
This is called by the
+   * SortShuffleWriter and can go through an efficient path of just 
concatenating binary files
+   * if we decided to avoid merge-sorting.
+   *
+   * @param blockId block ID to write to. The index file will be 
blockId.name + ".index".
+   * @param context a TaskContext for a running Spark task, for us to 
update shuffle metrics.
+   * @return array of lengths, in bytes, of each partition of the file 
(used by map output tracker)
+   */
+  def writePartitionedFile(blockId: BlockId, context: TaskContext): 
Array[Long] = {
+val outputFile = blockManager.diskBlockManager.getFile(blockId)
+
+// Track location of each range in the output file
+val offsets = new Array[Long](numPartitions + 1)
+val lengths = new Array[Long](numPartitions)
+
+// Statistics
+var totalBytes = 0L
+var totalTime = 0L
+
+if (bypassMergeSort && partitionWriters != null) {
+  // We decided to write separate files for each partition, so just 
concatenate them. To keep
+  // this simple we spill out the current in-memory collection so that 
everything is in files.
+  spillToPartitionFiles(if (aggregator.isDefined) map else buffer)
+  partitionWriters.foreach(_.commitAndClose())
+  var out: FileOutputStream = null
+  var in: FileInputStream = null
+  try {
+out = new FileOutputStream(outputFile)
+for (i <- 0 until numPartitions) {
+  val file = partitionWriters(i).fileSegment().file
--- End diff --

I find this part that uses fileSegments slightly convoluted. But we can 
deal with this later. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2787: Make sort-based shuffle write file...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1799#discussion_r15898996
  
--- Diff: 
core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala ---
@@ -120,6 +128,18 @@ private[spark] class ExternalSorter[K, V, C](
   // How much of the shared memory pool this collection has claimed
   private var myMemoryThreshold = 0L
 
+  // If there are fewer than spark.shuffle.sort.bypassMergeThreshold 
partitions and we don't need
+  // local aggregation and sorting, write numPartitions files directly and 
just concatenate them
+  // at the end. This avoids doing serialization and deserialization twice 
to merge together the
+  // spilled files, which would happen with the normal code path. The 
downside is having multiple
+  // files open at a time and thus more memory allocated to buffers.
+  private val bypassMergeThreshold = 
conf.getInt("spark.shuffle.sort.bypassMergeThreshold", 200)
+  private[collection] val bypassMergeSort =// 
private[collection] for unit tests
--- End diff --

You can use this in unit test: 
http://doc.scalatest.org/1.4.1/org/scalatest/PrivateMethodTester.html


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51388197
  
Ok I'm merging this in master. Thanks, @nchammas.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2627] [PySpark] have the build enforce ...

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1744#issuecomment-51388412
  
And branch-1.1 too.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2583] ConnectionManager error reporting

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1758#issuecomment-51388657
  
It actually failed deterministically on Josh's laptop. It could be a bug 
somewhere in the configuration or a racing condition in Spark.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2879 [BUILD] Use HTTPS to access Maven C...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1805#discussion_r15899974
  
--- Diff: pom.xml ---
@@ -143,11 +143,11 @@
 
   
 
-  maven-repo
+  central
--- End diff --

any reason we call apache maven "central"? (the old name is confusing too)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15900503
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,13 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  private var valRB: Array[K] = Array()
--- End diff --

Can we perhaps make this thread safe? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15919352
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   }
 
   @throws(classOf[IOException])
-  private def readObject(in: ObjectInputStream) {
+  private def readObject(in: ObjectInputStream): Unit = this.synchronized {
+if (valRB != null) return
--- End diff --

Do we not want to deserialize valRB if it is not null? Are you worried 
rangeBounds might be called while the deserialization is happening? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1689#discussion_r15919599
  
--- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala ---
@@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](
   private var ordering = implicitly[Ordering[K]]
 
   // An array of upper bounds for the first (partitions - 1) partitions
-  private var rangeBounds: Array[K] = {
-if (partitions <= 1) {
+  @volatile private var valRB: Array[K] = null
--- End diff --

Any idea on volatile's impact on read performance? rangeBounds is read 
multiple times in getPartition.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2887] fix bug of countApproxDistinct() ...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1812#discussion_r15919681
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1004,7 +1004,7 @@ abstract class RDD[T: ClassTag](
   },
   (h1: HyperLogLogPlus, h2: HyperLogLogPlus) => {
 h1.addAll(h2)
-h2
+h1
--- End diff --

ah can't believe I missed that


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Branch 1.1

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1824#issuecomment-51434585
  
What is this pull request for? Do you mind closing it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-2894] spark-shell doesn't accept flags

2014-08-06 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1825#issuecomment-51434809
  
Jenkins, add to whitelist.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919794
  
--- Diff: core/src/main/scala/org/apache/spark/HttpServer.scala ---
@@ -50,7 +50,7 @@ private[spark] class HttpServer(resourceBase: File, 
securityManager: SecurityMan
 if (server != null) {
   throw new ServerStateException("Server is already started")
 } else {
-  logInfo("Starting HTTP Server")
+  logDebug("Starting HTTP Server")
--- End diff --

This logging message is useless to me... maybe we can remove it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919800
  
--- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala ---
@@ -146,7 +146,7 @@ private[spark] class SecurityManager(sparkConf: 
SparkConf) extends Logging {
   setViewAcls(defaultAclUsers, sparkConf.get("spark.ui.view.acls", ""))
 
   private val secretKey = generateSecretKey()
-  logInfo("SecurityManager: authentication " + (if (authOn) "enabled" else 
"disabled") +
+  logDebug("SecurityManager: authentication " + (if (authOn) "enabled" 
else "disabled") +
--- End diff --

This would be very useful. I'd keep it.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919803
  
--- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala ---
@@ -170,8 +170,8 @@ private[spark] class SecurityManager(sparkConf: 
SparkConf) extends Logging {
   }
 
   private[spark] def setViewAcls(defaultUsers: Seq[String], allowedUsers: 
String) {
-viewAcls = (defaultUsers ++ 
allowedUsers.split(',')).map(_.trim()).filter(!_.isEmpty).toSet 
-logInfo("Changing view acls to: " + viewAcls.mkString(","))
+viewAcls = (defaultUsers ++ 
allowedUsers.split(',')).map(_.trim()).filter(!_.isEmpty).toSet
+logDebug("Changing view acls to: " + viewAcls.mkString(","))
--- End diff --

this too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919807
  
--- Diff: core/src/main/scala/org/apache/spark/SecurityManager.scala ---
@@ -179,8 +179,8 @@ private[spark] class SecurityManager(sparkConf: 
SparkConf) extends Logging {
   }
 
   private[spark] def setUIAcls(aclSetting: Boolean) { 
-uiAclsOn = aclSetting 
-logInfo("Changing acls enabled to: " + uiAclsOn)
+uiAclsOn = aclSetting
+logDebug("Changing acls enabled to: " + uiAclsOn)
--- End diff --

and this one


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919824
  
--- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala ---
@@ -810,7 +810,7 @@ class SparkContext(config: SparkConf) extends Logging {
 // Fetch the file locally in case a job is executed using 
DAGScheduler.runLocally().
 Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, 
env.securityManager)
 
-logInfo("Added file " + path + " at " + key + " with timestamp " + 
addedFiles(key))
+logDebug("Added file " + path + " at " + key + " with timestamp " + 
addedFiles(key))
--- End diff --

I'd keep all the ones in SparkContext in INFO. I think they are kind of 
important ... 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919835
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -167,7 +167,7 @@ private[spark] class Executor(
   SparkEnv.set(env)
   Thread.currentThread.setContextClassLoader(replClassLoader)
   val ser = SparkEnv.get.closureSerializer.newInstance()
-  logInfo(s"Running $taskName (TID $taskId)")
+  logDebug(s"Running $taskName (TID $taskId)")
--- End diff --

let's keep this on. this is only chatty if you run it in single node. in 
distributed mode, it is very important information to log.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919838
  
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -167,7 +167,7 @@ private[spark] class Executor(
   SparkEnv.set(env)
   Thread.currentThread.setContextClassLoader(replClassLoader)
   val ser = SparkEnv.get.closureSerializer.newInstance()
-  logInfo(s"Running $taskName (TID $taskId)")
+  logDebug(s"Running $taskName (TID $taskId)")
--- End diff --

also the following two


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919852
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -188,7 +188,7 @@ class HadoopRDD[K, V](
 val iter = new NextIterator[(K, V)] {
 
   val split = theSplit.asInstanceOf[HadoopPartition]
-  logInfo("Input split: " + split.inputSplit)
+  logDebug("Input split: " + split.inputSplit)
--- End diff --

i'd leave this one as info too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919867
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -859,7 +859,7 @@ class DAGScheduler(
   return
   }
 
-  logInfo("Submitting " + tasks.size + " missing tasks from " + stage 
+ " (" + stage.rdd + ")")
+  logDebug("Submitting " + tasks.size + " missing tasks from " + stage 
+ " (" + stage.rdd + ")")
--- End diff --

i'd leave this one as info too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919860
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -535,12 +535,12 @@ class DAGScheduler(
* Cancel a job that is running or waiting in the queue.
*/
   def cancelJob(jobId: Int) {
-logInfo("Asked to cancel job " + jobId)
+logDebug("Asked to cancel job " + jobId)
 eventProcessActor ! JobCancelled(jobId)
   }
 
   def cancelJobGroup(groupId: String) {
-logInfo("Asked to cancel job group " + groupId)
+logDebug("Asked to cancel job group " + groupId)
--- End diff --

i'd leave this one as info too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919859
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -535,12 +535,12 @@ class DAGScheduler(
* Cancel a job that is running or waiting in the queue.
*/
   def cancelJob(jobId: Int) {
-logInfo("Asked to cancel job " + jobId)
+logDebug("Asked to cancel job " + jobId)
--- End diff --

i'd leave this one as info too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919873
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1086,7 +1086,7 @@ class DAGScheduler(
   handleJobCancellation(jobId, s"because Stage $stageId was 
cancelled")
 }
   case None =>
-logInfo("No active jobs to kill for Stage " + stageId)
+logDebug("No active jobs to kill for Stage " + stageId)
--- End diff --

i'd leave this one as info too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919874
  
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 
---
@@ -1117,7 +1117,7 @@ class DAGScheduler(
   failJobAndIndependentStages(job, s"Job aborted due to stage failure: 
$reason")
 }
 if (dependentJobs.isEmpty) {
-  logInfo("Ignoring failure of " + failedStage + " because all jobs 
depending on it are done")
+  logDebug("Ignoring failure of " + failedStage + " because all jobs 
depending on it are done")
--- End diff --

i'd leave this one as info too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919880
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -431,7 +431,7 @@ private[spark] class TaskSetManager(
   // a good proxy to task serialization time.
   // val timeTaken = clock.getTime() - startTime
   val taskName = s"task ${info.id} in stage ${taskSet.id}"
-  logInfo("Starting %s (TID %d, %s, %s, %d bytes)".format(
+  logDebug("Starting %s (TID %d, %s, %s, %d bytes)".format(
--- End diff --

this is progress reporting, isn't it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919888
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -503,7 +503,7 @@ private[spark] class TaskSetManager(
 isZombie = true
   }
 } else {
-  logInfo("Ignoring task-finished event for " + info.id + " in stage " 
+ taskSet.id +
+  logDebug("Ignoring task-finished event for " + info.id + " in stage 
" + taskSet.id +
--- End diff --

i would make this info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919916
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -908,7 +908,7 @@ private[spark] class BlockManager(
* Remove all blocks belonging to the given broadcast.
*/
   def removeBroadcast(broadcastId: Long, tellMaster: Boolean): Int = {
-logInfo(s"Removing broadcast $broadcastId")
+logDebug(s"Removing broadcast $broadcastId")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919915
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -898,7 +898,7 @@ private[spark] class BlockManager(
*/
   def removeRdd(rddId: Int): Int = {
 // TODO: Avoid a linear scan by creating another mapping of RDD.id to 
blocks.
-logInfo(s"Removing RDD $rddId")
+logDebug(s"Removing RDD $rddId")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919909
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -518,12 +518,12 @@ private[spark] class BlockManager(
   def get(blockId: BlockId): Option[BlockResult] = {
 val local = getLocal(blockId)
 if (local.isDefined) {
-  logInfo(s"Found block $blockId locally")
+  logDebug(s"Found block $blockId locally")
   return local
 }
 val remote = getRemote(blockId)
 if (remote.isDefined) {
-  logInfo(s"Found block $blockId remotely")
+  logDebug(s"Found block $blockId remotely")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919913
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -856,7 +856,7 @@ private[spark] class BlockManager(
 
 // Drop to disk, if storage level requires
 if (level.useDisk && !diskStore.contains(blockId)) {
-  logInfo(s"Writing block $blockId to disk")
+  logDebug(s"Writing block $blockId to disk")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919907
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -518,12 +518,12 @@ private[spark] class BlockManager(
   def get(blockId: BlockId): Option[BlockResult] = {
 val local = getLocal(blockId)
 if (local.isDefined) {
-  logInfo(s"Found block $blockId locally")
+  logDebug(s"Found block $blockId locally")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919922
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala ---
@@ -38,14 +38,14 @@ class BlockManagerMaster(var driverActor: ActorRef, 
conf: SparkConf) extends Log
   /** Remove a dead executor from the driver actor. This is only called on 
the driver side. */
   def removeExecutor(execId: String) {
 tell(RemoveExecutor(execId))
-logInfo("Removed " + execId + " successfully in removeExecutor")
+logDebug("Removed " + execId + " successfully in removeExecutor")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919918
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -920,7 +920,7 @@ private[spark] class BlockManager(
* Remove a block from both memory and disk.
*/
   def removeBlock(blockId: BlockId, tellMaster: Boolean = true): Unit = {
-logInfo(s"Removing block $blockId")
+logDebug(s"Removing block $blockId")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919910
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -837,7 +837,7 @@ private[spark] class BlockManager(
   blockId: BlockId,
   data: Either[Array[Any], ByteBuffer]): Option[BlockStatus] = {
 
-logInfo(s"Dropping block $blockId from memory")
+logDebug(s"Dropping block $blockId from memory")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919930
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala ---
@@ -38,14 +38,14 @@ class BlockManagerMaster(var driverActor: ActorRef, 
conf: SparkConf) extends Log
   /** Remove a dead executor from the driver actor. This is only called on 
the driver side. */
   def removeExecutor(execId: String) {
 tell(RemoveExecutor(execId))
-logInfo("Removed " + execId + " successfully in removeExecutor")
+logDebug("Removed " + execId + " successfully in removeExecutor")
   }
 
   /** Register the BlockManager's id with the driver. */
   def registerBlockManager(blockManagerId: BlockManagerId, maxMemSize: 
Long, slaveActor: ActorRef) {
-logInfo("Trying to register BlockManager")
+logDebug("Trying to register BlockManager")
 tell(RegisterBlockManager(blockManagerId, maxMemSize, slaveActor))
-logInfo("Registered BlockManager")
+logDebug("Registered BlockManager")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919933
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala ---
@@ -57,7 +57,7 @@ class BlockManagerMaster(var driverActor: ActorRef, conf: 
SparkConf) extends Log
   tachyonSize: Long): Boolean = {
 val res = askDriverWithReply[Boolean](
   UpdateBlockInfo(blockManagerId, blockId, storageLevel, memSize, 
diskSize, tachyonSize))
-logInfo("Updated info of block " + blockId)
+logDebug("Updated info of block " + blockId)
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919929
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala ---
@@ -38,14 +38,14 @@ class BlockManagerMaster(var driverActor: ActorRef, 
conf: SparkConf) extends Log
   /** Remove a dead executor from the driver actor. This is only called on 
the driver side. */
   def removeExecutor(execId: String) {
 tell(RemoveExecutor(execId))
-logInfo("Removed " + execId + " successfully in removeExecutor")
+logDebug("Removed " + execId + " successfully in removeExecutor")
   }
 
   /** Register the BlockManager's id with the driver. */
   def registerBlockManager(blockManagerId: BlockManagerId, maxMemSize: 
Long, slaveActor: ActorRef) {
-logInfo("Trying to register BlockManager")
+logDebug("Trying to register BlockManager")
--- End diff --

i'd leave this as info


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/1776#discussion_r15919942
  
--- Diff: 
core/src/main/scala/org/apache/spark/storage/BlockManagerMasterActor.scala ---
@@ -69,7 +69,7 @@ class BlockManagerMasterActor(val isLocal: Boolean, conf: 
SparkConf, listenerBus
 
   def receive = {
 case RegisterBlockManager(blockManagerId, maxMemSize, slaveActor) =>
-  logInfo("received a register")
+  logDebug("received a register")
--- End diff --

i'd remove this and maybe just keep a general logDebug to log all the 
messages.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



<    8   9   10   11   12   13   14   15   16   17   >