[GitHub] spark pull request: [SPARK-1061] assumePartitioned
Github user rapen commented on the pull request: https://github.com/apache/spark/pull/4449#issuecomment-137697278 @danielhaviv , see tests in PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9669][MESOS] Support PySpark on Mesos c...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8349#issuecomment-137709964 [Test build #41995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41995/console) for PR 8349 at commit [`231c810`](https://github.com/apache/spark/commit/231c810b51b31b928e402ceebad72ca19b4314e0). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38742310 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/SimpleTextRelation.scala --- @@ -112,11 +114,14 @@ class SimpleTextRelation( val fields = dataSchema.map(_.dataType) sparkContext.textFile(inputStatuses.map(_.getPath).mkString(",")).map { record => - Row(record.split(",").zip(fields).map { case (value, dataType) => + Row(record.split(",", -1).zip(fields).map { case (v, dataType) => +val value = if (v == "") null else v // `Cast`ed values are always of Catalyst types (i.e. UTF8String instead of String, etc.) val catalystValue = Cast(Literal(value), dataType).eval() // Here we're converting Catalyst values to Scala values to test `needsConversion` -CatalystTypeConverters.convertToScala(catalystValue, dataType) +val scalaV = CatalystTypeConverters.convertToScala(catalystValue, dataType) + +scalaV --- End diff -- Nit: Remove `scalaV`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38742485 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -100,6 +104,87 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils { } } + test("test all data types") { +withTempDir { file => + file.delete() + + // Create the schema. + val struct = +StructType( + StructField("f1", FloatType, true) :: +StructField("f2", ArrayType(BooleanType), true) :: Nil) + val dataTypes = +Seq( + StringType, BinaryType, NullType, BooleanType, + ByteType, ShortType, IntegerType, LongType, + FloatType, DoubleType, DecimalType(25, 5), DecimalType(6, 5), + DateType, TimestampType, + ArrayType(IntegerType), MapType(StringType, LongType), struct, + new MyDenseVectorUDT()) --- End diff -- `CalendarIntervalType` is not covered here. Is it intentional? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-10445: Extend Maven version (enforcer)
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/8598#issuecomment-137684318 No, we require 3.3.3 explicitly to work around some problems with Maven 3.2. `build/mvn` downloads it for you though. Do you mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137695938 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137695988 [Test build #41997 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41997/consoleFull) for PR 8600 at commit [`8ff97ed`](https://github.com/apache/spark/commit/8ff97ede7250e032c88cf20a4e95f3e1e1cd416f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137695913 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38742967 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -100,6 +104,87 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils { } } + test("test all data types") { +withTempDir { file => + file.delete() + + // Create the schema. + val struct = +StructType( + StructField("f1", FloatType, true) :: +StructField("f2", ArrayType(BooleanType), true) :: Nil) + val dataTypes = +Seq( + StringType, BinaryType, NullType, BooleanType, + ByteType, ShortType, IntegerType, LongType, + FloatType, DoubleType, DecimalType(25, 5), DecimalType(6, 5), + DateType, TimestampType, + ArrayType(IntegerType), MapType(StringType, LongType), struct, + new MyDenseVectorUDT()) + val fields = dataTypes.zipWithIndex.map { case (dataType, index) => +StructField(s"col$index", dataType, true) + } + val schema = StructType(fields) + + // Create a RDD for the schema + val rdd = +sqlContext.sparkContext.parallelize((1 to 100), 10).flatMap { i => + val row1 = Row( +s"str${i}: test save.", +s"binary${i}: test save.".getBytes("UTF-8"), +null, +i % 2 == 0, +i.toByte, +i.toShort, +i, +Long.MaxValue - i.toLong, +(i + 0.25).toFloat, +(i + 0.75), +BigDecimal(Long.MaxValue.toString + ".12345"), +new java.math.BigDecimal(s"${i % 9 + 1}" + ".23456"), +new Date(i), +new Timestamp(i), +(1 to i).toSeq, +(0 to i).map(j => s"map_key_$j" -> (Long.MaxValue - j)).toMap, +Row((i - 0.25).toFloat, Seq(true, false, null)), +new MyDenseVector(Array(1.1, 2.1, 3.1))) + val row2 = Row.fromSeq(Seq.fill(dataTypes.length)(null)) + row1 :: row2 :: Nil +} + val df = sqlContext.createDataFrame(rdd, schema) + + // All columns that have supported data types of this source. + val supportedColumns = schema.fields.filter { field => +supportsDataType(field.dataType) + }.map { field => +field.name + } --- End diff -- Nit: Can be simplified a little bit: ```scala val supportedColumns = schema.collect { case StructField(name, dataType) if supportsDataType(dataType) => name } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SQL] SPARK-6981: Factor out SparkPlanner and ...
Github user evacchi commented on the pull request: https://github.com/apache/spark/pull/6356#issuecomment-137717489 @marmbrus I have brought this up to date. Might need fixes in order to merge cleanly, though --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc commented on the pull request: https://github.com/apache/spark/pull/8507#issuecomment-137686626 @feynmanliang , I've removed case classes used for schema inference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137708683 [Test build #41996 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41996/console) for PR 8599 at commit [`452cfb5`](https://github.com/apache/spark/commit/452cfb5259e2942364aeede944cccaeda7d19a24). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137708777 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137708779 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41996/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10227] fatal warnings with sbt on Scala...
Github user skyluc commented on a diff in the pull request: https://github.com/apache/spark/pull/8433#discussion_r38743996 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -24,6 +24,7 @@ import java.util.{Collections, ArrayList => JArrayList, List => JList, Map => JM import scala.collection.JavaConverters._ import scala.collection.mutable import scala.language.existentials +import scala.annotation.meta._ --- End diff -- No, it is a left over of the previous changes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10310][SQL]Using \t as the field delime...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8476#discussion_r38745091 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala --- @@ -429,7 +429,8 @@ class HiveQuerySuite extends HiveComparisonTest with BeforeAndAfter { |'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' FROM src; """.stripMargin.replaceAll(System.lineSeparator(), " ")) - test("transform with SerDe2") { + // TODO: Only support serde which compatible with TextRecordReader at the moment. + ignore("transform with SerDe2") { --- End diff -- Why this test case should be ignored? The involved SQL query doesn't contain a `RECORDREADER` clause, and should fall back to `TextRecordReader`, shouldn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10227] fatal warnings with sbt on Scala...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/8433#issuecomment-137707108 Aside from possible removing those imports, yes, this looks good to me for master/1.6. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38742692 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -100,6 +104,87 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils { } } + test("test all data types") { +withTempDir { file => + file.delete() + + // Create the schema. + val struct = +StructType( + StructField("f1", FloatType, true) :: +StructField("f2", ArrayType(BooleanType), true) :: Nil) + val dataTypes = +Seq( + StringType, BinaryType, NullType, BooleanType, + ByteType, ShortType, IntegerType, LongType, + FloatType, DoubleType, DecimalType(25, 5), DecimalType(6, 5), + DateType, TimestampType, + ArrayType(IntegerType), MapType(StringType, LongType), struct, + new MyDenseVectorUDT()) + val fields = dataTypes.zipWithIndex.map { case (dataType, index) => +StructField(s"col$index", dataType, true) + } + val schema = StructType(fields) + + // Create a RDD for the schema + val rdd = +sqlContext.sparkContext.parallelize((1 to 100), 10).flatMap { i => + val row1 = Row( +s"str${i}: test save.", +s"binary${i}: test save.".getBytes("UTF-8"), +null, +i % 2 == 0, +i.toByte, +i.toShort, +i, +Long.MaxValue - i.toLong, +(i + 0.25).toFloat, +(i + 0.75), --- End diff -- Nit: `075D` or add the `toDouble` call to make it explicit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10310][SQL]Using \t as the field delime...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8476#discussion_r38747088 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala --- @@ -328,23 +361,27 @@ case class HiveScriptIOSchema ( (columns, columnTypes) } - private def initSerDe( + private def createTableProperties( serdeClassName: String, columns: Seq[String], columnTypes: Seq[DataType], - serdeProps: Seq[(String, String)]): AbstractSerDe = { - -val serde = Utils.classForName(serdeClassName).newInstance.asInstanceOf[AbstractSerDe] - + serdeProps: Seq[(String, String)]) = { val columnTypesNames = columnTypes.map(_.toTypeInfo.getTypeName()).mkString(",") - var propsMap = serdeProps.toMap + (serdeConstants.LIST_COLUMNS -> columns.mkString(",")) propsMap = propsMap + (serdeConstants.LIST_COLUMN_TYPES -> columnTypesNames) - +propsMap = propsMap + (serdeConstants.FIELD_DELIM -> "\t") --- End diff -- Shouldn't we need to specify line delimiter here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10199][SPARK-10200][SPARK-10201][SPARK-...
Github user vinodkc commented on a diff in the pull request: https://github.com/apache/spark/pull/8507#discussion_r38733907 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala --- @@ -125,16 +126,19 @@ object KMeansModel extends Loader[KMeansModel] { def save(sc: SparkContext, model: KMeansModel, path: String): Unit = { val sqlContext = new SQLContext(sc) - import sqlContext.implicits._ val metadata = compact(render( ("class" -> thisClassName) ~ ("version" -> thisFormatVersion) ~ ("k" -> model.k))) sc.parallelize(Seq(metadata), 1).saveAsTextFile(Loader.metadataPath(path)) - val dataRDD = sc.parallelize(model.clusterCenters.zipWithIndex).map { case (point, id) => -Cluster(id, point) --- End diff -- Removed case classes except NodeData,SplitData,PredictData, these classes simplify data extraction --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-10445: Extend Maven version (enforcer)
Github user jbonofre closed the pull request at: https://github.com/apache/spark/pull/8598 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10310][SQL]Using \t as the field delime...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/8476#issuecomment-137723720 @zhichao-li Could you please add a test case that explicit checks for the output format of a transformation query? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8537#issuecomment-137726639 [Test build #41998 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41998/consoleFull) for PR 8537 at commit [`8660d0e`](https://github.com/apache/spark/commit/8660d0e2a815b367cc9f34251926e315bc95f9c1). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-10445: Extend Maven version (enforcer)
Github user jbonofre commented on the pull request: https://github.com/apache/spark/pull/8598#issuecomment-137684872 Allright, weird, as it works with Maven 3.2.5 for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
GitHub user viirya opened a pull request: https://github.com/apache/spark/pull/8600 [SPARK-10446][SQL] Support to specify join type when calling join with usingColumns JIRA: https://issues.apache.org/jira/browse/SPARK-10446 Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types. You can merge this pull request into a Git repository by running: $ git pull https://github.com/viirya/spark-1 usingcolumns_df Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/8600.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #8600 commit 5ab4846852723d1c3505223e18c41dbf7bc40fa0 Author: Liang-Chi HsiehDate: 2015-09-04T09:43:35Z Support to specify join type when calling join with usingColumns. commit 8ff97ede7250e032c88cf20a4e95f3e1e1cd416f Author: Liang-Chi Hsieh Date: 2015-09-04T10:06:35Z Add unit test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38742782 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -100,6 +104,87 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils { } } + test("test all data types") { +withTempDir { file => + file.delete() + + // Create the schema. + val struct = +StructType( + StructField("f1", FloatType, true) :: +StructField("f2", ArrayType(BooleanType), true) :: Nil) + val dataTypes = +Seq( + StringType, BinaryType, NullType, BooleanType, + ByteType, ShortType, IntegerType, LongType, + FloatType, DoubleType, DecimalType(25, 5), DecimalType(6, 5), + DateType, TimestampType, + ArrayType(IntegerType), MapType(StringType, LongType), struct, + new MyDenseVectorUDT()) + val fields = dataTypes.zipWithIndex.map { case (dataType, index) => +StructField(s"col$index", dataType, true) + } + val schema = StructType(fields) + + // Create a RDD for the schema + val rdd = +sqlContext.sparkContext.parallelize((1 to 100), 10).flatMap { i => + val row1 = Row( +s"str${i}: test save.", +s"binary${i}: test save.".getBytes("UTF-8"), +null, +i % 2 == 0, +i.toByte, +i.toShort, +i, +Long.MaxValue - i.toLong, +(i + 0.25).toFloat, +(i + 0.75), +BigDecimal(Long.MaxValue.toString + ".12345"), +new java.math.BigDecimal(s"${i % 9 + 1}" + ".23456"), +new Date(i), +new Timestamp(i), +(1 to i).toSeq, +(0 to i).map(j => s"map_key_$j" -> (Long.MaxValue - j)).toMap, +Row((i - 0.25).toFloat, Seq(true, false, null)), +new MyDenseVector(Array(1.1, 2.1, 3.1))) + val row2 = Row.fromSeq(Seq.fill(dataTypes.length)(null)) + row1 :: row2 :: Nil +} --- End diff -- Seems that `RandomDataGenerator` helps here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137715359 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41997/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10310][SQL]Using \t as the field delime...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8476#discussion_r38745795 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala --- @@ -290,7 +300,9 @@ case class HiveScriptIOSchema ( outputSerdeClass: Option[String], inputSerdeProps: Seq[(String, String)], outputSerdeProps: Seq[(String, String)], -schemaLess: Boolean) extends ScriptInputOutputSchema with HiveInspectors { +schemaLess: Boolean, +recordWriter: String, +recordReader: String) extends ScriptInputOutputSchema with HiveInspectors { --- End diff -- Use `Option[String]` instead of `String` for `recordWriter` and `recordReader`, and don't use empty string as their default value in other places of this PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-10445: Extend Maven version (enforcer)
Github user jbonofre commented on the pull request: https://github.com/apache/spark/pull/8598#issuecomment-137692450 It makes sense, thanks ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38742367 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -100,6 +104,87 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils { } } + test("test all data types") { +withTempDir { file => + file.delete() --- End diff -- You can use `withTempPath` here. It provides a temporary path without creating the directory, so that you don't need the `delete()` call. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137715358 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137715324 [Test build #41997 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41997/console) for PR 8600 at commit [`8ff97ed`](https://github.com/apache/spark/commit/8ff97ede7250e032c88cf20a4e95f3e1e1cd416f). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class BlockFetchException(messages: String, throwable: Throwable)` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137684190 [Test build #41996 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41996/consoleFull) for PR 8599 at commit [`452cfb5`](https://github.com/apache/spark/commit/452cfb5259e2942364aeede944cccaeda7d19a24). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10227] fatal warnings with sbt on Scala...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8433#issuecomment-137698769 [Test build #1719 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1719/console) for PR 8433 at commit [`0408404`](https://github.com/apache/spark/commit/04084043276cba2b773b5895a0935278ccc611bd). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/8597#issuecomment-137714184 Generally looks good except for a few minor issues. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-10445: Extend Maven version (enforcer)
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/8598#issuecomment-137686197 Yeah it 99% works -- I recall that the problem was a little bit subtle, some problem with dependencies or artifacts, and maybe only affects a small number of use cases, but still worth avoiding. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9669][MESOS] Support PySpark on Mesos c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8349#issuecomment-137710055 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41995/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9669][MESOS] Support PySpark on Mesos c...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8349#issuecomment-137710050 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10310][SQL]Using \t as the field delime...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8476#discussion_r38745663 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformation.scala --- @@ -58,6 +61,9 @@ case class ScriptTransformation( override def otherCopyArgs: Seq[HiveContext] = sc :: Nil + private val _broadcastedHiveConf = this. +sc.sparkContext.broadcast(new SerializableConfiguration(sc.hiveconf)) --- End diff -- You probably don't need broadcasting here. `SerializableConfiguration` already avoids reading XML files while deserializing `Configuration` instances. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8537#issuecomment-137725481 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8537#issuecomment-137725498 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9652][CORE] Added method for Avro file ...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/7971#discussion_r38748233 --- Diff: core/src/test/scala/org/apache/spark/SparkFunSuite.scala --- @@ -44,5 +49,27 @@ private[spark] abstract class SparkFunSuite extends FunSuite with Logging { logInfo(s"\n\n= FINISHED $shortSuiteName: '$testName' =\n") } } + /** + * Generates a temporary path without creating the actual file/directory, then pass it to `f`. If + * a file/directory is created there by `f`, it will be delete after `f` returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withTempPath(f: File => Unit): Unit = { +val path = Utils.createTempDir() +path.delete() +try f(path) finally Utils.deleteRecursively(path) + } + + /** + * Creates a temporary directory, which is then passed to `f` and will be deleted after `f` + * returns. + * + * @todo Probably this method should be moved to a more general place + */ + protected def withTempDir(f: File => Unit): Unit = { +val dir = Utils.createTempDir().getCanonicalFile +try f(dir) finally Utils.deleteRecursively(dir) + } --- End diff -- Please remove the `withTempDir` method defined in `OrcPartitionDiscoverySuite`. It's causing compilation error since an `override` is missing there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137746771 [Test build #41999 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/consoleFull) for PR 8402 at commit [`cfcf4e6`](https://github.com/apache/spark/commit/cfcf4e667121b4225ce327f5f764b00677059865). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9652][CORE] Added method for Avro file ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/7971#issuecomment-137751981 [Test build #42000 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42000/consoleFull) for PR 7971 at commit [`bc8f2be`](https://github.com/apache/spark/commit/bc8f2beb80bd10f71eff1010e250cfccc99d9a8e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9652][CORE] Added method for Avro file ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7971#issuecomment-137751364 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user squito commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137745960 Sorry Andrew, I certainly didn't mean to imply that I knew your opinion on this particular patch -- as Sean said, I was just trying to point out the general situation. My point is just that new features add complexity, and make it harder to fix bugs; and we have plenty of complexity and bugs now. I don't mean to block the patch, its a cool feature. I'm just nervous about any change to the dag scheduler (eg., even my own proposal in https://github.com/apache/spark/pull/8427, which is 30 lines + 100 lines of tests, and most likely needs more testing still). Perhaps I err too much on the side of caution, I'm just providing a counterpoint. Thanks for adding the the additional tests, Matei. Btw, there is an example test for skipped stages you should be able to copy more or less here: https://github.com/apache/spark/pull/8402 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9652][CORE] Added method for Avro file ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/7971#issuecomment-137751393 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137752519 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137752570 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137754680 [Test build #42001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42001/consoleFull) for PR 8600 at commit [`efe069a`](https://github.com/apache/spark/commit/efe069aabfb3b06f2a9884153bb035022265652f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9170][SQL] Use OrcStructInspector to be...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/7520#issuecomment-137728254 @viirya Oh sorry. It would be nice if you ping me after you updated your PR next time :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8537#issuecomment-137737040 [Test build #41998 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41998/console) for PR 8537 at commit [`8660d0e`](https://github.com/apache/spark/commit/8660d0e2a815b367cc9f34251926e315bc95f9c1). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class DefaultSource extends RelationProvider with DataSourceRegister ` * ` implicit class LibSVMReader(read: DataFrameReader) ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8537#issuecomment-137737175 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10310][SQL]Using \t as the field delime...
Github user liancheng commented on the pull request: https://github.com/apache/spark/pull/8476#issuecomment-137726968 I'm not super familiar with the script transformation feature. If I understand this problem correctly, in prior versions, we doesn't support `RECORDREADER` or `RECORDWRITER` clauses and thus always fallback to `TextRecordReader` and `TextRecordWriter`. However, we didn't specify line delimiter or field delimiter properly. Is it? It seems that this PR not only tries to fix the delimiters issue, but also adds support for `RECORDREADER` and `RECORDWRITER` clauses, which I think could be moved into a separate PR to simplify this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8537#issuecomment-137737181 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41998/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137745447 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137745469 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/8599#discussion_r38762488 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -561,7 +561,7 @@ class Analyzer( } case sort @ Sort(sortOrder, global, aggregate: Aggregate) -if aggregate.resolved && !sort.resolved => +if aggregate.resolved => --- End diff -- We need to set up a stop condition for this rule, or something like `SELECT a, SUM(b) FROM t GROUP BY a ORDER BY a` will go through this rule again and again until reach the fixed point. How about changing the end of this rule to: ``` if (evaluatedOrderings == sortOrder) { sort } else { Project(aggregate.output, Sort(evaluatedOrderings, global, aggregate.copy(aggregateExpressions = originalAggExprs ++ needsPushDown))) } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/8599#discussion_r38762581 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -1519,6 +1519,19 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { |ORDER BY sum(b) + 1 """.stripMargin), Row("4", 3) :: Row("1", 7) :: Row("3", 11) :: Row("2", 15) :: Nil) + +Seq("1" -> 3, "2" -> 7, "2" -> 8, "3" -> 5, "3" -> 6, "3" -> 2, "4" -> 1, "4" -> 2, + "4" -> 3, "4" -> 4).toDF("a", "b").registerTempTable("orderByData2") --- End diff -- Why add `orderByData2`? I thought `orderByData` can also reproduce this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137778681 [Test build #42003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42003/consoleFull) for PR 8599 at commit [`e65f4db`](https://github.com/apache/spark/commit/e65f4dbf1fa00d786ba7ebf0f302861965d6d5cc). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137779452 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137779431 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10441][SQL] Save data correctly to json...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8597#discussion_r38769233 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala --- @@ -100,6 +104,87 @@ abstract class HadoopFsRelationTest extends QueryTest with SQLTestUtils { } } + test("test all data types") { +withTempDir { file => + file.delete() + + // Create the schema. + val struct = +StructType( + StructField("f1", FloatType, true) :: +StructField("f2", ArrayType(BooleanType), true) :: Nil) + val dataTypes = +Seq( + StringType, BinaryType, NullType, BooleanType, + ByteType, ShortType, IntegerType, LongType, + FloatType, DoubleType, DecimalType(25, 5), DecimalType(6, 5), + DateType, TimestampType, + ArrayType(IntegerType), MapType(StringType, LongType), struct, + new MyDenseVectorUDT()) --- End diff -- I do not think we can save it to any data source right now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137781689 Build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9926] [SPARK-10340] [SQL] Use S3 bulk l...
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/8512#issuecomment-137765159 @ewan-realitymine, yeah, that is @davies point about `HadoopFsRelation`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765918 --- Diff: mllib/src/test/java/org/apache/spark/ml/source/JavaLibSVMRelationSuite.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source; + +import com.google.common.base.Charsets; +import com.google.common.io.Files; + +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.util.Utils; + +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.io.IOException; + +/** + * Test LibSVMRelation in Java. + */ +public class JavaLibSVMRelationSuite { + private transient JavaSparkContext jsc; + private transient SQLContext jsql; + private transient DataFrame dataset; + + private File path; + + @Before + public void setUp() throws IOException { +jsc = new JavaSparkContext("local", "JavaLibSVMRelationSuite"); +jsql = new SQLContext(jsc); + +path = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "datasource") + .getCanonicalFile(); +if (path.exists()) { + path.delete(); +} + +String s = "1 1:1.0 3:2.0 5:3.0\n0\n0 2:4.0 4:5.0 6:6.0"; +Files.write(s, path, Charsets.US_ASCII); + } + + @After + public void tearDown() { +jsc.stop(); +jsc = null; +path.delete(); + } + + @Test + public void verifyLibSVMDF() { +dataset = jsql.read().format("org.apache.spark.ml.source.libsvm").load(path.getPath()); --- End diff -- Add `option("vectorType", "dense")`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765954 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/LibSVMRelationSuite.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source + +import java.io.File + +import com.google.common.base.Charsets +import com.google.common.io.Files +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.source.libsvm._ +import org.apache.spark.mllib.linalg.{SparseVector, Vectors, DenseVector} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.util.Utils + +class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { + var path: String = _ + + override def beforeAll(): Unit = { +super.beforeAll() +val lines = + """ +|1 1:1.0 3:2.0 5:3.0 +|0 +|0 2:4.0 4:5.0 6:6.0 + """.stripMargin +val tempDir = Utils.createTempDir() +val file = new File(tempDir.getPath, "part-0") +Files.write(lines, file, Charsets.US_ASCII) +path = tempDir.toURI.toString + } + + test("select as sparse vector") { +val df = sqlContext.read.options(Map("numFeatures" -> "6")).libsvm(path) +assert(df.columns(0) == "label") +assert(df.columns(1) == "features") +val row1 = df.first() +assert(row1.getDouble(0) == 1.0) +assert(row1.getAs[SparseVector](1) == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 --- End diff -- This doesn't verify the result is a sparse vector because runtime type erasure. We need ~~~scala val v = row1.getAs[SparseVector](1) assert(v == Vectors.sparse(...)) ~~~ to force check. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765916 --- Diff: mllib/src/test/java/org/apache/spark/ml/source/JavaLibSVMRelationSuite.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source; + +import com.google.common.base.Charsets; +import com.google.common.io.Files; + +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.util.Utils; + +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.io.IOException; + +/** + * Test LibSVMRelation in Java. + */ +public class JavaLibSVMRelationSuite { + private transient JavaSparkContext jsc; + private transient SQLContext jsql; + private transient DataFrame dataset; + + private File path; + + @Before + public void setUp() throws IOException { +jsc = new JavaSparkContext("local", "JavaLibSVMRelationSuite"); +jsql = new SQLContext(jsc); + +path = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "datasource") + .getCanonicalFile(); --- End diff -- minor: calling `.getCanonicalFile` and checking `path.exists()` are not necessary --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-137789617 [Test build #42005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42005/consoleFull) for PR 5423 at commit [`7a348f5`](https://github.com/apache/spark/commit/7a348f553b6b747d76ceb7f4e51478f875df36b0). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8588#discussion_r38772511 --- Diff: mllib/src/test/scala/org/apache/spark/ml/optim/WeightedLeastSquaresSuite.scala --- @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.optim.WeightedLeastSquares.Instance +import org.apache.spark.mllib.linalg.Vectors +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.rdd.RDD + +class WeightedLeastSquaresSuite extends SparkFunSuite with MLlibTestSparkContext { + + private var instances: RDD[Instance] = _ + + override def beforeAll(): Unit = { +super.beforeAll() +/* + R code: + +A <- matrix(c(0, 1, 2, 3, 5, 7, 11, 13), 4, 2) +b <- c(17, 19, 23, 29) +w <- c(1, 2, 3, 4) + */ +instances = sc.parallelize(Seq( + Instance(1.0, Vectors.dense(0.0, 5.0).toSparse, 17.0), + Instance(2.0, Vectors.dense(1.0, 7.0), 19.0), + Instance(3.0, Vectors.dense(2.0, 11.0), 23.0), + Instance(4.0, Vectors.dense(3.0, 13.0), 29.0) +), 2) + } + + test("WLS against lm") { +/* + R code: + +df <- as.data.frame(cbind(A, b)) +for (formula in c(b ~ . -1, b ~ .)) { + model <- lm(formula, data=df, weights=w) + print(as.vector(coef(model))) +} + +[1] -3.727121 3.009983 +[1] 18.08 6.08 -0.60 + */ + +val expected = Seq( + Vectors.dense(0.0, -3.727121, 3.009983), + Vectors.dense(18.08, 6.08, -0.60)) + +var idx = 0 +for (fitIntercept <- Seq(false, true)) { + val wls = new WeightedLeastSquares( +fitIntercept, regParam = 0.0, standardizeFeatures = false, standardizeLabel = false) --- End diff -- Do we need `standardizeLabel`? I think without regularization, with/without standardization will not change the solution. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10301] [SQL] Fixes schema merging for n...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8509#discussion_r38760022 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala --- @@ -160,4 +101,168 @@ private[parquet] object CatalystReadSupport { val SPARK_ROW_REQUESTED_SCHEMA = "org.apache.spark.sql.parquet.row.requested_schema" val SPARK_METADATA_KEY = "org.apache.spark.sql.parquet.row.metadata" + + /** + * Tailors `parquetSchema` according to `catalystSchema` by removing column paths don't exist + * in `catalystSchema`, and adding those only exist in `catalystSchema`. + */ + def clipParquetSchema(parquetSchema: MessageType, catalystSchema: StructType): MessageType = { +val clippedParquetFields = clipParquetGroupFields(parquetSchema.asGroupType(), catalystSchema) +Types.buildMessage().addFields(clippedParquetFields: _*).named("root") + } + + private def clipParquetType(parquetType: Type, catalystType: DataType): Type = { +catalystType match { + case t: ArrayType if !isPrimitiveCatalystType(t.elementType) => +// Only clips array types with nested type as element type. +clipParquetListType(parquetType.asGroupType(), t.elementType) + + case t: MapType if !isPrimitiveCatalystType(t.valueType) => +// Only clips map types with nested type as value type. +clipParquetMapType(parquetType.asGroupType(), t.keyType, t.valueType) + + case t: StructType => +clipParquetGroup(parquetType.asGroupType(), t) + + case _ => +parquetType +} + } + + /** + * Whether a Catalyst [[DataType]] is primitive. Primitive [[DataType]] is not equivalent to + * [[AtomicType]]. For example, [[CalendarIntervalType]] is primitive, but it's not an + * [[AtomicType]]. + */ + private def isPrimitiveCatalystType(dataType: DataType): Boolean = { +dataType match { + case _: ArrayType | _: MapType | _: StructType => false + case _ => true +} + } + + /** + * Clips a Parquet [[GroupType]] which corresponds to a Catalyst [[ArrayType]]. The element type + * of the [[ArrayType]] should also be a nested type, namely an [[ArrayType]], a [[MapType]], or a + * [[StructType]]. + */ + private def clipParquetListType(parquetList: GroupType, elementType: DataType): Type = { +// Precondition of this method, should only be called for lists with nested element types. +assert(!isPrimitiveCatalystType(elementType)) + +// Unannotated repeated group should be interpreted as required list of required element, so +// list element type is just the group itself. Clip it. +if (parquetList.getOriginalType == null && parquetList.isRepetition(Repetition.REPEATED)) { + clipParquetType(parquetList, elementType) +} else { + assert( +parquetList.getOriginalType == OriginalType.LIST, +"Invalid Parquet schema. " + + "Original type of annotated Parquet lists must be LIST: " + + parquetList.toString) + + assert( +parquetList.getFieldCount == 1 && parquetList.getType(0).isRepetition(Repetition.REPEATED), +"Invalid Parquet schema. " + + "LIST-annotated group should only have exactly one repeated field: " + + parquetList) + + // Precondition of this method, should only be called for lists with nested element types. + assert(!parquetList.getType(0).isPrimitive) + + val repeatedGroup = parquetList.getType(0).asGroupType() + + // If the repeated field is a group with multiple fields, or the repeated field is a group + // with one field and is named either "array" or uses the LIST-annotated group's name with + // "_tuple" appended then the repeated type is the element type and elements are required. + // Build a new LIST-annotated group with clipped `repeatedGroup` as element type and the + // only field. + if ( +repeatedGroup.getFieldCount > 1 || --- End diff -- This case corresponds to the 2nd rule of LIST backwards-compatibility rules defined here: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail:
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137786977 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137786978 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42001/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-137788331 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-1537 [WiP] Application Timeline Server i...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5423#issuecomment-137788310 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10301] [SQL] Fixes schema merging for n...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8509#discussion_r38761117 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala --- @@ -941,4 +942,313 @@ class ParquetSchemaSuite extends ParquetSchemaTest { | optional fixed_len_byte_array(8) f1 (DECIMAL(18, 3)); |} """.stripMargin) + + private def testSchemaClipping( + testName: String, + parquetSchema: String, + catalystSchema: StructType, + expectedSchema: String): Unit = { +test(s"Clipping - $testName") { + val expected = MessageTypeParser.parseMessageType(expectedSchema) + val actual = CatalystReadSupport.clipParquetSchema( +MessageTypeParser.parseMessageType(parquetSchema), catalystSchema) + + try { +expected.checkContains(actual) +actual.checkContains(expected) + } catch { case cause: Throwable => +fail( + s"""Expected clipped schema: + |$expected + |Actual clipped schema: + |$actual + """.stripMargin, + cause) + } +} + } + + testSchemaClipping( +"simple nested struct", + +parquetSchema = + """message root { +| required group f0 { +|optional int32 f00; +|optional int32 f01; +| } +|} + """.stripMargin, + +catalystSchema = { + val f0Type = new StructType().add("f00", IntegerType, nullable = true) + new StructType() +.add("f0", f0Type, nullable = false) +.add("f1", IntegerType, nullable = true) +}, + +expectedSchema = + """message root { +| required group f0 { +|optional int32 f00; +| } +| optional int32 f1; +|} + """.stripMargin) + + testSchemaClipping( +"parquet-protobuf style array", + +parquetSchema = + """message root { +| required group f0 { +|repeated binary f00 (UTF8); +|repeated group f01 { +| optional int32 f010; +| optional double f011; +|} +| } +|} + """.stripMargin, + +catalystSchema = { + val f11Type = new StructType().add("f011", DoubleType, nullable = true) --- End diff -- Yes, thanks! And the variable name is wrong, should be `f01Type`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/8599#discussion_r3878 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -561,7 +561,7 @@ class Analyzer( } case sort @ Sort(sortOrder, global, aggregate: Aggregate) -if aggregate.resolved && !sort.resolved => +if aggregate.resolved => --- End diff -- Thanks. I've updated it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9834][MLLIB] implement weighted least s...
Github user dbtsai commented on a diff in the pull request: https://github.com/apache/spark/pull/8588#discussion_r38772704 --- Diff: mllib/src/main/scala/org/apache/spark/ml/optim/WeightedLeastSquares.scala --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.optim + +import com.github.fommil.netlib.LAPACK.{getInstance => lapack} +import org.netlib.util.intW + +import org.apache.spark.Logging +import org.apache.spark.mllib.linalg._ +import org.apache.spark.mllib.linalg.distributed.RowMatrix +import org.apache.spark.rdd.RDD + +/** + * Model fitted by [[WeightedLeastSquares]]. + * @param coefficients model coefficients + * @param intercept model intercept + */ +private[ml] class WeightedLeastSquaresModel( --- End diff -- Will you merge this code into current `LinearRegression.scala`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765752 --- Diff: mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala --- @@ -0,0 +1,101 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.libsvm + +import com.google.common.base.Objects + +import org.apache.spark.Logging +import org.apache.spark.mllib.linalg.VectorUDT +import org.apache.spark.mllib.regression.LabeledPoint +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD +import org.apache.spark.sql.types.{StructType, StructField, DoubleType} +import org.apache.spark.sql.{Row, SQLContext} +import org.apache.spark.sql.sources._ + +/** + * LibSVMRelation provides the DataFrame constructed from LibSVM format data. + * @param path File path of LibSVM format + * @param numFeatures The number of features + * @param vectorType The type of vector. It can be 'sparse' or 'dense' + * @param sqlContext The Spark SQLContext + */ +private[ml] class LibSVMRelation(val path: String, val numFeatures: Int, val vectorType: String) +(@transient val sqlContext: SQLContext) + extends BaseRelation with TableScan with Logging { + + override def schema: StructType = StructType( +StructField("label", DoubleType, nullable = false) :: + StructField("features", new VectorUDT(), nullable = false) :: Nil + ) + + override def buildScan(): RDD[Row] = { +val sc = sqlContext.sparkContext +val baseRdd = MLUtils.loadLibSVMFile(sc, path, numFeatures) + +val rowBuilders = Array( --- End diff -- Do we need `rowBuilders`? Since we don't have extra optimization, the line below should be sufficient. ~~~scala basedRdd.map(pt => Row(pt.label, pt.features)) ~~~ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137781454 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137781457 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42004/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6548 Adding stddev to DataFrame function...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6297#issuecomment-137794798 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-6548 Adding stddev to DataFrame function...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/6297#issuecomment-137794776 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10301] [SQL] Fixes schema merging for n...
Github user liancheng commented on a diff in the pull request: https://github.com/apache/spark/pull/8509#discussion_r38764680 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystReadSupport.scala --- @@ -160,4 +101,168 @@ private[parquet] object CatalystReadSupport { val SPARK_ROW_REQUESTED_SCHEMA = "org.apache.spark.sql.parquet.row.requested_schema" val SPARK_METADATA_KEY = "org.apache.spark.sql.parquet.row.metadata" + + /** + * Tailors `parquetSchema` according to `catalystSchema` by removing column paths don't exist + * in `catalystSchema`, and adding those only exist in `catalystSchema`. + */ + def clipParquetSchema(parquetSchema: MessageType, catalystSchema: StructType): MessageType = { +val clippedParquetFields = clipParquetGroupFields(parquetSchema.asGroupType(), catalystSchema) +Types.buildMessage().addFields(clippedParquetFields: _*).named("root") + } + + private def clipParquetType(parquetType: Type, catalystType: DataType): Type = { +catalystType match { + case t: ArrayType if !isPrimitiveCatalystType(t.elementType) => +// Only clips array types with nested type as element type. +clipParquetListType(parquetType.asGroupType(), t.elementType) + + case t: MapType if !isPrimitiveCatalystType(t.valueType) => +// Only clips map types with nested type as value type. +clipParquetMapType(parquetType.asGroupType(), t.keyType, t.valueType) + + case t: StructType => +clipParquetGroup(parquetType.asGroupType(), t) + + case _ => +parquetType +} + } + + /** + * Whether a Catalyst [[DataType]] is primitive. Primitive [[DataType]] is not equivalent to + * [[AtomicType]]. For example, [[CalendarIntervalType]] is primitive, but it's not an + * [[AtomicType]]. + */ + private def isPrimitiveCatalystType(dataType: DataType): Boolean = { +dataType match { + case _: ArrayType | _: MapType | _: StructType => false + case _ => true +} + } + + /** + * Clips a Parquet [[GroupType]] which corresponds to a Catalyst [[ArrayType]]. The element type + * of the [[ArrayType]] should also be a nested type, namely an [[ArrayType]], a [[MapType]], or a + * [[StructType]]. + */ + private def clipParquetListType(parquetList: GroupType, elementType: DataType): Type = { +// Precondition of this method, should only be called for lists with nested element types. +assert(!isPrimitiveCatalystType(elementType)) + +// Unannotated repeated group should be interpreted as required list of required element, so +// list element type is just the group itself. Clip it. +if (parquetList.getOriginalType == null && parquetList.isRepetition(Repetition.REPEATED)) { + clipParquetType(parquetList, elementType) +} else { + assert( +parquetList.getOriginalType == OriginalType.LIST, +"Invalid Parquet schema. " + + "Original type of annotated Parquet lists must be LIST: " + + parquetList.toString) + + assert( +parquetList.getFieldCount == 1 && parquetList.getType(0).isRepetition(Repetition.REPEATED), +"Invalid Parquet schema. " + + "LIST-annotated group should only have exactly one repeated field: " + + parquetList) + + // Precondition of this method, should only be called for lists with nested element types. + assert(!parquetList.getType(0).isPrimitive) + + val repeatedGroup = parquetList.getType(0).asGroupType() + + // If the repeated field is a group with multiple fields, or the repeated field is a group + // with one field and is named either "array" or uses the LIST-annotated group's name with + // "_tuple" appended then the repeated type is the element type and elements are required. + // Build a new LIST-annotated group with clipped `repeatedGroup` as element type and the + // only field. + if ( +repeatedGroup.getFieldCount > 1 || --- End diff -- Actually this method is a direct mapping of LIST backwards-compatibility rules defined in the link above. But list of primitive types is not handled in this method, since we only care about complex element type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- -
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765932 --- Diff: mllib/src/test/java/org/apache/spark/ml/source/JavaLibSVMRelationSuite.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source; + +import com.google.common.base.Charsets; +import com.google.common.io.Files; + +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.util.Utils; + +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.io.IOException; + +/** + * Test LibSVMRelation in Java. + */ +public class JavaLibSVMRelationSuite { + private transient JavaSparkContext jsc; + private transient SQLContext jsql; + private transient DataFrame dataset; + + private File path; + + @Before + public void setUp() throws IOException { +jsc = new JavaSparkContext("local", "JavaLibSVMRelationSuite"); +jsql = new SQLContext(jsc); + +path = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "datasource") + .getCanonicalFile(); +if (path.exists()) { + path.delete(); +} + +String s = "1 1:1.0 3:2.0 5:3.0\n0\n0 2:4.0 4:5.0 6:6.0"; +Files.write(s, path, Charsets.US_ASCII); + } + + @After + public void tearDown() { +jsc.stop(); +jsc = null; +path.delete(); + } + + @Test + public void verifyLibSVMDF() { +dataset = jsql.read().format("org.apache.spark.ml.source.libsvm").load(path.getPath()); +Assert.assertEquals("label", dataset.columns()[0]); +Assert.assertEquals("features", dataset.columns()[1]); +Row r = dataset.first(); +Assert.assertEquals(Double.valueOf(r.getDouble(0)), Double.valueOf(1.0)); --- End diff -- * `Double.valueOf(...)` is not necessary. * move `1.0` to the first position --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765957 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/LibSVMRelationSuite.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source + +import java.io.File + +import com.google.common.base.Charsets +import com.google.common.io.Files +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.source.libsvm._ +import org.apache.spark.mllib.linalg.{SparseVector, Vectors, DenseVector} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.util.Utils + +class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { + var path: String = _ + + override def beforeAll(): Unit = { +super.beforeAll() +val lines = + """ +|1 1:1.0 3:2.0 5:3.0 +|0 +|0 2:4.0 4:5.0 6:6.0 + """.stripMargin +val tempDir = Utils.createTempDir() +val file = new File(tempDir.getPath, "part-0") +Files.write(lines, file, Charsets.US_ASCII) +path = tempDir.toURI.toString + } + + test("select as sparse vector") { +val df = sqlContext.read.options(Map("numFeatures" -> "6")).libsvm(path) +assert(df.columns(0) == "label") +assert(df.columns(1) == "features") +val row1 = df.first() +assert(row1.getDouble(0) == 1.0) +assert(row1.getAs[SparseVector](1) == Vectors.sparse(6, Seq((0, 1.0), (2, 2.0), (4, 3.0 + } + + test("select as dense vector") { +val df = sqlContext.read.options(Map("numFeatures" -> "6", "featuresType" -> "dense")) + .libsvm(path) +assert(df.columns(0) == "label") +assert(df.columns(1) == "features") +assert(df.count() == 3) +val row1 = df.first() +assert(row1.getDouble(0) == 1.0) +assert(row1.getAs[DenseVector](1) == Vectors.dense(1.0, 0.0, 2.0, 0.0, 3.0, 0.0)) + } + + test("select without any option") { --- End diff -- Should add another test that sets `numFeatures` to a larger number and verify it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765946 --- Diff: mllib/src/test/scala/org/apache/spark/ml/source/LibSVMRelationSuite.scala --- @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source + +import java.io.File + +import com.google.common.base.Charsets +import com.google.common.io.Files +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.source.libsvm._ +import org.apache.spark.mllib.linalg.{SparseVector, Vectors, DenseVector} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.util.Utils + +class LibSVMRelationSuite extends SparkFunSuite with MLlibTestSparkContext { + var path: String = _ + + override def beforeAll(): Unit = { +super.beforeAll() +val lines = + """ +|1 1:1.0 3:2.0 5:3.0 +|0 +|0 2:4.0 4:5.0 6:6.0 + """.stripMargin +val tempDir = Utils.createTempDir() +val file = new File(tempDir.getPath, "part-0") +Files.write(lines, file, Charsets.US_ASCII) +path = tempDir.toURI.toString + } + + test("select as sparse vector") { +val df = sqlContext.read.options(Map("numFeatures" -> "6")).libsvm(path) --- End diff -- We can remove `"numFeatures" -> 6` in one test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765935 --- Diff: mllib/src/test/java/org/apache/spark/ml/source/JavaLibSVMRelationSuite.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source; + +import com.google.common.base.Charsets; +import com.google.common.io.Files; + +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.util.Utils; + +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.io.IOException; + +/** + * Test LibSVMRelation in Java. + */ +public class JavaLibSVMRelationSuite { + private transient JavaSparkContext jsc; + private transient SQLContext jsql; + private transient DataFrame dataset; + + private File path; + + @Before + public void setUp() throws IOException { +jsc = new JavaSparkContext("local", "JavaLibSVMRelationSuite"); +jsql = new SQLContext(jsc); + +path = Utils.createTempDir(System.getProperty("java.io.tmpdir"), "datasource") + .getCanonicalFile(); +if (path.exists()) { + path.delete(); +} + +String s = "1 1:1.0 3:2.0 5:3.0\n0\n0 2:4.0 4:5.0 6:6.0"; +Files.write(s, path, Charsets.US_ASCII); + } + + @After + public void tearDown() { +jsc.stop(); +jsc = null; +path.delete(); + } + + @Test + public void verifyLibSVMDF() { +dataset = jsql.read().format("org.apache.spark.ml.source.libsvm").load(path.getPath()); +Assert.assertEquals("label", dataset.columns()[0]); +Assert.assertEquals("features", dataset.columns()[1]); +Row r = dataset.first(); +Assert.assertEquals(Double.valueOf(r.getDouble(0)), Double.valueOf(1.0)); +Assert.assertEquals(r.getAs(1), Vectors.dense(1.0, 0.0, 2.0, 0.0, 3.0, 0.0)); --- End diff -- We need to check the class name first or cast it to `DenseVector` directly: ~~~java DenseVector v = r.getAs(1) Assert.assertEquals(Vectors.dense(...), v) ~~~ If it is a sparse vector, the first line will throw an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10117][MLLIB] Implement SQL data source...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/8537#discussion_r38765912 --- Diff: mllib/src/test/java/org/apache/spark/ml/source/JavaLibSVMRelationSuite.java --- @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source; + +import com.google.common.base.Charsets; +import com.google.common.io.Files; + +import org.apache.spark.api.java.JavaSparkContext; +import org.apache.spark.mllib.linalg.Vectors; +import org.apache.spark.sql.DataFrame; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.SQLContext; +import org.apache.spark.util.Utils; + +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import java.io.File; +import java.io.IOException; --- End diff -- organize imports: java, scala, 3rd-party, spark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137778493 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137778525 Build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10437][SQ] Support aggregation expressi...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8599#issuecomment-137778526 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137778499 Build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137780768 [Test build #42004 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42004/consoleFull) for PR 8180 at commit [`bb5190f`](https://github.com/apache/spark/commit/bb5190f0804f83fd178960bdf0ee857a65312859). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9851] Support submitting map stages ind...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8180#issuecomment-137781690 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42002/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137786063 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137786073 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10192] [core] simple test w/ failure in...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8402#issuecomment-137785853 [Test build #41999 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/41999/console) for PR 8402 at commit [`cfcf4e6`](https://github.com/apache/spark/commit/cfcf4e667121b4225ce327f5f764b00677059865). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10446][SQL] Support to specify join typ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/8600#issuecomment-137786844 [Test build #42001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42001/console) for PR 8600 at commit [`efe069a`](https://github.com/apache/spark/commit/efe069aabfb3b06f2a9884153bb035022265652f). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-10301] [SQL] Fixes schema merging for n...
Github user yhuai commented on a diff in the pull request: https://github.com/apache/spark/pull/8509#discussion_r38773863 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -229,4 +229,81 @@ class ParquetQuerySuite extends QueryTest with ParquetTest with SharedSQLContext } } } + + test("SPARK-10301 Clipping nested structs in requested schema") { --- End diff -- Can we list all cases that are tested at here? Cases that should be here will be * two struct types have the same fields * two struct types have totally two different set of fields * one struct type is a super set of another one * there are some common fields. But, there are also fields that only exist in one file. I believe that the ordering of fields is also matter at here. For example, for a struct in the global schema, if its fields are `a, b, c, d`, `a, d` in local struct and `a, b` in struct filed are two different cases. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org