subject:"\[GitHub\] spark pull request\: \[Spark 2060\]\[SQL\] Querying JSON Datasets with ..."

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-19 Thread mateiz

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46529253
  
Ah yeah, this might not have been super clear, but this has at least been 
my assumption. We do want to make both SQL and GraphX be non-alpha soon though, 
perhaps as early as 1.1. GraphX is closer but SQL has such a narrow external 
API that I think it's good to lock it down.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-18 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46436035
  
Hmmm, that doesn't precisely match my recollection or understanding.  
Certainly we discussed that alpha components aren't required to maintain a 
stable API, but I don't recall an explicit decision that changes to alpha 
components would routinely be merged back into maintenance releases.  I could 
be mistaken, and merging new alpha API into maintenance branches may be the 
right strategy, but this did take me a little by surprise. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46352124
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46352148
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46360655
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15853/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46360653
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46363656
  
Jenkins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46363827
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46363845
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46372514
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46372515
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15855/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13891473
  
--- Diff: docs/sql-programming-guide.md ---
@@ -91,14 +91,33 @@ of its decedents.  To create a basic SQLContext, all 
you need is a SparkContext.
 
 {% highlight python %}
 from pyspark.sql import SQLContext
-sqlCtx = SQLContext(sc)
+sqlContext = SQLContext(sc)
 {% endhighlight %}
 
 /div
 
 /div
 
-## Running SQL on RDDs
+# Data Sources
+
+div class=codetabs
+div data-lang=scala  markdown=1
+Spark SQL supports operating on a variety of data sources though the 
SchemaRDD interface.
--- End diff --

best to put code ... /code around SchemaRDD


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13891482
  
--- Diff: docs/sql-programming-guide.md ---
@@ -91,14 +91,33 @@ of its decedents.  To create a basic SQLContext, all 
you need is a SparkContext.
 
 {% highlight python %}
 from pyspark.sql import SQLContext
-sqlCtx = SQLContext(sc)
+sqlContext = SQLContext(sc)
 {% endhighlight %}
 
 /div
 
 /div
 
-## Running SQL on RDDs
+# Data Sources
+
+div class=codetabs
+div data-lang=scala  markdown=1
+Spark SQL supports operating on a variety of data sources though the 
SchemaRDD interface.
--- End diff --

and for Python/Java too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13891733
  
--- Diff: docs/sql-programming-guide.md ---
@@ -297,50 +328,152 @@ JavaSchemaRDD teenagers = sqlCtx.sql(SELECT name 
FROM parquetFile WHERE age =
 div data-lang=python  markdown=1
 
 {% highlight python %}
+# sqlContext from the previous example is used in this example.
 
-peopleTable # The SchemaRDD from the previous example.
+schemaPeople # The SchemaRDD from the previous example.
 
 # SchemaRDDs can be saved as Parquet files, maintaining the schema 
information.
-peopleTable.saveAsParquetFile(people.parquet)
+schemaPeople.saveAsParquetFile(people.parquet)
 
 # Read in the Parquet file created above.  Parquet files are 
self-describing so the schema is preserved.
 # The result of loading a parquet file is also a SchemaRDD.
-parquetFile = sqlCtx.parquetFile(people.parquet)
+parquetFile = sqlContext.parquetFile(people.parquet)
 
 # Parquet files can also be registered as tables and then used in SQL 
statements.
 parquetFile.registerAsTable(parquetFile);
-teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND 
age = 19)
-
+teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 13 
AND age = 19)
+teenNames = teenagers.map(lambda p: Name:  + p.name)
+for teenName in teenNames.collect():
+  print teenName
 {% endhighlight %}
 
 /div
 
 /div
 
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+div class=codetabs
 
-**Language-Integrated queries are currently only supported in Scala.**
+div data-lang=scala  markdown=1
+Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a SchemaRDD.
+This conversion can be done using one of two methods in a SQLContext:
 
-Spark SQL also supports a domain specific language for writing queries.  
Once again,
-using the data from the above examples:
+* `jsonFile` - loads data from a directory of JSON files where each line 
of the files is a JSON object.
+* `jsonRdd` - loads data from an existing RDD where each element of the 
RDD is a string containing a JSON object.
 
 {% highlight scala %}
+// sc is an existing SparkContext.
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-import sqlContext._
-val people: RDD[Person] = ... // An RDD of case class objects, from the 
first example.
 
-// The following is the same as 'SELECT name FROM people WHERE age = 10 
AND age = 19'
-val teenagers = people.where('age = 10).where('age = 19).select('name)
+// A JSON dataset is pointed to by path.
+// The path can be either a single text file or a directory storing text 
files.
+val path = examples/src/main/resources/people.json
+// Create a SchemaRDD from the file(s) pointed to by path
+val people = sqlContext.jsonFile(path)
+
+// The inferred schema can be visualized using the printSchema() method.
+people.printSchema()
+// The schema of people is ...
--- End diff --

i'd remove this line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13892635
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -123,4 +125,53 @@ abstract class QueryPlan[PlanType : 
TreeNode[PlanType]] extends TreeNode[PlanTy
   case other = Nil
 }.toSeq
   }
+
+  protected def generateSchemaTreeString(schema: Seq[Attribute]): String = 
{
+val builder = new StringBuilder
+builder.append(root\n)
+val prefix =  |
+schema.foreach { attribute =
+  val name = attribute.name
+  val dataType = attribute.dataType
+  dataType match {
+case fields: StructType =
+  builder.append(s$prefix-- $name: $StructType\n)
+  generateSchemaTreeString(fields, s$prefix|, builder)
+case ArrayType(fields: StructType) =
+  builder.append(s$prefix-- $name: $ArrayType[$StructType]\n)
+  generateSchemaTreeString(fields, s$prefix|, builder)
+case ArrayType(elementType: DataType) =
+  builder.append(s$prefix-- $name: $ArrayType[$elementType]\n)
+case _ = builder.append(s$prefix-- $name: $dataType\n)
+  }
+}
+
+builder.toString()
+  }
+
+  protected def generateSchemaTreeString(
+  schema: StructType,
+  prefix: String,
+  builder: StringBuilder): StringBuilder = {
+schema.fields.foreach {
+  case StructField(name, fields: StructType, _) =
+builder.append(s$prefix-- $name: $StructType\n)
+generateSchemaTreeString(fields, s$prefix|, builder)
+  case StructField(name, ArrayType(fields: StructType), _) =
+builder.append(s$prefix-- $name: $ArrayType[$StructType]\n)
+generateSchemaTreeString(fields, s$prefix|, builder)
+  case StructField(name, ArrayType(elementType: DataType), _) =
+builder.append(s$prefix-- $name: $ArrayType[$elementType]\n)
+  case StructField(name, fieldType: DataType, _) =
+builder.append(s$prefix-- $name: $fieldType\n)
+}
+
+builder
+  }
+
+  /** Returns the output schema in the tree format. */
+  def schemaTreeString: String = generateSchemaTreeString(output)
--- End diff --

maybe just schemaString


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13892709
  
--- Diff: sql/core/pom.xml ---
@@ -54,6 +61,11 @@
   version${parquet.version}/version
 /dependency
 dependency
+  groupIdcom.fasterxml.jackson.core/groupId
+  artifactIdjackson-core/artifactId
+  version2.3.2/version
--- End diff --

@pwendell I think in general sub project pom files don't specify dependency 
versions. Can you verify?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13892874
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -99,6 +97,37 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, parquet.ParquetRelation(path))
 
   /**
+   * Loads a JSON file (one object per line), returning the result as a 
[[SchemaRDD]].
--- End diff --

Maybe add a line explaining this goes through the data once to infer the 
schema ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13892881
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -99,6 +97,35 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, parquet.ParquetRelation(path))
 
   /**
+   * Loads a JSON file (one object per line), returning the result as a 
[[SchemaRDD]].
+   *
+   * @group userf
+   */
+  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0)
+
+  /**
+   * :: Experimental ::
+   */
--- End diff --

here too, although with sampling


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13893161
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala ---
@@ -342,13 +344,34 @@ class SchemaRDD(
   def toJavaSchemaRDD: JavaSchemaRDD = new JavaSchemaRDD(sqlContext, 
logicalPlan)
 
   private[sql] def javaToPython: JavaRDD[Array[Byte]] = {
--- End diff --

add some inline doc explaining this is used for the Python API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13893257
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -0,0 +1,399 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import scala.collection.JavaConversions._
+import scala.math.BigDecimal
+
+import com.fasterxml.jackson.databind.ObjectMapper
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+private[sql] object JsonRDD extends Logging {
+
+  private[sql] def inferSchema(
+  json: RDD[String],
+  samplingRatio: Double = 1.0): LogicalPlan = {
+require(samplingRatio  0)
--- End diff --

add a more meaningful exception message, i.e.
```
require(samplingRatio  0, ssamplingRatio ($samplingRatio) should be 
greater than 0)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13893273
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -0,0 +1,399 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import scala.collection.JavaConversions._
+import scala.math.BigDecimal
+
+import com.fasterxml.jackson.databind.ObjectMapper
+
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+private[sql] object JsonRDD extends Logging {
+
+  private[sql] def inferSchema(
+  json: RDD[String],
+  samplingRatio: Double = 1.0): LogicalPlan = {
+require(samplingRatio  0)
+val schemaData = if (samplingRatio  0.99) json else 
json.sample(false, samplingRatio, 1)
+
--- End diff --

probably no need to have a blank line for each statement ...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46380653
  
This looks to me overall. Only few nitpicks. 

I think we should merge it after you addressed the couple comments I had.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46383321
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46383326
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46387956
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46387957
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15862/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread rxin

Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46389105
  
Thanks. I'm merging this in master  branch-1.0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/999


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2060][SQL] Querying JSON Datasets with ...

2014-06-17 Thread markhamstra

Github user markhamstra commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46389597
  
Is that the basic strategy we are going to use with AlphaComponents -- 
merging new APIs at both the minor and maintenance levels?  I don't know that I 
have any objection to that, but I don't recall any discussion directly on 
point, and this is the first such addition that has been made to branch-1.0 
while I was paying attention.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46222354
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46222342
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46222401
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15821/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46222400
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46222714
  
Have made a few changes:
* Removed the special SchemaRDD (JsonRDD) for JSON datasets. Now, when 
users call `jsonFile` and `jsonRDD`, a SchemaRDD is returned.
* Added Java and Python APIs. For the Python API, SchemaRDD.javaToPython 
can also handle StructType now.
*  Updated programming guide. 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13826595
  
--- Diff: docs/sql-programming-guide.md ---
@@ -17,20 +17,20 @@ Spark.  At the core of this component is a new type of 
RDD,
 [Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) 
objects along with
 a schema that describes the data types of each column in the row.  A 
SchemaRDD is similar to a table
 in a traditional relational database.  A SchemaRDD can be created from an 
existing RDD, [Parquet](http://parquet.io)
-file, or by running HiveQL against data stored in [Apache 
Hive](http://hive.apache.org/).
+file, a JSON dataset, or by running HiveQL against data stored in [Apache 
Hive](http://hive.apache.org/).
 
 All of the examples on this page use sample data included in the Spark 
distribution and can be run in the `spark-shell`.
 
 /div
 
 div data-lang=java  markdown=1
-Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to 
be executed using
+Spark SQL allows relational queries expressed in SQL or HiveQL to be 
executed using
--- End diff --

Why remove scala?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13826671
  
--- Diff: docs/sql-programming-guide.md ---
@@ -62,10 +62,10 @@ descendants.  To create a basic SQLContext, all you 
need is a SparkContext.
 
 {% highlight scala %}
 val sc: SparkContext // An existing SparkContext.
-val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
--- End diff --

I don't think it's generally good Scala style to shorten things 
unnecessarily (Context - ctx).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13827260
  
--- Diff: docs/sql-programming-guide.md ---
@@ -98,7 +98,9 @@ sqlCtx = SQLContext(sc)
 
 /div
 
-## Running SQL on RDDs
+# Data Sources
+
+## RDDs
--- End diff --

These second level headings don't appear to be formatting any different 
than first level ones.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13827230
  
--- Diff: docs/sql-programming-guide.md ---
@@ -98,7 +98,9 @@ sqlCtx = SQLContext(sc)
 
 /div
 
-## Running SQL on RDDs
+# Data Sources
--- End diff --

We should have something here.  Maybe:
`Spark SQL supports operating of a variety of data sources though the 
SchemaRDD interface.  Once a dataset has been loaded, it can be registered as a 
table and even joined with data from other sources.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13827652
  
--- Diff: docs/sql-programming-guide.md ---
@@ -310,37 +325,190 @@ parquetFile = sqlCtx.parquetFile(people.parquet)
 # Parquet files can also be registered as tables and then used in SQL 
statements.
 parquetFile.registerAsTable(parquetFile);
 teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND 
age = 19)
-
+teenNames = teenagers.map(lambda p: Name:  + p.name)
+for teenName in teenNames.collect():
+  print teenName
 {% endhighlight %}
 
 /div
 
 /div
 
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+div class=codetabs
+
+div data-lang=scala  markdown=1
+Spark SQL supports querying JSON datasets. To query a JSON dataset, a 
SchemaRDD needs to be created for this JSON dataset. There are two ways to 
create a SchemaRDD for a JSON dataset:
--- End diff --

How about:

Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a SchemaRDD.  This conversion can be done using one of two methods:
 - `jsonFile` - loads data from a directory of JSON files where each line 
of the files is a JSON object.
 - `jsonRdd` - loads data from an existing RDD where each element of the 
RDD is a string containing a JSON object.

We should probably also have a brief section on the rules of schema 
inference as they are non-obvious.

A future TODO:  There is no reason that we need the json objects to be one 
per line right?  We can know when an object has not ended and could continue 
reading past a line break, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13827707
  
--- Diff: docs/sql-programming-guide.md ---
@@ -310,37 +325,190 @@ parquetFile = sqlCtx.parquetFile(people.parquet)
 # Parquet files can also be registered as tables and then used in SQL 
statements.
 parquetFile.registerAsTable(parquetFile);
 teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND 
age = 19)
-
+teenNames = teenagers.map(lambda p: Name:  + p.name)
+for teenName in teenNames.collect():
+  print teenName
 {% endhighlight %}
 
 /div
 
 /div
 
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+div class=codetabs
+
+div data-lang=scala  markdown=1
+Spark SQL supports querying JSON datasets. To query a JSON dataset, a 
SchemaRDD needs to be created for this JSON dataset. There are two ways to 
create a SchemaRDD for a JSON dataset:
 
-**Language-Integrated queries are currently only supported in Scala.**
+1. Creating the SchemaRDD from text files that store one JSON object per 
line.
+2. Creating the SchemaRDD from a RDD of strings (`RDD[String]`) that 
stores one JSON object.
 
-Spark SQL also supports a domain specific language for writing queries.  
Once again,
-using the data from the above examples:
+The schema of a JSON dataset is automatically inferred when the SchemaRDD 
is created.
 
 {% highlight scala %}
-val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-import sqlContext._
-val people: RDD[Person] = ... // An RDD of case class objects, from the 
first example.
+// sc is an existing SparkContext.
+val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
+
+// A JSON dataset is pointed by path.
+// The path can be either a single text file or a directory storing text 
files.
+val path = examples/src/main/resources/people.json
+// Create a SchemaRDD from the file(s) pointed by path
+val people = sqlCtx.jsonFile(path)
+
+// Because the schema of a JSON dataset is automatically inferred, to 
write queries,
--- End diff --

The inferred schema can be visualized using the `printSchema()` method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13827732
  
--- Diff: docs/sql-programming-guide.md ---
@@ -310,37 +325,190 @@ parquetFile = sqlCtx.parquetFile(people.parquet)
 # Parquet files can also be registered as tables and then used in SQL 
statements.
 parquetFile.registerAsTable(parquetFile);
 teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND 
age = 19)
-
+teenNames = teenagers.map(lambda p: Name:  + p.name)
+for teenName in teenNames.collect():
+  print teenName
 {% endhighlight %}
 
 /div
 
 /div
 
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+div class=codetabs
+
+div data-lang=scala  markdown=1
+Spark SQL supports querying JSON datasets. To query a JSON dataset, a 
SchemaRDD needs to be created for this JSON dataset. There are two ways to 
create a SchemaRDD for a JSON dataset:
 
-**Language-Integrated queries are currently only supported in Scala.**
+1. Creating the SchemaRDD from text files that store one JSON object per 
line.
+2. Creating the SchemaRDD from a RDD of strings (`RDD[String]`) that 
stores one JSON object.
 
-Spark SQL also supports a domain specific language for writing queries.  
Once again,
-using the data from the above examples:
+The schema of a JSON dataset is automatically inferred when the SchemaRDD 
is created.
 
 {% highlight scala %}
-val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-import sqlContext._
-val people: RDD[Person] = ... // An RDD of case class objects, from the 
first example.
+// sc is an existing SparkContext.
+val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
+
+// A JSON dataset is pointed by path.
+// The path can be either a single text file or a directory storing text 
files.
+val path = examples/src/main/resources/people.json
+// Create a SchemaRDD from the file(s) pointed by path
+val people = sqlCtx.jsonFile(path)
+
+// Because the schema of a JSON dataset is automatically inferred, to 
write queries,
+// it is better to take a look at what is the schema.
+people.printSchema()
+// The schema of people is ...
+// root
+//  |-- age: IntegerType
+//  |-- name: StringType
+
+// Register this SchemaRDD as a table.
+people.registerAsTable(people)
 
-// The following is the same as 'SELECT name FROM people WHERE age = 10 
AND age = 19'
-val teenagers = people.where('age = 10).where('age = 19).select('name)
+// SQL statements can be run by using the sql methods provided by sqlCtx.
+val teenagers = sqlCtx.sql(SELECT name FROM people WHERE age = 13 AND 
age = 19)
+
+// The results of SQL queries are SchemaRDDs and support all the normal 
RDD operations.
+// The columns of a row in the result can be accessed by ordinal.
+teenagers.map(t = Name:  + t(0)).collect().foreach(println)
--- End diff --

This is probably overkill as we have shown this multiple times in this 
guide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828272
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala
 ---
@@ -108,19 +118,18 @@ trait HiveTypeCoercion {
*
* Additionally, all types when UNION-ed with strings will be promoted 
to strings.
* Other string conversions are handled by PromoteStrings.
+   *
+   * Widening types might result in loss of precision in the following 
cases:
+   * - IntegerType to FloatType
+   * - LongType to FloatType
+   * - LongType to DoubleType
*/
   object WidenTypes extends Rule[LogicalPlan] {
-// See 
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types.
-// The conversion for integral and floating point types have a linear 
widening hierarchy:
-val numericPrecedence =
-  Seq(NullType, ByteType, ShortType, IntegerType, LongType, FloatType, 
DoubleType, DecimalType)
-// Boolean is only wider than Void
-val booleanPrecedence = Seq(NullType, BooleanType)
-val allPromotions: Seq[Seq[DataType]] = numericPrecedence :: 
booleanPrecedence :: Nil
 
 def findTightestCommonType(t1: DataType, t2: DataType): 
Option[DataType] = {
   // Try and find a promotion rule that contains both types in 
question.
-  val applicableConversion = allPromotions.find(p = p.contains(t1)  
p.contains(t2))
+  val applicableConversion = HiveTypeCoercion.allPromotions.find(p = 
p.contains(t1)  p
--- End diff --

In general I'd prefer to break at a higher syntactic level.  (i.e., right 
after the  or even put the whole command on the next line with only the `val 
applicableConversion =` on the line above).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828440
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizerTest.scala
 ---
@@ -17,39 +17,10 @@
 
 package org.apache.spark.sql.catalyst.optimizer
 
-import org.scalatest.FunSuite
-
-import org.apache.spark.sql.catalyst.expressions._
-import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
-import org.apache.spark.sql.catalyst.util._
+import org.apache.spark.sql.catalyst.plans.PlanTest
 
 /**
  * Provides helper methods for comparing plans produced by optimization 
rules with the expected
  * result
  */
-class OptimizerTest extends FunSuite {
-
-  /**
-   * Since attribute references are given globally unique ids during 
analysis,
-   * we must normalize them to check if two different queries are 
identical.
-   */
-  protected def normalizeExprIds(plan: LogicalPlan) = {
-val minId = 
plan.flatMap(_.expressions.flatMap(_.references).map(_.exprId.id)).min
-plan transformAllExpressions {
-  case a: AttributeReference =
-AttributeReference(a.name, a.dataType, a.nullable)(exprId = 
ExprId(a.exprId.id - minId))
-}
-  }
-
-  /** Fails the test if the two plans do not match */
-  protected def comparePlans(plan1: LogicalPlan, plan2: LogicalPlan) {
-val normalized1 = normalizeExprIds(plan1)
-val normalized2 = normalizeExprIds(plan2)
-if (normalized1 != normalized2)
-  fail(
-s
-  |== FAIL: Plans do not match ===
-  |${sideBySide(normalized1.treeString, 
normalized2.treeString).mkString(\n)}
-.stripMargin)
-  }
-}
+class OptimizerTest extends PlanTest
--- End diff --

Is there any reason to keep this class?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828482
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/package.scala 
---
@@ -17,8 +17,55 @@
 
 package org.apache.spark.sql.catalyst
 
+import org.apache.spark.sql.catalyst.expressions.Attribute
+import org.apache.spark.sql.catalyst.types.{ArrayType, DataType, 
StructField, StructType}
+
 /**
  * A a collection of common abstractions for query plans as well as
  * a base logical plan representation.
  */
-package object plans
+package object plans {
+
+  def generateSchemaTreeString(schema: Seq[Attribute]): String = {
--- End diff --

is this only used in QueryPlan?  If so, maybe is should just live there as 
a protected method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828571
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -99,6 +97,35 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, parquet.ParquetRelation(path))
 
   /**
+   * Loads a JSON file (one object per line), returning the result as a 
[[SchemaRDD]].
+   *
+   * @group userf
+   */
+  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0)
+
+  /**
+   * :: Experimental ::
+   */
--- End diff --

Needs the `@experimental` annotation too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828608
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala ---
@@ -99,6 +97,35 @@ class SQLContext(@transient val sparkContext: 
SparkContext)
 new SchemaRDD(this, parquet.ParquetRelation(path))
 
   /**
+   * Loads a JSON file (one object per line), returning the result as a 
[[SchemaRDD]].
+   *
+   * @group userf
+   */
+  def jsonFile(path: String): SchemaRDD = jsonFile(path, 1.0)
+
+  /**
+   * :: Experimental ::
+   */
+  def jsonFile(path: String, samplingRatio: Double): SchemaRDD = {
+val json = sparkContext.textFile(path)
+jsonRDD(json, samplingRatio)
+  }
+
+  /**
+   * Loads a RDD[String] storing JSON objects (one object per record), 
returning the result as a
+   * [[SchemaRDD]].
+   *
+   * @group userf
+   */
+  def jsonRDD(json: RDD[String]): SchemaRDD = jsonRDD(json, 1.0)
+
+  /**
+   * :: Experimental ::
+   */
+  def jsonRDD(json: RDD[String], samplingRatio: Double): SchemaRDD =
--- End diff --

Annotation too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828715
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -0,0 +1,402 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import scala.collection.JavaConversions._
+import scala.math.BigDecimal
+
+import com.fasterxml.jackson.databind.ObjectMapper
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+@Experimental
--- End diff --

We can remove these annotations if the object is not in the public API.

Also, some developer facing scala doc would be good.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13828804
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala 
---
@@ -0,0 +1,402 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import scala.collection.JavaConversions._
+import scala.math.BigDecimal
+
+import com.fasterxml.jackson.databind.ObjectMapper
+
+import org.apache.spark.annotation.{DeveloperApi, Experimental}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.Logging
+
+@Experimental
+private[sql] object JsonRDD extends Logging {
+
+  @DeveloperApi
+  private[sql] def inferSchema(
+  json: RDD[String],
+  samplingRatio: Double = 1.0): LogicalPlan = {
+require(samplingRatio  0)
+val schemaData = if (samplingRatio  0.99) json else 
json.sample(false, samplingRatio, 1)
+
+val allKeys = 
parseJson(schemaData).map(getAllKeysWithValueTypes).reduce(_ ++ _)
+
+val baseSchema = createSchema(allKeys)
+
+createLogicalPlan(json, baseSchema)
+  }
+
+  private def createLogicalPlan(
+  json: RDD[String],
+  baseSchema: StructType): LogicalPlan = {
+val schema = nullTypeToStringType(baseSchema)
+
+SparkLogicalPlan(ExistingRdd(asAttributes(schema), 
parseJson(json).map(asRow(_, schema
+  }
+
+  private def createSchema(allKeys: Set[(String, DataType)]): StructType = 
{
+// Resolve type conflicts
+val resolved = allKeys.groupBy {
+  case (key, dataType) = key
+}.map {
+  // Now, keys and types are organized in the format of
+  // key - Set(type1, type2, ...).
+  case (key, typeSet) = {
+val fieldName = key.substring(1, key.length - 1).split(`.`).toSeq
+val dataType = typeSet.map {
+  case (_, dataType) = dataType
+}.reduce((type1: DataType, type2: DataType) = 
getCompatibleType(type1, type2))
+
+(fieldName, dataType)
+  }
+}
+
+def makeStruct(values: Seq[Seq[String]], prefix: Seq[String]): 
StructType = {
+  val (topLevel, structLike) = values.partition(_.size == 1)
+  val topLevelFields = topLevel.filter {
+name = resolved.get(prefix ++ name).get match {
+  case ArrayType(StructType(Nil)) = false
+  case ArrayType(_) = true
+  case struct: StructType = false
+  case _ = true
+}
+  }.map {
+a = StructField(a.head, resolved.get(prefix ++ a).get, nullable = 
true)
+  }
+
+  val structFields: Seq[StructField] = structLike.groupBy(_(0)).map {
+case (name, fields) = {
+  val nestedFields = fields.map(_.tail)
+  val structType = makeStruct(nestedFields, prefix :+ name)
+  val dataType = resolved.get(prefix :+ name).get
+  dataType match {
+case array: ArrayType = Some(StructField(name, 
ArrayType(structType), nullable = true))
+case struct: StructType = Some(StructField(name, structType, 
nullable = true))
+// dataType is StringType means that we have resolved type 
conflicts involving
+// primitive types and complex types. So, the type of name has 
been relaxed to
+// StringType. Also, this field should have already been put 
in topLevelFields.
+case StringType = None
+  }
+}
+  }.flatMap(field = field).toSeq
+
+  StructType(
+(topLevelFields ++ structFields).sortBy {
+case StructField(name, _, _) = name
+

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13829492
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala 
---
@@ -122,4 +122,10 @@ private[sql] trait SchemaRDDLike {
   @Experimental
   def saveAsTable(tableName: String): Unit =
 sqlContext.executePlan(InsertIntoCreatedTable(None, tableName, 
logicalPlan)).toRdd
+
+  /** Returns the output schema in the tree format. */
+  def getSchemaTreeString(): String = 
queryExecution.analyzed.getSchemaTreeString()
--- End diff --

No `get`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13829475
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -123,4 +124,10 @@ abstract class QueryPlan[PlanType : 
TreeNode[PlanType]] extends TreeNode[PlanTy
   case other = Nil
 }.toSeq
   }
+
+  /** Returns the output schema in the tree format. */
+  def getSchemaTreeString(): String = 
plans.generateSchemaTreeString(output)
--- End diff --

Don't use `get` for scala functions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13829803
  
--- Diff: 
examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java ---
@@ -56,6 +57,8 @@ public static void main(String[] args) throws Exception {
 JavaSparkContext ctx = new JavaSparkContext(sparkConf);
 JavaSQLContext sqlCtx = new JavaSQLContext(ctx);
 
+
--- End diff --

extra line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13830011
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/QueryPlan.scala 
---
@@ -123,4 +124,10 @@ abstract class QueryPlan[PlanType : 
TreeNode[PlanType]] extends TreeNode[PlanTy
   case other = Nil
 }.toSeq
   }
+
+  /** Returns the output schema in the tree format. */
+  def getSchemaTreeString(): String = 
plans.generateSchemaTreeString(output)
--- End diff --

and no () for purely functional methods.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13830083
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDDLike.scala 
---
@@ -122,4 +122,10 @@ private[sql] trait SchemaRDDLike {
   @Experimental
   def saveAsTable(tableName: String): Unit =
 sqlContext.executePlan(InsertIntoCreatedTable(None, tableName, 
logicalPlan)).toRdd
+
+  /** Returns the output schema in the tree format. */
+  def getSchemaTreeString(): String = 
queryExecution.analyzed.getSchemaTreeString()
--- End diff --

no ().


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13830433
  
--- Diff: docs/sql-programming-guide.md ---
@@ -62,10 +62,10 @@ descendants.  To create a basic SQLContext, all you 
need is a SparkContext.
 
 {% highlight scala %}
 val sc: SparkContext // An existing SparkContext.
-val sqlContext = new org.apache.spark.sql.SQLContext(sc)
+val sqlCtx = new org.apache.spark.sql.SQLContext(sc)
--- End diff --

i see. I was trying to make those names consistent in the doc for these 
three languages. I will use sqlContext.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13834977
  
--- Diff: docs/sql-programming-guide.md ---
@@ -310,37 +325,190 @@ parquetFile = sqlCtx.parquetFile(people.parquet)
 # Parquet files can also be registered as tables and then used in SQL 
statements.
 parquetFile.registerAsTable(parquetFile);
 teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND 
age = 19)
-
+teenNames = teenagers.map(lambda p: Name:  + p.name)
+for teenName in teenNames.collect():
+  print teenName
 {% endhighlight %}
 
 /div
 
 /div
 
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+div class=codetabs
+
+div data-lang=scala  markdown=1
+Spark SQL supports querying JSON datasets. To query a JSON dataset, a 
SchemaRDD needs to be created for this JSON dataset. There are two ways to 
create a SchemaRDD for a JSON dataset:
--- End diff --

How about we add the section on the rules of dealing schema inference in 
the work on type system?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46249486
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46249650
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46250558
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46250551
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46255972
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15831/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46255971
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13839460
  
--- Diff: docs/sql-programming-guide.md ---
@@ -91,14 +91,33 @@ of its decedents.  To create a basic SQLContext, all 
you need is a SparkContext.
 
 {% highlight python %}
 from pyspark.sql import SQLContext
-sqlCtx = SQLContext(sc)
+sqlContext = SQLContext(sc)
 {% endhighlight %}
 
 /div
 
 /div
 
-## Running SQL on RDDs
+# Data Sources
+
+div class=codetabs
+div data-lang=scala  markdown=1
+Spark SQL supports operating of a variety of data sources though the 
SchemaRDD interface.
--- End diff --

This should be operating on a variety


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13839520
  
--- Diff: docs/sql-programming-guide.md ---
@@ -170,12 +191,11 @@ A schema can be applied to an existing RDD by calling 
`applySchema` and providin
 for the JavaBean.
 
 {% highlight java %}
-
-JavaSparkContext ctx = ...; // An existing JavaSparkContext.
-JavaSQLContext sqlCtx = new 
org.apache.spark.sql.api.java.JavaSQLContext(ctx)
+// sc is an existing JavaSparkContext.
+JavaSQLContext sqlContext = new 
org.apache.spark.sql.api.java.JavaSQLContext(sc)
 
 // Load a text file and convert each line to a JavaBean.
-JavaRDDPerson people = 
ctx.textFile(examples/src/main/resources/people.txt).map(
+JavaRDDPerson people = 
sc.textFile(examples/src/main/resources/people.txt).map(
--- End diff --

I think you have changes here that are accidental (this actually reverts 
some improvements/fixes to this doc)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13839554
  
--- Diff: docs/sql-programming-guide.md ---
@@ -64,8 +64,8 @@ descendants.  To create a basic SQLContext, all you need 
is a SparkContext.
 val sc: SparkContext // An existing SparkContext.
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 
-// Importing the SQL context gives access to all the public SQL functions 
and implicit conversions.
-import sqlContext._
+// createSchemaRDD is used to implicitly convert a RDD to a SchemaRDD.
--- End diff --

Instead of a RDD it should say an RDD. There are several instances of 
that in this pr, would be good to check them out.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13840123
  
--- Diff: docs/sql-programming-guide.md ---
@@ -170,12 +191,11 @@ A schema can be applied to an existing RDD by calling 
`applySchema` and providin
 for the JavaBean.
 
 {% highlight java %}
-
-JavaSparkContext ctx = ...; // An existing JavaSparkContext.
-JavaSQLContext sqlCtx = new 
org.apache.spark.sql.api.java.JavaSQLContext(ctx)
+// sc is an existing JavaSparkContext.
+JavaSQLContext sqlContext = new 
org.apache.spark.sql.api.java.JavaSQLContext(sc)
 
 // Load a text file and convert each line to a JavaBean.
-JavaRDDPerson people = 
ctx.textFile(examples/src/main/resources/people.txt).map(
+JavaRDDPerson people = 
sc.textFile(examples/src/main/resources/people.txt).map(
--- End diff --

actually sorry maybe this is intentional?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread pwendell

Github user pwendell commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13840206
  
--- Diff: docs/sql-programming-guide.md ---
@@ -297,50 +328,152 @@ JavaSchemaRDD teenagers = sqlCtx.sql(SELECT name 
FROM parquetFile WHERE age =
 div data-lang=python  markdown=1
 
 {% highlight python %}
+# sqlContext from the previous example is used in this example.
 
-peopleTable # The SchemaRDD from the previous example.
+schemaPeople # The SchemaRDD from the previous example.
 
 # SchemaRDDs can be saved as Parquet files, maintaining the schema 
information.
-peopleTable.saveAsParquetFile(people.parquet)
+schemaPeople.saveAsParquetFile(people.parquet)
 
 # Read in the Parquet file created above.  Parquet files are 
self-describing so the schema is preserved.
 # The result of loading a parquet file is also a SchemaRDD.
-parquetFile = sqlCtx.parquetFile(people.parquet)
+parquetFile = sqlContext.parquetFile(people.parquet)
 
 # Parquet files can also be registered as tables and then used in SQL 
statements.
 parquetFile.registerAsTable(parquetFile);
-teenagers = sqlCtx.sql(SELECT name FROM parquetFile WHERE age = 13 AND 
age = 19)
-
+teenagers = sqlContext.sql(SELECT name FROM parquetFile WHERE age = 13 
AND age = 19)
+teenNames = teenagers.map(lambda p: Name:  + p.name)
+for teenName in teenNames.collect():
+  print teenName
 {% endhighlight %}
 
 /div
 
 /div
 
-## Writing Language-Integrated Relational Queries
+## JSON Datasets
+div class=codetabs
 
-**Language-Integrated queries are currently only supported in Scala.**
+div data-lang=scala  markdown=1
+Spark SQL can automatically infer the schema of a JSON dataset and load it 
as a SchemaRDD.
+This conversion can be done using one of two methods in a SQLContext:
 
-Spark SQL also supports a domain specific language for writing queries.  
Once again,
-using the data from the above examples:
+* `jsonFile` - loads data from a directory of JSON files where each line 
of the files is a JSON object.
+* `jsonRdd` - loads data from an existing RDD where each element of the 
RDD is a string containing a JSON object.
 
 {% highlight scala %}
+// sc is an existing SparkContext.
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-import sqlContext._
-val people: RDD[Person] = ... // An RDD of case class objects, from the 
first example.
 
-// The following is the same as 'SELECT name FROM people WHERE age = 10 
AND age = 19'
-val teenagers = people.where('age = 10).where('age = 19).select('name)
+// A JSON dataset is pointed by path.
--- End diff --

is pointed to by path


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread pwendell

Github user pwendell commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46260432
  
I just did a really light pass on the docs and public interfaces exposed. 
From that perspective, this looks good to me!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13840531
  
--- Diff: docs/sql-programming-guide.md ---
@@ -170,12 +191,11 @@ A schema can be applied to an existing RDD by calling 
`applySchema` and providin
 for the JavaBean.
 
 {% highlight java %}
-
-JavaSparkContext ctx = ...; // An existing JavaSparkContext.
-JavaSQLContext sqlCtx = new 
org.apache.spark.sql.api.java.JavaSQLContext(ctx)
+// sc is an existing JavaSparkContext.
+JavaSQLContext sqlContext = new 
org.apache.spark.sql.api.java.JavaSQLContext(sc)
 
 // Load a text file and convert each line to a JavaBean.
-JavaRDDPerson people = 
ctx.textFile(examples/src/main/resources/people.txt).map(
+JavaRDDPerson people = 
sc.textFile(examples/src/main/resources/people.txt).map(
--- End diff --

yeah, I was trying to make variable names consistent for these three 
languages. If there is any convention for a specific language, I can revert my 
changes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46261318
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46261327
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46264603
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-16 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46264604
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15837/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-15 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46111039
  
@mateiz, thats a good point and actually there is only a single implicit 
conversion needed for all the non-DSL examples (from RDD - SchemaRDD).  
Perhaps we could import that explicitly and move the other expression 
conversions to a dsl object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-14 Thread mateiz

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46101559
  
Hey Yin, a few comments on the docs:
- You should mention JSON as a data source in the first paragraph of the 
Spark SQL doc (right now it only mentions Parquet and Hive)
- We may want to reorganize the doc to list all the external data sources 
in the same section. It's weird that Parquet and JSON are under getting 
started, but then Hive is a separate section, and the DSL stuff is in-between. 
I'd actually put the DSL last and move the other there into a new top-level 
section called data sources. You can also leave a quick example of one data 
source in Getting Started.
- There need to be Java and Python versions of the JSON example (I guess 
you're waiting to implement those APIs?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-14 Thread mateiz

Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46101589
  
Also another note for @marmbrus as well: I'd really try to minimize the use 
of top-level SQLContext methods being called without `context.` in front of 
them due to `import SQLContext._`. This is confusing for newcomers to Scala and 
makes the code harder to translate across languages. You can *maybe* use it for 
`sql()` and `hql()` but even then I'd consider moving the implicit conversions 
somewhere else in a future version of the API (e.g. `import SQLContext._` to 
give you those). Let's discuss that more later, for now though it would be nice 
to update the examples to avoid these for `jsonFile`, `registerRDDAsTable`, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46046033
  
Programming guide: http://yhuai.github.io/site/sql-programming-guide.html

Scala doc of SQLContext: 
http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext

Scala doc of JsonTable: 
http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.json.JsonTable


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46046107
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46046123
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46046141
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46046144
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15768/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46047235
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46047249
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46047263
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15769/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46047262
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46061901
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46061886
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46061920
  
Merged build finished. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-13 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-46061921
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15773/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-12 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13690133
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 
---
@@ -0,0 +1,485 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import scala.collection.mutable.HashSet
+
+import org.apache.spark.sql.catalyst.analysis.Star
+import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference}
+import org.apache.spark.sql.catalyst.plans.logical.LeafNode
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.json.JsonTable.enforceCorrectType
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.TestSQLContext
+import org.apache.spark.sql.test.TestSQLContext._
+
+protected case class Schema(output: Seq[Attribute]) extends LeafNode
+
+class JsonSuite extends QueryTest {
+  import TestJsonData._
+  TestJsonData
+
+  test(Primitive field and type inferring) {
+val jsonSchemaRDD = jsonRDD(primitiveFieldAndType)
+
+val expectedSchema =
+  AttributeReference(bigInteger, DecimalType, true)() ::
+  AttributeReference(boolean, BooleanType, true)() ::
+  AttributeReference(double, DoubleType, true)() ::
+  AttributeReference(integer, IntegerType, true)() ::
+  AttributeReference(long, LongType, true)() ::
+  AttributeReference(null, StringType, true)() ::
+  AttributeReference(string, StringType, true)() :: Nil
+
+comparePlans(Schema(expectedSchema), 
Schema(jsonSchemaRDD.logicalPlan.output))
+
+jsonSchemaRDD.registerAsTable(jsonTable)
+
+checkAnswer(
+  sql(select * from jsonTable),
+  (BigDecimal(92233720368547758070),
+  true,
+  1.7976931348623157E308,
+  10,
+  21474836470L,
+  null,
+  this is a simple string.) :: Nil
+)
+  }
+
+  test(Complex field and type inferring) {
+val jsonSchemaRDD = jsonRDD(complexFieldAndType)
+
+val expectedSchema =
+  AttributeReference(arrayOfArray1, 
ArrayType(ArrayType(StringType)), true)() ::
+  AttributeReference(arrayOfArray2, 
ArrayType(ArrayType(DoubleType)), true)() ::
+  AttributeReference(arrayOfBigInteger, ArrayType(DecimalType), 
true)() ::
+  AttributeReference(arrayOfBoolean, ArrayType(BooleanType), true)() 
::
+  AttributeReference(arrayOfDouble, ArrayType(DoubleType), true)() ::
+  AttributeReference(arrayOfInteger, ArrayType(IntegerType), true)() 
::
+  AttributeReference(arrayOfLong, ArrayType(LongType), true)() ::
+  AttributeReference(arrayOfNull, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfString, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfStruct, ArrayType(
+StructType(StructField(field1, BooleanType, true) ::
+   StructField(field2, StringType, true) :: Nil)), 
true)() ::
+  AttributeReference(struct, StructType(
+StructField(field1, BooleanType, true) ::
+StructField(field2, DecimalType, true) :: Nil), true)() ::
+  AttributeReference(structWithArrayFields, StructType(
+StructField(field1, ArrayType(IntegerType), true) ::
+StructField(field2, ArrayType(StringType), true) :: Nil), 
true)() :: Nil
+
+comparePlans(Schema(expectedSchema), 
Schema(jsonSchemaRDD.logicalPlan.output))
+
+jsonSchemaRDD.registerAsTable(jsonTable)
+
+// Access elements of a primitive array.
+checkAnswer(
+  sql(select arrayOfString[0], arrayOfString[1], arrayOfString[2] 
from jsonTable),
+  (str1, str2, null) :: Nil
+)
+
+// Access an array of null values.
+checkAnswer(
+  sql(select arrayOfNull from jsonTable),
+  Seq(Seq(null, null, null, null))

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-45814358
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-45814366
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-11 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-45814385
  

Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15694/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13601226
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonTable.scala 
---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.SchemaRDD
+import org.apache.spark.sql.Logging
+import org.apache.spark.sql.catalyst.expressions.{Alias, 
AttributeReference, GetField}
+
+import com.fasterxml.jackson.databind.ObjectMapper
+
+import scala.collection.JavaConversions._
+import scala.math.BigDecimal
+import org.apache.spark.sql.catalyst.expressions.GetField
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.execution.SparkLogicalPlan
+import org.apache.spark.sql.catalyst.expressions.Alias
+import org.apache.spark.sql.catalyst.expressions.GetField
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.execution.SparkLogicalPlan
+import org.apache.spark.sql.catalyst.expressions.Alias
+import org.apache.spark.sql.catalyst.types.StructField
+import org.apache.spark.sql.catalyst.types.StructType
+import org.apache.spark.sql.catalyst.types.ArrayType
+import org.apache.spark.sql.catalyst.expressions.GetField
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.execution.SparkLogicalPlan
+import org.apache.spark.sql.catalyst.expressions.Alias
+
+sealed trait SchemaResolutionMode
+
+case object EAGER_SCHEMA_RESOLUTION extends SchemaResolutionMode
+case class EAGER_SCHEMA_RESOLUTION_WITH_SAMPLING(val fraction: Double) 
extends SchemaResolutionMode
+case object LAZY_SCHEMA_RESOLUTION extends SchemaResolutionMode
+
+/**
+ * :: Experimental ::
+ * Converts a JSON file to a SparkSQL logical query plan.  This 
implementation is only designed to
+ * work on JSON files that have mostly uniform schema.  The conversion 
suffers from the following
+ * limitation:
+ *  - The data is optionally sampled to determine all of the possible 
fields. Any fields that do
+ *not appear in this sample will not be included in the final output.
+ */
+@Experimental
+object JsonTable extends Serializable with Logging {
+  def inferSchema(
+  json: RDD[String], sampleSchema: Option[Double] = None): LogicalPlan 
= {
+val schemaData = sampleSchema.map(json.sample(false, _, 
1)).getOrElse(json)
+val allKeys = 
parseJson(schemaData).map(getAllKeysWithValueTypes).reduce(_ ++ _)
+
+// Resolve type conflicts
+val resolved = allKeys.groupBy {
+  case (key, dataType) = key
+}.map {
+  // Now, keys and types are organized in the format of
+  // key - Set(type1, type2, ...).
+  case (key, typeSet) = {
+val fieldName = key.substring(1, key.length - 1).split(`.`).toSeq
+val dataType = typeSet.map {
+  case (_, dataType) = dataType
+}.reduce((type1: DataType, type2: DataType) = 
getCompatibleType(type1, type2))
+
+// Finally, we replace all NullType to StringType. We do not need 
to take care
+// StructType because all fields with a StructType are represented 
by a placeholder
+// StructType(Nil).
+dataType match {
+  case NullType = (fieldName, StringType)
+  case ArrayType(NullType) = (fieldName, ArrayType(StringType))
+  case other = (fieldName, other)
+}
+  }
+}
+
+def makeStruct(values: Seq[Seq[String]], prefix: Seq[String]): 
StructType = {

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13601819
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 
---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.TestSQLContext._
+import org.apache.spark.sql.catalyst.expressions.{ExprId, 
AttributeReference, Attribute}
+import org.apache.spark.sql.catalyst.plans.generateSchemaTreeString
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.util._
+
+class JsonSuite extends QueryTest {
+  import TestJsonData._
+  TestJsonData
+
+  /**
+   * Since attribute references are given globally unique ids during 
analysis,
+   * we must normalize them to check if two different queries are 
identical.
+   */
+  protected def normalizeExprIds(attributes: Seq[Attribute]) = {
+val minId = attributes.map(_.exprId.id).min
+attributes.map {
+  case a: AttributeReference =
+AttributeReference(a.name, a.dataType, a.nullable)(exprId = 
ExprId(a.exprId.id - minId))
+}
+  }
+
+  protected def checkSchema(expected: Seq[Attribute], actual: 
Seq[Attribute]): Unit = {
+val normalizedExpected = normalizeExprIds(expected).toSeq
+val normalizedActual = normalizeExprIds(actual).toSeq
+if (normalizedExpected != normalizedActual) {
+  fail(
+s
+  |=== FAIL: Schemas do not match ===
+  |${sideBySide(
+  s== Expected Schema ==\n +
+  generateSchemaTreeString(normalizedExpected),
+  s==  Actual Schema  ==\n +
+  generateSchemaTreeString(normalizedActual)).mkString(\n)}
+.stripMargin)
+}
+  }
+
+  test(Primitive field and type inferring) {
+val jsonSchemaRDD = jsonRDD(primitiveFieldAndType)
+
+val expectedSchema =
+  AttributeReference(bigInteger, DecimalType, true)() ::
+  AttributeReference(boolean, BooleanType, true)() ::
+  AttributeReference(double, DoubleType, true)() ::
+  AttributeReference(integer, IntegerType, true)() ::
+  AttributeReference(long, LongType, true)() ::
+  AttributeReference(null, StringType, true)() ::
+  AttributeReference(string, StringType, true)() :: Nil
+
+checkSchema(expectedSchema, jsonSchemaRDD.logicalPlan.output)
+
+jsonSchemaRDD.registerAsTable(jsonTable)
+
+checkAnswer(
+  sql(select * from jsonTable),
+  (BigDecimal(92233720368547758070),
+  true,
+  1.7976931348623157E308,
+  10,
+  21474836470L,
+  null,
+  this is a simple string.) :: Nil
+)
+  }
+
+  test(Complex field and type inferring) {
+val jsonSchemaRDD = jsonRDD(complexFieldAndType)
+
+val expectedSchema =
+  AttributeReference(arrayOfArray1, 
ArrayType(ArrayType(StringType)), true)() ::
+  AttributeReference(arrayOfArray2, 
ArrayType(ArrayType(DoubleType)), true)() ::
+  AttributeReference(arrayOfBigInteger, ArrayType(DecimalType), 
true)() ::
+  AttributeReference(arrayOfBoolean, ArrayType(BooleanType), true)() 
::
+  AttributeReference(arrayOfDouble, ArrayType(DoubleType), true)() ::
+  AttributeReference(arrayOfInteger, ArrayType(IntegerType), true)() 
::
+  AttributeReference(arrayOfLong, ArrayType(LongType), true)() ::
+  AttributeReference(arrayOfNull, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfString, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfStruct, ArrayType(
+StructType(StructField(field1, BooleanType, true) ::
+   StructField(field2, StringType, true) :: Nil)), 
true)() ::
+  AttributeReference(struct, StructType(
+StructField(field1,

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13602167
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 
---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.TestSQLContext._
+import org.apache.spark.sql.catalyst.expressions.{ExprId, 
AttributeReference, Attribute}
+import org.apache.spark.sql.catalyst.plans.generateSchemaTreeString
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.util._
+
+class JsonSuite extends QueryTest {
+  import TestJsonData._
+  TestJsonData
+
+  /**
+   * Since attribute references are given globally unique ids during 
analysis,
+   * we must normalize them to check if two different queries are 
identical.
+   */
+  protected def normalizeExprIds(attributes: Seq[Attribute]) = {
+val minId = attributes.map(_.exprId.id).min
+attributes.map {
+  case a: AttributeReference =
+AttributeReference(a.name, a.dataType, a.nullable)(exprId = 
ExprId(a.exprId.id - minId))
+}
+  }
+
+  protected def checkSchema(expected: Seq[Attribute], actual: 
Seq[Attribute]): Unit = {
+val normalizedExpected = normalizeExprIds(expected).toSeq
+val normalizedActual = normalizeExprIds(actual).toSeq
+if (normalizedExpected != normalizedActual) {
+  fail(
+s
+  |=== FAIL: Schemas do not match ===
+  |${sideBySide(
+  s== Expected Schema ==\n +
+  generateSchemaTreeString(normalizedExpected),
+  s==  Actual Schema  ==\n +
+  generateSchemaTreeString(normalizedActual)).mkString(\n)}
+.stripMargin)
+}
+  }
+
+  test(Primitive field and type inferring) {
+val jsonSchemaRDD = jsonRDD(primitiveFieldAndType)
+
+val expectedSchema =
+  AttributeReference(bigInteger, DecimalType, true)() ::
+  AttributeReference(boolean, BooleanType, true)() ::
+  AttributeReference(double, DoubleType, true)() ::
+  AttributeReference(integer, IntegerType, true)() ::
+  AttributeReference(long, LongType, true)() ::
+  AttributeReference(null, StringType, true)() ::
+  AttributeReference(string, StringType, true)() :: Nil
+
+checkSchema(expectedSchema, jsonSchemaRDD.logicalPlan.output)
+
+jsonSchemaRDD.registerAsTable(jsonTable)
+
+checkAnswer(
+  sql(select * from jsonTable),
+  (BigDecimal(92233720368547758070),
+  true,
+  1.7976931348623157E308,
+  10,
+  21474836470L,
+  null,
+  this is a simple string.) :: Nil
+)
+  }
+
+  test(Complex field and type inferring) {
+val jsonSchemaRDD = jsonRDD(complexFieldAndType)
+
+val expectedSchema =
+  AttributeReference(arrayOfArray1, 
ArrayType(ArrayType(StringType)), true)() ::
+  AttributeReference(arrayOfArray2, 
ArrayType(ArrayType(DoubleType)), true)() ::
+  AttributeReference(arrayOfBigInteger, ArrayType(DecimalType), 
true)() ::
+  AttributeReference(arrayOfBoolean, ArrayType(BooleanType), true)() 
::
+  AttributeReference(arrayOfDouble, ArrayType(DoubleType), true)() ::
+  AttributeReference(arrayOfInteger, ArrayType(IntegerType), true)() 
::
+  AttributeReference(arrayOfLong, ArrayType(LongType), true)() ::
+  AttributeReference(arrayOfNull, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfString, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfStruct, ArrayType(
+StructType(StructField(field1, BooleanType, true) ::
+   StructField(field2, StringType, true) :: Nil)), 
true)() ::
+  AttributeReference(struct, StructType(
+StructField(field1,

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13603527
  
--- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala 
---
@@ -0,0 +1,371 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import org.apache.spark.sql.QueryTest
+import org.apache.spark.sql.test.TestSQLContext._
+import org.apache.spark.sql.catalyst.expressions.{ExprId, 
AttributeReference, Attribute}
+import org.apache.spark.sql.catalyst.plans.generateSchemaTreeString
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.catalyst.util._
+
+class JsonSuite extends QueryTest {
+  import TestJsonData._
+  TestJsonData
+
+  /**
+   * Since attribute references are given globally unique ids during 
analysis,
+   * we must normalize them to check if two different queries are 
identical.
+   */
+  protected def normalizeExprIds(attributes: Seq[Attribute]) = {
+val minId = attributes.map(_.exprId.id).min
+attributes.map {
+  case a: AttributeReference =
+AttributeReference(a.name, a.dataType, a.nullable)(exprId = 
ExprId(a.exprId.id - minId))
+}
+  }
+
+  protected def checkSchema(expected: Seq[Attribute], actual: 
Seq[Attribute]): Unit = {
+val normalizedExpected = normalizeExprIds(expected).toSeq
+val normalizedActual = normalizeExprIds(actual).toSeq
+if (normalizedExpected != normalizedActual) {
+  fail(
+s
+  |=== FAIL: Schemas do not match ===
+  |${sideBySide(
+  s== Expected Schema ==\n +
+  generateSchemaTreeString(normalizedExpected),
+  s==  Actual Schema  ==\n +
+  generateSchemaTreeString(normalizedActual)).mkString(\n)}
+.stripMargin)
+}
+  }
+
+  test(Primitive field and type inferring) {
+val jsonSchemaRDD = jsonRDD(primitiveFieldAndType)
+
+val expectedSchema =
+  AttributeReference(bigInteger, DecimalType, true)() ::
+  AttributeReference(boolean, BooleanType, true)() ::
+  AttributeReference(double, DoubleType, true)() ::
+  AttributeReference(integer, IntegerType, true)() ::
+  AttributeReference(long, LongType, true)() ::
+  AttributeReference(null, StringType, true)() ::
+  AttributeReference(string, StringType, true)() :: Nil
+
+checkSchema(expectedSchema, jsonSchemaRDD.logicalPlan.output)
+
+jsonSchemaRDD.registerAsTable(jsonTable)
+
+checkAnswer(
+  sql(select * from jsonTable),
+  (BigDecimal(92233720368547758070),
+  true,
+  1.7976931348623157E308,
+  10,
+  21474836470L,
+  null,
+  this is a simple string.) :: Nil
+)
+  }
+
+  test(Complex field and type inferring) {
+val jsonSchemaRDD = jsonRDD(complexFieldAndType)
+
+val expectedSchema =
+  AttributeReference(arrayOfArray1, 
ArrayType(ArrayType(StringType)), true)() ::
+  AttributeReference(arrayOfArray2, 
ArrayType(ArrayType(DoubleType)), true)() ::
+  AttributeReference(arrayOfBigInteger, ArrayType(DecimalType), 
true)() ::
+  AttributeReference(arrayOfBoolean, ArrayType(BooleanType), true)() 
::
+  AttributeReference(arrayOfDouble, ArrayType(DoubleType), true)() ::
+  AttributeReference(arrayOfInteger, ArrayType(IntegerType), true)() 
::
+  AttributeReference(arrayOfLong, ArrayType(LongType), true)() ::
+  AttributeReference(arrayOfNull, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfString, ArrayType(StringType), true)() ::
+  AttributeReference(arrayOfStruct, ArrayType(
+StructType(StructField(field1, BooleanType, true) ::
+   StructField(field2, StringType, true) :: Nil)), 
true)() ::
+  AttributeReference(struct, StructType(
+StructField(field1,

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-45637957
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread yhuai

Github user yhuai commented on a diff in the pull request:

https://github.com/apache/spark/pull/999#discussion_r13604518
  
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonTable.scala 
---
@@ -0,0 +1,364 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.json
+
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.execution.{ExistingRdd, SparkLogicalPlan}
+import org.apache.spark.sql.catalyst.plans.logical._
+import org.apache.spark.sql.catalyst.types._
+import org.apache.spark.sql.SchemaRDD
+import org.apache.spark.sql.Logging
+import org.apache.spark.sql.catalyst.expressions.{Alias, 
AttributeReference, GetField}
+
+import com.fasterxml.jackson.databind.ObjectMapper
+
+import scala.collection.JavaConversions._
+import scala.math.BigDecimal
+import org.apache.spark.sql.catalyst.expressions.GetField
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.execution.SparkLogicalPlan
+import org.apache.spark.sql.catalyst.expressions.Alias
+import org.apache.spark.sql.catalyst.expressions.GetField
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.execution.SparkLogicalPlan
+import org.apache.spark.sql.catalyst.expressions.Alias
+import org.apache.spark.sql.catalyst.types.StructField
+import org.apache.spark.sql.catalyst.types.StructType
+import org.apache.spark.sql.catalyst.types.ArrayType
+import org.apache.spark.sql.catalyst.expressions.GetField
+import org.apache.spark.sql.catalyst.expressions.AttributeReference
+import org.apache.spark.sql.execution.SparkLogicalPlan
+import org.apache.spark.sql.catalyst.expressions.Alias
+
+sealed trait SchemaResolutionMode
+
+case object EAGER_SCHEMA_RESOLUTION extends SchemaResolutionMode
+case class EAGER_SCHEMA_RESOLUTION_WITH_SAMPLING(val fraction: Double) 
extends SchemaResolutionMode
+case object LAZY_SCHEMA_RESOLUTION extends SchemaResolutionMode
+
+/**
+ * :: Experimental ::
+ * Converts a JSON file to a SparkSQL logical query plan.  This 
implementation is only designed to
+ * work on JSON files that have mostly uniform schema.  The conversion 
suffers from the following
+ * limitation:
+ *  - The data is optionally sampled to determine all of the possible 
fields. Any fields that do
+ *not appear in this sample will not be included in the final output.
+ */
+@Experimental
+object JsonTable extends Serializable with Logging {
+  def inferSchema(
+  json: RDD[String], sampleSchema: Option[Double] = None): LogicalPlan 
= {
+val schemaData = sampleSchema.map(json.sample(false, _, 
1)).getOrElse(json)
+val allKeys = 
parseJson(schemaData).map(getAllKeysWithValueTypes).reduce(_ ++ _)
+
+// Resolve type conflicts
+val resolved = allKeys.groupBy {
+  case (key, dataType) = key
+}.map {
+  // Now, keys and types are organized in the format of
+  // key - Set(type1, type2, ...).
+  case (key, typeSet) = {
+val fieldName = key.substring(1, key.length - 1).split(`.`).toSeq
+val dataType = typeSet.map {
+  case (_, dataType) = dataType
+}.reduce((type1: DataType, type2: DataType) = 
getCompatibleType(type1, type2))
+
+// Finally, we replace all NullType to StringType. We do not need 
to take care
+// StructType because all fields with a StructType are represented 
by a placeholder
+// StructType(Nil).
+dataType match {
+  case NullType = (fieldName, StringType)
+  case ArrayType(NullType) = (fieldName, ArrayType(StringType))
+  case other = (fieldName, other)
+}
+  }
+}
+
+def makeStruct(values: Seq[Seq[String]], prefix: Seq[String]): 
StructType = {

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-45637977
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [Spark 2060][SQL] Querying JSON Datasets with ...

2014-06-10 Thread yhuai

Github user yhuai commented on the pull request:

https://github.com/apache/spark/pull/999#issuecomment-45642845
  
API doc for sql/core http://yhuai.github.io/spark-sql-core/api/#package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

1 2 >

1 - 100 of 167 matches

Mail list logo