git commit: [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

2014-11-03 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/branch-1.2 292da4ef2 - cc5dc4247


[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

This patch will try to infer schema for RDD which has empty value (None, [], 
{}) in the first row. It will try first 100 rows and merge the types into 
schema, also merge fields of StructType together. If there is still NullType in 
schema, then it will show an warning, tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

Author: Davies Liu davies@gmail.com
Author: Davies Liu dav...@databricks.com

Closes #2716 from davies/infer and squashes the following commits:

e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
567dc60 [Davies Liu] update docs
9767b27 [Davies Liu] Merge branch 'master' into infer
e48d7fb [Davies Liu] fix tests
29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
540d1d5 [Davies Liu] merge fields for StructType
f93fd84 [Davies Liu] add more tests
3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by 
sampling the RDD

(cherry picked from commit 24544fbce05665ab4999a1fe5aac434d29cd912c)
Signed-off-by: Michael Armbrust mich...@databricks.com


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/cc5dc424
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/cc5dc424
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/cc5dc424

Branch: refs/heads/branch-1.2
Commit: cc5dc4247979dc001302f7af978801b789acdbfa
Parents: 292da4e
Author: Davies Liu davies@gmail.com
Authored: Mon Nov 3 13:17:09 2014 -0800
Committer: Michael Armbrust mich...@databricks.com
Committed: Mon Nov 3 13:17:25 2014 -0800

--
 python/pyspark/sql.py   | 196 ---
 python/pyspark/tests.py |  19 ++
 .../spark/sql/catalyst/types/dataTypes.scala|   2 +-
 3 files changed, 148 insertions(+), 69 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/cc5dc424/python/pyspark/sql.py
--
diff --git a/python/pyspark/sql.py b/python/pyspark/sql.py
index 98e41f8..675df08 100644
--- a/python/pyspark/sql.py
+++ b/python/pyspark/sql.py
@@ -109,6 +109,15 @@ class PrimitiveType(DataType):
 return self is other
 
 
+class NullType(PrimitiveType):
+
+Spark SQL NullType
+
+The data type representing None, used for the types which has not
+been inferred.
+
+
+
 class StringType(PrimitiveType):
 
 Spark SQL StringType
@@ -331,7 +340,7 @@ class StructField(DataType):
 
 
 
-def __init__(self, name, dataType, nullable, metadata=None):
+def __init__(self, name, dataType, nullable=True, metadata=None):
 Creates a StructField
 :param name: the name of this field.
 :param dataType: the data type of this field.
@@ -484,6 +493,7 @@ def _parse_datatype_json_value(json_value):
 
 # Mapping Python types to Spark SQL DataType
 _type_mappings = {
+type(None): NullType,
 bool: BooleanType,
 int: IntegerType,
 long: LongType,
@@ -500,22 +510,22 @@ _type_mappings = {
 
 def _infer_type(obj):
 Infer the DataType from obj
-if obj is None:
-raise ValueError(Can not infer type for None)
-
 dataType = _type_mappings.get(type(obj))
 if dataType is not None:
 return dataType()
 
 if isinstance(obj, dict):
-if not obj:
-raise ValueError(Can not infer type for empty dict)
-key, value = obj.iteritems().next()
-return MapType(_infer_type(key), _infer_type(value), True)
+for key, value in obj.iteritems():
+if key is not None and value is not None:
+return MapType(_infer_type(key), _infer_type(value), True)
+else:
+return MapType(NullType(), NullType(), True)
 elif isinstance(obj, (list, array)):
-if not obj:
-raise ValueError(Can not infer type for empty list/array)
-return ArrayType(_infer_type(obj[0]), True)
+for v in obj:
+if v is not None:
+return ArrayType(_infer_type(obj[0]), True)
+else:
+return ArrayType(NullType(), True)
 else:
 try:
 return _infer_schema(obj)
@@ -548,60 +558,93 @@ def _infer_schema(row):
 return StructType(fields)
 
 
-def _create_converter(obj, dataType):
+def _has_nulltype(dt):
+ Return whether there is NullType in `dt` or not 
+if isinstance(dt, StructType):
+return 

git commit: [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

2014-11-03 Thread marmbrus
Repository: spark
Updated Branches:
  refs/heads/master 2b6e1ce6e - 24544fbce


[SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling

This patch will try to infer schema for RDD which has empty value (None, [], 
{}) in the first row. It will try first 100 rows and merge the types into 
schema, also merge fields of StructType together. If there is still NullType in 
schema, then it will show an warning, tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

Author: Davies Liu davies@gmail.com
Author: Davies Liu dav...@databricks.com

Closes #2716 from davies/infer and squashes the following commits:

e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
567dc60 [Davies Liu] update docs
9767b27 [Davies Liu] Merge branch 'master' into infer
e48d7fb [Davies Liu] fix tests
29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
540d1d5 [Davies Liu] merge fields for StructType
f93fd84 [Davies Liu] add more tests
3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by 
sampling the RDD


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/24544fbc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/24544fbc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/24544fbc

Branch: refs/heads/master
Commit: 24544fbce05665ab4999a1fe5aac434d29cd912c
Parents: 2b6e1ce
Author: Davies Liu davies@gmail.com
Authored: Mon Nov 3 13:17:09 2014 -0800
Committer: Michael Armbrust mich...@databricks.com
Committed: Mon Nov 3 13:17:09 2014 -0800

--
 python/pyspark/sql.py   | 196 ---
 python/pyspark/tests.py |  19 ++
 .../spark/sql/catalyst/types/dataTypes.scala|   2 +-
 3 files changed, 148 insertions(+), 69 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/24544fbc/python/pyspark/sql.py
--
diff --git a/python/pyspark/sql.py b/python/pyspark/sql.py
index 98e41f8..675df08 100644
--- a/python/pyspark/sql.py
+++ b/python/pyspark/sql.py
@@ -109,6 +109,15 @@ class PrimitiveType(DataType):
 return self is other
 
 
+class NullType(PrimitiveType):
+
+Spark SQL NullType
+
+The data type representing None, used for the types which has not
+been inferred.
+
+
+
 class StringType(PrimitiveType):
 
 Spark SQL StringType
@@ -331,7 +340,7 @@ class StructField(DataType):
 
 
 
-def __init__(self, name, dataType, nullable, metadata=None):
+def __init__(self, name, dataType, nullable=True, metadata=None):
 Creates a StructField
 :param name: the name of this field.
 :param dataType: the data type of this field.
@@ -484,6 +493,7 @@ def _parse_datatype_json_value(json_value):
 
 # Mapping Python types to Spark SQL DataType
 _type_mappings = {
+type(None): NullType,
 bool: BooleanType,
 int: IntegerType,
 long: LongType,
@@ -500,22 +510,22 @@ _type_mappings = {
 
 def _infer_type(obj):
 Infer the DataType from obj
-if obj is None:
-raise ValueError(Can not infer type for None)
-
 dataType = _type_mappings.get(type(obj))
 if dataType is not None:
 return dataType()
 
 if isinstance(obj, dict):
-if not obj:
-raise ValueError(Can not infer type for empty dict)
-key, value = obj.iteritems().next()
-return MapType(_infer_type(key), _infer_type(value), True)
+for key, value in obj.iteritems():
+if key is not None and value is not None:
+return MapType(_infer_type(key), _infer_type(value), True)
+else:
+return MapType(NullType(), NullType(), True)
 elif isinstance(obj, (list, array)):
-if not obj:
-raise ValueError(Can not infer type for empty list/array)
-return ArrayType(_infer_type(obj[0]), True)
+for v in obj:
+if v is not None:
+return ArrayType(_infer_type(obj[0]), True)
+else:
+return ArrayType(NullType(), True)
 else:
 try:
 return _infer_schema(obj)
@@ -548,60 +558,93 @@ def _infer_schema(row):
 return StructType(fields)
 
 
-def _create_converter(obj, dataType):
+def _has_nulltype(dt):
+ Return whether there is NullType in `dt` or not 
+if isinstance(dt, StructType):
+return any(_has_nulltype(f.dataType) for f in dt.fields)
+elif isinstance(dt, ArrayType):
+return _has_nulltype((dt.elementType))
+