subject:"\[GitHub\] spark pull request\: \[SPARK\-3713\]\[SQL\] Uses JSON to serialize DataT..."

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-08 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58446924
  
Thanks! I've merged this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-08 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/2563


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58234335
  
Could you rebase this to master?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58291809
  
Finished rebasing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58292425
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21437/consoleFull)
 for   PR 2563 at commit 
[`fc92eb3`](https://github.com/apache/spark/commit/fc92eb3ad82c998d5f0ea4e94d730a6e90185d9e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58292948
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/285/consoleFull)
 for   PR 2563 at commit 
[`fc92eb3`](https://github.com/apache/spark/commit/fc92eb3ad82c998d5f0ea4e94d730a6e90185d9e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58299048
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21437/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58299045
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21437/consoleFull)
 for   PR 2563 at commit 
[`fc92eb3`](https://github.com/apache/spark/commit/fc92eb3ad82c998d5f0ea4e94d730a6e90185d9e).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58300431
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/285/consoleFull)
 for   PR 2563 at commit 
[`fc92eb3`](https://github.com/apache/spark/commit/fc92eb3ad82c998d5f0ea4e94d730a6e90185d9e).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Params(inputFile: String = null, threshold: Double = 0.1)`
  * `class Word2VecModel(object):`
  * `class Word2Vec(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-58307821
  
@marmbrus I think this is ready to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57897337
  
@davis Thanks for all the suggestions, really makes things a lot cleaner!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57897375
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21289/consoleFull)
 for   PR 2563 at commit 
[`54c46ce`](https://github.com/apache/spark/commit/54c46ce607c521df4bea390d3cac7d42a6f006f8).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57897538
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21291/consoleFull)
 for   PR 2563 at commit 
[`785b683`](https://github.com/apache/spark/commit/785b6834e4f0ea24a3b5be4c55d675b8687b12c9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57898946
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21291/consoleFull)
 for   PR 2563 at commit 
[`785b683`](https://github.com/apache/spark/commit/785b6834e4f0ea24a3b5be4c55d675b8687b12c9).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57898948
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21291/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57899622
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21289/consoleFull)**
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57899624
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21289/Test 
FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57900055
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/265/consoleFull)
 for   PR 2563 at commit 
[`785b683`](https://github.com/apache/spark/commit/785b6834e4f0ea24a3b5be4c55d675b8687b12c9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57922847
  
@liancheng You had mentioned another guy, my id is davies


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18432650
  
--- Diff: python/pyspark/sql.py ---
@@ -312,42 +358,30 @@ def __repr__(self):
 return (StructType(List(%s)) %
 ,.join(str(field) for field in self.fields))
 
+def jsonValue(self):
+return {type: self.typeName(),
+fields: map(lambda f: f.jsonValue(), self.fields)}
--- End diff --

list comprehension is preferred  than map and lambda:
```
[f.jsonValue() for f in self.fields]
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18432669
  
--- Diff: python/run-tests ---
@@ -60,56 +60,58 @@ fi
 echo Testing with Python version:
 $PYSPARK_PYTHON --version
 
-run_test pyspark/rdd.py
-run_test pyspark/context.py
-run_test pyspark/conf.py
 run_test pyspark/sql.py
-# These tests are included in the module-level docs, and so must
--- End diff --

you can setup path in bashrc:
```
export SPARK_HOME=path_to_spark
export 
PYTHONPATH=${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip
```
then you could run any pyspark jobs directly with python (or run single 
test)
```
python python/pyspark/sql.py
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18432681
  
--- Diff: python/pyspark/sql.py ---
@@ -62,6 +63,18 @@ def __eq__(self, other):
 def __ne__(self, other):
 return not self.__eq__(other)
 
+@classmethod
+def typeName(cls):
+return cls.__name__[:-4].lower()
+
+def jsonValue(self):
+return {type: self.typeName()}
--- End diff --

If you like to use single string for Primitive types, it's still doable, 
only use one layer dict for others.

Either one is good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57923097
  
This looks good to me, you just forget to rollback the changes in run-tests 
after debugging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread liancheng

Github user liancheng commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57923419
  
@davies Sorry for my carelessness... And thanks again for all the great 
advices!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57924412
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21305/consoleFull)
 for   PR 2563 at commit 
[`de18dea`](https://github.com/apache/spark/commit/de18dead6077327e8870841a6194894ba51b5b9f).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57925709
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21305/consoleFull)
 for   PR 2563 at commit 
[`de18dea`](https://github.com/apache/spark/commit/de18dead6077327e8870841a6194894ba51b5b9f).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57925711
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21305/Test 
PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-04 Thread davies

Github user davies commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57926775
  
LGTM now, thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18383911
  
--- Diff: python/pyspark/sql.py ---
@@ -62,6 +67,17 @@ def __eq__(self, other):
 def __ne__(self, other):
 return not self.__eq__(other)
 
+def simpleString(self):
+return _get_simple_string(self.__class__)
--- End diff --

why not just put _get_simple_string here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18384024
  
--- Diff: python/pyspark/sql.py ---
@@ -312,42 +343,24 @@ def __repr__(self):
 return (StructType(List(%s)) %
 ,.join(str(field) for field in self.fields))
 
+def jsonValue(self):
+return {self.simpleString():
+{'fields': map(lambda f: f.jsonValue(), self.fields)}}
 
-def _parse_datatype_list(datatype_list_string):
-Parses a list of comma separated data types.
-index = 0
-datatype_list = []
-start = 0
-depth = 0
-while index  len(datatype_list_string):
-if depth == 0 and datatype_list_string[index] == ,:
-datatype_string = datatype_list_string[start:index].strip()
-datatype_list.append(_parse_datatype_string(datatype_string))
-start = index + 1
-elif datatype_list_string[index] == (:
-depth += 1
-elif datatype_list_string[index] == ):
-depth -= 1
-
-index += 1
-
-# Handle the last data type
-datatype_string = datatype_list_string[start:index].strip()
-datatype_list.append(_parse_datatype_string(datatype_string))
-return datatype_list
 
+_all_primitive_types = dict((_get_simple_string(v), v)
+for v in globals().itervalues()
--- End diff --

it's better to call v.simpleString(), maybe call it `typeName` ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-03 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18384157
  
--- Diff: python/pyspark/sql.py ---
@@ -385,51 +398,35 @@ def _parse_datatype_string(datatype_string):
  check_datatype(complex_maptype)
 True
 
-index = datatype_string.find(()
-if index == -1:
-# It is a primitive type.
-index = len(datatype_string)
-type_or_field = datatype_string[:index]
-rest_part = datatype_string[index + 1:len(datatype_string) - 1].strip()
-
-if type_or_field in _all_primitive_types:
-return _all_primitive_types[type_or_field]()
-
-elif type_or_field == ArrayType:
-last_comma_index = rest_part.rfind(,)
-containsNull = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-containsNull = False
-elementType = _parse_datatype_string(
-rest_part[:last_comma_index].strip())
-return ArrayType(elementType, containsNull)
-
-elif type_or_field == MapType:
-last_comma_index = rest_part.rfind(,)
-valueContainsNull = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-valueContainsNull = False
-keyType, valueType = _parse_datatype_list(
-rest_part[:last_comma_index].strip())
-return MapType(keyType, valueType, valueContainsNull)
-
-elif type_or_field == StructField:
-first_comma_index = rest_part.find(,)
-name = rest_part[:first_comma_index].strip()
-last_comma_index = rest_part.rfind(,)
-nullable = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-nullable = False
-dataType = _parse_datatype_string(
-rest_part[first_comma_index + 1:last_comma_index].strip())
-return StructField(name, dataType, nullable)
-
-elif type_or_field == StructType:
-# rest_part should be in the format like
-# List(StructField(field1,IntegerType,false)).
-field_list_string = rest_part[rest_part.find(() + 1:-1]
-fields = _parse_datatype_list(field_list_string)
+return _parse_datatype_json_value(json.loads(json_string))
+
+
+def _parse_datatype_json_value(json_value):
+if type(json_value) is unicode and json_value in 
_all_primitive_types.keys():
+return _all_primitive_types[json_value]()
+elif 'array' in json_value:
+array_type = json_value['array']
+element_type = _parse_datatype_json_value(array_type['type'])
+contains_null = array_type['containsNull']
+return ArrayType(element_type, contains_null)
--- End diff --

If the jsonValue has one level, then these lines can be written like this:
```
if json_value['type'] == 'array':
  return ArrayType(json_value['element'], json_value[''containsNull''])
```
Also, it will be much easier to to like this:

```
class ArrayType:
   @classmethod
   def load_from_json(cls, json):
return ArrayType(json['element'], json[''containsNull''])

types[json_value['type']].load_from_json(json_value)
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-03 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18399479
  
--- Diff: python/pyspark/sql.py ---
@@ -312,42 +343,24 @@ def __repr__(self):
 return (StructType(List(%s)) %
 ,.join(str(field) for field in self.fields))
 
+def jsonValue(self):
+return {self.simpleString():
+{'fields': map(lambda f: f.jsonValue(), self.fields)}}
 
-def _parse_datatype_list(datatype_list_string):
-Parses a list of comma separated data types.
-index = 0
-datatype_list = []
-start = 0
-depth = 0
-while index  len(datatype_list_string):
-if depth == 0 and datatype_list_string[index] == ,:
-datatype_string = datatype_list_string[start:index].strip()
-datatype_list.append(_parse_datatype_string(datatype_string))
-start = index + 1
-elif datatype_list_string[index] == (:
-depth += 1
-elif datatype_list_string[index] == ):
-depth -= 1
-
-index += 1
-
-# Handle the last data type
-datatype_string = datatype_list_string[start:index].strip()
-datatype_list.append(_parse_datatype_string(datatype_string))
-return datatype_list
 
+_all_primitive_types = dict((_get_simple_string(v), v)
+for v in globals().itervalues()
--- End diff --

`simpleString` is the name chosen in the Scala code, I agree that 
`typeName` is better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18324999
  
--- Diff: python/pyspark/sql.py ---
@@ -205,6 +234,16 @@ def __str__(self):
 return ArrayType(%s,%s) % (self.elementType,
  str(self.containsNull).lower())
 
+simpleString = 'array'
+
+def jsonValue(self):
+return {
+self.simpleString: {
+'type': self.elementType.jsonValue(),
+'containsNull': self.containsNull
+}
+}
--- End diff --

Any suggestions about indenting and wrapping complex nested Python data 
structure like this? I checked PEP8 while adding these lines, but didn't find 
useful guidelines for this case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18325514
  
--- Diff: python/pyspark/sql.py ---
@@ -205,6 +234,16 @@ def __str__(self):
 return ArrayType(%s,%s) % (self.elementType,
  str(self.containsNull).lower())
 
+simpleString = 'array'
+
+def jsonValue(self):
+return {
+self.simpleString: {
+'type': self.elementType.jsonValue(),
+'containsNull': self.containsNull
+}
+}
--- End diff --

I'd like this one:
```
{self.simpleString: {'type': self.elementType.jsonValue(),
   'containsNull': self.containsNull}}}
```
it will be better if it has one layer:
```
{'type': self.simpleString, 
 'type': self.elementType.jsonValue(), 
 'containsNull': self.containsNull}
```

I prefer fewer lines personally, then I can read more codes in one screen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18342875
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -19,71 +19,127 @@ package org.apache.spark.sql.catalyst.types
 
 import java.sql.Timestamp
 
-import scala.math.Numeric.{FloatAsIfIntegral, BigDecimalAsIfIntegral, 
DoubleAsIfIntegral}
+import scala.math.Numeric.{BigDecimalAsIfIntegral, DoubleAsIfIntegral, 
FloatAsIfIntegral}
 import scala.reflect.ClassTag
-import scala.reflect.runtime.universe.{typeTag, TypeTag, runtimeMirror}
+import scala.reflect.runtime.universe.{TypeTag, runtimeMirror, typeTag}
 import scala.util.parsing.combinator.RegexParsers
 
+import org.json4s.JsonAST.JValue
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
 import org.apache.spark.sql.catalyst.ScalaReflectionLock
 import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference, Expression}
 import org.apache.spark.util.Utils
 
-/**
- * Utility functions for working with DataTypes.
- */
-object DataType extends RegexParsers {
-  protected lazy val primitiveType: Parser[DataType] =
-StringType ^^^ StringType |
-FloatType ^^^ FloatType |
-IntegerType ^^^ IntegerType |
-ByteType ^^^ ByteType |
-ShortType ^^^ ShortType |
-DoubleType ^^^ DoubleType |
-LongType ^^^ LongType |
-BinaryType ^^^ BinaryType |
-BooleanType ^^^ BooleanType |
-DecimalType ^^^ DecimalType |
-TimestampType ^^^ TimestampType
-
-  protected lazy val arrayType: Parser[DataType] =
-ArrayType ~ ( ~ dataType ~ , ~ boolVal ~ ) ^^ {
-  case tpe ~ _ ~ containsNull = ArrayType(tpe, containsNull)
-}
 
-  protected lazy val mapType: Parser[DataType] =
-MapType ~ ( ~ dataType ~ , ~ dataType ~ , ~ boolVal ~ ) 
^^ {
-  case t1 ~ _ ~ t2 ~ _ ~ valueContainsNull = MapType(t1, t2, 
valueContainsNull)
-}
+object DataType {
+  def fromJson(json: String): DataType = parseDataType(parse(json))
 
-  protected lazy val structField: Parser[StructField] =
-(StructField( ~ [a-zA-Z0-9_]*.r) ~ (, ~ dataType) ~ (, ~ 
boolVal ~ )) ^^ {
-  case name ~ tpe ~ nullable  =
-  StructField(name, tpe, nullable = nullable)
+  private object JSortedObject {
+def unapplySeq(value: JValue): Option[List[(String, JValue)]] = value 
match {
+  case JObject(seq) = Some(seq.toList.sortBy(_._1))
+  case _ = None
 }
+  }
 
-  protected lazy val boolVal: Parser[Boolean] =
-true ^^^ true |
-false ^^^ false
+  private def parseDataType(asJValue: JValue): DataType = asJValue match {
+case JString(boolean) = BooleanType
+case JString(byte) = ByteType
+case JString(short) = ShortType
+case JString(integer) = IntegerType
+case JString(long) = LongType
+case JString(float) = FloatType
+case JString(double) = DoubleType
+case JString(decimal) = DecimalType
+case JString(string) = StringType
+case JString(binary) = BinaryType
+case JString(timestamp) = TimestampType
+case JString(null) = NullType
+case JObject(List((array, JSortedObject(
+(containsNull, JBool(n)), (type, t: JValue) =
+  ArrayType(parseDataType(t), n)
+case JObject(List((struct, JObject(List((fields, 
JArray(fields))) =
+  StructType(fields.map(parseStructField))
+case JObject(List((map, JSortedObject(
+(key, k: JValue), (value, v: JValue), (valueContainsNull, 
JBool(n)) =
+  MapType(parseDataType(k), parseDataType(v), n)
+  }
 
-  protected lazy val structType: Parser[DataType] =
-StructType\\([A-zA-z]*\\(.r ~ repsep(structField, ,) ~ )) ^^ {
-  case fields = new StructType(fields)
-}
+  private def parseStructField(asJValue: JValue): StructField = asJValue 
match {
+case JObject(Seq((field, JSortedObject(
+(name, JString(name)),
+(nullable, JBool(nullable)),
+(type, dataType: JValue) =
+  StructField(name, parseDataType(dataType), nullable)
+  }
 
-  protected lazy val dataType: Parser[DataType] =
-arrayType |
-  mapType |
-  structType |
-  primitiveType
+  @deprecated(Use DataType.fromJson instead)
+  def fromCaseClassString(string: String): DataType = 
CaseClassStringParser(string)
 
   /**
-   * Parses a string representation of a DataType.
-   *
-   * TODO: Generate parser as pickler...
+   * Utility functions for working with DataTypes.
--- End diff --

Ah, this comment is a mistake. Instead

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18342925
  
--- Diff: python/pyspark/sql.py ---
@@ -205,6 +234,16 @@ def __str__(self):
 return ArrayType(%s,%s) % (self.elementType,
  str(self.containsNull).lower())
 
+simpleString = 'array'
+
+def jsonValue(self):
+return {
+self.simpleString: {
+'type': self.elementType.jsonValue(),
+'containsNull': self.containsNull
+}
+}
--- End diff --

Thanks, I like this style :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18345824
  
--- Diff: python/pyspark/sql.py ---
@@ -385,50 +429,32 @@ def _parse_datatype_string(datatype_string):
  check_datatype(complex_maptype)
 True
 
-index = datatype_string.find(()
-if index == -1:
-# It is a primitive type.
-index = len(datatype_string)
-type_or_field = datatype_string[:index]
-rest_part = datatype_string[index + 1:len(datatype_string) - 1].strip()
-
-if type_or_field in _all_primitive_types:
-return _all_primitive_types[type_or_field]()
-
-elif type_or_field == ArrayType:
-last_comma_index = rest_part.rfind(,)
-containsNull = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-containsNull = False
-elementType = _parse_datatype_string(
-rest_part[:last_comma_index].strip())
-return ArrayType(elementType, containsNull)
-
-elif type_or_field == MapType:
-last_comma_index = rest_part.rfind(,)
-valueContainsNull = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-valueContainsNull = False
-keyType, valueType = _parse_datatype_list(
-rest_part[:last_comma_index].strip())
-return MapType(keyType, valueType, valueContainsNull)
-
-elif type_or_field == StructField:
-first_comma_index = rest_part.find(,)
-name = rest_part[:first_comma_index].strip()
-last_comma_index = rest_part.rfind(,)
-nullable = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-nullable = False
-dataType = _parse_datatype_string(
-rest_part[first_comma_index + 1:last_comma_index].strip())
-return StructField(name, dataType, nullable)
-
-elif type_or_field == StructType:
-# rest_part should be in the format like
-# List(StructField(field1,IntegerType,false)).
-field_list_string = rest_part[rest_part.find(() + 1:-1]
-fields = _parse_datatype_list(field_list_string)
+return _parse_datatype_json_value(json.loads(json_string))
+
+
+def _parse_datatype_json_value(json_value):
+if json_value in _all_primitive_types.keys():
--- End diff --

Thanks for the unhashable hint. I'd like to make the result JSON string as 
compact as possible, that's why all primitive types are serialized to a single 
string. Then I'll add a type check here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57648509
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21205/consoleFull)
 for   PR 2563 at commit 
[`5169238`](https://github.com/apache/spark/commit/51692385ea7c9cde75e37adab776e71d16e26ff3).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57648676
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21205/consoleFull)
 for   PR 2563 at commit 
[`5169238`](https://github.com/apache/spark/commit/51692385ea7c9cde75e37adab776e71d16e26ff3).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57648680
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21205/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57722826
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21227/consoleFull)
 for   PR 2563 at commit 
[`81e28fb`](https://github.com/apache/spark/commit/81e28fbf89d65202d8b934a9d98c3c60fce2e2a2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57732086
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21227/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57732079
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21227/consoleFull)
 for   PR 2563 at commit 
[`81e28fb`](https://github.com/apache/spark/commit/81e28fbf89d65202d8b934a9d98c3c60fce2e2a2).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class GetPeers(blockManagerId: BlockManagerId) extends 
ToBlockManagerMaster`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-02 Thread liancheng

Github user liancheng commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18377127
  
--- Diff: python/pyspark/sql.py ---
@@ -62,6 +63,12 @@ def __eq__(self, other):
 def __ne__(self, other):
 return not self.__eq__(other)
 
+def jsonValue(self):
+return self.simpleString
--- End diff --

Thanks for this, saved lots of boilerplate code! Removed all 
`simpleString()` method in subclasses.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread marmbrus

Github user marmbrus commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18310445
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 
---
@@ -19,71 +19,127 @@ package org.apache.spark.sql.catalyst.types
 
 import java.sql.Timestamp
 
-import scala.math.Numeric.{FloatAsIfIntegral, BigDecimalAsIfIntegral, 
DoubleAsIfIntegral}
+import scala.math.Numeric.{BigDecimalAsIfIntegral, DoubleAsIfIntegral, 
FloatAsIfIntegral}
 import scala.reflect.ClassTag
-import scala.reflect.runtime.universe.{typeTag, TypeTag, runtimeMirror}
+import scala.reflect.runtime.universe.{TypeTag, runtimeMirror, typeTag}
 import scala.util.parsing.combinator.RegexParsers
 
+import org.json4s.JsonAST.JValue
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
 import org.apache.spark.sql.catalyst.ScalaReflectionLock
 import org.apache.spark.sql.catalyst.expressions.{Attribute, 
AttributeReference, Expression}
 import org.apache.spark.util.Utils
 
-/**
- * Utility functions for working with DataTypes.
- */
-object DataType extends RegexParsers {
-  protected lazy val primitiveType: Parser[DataType] =
-StringType ^^^ StringType |
-FloatType ^^^ FloatType |
-IntegerType ^^^ IntegerType |
-ByteType ^^^ ByteType |
-ShortType ^^^ ShortType |
-DoubleType ^^^ DoubleType |
-LongType ^^^ LongType |
-BinaryType ^^^ BinaryType |
-BooleanType ^^^ BooleanType |
-DecimalType ^^^ DecimalType |
-TimestampType ^^^ TimestampType
-
-  protected lazy val arrayType: Parser[DataType] =
-ArrayType ~ ( ~ dataType ~ , ~ boolVal ~ ) ^^ {
-  case tpe ~ _ ~ containsNull = ArrayType(tpe, containsNull)
-}
 
-  protected lazy val mapType: Parser[DataType] =
-MapType ~ ( ~ dataType ~ , ~ dataType ~ , ~ boolVal ~ ) 
^^ {
-  case t1 ~ _ ~ t2 ~ _ ~ valueContainsNull = MapType(t1, t2, 
valueContainsNull)
-}
+object DataType {
+  def fromJson(json: String): DataType = parseDataType(parse(json))
 
-  protected lazy val structField: Parser[StructField] =
-(StructField( ~ [a-zA-Z0-9_]*.r) ~ (, ~ dataType) ~ (, ~ 
boolVal ~ )) ^^ {
-  case name ~ tpe ~ nullable  =
-  StructField(name, tpe, nullable = nullable)
+  private object JSortedObject {
+def unapplySeq(value: JValue): Option[List[(String, JValue)]] = value 
match {
+  case JObject(seq) = Some(seq.toList.sortBy(_._1))
+  case _ = None
 }
+  }
 
-  protected lazy val boolVal: Parser[Boolean] =
-true ^^^ true |
-false ^^^ false
+  private def parseDataType(asJValue: JValue): DataType = asJValue match {
+case JString(boolean) = BooleanType
+case JString(byte) = ByteType
+case JString(short) = ShortType
+case JString(integer) = IntegerType
+case JString(long) = LongType
+case JString(float) = FloatType
+case JString(double) = DoubleType
+case JString(decimal) = DecimalType
+case JString(string) = StringType
+case JString(binary) = BinaryType
+case JString(timestamp) = TimestampType
+case JString(null) = NullType
+case JObject(List((array, JSortedObject(
+(containsNull, JBool(n)), (type, t: JValue) =
+  ArrayType(parseDataType(t), n)
+case JObject(List((struct, JObject(List((fields, 
JArray(fields))) =
+  StructType(fields.map(parseStructField))
+case JObject(List((map, JSortedObject(
+(key, k: JValue), (value, v: JValue), (valueContainsNull, 
JBool(n)) =
+  MapType(parseDataType(k), parseDataType(v), n)
+  }
 
-  protected lazy val structType: Parser[DataType] =
-StructType\\([A-zA-z]*\\(.r ~ repsep(structField, ,) ~ )) ^^ {
-  case fields = new StructType(fields)
-}
+  private def parseStructField(asJValue: JValue): StructField = asJValue 
match {
+case JObject(Seq((field, JSortedObject(
+(name, JString(name)),
+(nullable, JBool(nullable)),
+(type, dataType: JValue) =
+  StructField(name, parseDataType(dataType), nullable)
+  }
 
-  protected lazy val dataType: Parser[DataType] =
-arrayType |
-  mapType |
-  structType |
-  primitiveType
+  @deprecated(Use DataType.fromJson instead)
+  def fromCaseClassString(string: String): DataType = 
CaseClassStringParser(string)
 
   /**
-   * Parses a string representation of a DataType.
-   *
-   * TODO: Generate parser as pickler...
+   * Utility functions for working with DataTypes.
--- End diff --

I think this comment is in the wrong

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread marmbrus

Github user marmbrus commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57544564
  
Minor comment otherwise this LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18321283
  
--- Diff: python/pyspark/sql.py ---
@@ -62,6 +63,12 @@ def __eq__(self, other):
 def __ne__(self, other):
 return not self.__eq__(other)
 
+def jsonValue(self):
+return self.simpleString
--- End diff --

you can have default implementation as:

self.__class__.__name__.[:-4].lower()


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18321352
  
--- Diff: python/pyspark/sql.py ---
@@ -205,6 +234,16 @@ def __str__(self):
 return ArrayType(%s,%s) % (self.elementType,
  str(self.containsNull).lower())
 
+simpleString = 'array'
+
+def jsonValue(self):
+return {
+self.simpleString: {
+'type': self.elementType.jsonValue(),
+'containsNull': self.containsNull
+}
+}
--- End diff --

This looks like js style, it could be fit in fewer lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread davies

Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2563#discussion_r18321520
  
--- Diff: python/pyspark/sql.py ---
@@ -385,50 +429,32 @@ def _parse_datatype_string(datatype_string):
  check_datatype(complex_maptype)
 True
 
-index = datatype_string.find(()
-if index == -1:
-# It is a primitive type.
-index = len(datatype_string)
-type_or_field = datatype_string[:index]
-rest_part = datatype_string[index + 1:len(datatype_string) - 1].strip()
-
-if type_or_field in _all_primitive_types:
-return _all_primitive_types[type_or_field]()
-
-elif type_or_field == ArrayType:
-last_comma_index = rest_part.rfind(,)
-containsNull = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-containsNull = False
-elementType = _parse_datatype_string(
-rest_part[:last_comma_index].strip())
-return ArrayType(elementType, containsNull)
-
-elif type_or_field == MapType:
-last_comma_index = rest_part.rfind(,)
-valueContainsNull = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-valueContainsNull = False
-keyType, valueType = _parse_datatype_list(
-rest_part[:last_comma_index].strip())
-return MapType(keyType, valueType, valueContainsNull)
-
-elif type_or_field == StructField:
-first_comma_index = rest_part.find(,)
-name = rest_part[:first_comma_index].strip()
-last_comma_index = rest_part.rfind(,)
-nullable = True
-if rest_part[last_comma_index + 1:].strip().lower() == false:
-nullable = False
-dataType = _parse_datatype_string(
-rest_part[first_comma_index + 1:last_comma_index].strip())
-return StructField(name, dataType, nullable)
-
-elif type_or_field == StructType:
-# rest_part should be in the format like
-# List(StructField(field1,IntegerType,false)).
-field_list_string = rest_part[rest_part.find(() + 1:-1]
-fields = _parse_datatype_list(field_list_string)
+return _parse_datatype_json_value(json.loads(json_string))
+
+
+def _parse_datatype_json_value(json_value):
+if json_value in _all_primitive_types.keys():
--- End diff --

if json_value is {}, it's not hashable, you can not use 'in' for it.

I would like to use same type of json_value for all types, such as dict, 
with a key called type, such as:

```
{'type': 'int'}
``` 
for other types, it could have additional keys, based on the type, such as:

```
{'type':'array', 'element': {'type':'int'}, 'null': True}
```

In this ways, it will be easier to do the type switch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57272666
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/210/consoleFull)
 for   PR 2563 at commit 
[`03da3ec`](https://github.com/apache/spark/commit/03da3ec870940bd6ff56e03450993da6125b40a4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-30 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57279975
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/210/consoleFull)
 for   PR 2563 at commit 
[`03da3ec`](https://github.com/apache/spark/commit/03da3ec870940bd6ff56e03450993da6125b40a4).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-28 Thread liancheng

GitHub user liancheng opened a pull request:

https://github.com/apache/spark/pull/2563

[SPARK-3713][SQL] Uses JSON to serialize DataType objects

This PR uses JSON instead of `toString` to serialize `DataType`s. The 
latter is not only hard to parse but also flaky in many cases.

Since we already write schema information to Parquet metadata in the old 
style, we have to reserve the old `DataType` parser and ensure downward 
compatibility. The old parser is now renamed to `CaseClassStringParser` and 
moved into `object DataType`.

@JoshRosen @davis Please help review PySpark related changes, thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/liancheng/spark datatype-to-json

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2563.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2563


commit dca9153d213a9a9603d7b327d78750af66021ed2
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-09-25T09:28:06Z

De/serializes DataType objects from/to JSON

commit 5f792df158128f6bf41a49e816a915150698a9d2
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-09-28T11:19:34Z

Adds PySpark support

commit 26c6563ab1f7bc9c063da44ecfcb31dff65a3bf1
Author: Cheng Lian lian.cs@gmail.com
Date:   2014-09-28T11:54:26Z

Adds compatibility est case for Parquet type conversion




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-28 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57084987
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20938/consoleFull)
 for   PR 2563 at commit 
[`26c6563`](https://github.com/apache/spark/commit/26c6563ab1f7bc9c063da44ecfcb31dff65a3bf1).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57084988
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20938/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-28 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57087294
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20939/consoleFull)
 for   PR 2563 at commit 
[`03da3ec`](https://github.com/apache/spark/commit/03da3ec870940bd6ff56e03450993da6125b40a4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-28 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57090930
  
**[Tests timed 
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20939/consoleFull)**
 after a configured wait of `120m`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-09-28 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/2563#issuecomment-57090932
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20939/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

57 matches

Mail list logo