[GitHub] spark pull request: [SPARK-2887] fix bug of countApproxDistinct() ...

2014-08-06 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1812#issuecomment-51402466 cc @mateiz --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-06 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1791#issuecomment-51402751 @JoshRosen @mateiz Could you take a look at this? I hope that this can be in 1.1. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-06 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1791#discussion_r15907204 --- Diff: python/pyspark/context.py --- @@ -727,6 +738,13 @@ def sparkUser(self): return self._jsc.sc().sparkUser

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-06 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1791#issuecomment-51405218 The difference is that whether those unimplemented API should in the API docs, I think we should have an complete set of API in Java or Python, and user can easily know

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-06 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1791#discussion_r15916433 --- Diff: python/pyspark/context.py --- @@ -727,6 +738,13 @@ def sparkUser(self): return self._jsc.sc().sparkUser

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-07 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1791#discussion_r15955661 --- Diff: python/pyspark/rdd.py --- @@ -737,6 +754,19 @@ def _collect_iterator_through_file(self, iterator): yield item

[GitHub] spark pull request: [SPARK-2898] [PySpark] fix bugs in deamon.py

2014-08-07 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1842 [SPARK-2898] [PySpark] fix bugs in deamon.py 1. do not use signal handler for SIGCHILD, it's easy to cause deadlock 2. handle EINTR during accept() 3. pass errno into JVM 4. handle EAGAIN

[GitHub] spark pull request: [WIP] [SPARK-2655] Change logging level from I...

2014-08-07 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1776#issuecomment-51543727 Before moving on, maybe the first question is that, does Spark 1.1 is so stable that we can trust it work as expected? If yes in most cases, I think reduce the chatty

[GitHub] spark pull request: [SPARK-2790] [PySPark] fix zip with serializer...

2014-08-11 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1894 [SPARK-2790] [PySPark] fix zip with serializers which have different batch sizes. If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-11 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/1791#discussion_r16072784 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -741,6 +741,23 @@ private[spark] object PythonRDD extends Logging

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1791#issuecomment-51826195 @JoshRosen @mateiz I had commented out those not implemented APIs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-11 Thread davies
Github user davies closed the pull request at: https://github.com/apache/spark/pull/1791 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-11 Thread davies
GitHub user davies reopened a pull request: https://github.com/apache/spark/pull/1791 [SPARK-2871] [PySpark] Add missing API Try to bring all Java/Scala API to PySpark. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark

[GitHub] spark pull request: [SPARK-2871] [PySpark] Add missing API

2014-08-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1791#issuecomment-51826337 closed by accident --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2790] [PySPark] fix zip with serializer...

2014-08-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1894#issuecomment-51829030 The failure is not related to this PR, how to re-test this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-705] [PySpark] improve performance of s...

2014-08-11 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1898 [SPARK-705] [PySpark] improve performance of sortByKey() 1. skip partitionBy() when numOfPartition is 1 2. use bisect_left (O(lg(N))) instread of loop (O(N)) in rangePartitioner You can

[GitHub] spark pull request: fix flaky tests

2014-08-12 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1910 fix flaky tests Python 2.6 does not handle float error well as 2.7+ You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark fix_test

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-12 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/1912 [SPARK-1065] [PySpark] improve supporting for large broadcast Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize

[GitHub] spark pull request: [SPARK-1065] [PySpark] improve supporting for ...

2014-08-12 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1912#issuecomment-52001814 failed tests were not related to this PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-07 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18557162 --- Diff: python/pyspark/streaming/tests.py --- @@ -0,0 +1,548 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-07 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18557210 --- Diff: python/pyspark/streaming/context.py --- @@ -0,0 +1,319 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-07 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-58282175 @tdas It looks like the tests are stable enough, it had 5 successes in a row. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-07 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-58284518 I had created an JIRA to track the hacks for py4j: https://issues.apache.org/jira/browse/SPARK-3842 --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-07 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2563#issuecomment-58307821 @marmbrus I think this is ready to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2716 [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows

[GitHub] spark pull request: [SPARK-3855][SQL] Preserve the result attribut...

2014-10-08 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2717#issuecomment-58436504 It's reproducable by this query: ``` SELECT strlen(a) FROM test WHERE strlen(a) 1 ``` --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3855][SQL] Preserve the result attribut...

2014-10-08 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2717#issuecomment-58438988 @marmbrus Could you add a test in pyhon/pyspark/tests.py (SQLTests) ? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-08 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-58440322 @nchammas This PR only fix the problem of having empty values in first few rows, it can not handle different types for one field (like what json() had done

[GitHub] spark pull request: [SPARK-3772] Allow `ipython` to be used by Pys...

2014-10-08 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2651#issuecomment-58445558 LGTM, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-3721] [PySpark] broadcast objects large...

2014-10-08 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2659#issuecomment-58445706 The code in the JIRA could be used for test this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-3868][PySpark] Hard to recognize which ...

2014-10-08 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2724#issuecomment-58456661 LGTM. Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3855][SQL] Preserve the result attribut...

2014-10-09 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2717#discussion_r18687839 --- Diff: python/pyspark/tests.py --- @@ -679,6 +679,12 @@ def test_udf(self): [row] = self.sqlCtx.sql(SELECT twoArgs('test', 1)).collect

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-58691755 @giwa No, it's under testing/QA --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-3855][SQL] Preserve the result attribut...

2014-10-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2717#issuecomment-58710443 @marmbrus LGTM, just wonder that why you do not use IntegerType as returnType in the tests? (no change needed) --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-58739136 @tdas The failure looked wired, updater() take exactly two arguments, let's test it again. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-10-10 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-58739234 @tdas it's my mistake, the updateStateByKey() was used in another tests, it's fixed now. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [Spark] RDD take() method: overestimate too mu...

2014-10-13 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2648#issuecomment-58925491 @yingjieMiao it looks good to me, waiting for other people. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-3916] [Streaming] discover new appended...

2014-10-14 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2806 [SPARK-3916] [Streaming] discover new appended data for fileStream() In a case that new data will be appended to existed files continuously, then fileStream() should discovery the new appended data

[GitHub] spark pull request: [SPARK-3133] embed small object in broadcast t...

2014-10-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2681#issuecomment-59120483 Sometimes hit this bug during pyspark testing ``` Py4JJavaError: An error occurred while calling o55.collect. : org.apache.spark.SparkException: Job aborted due

[GitHub] spark pull request: [SPARK-3952] add Python examples in Streaming ...

2014-10-14 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2808 [SPARK-3952] add Python examples in Streaming Programming Guide Having Python examples in Streaming Programming Guide. Also add RecoverableNetworkWordCount example. You can merge this pull

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2819 [SPARK-3961] Python API for mllib.feature Added completed Python API for MLlib.feature Normalizer StandardScalerModel StandardScaler HashTF IDFModel IDF You can merge

[GitHub] spark pull request: [SPARK-3961] Python API for mllib.feature

2014-10-15 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2819#discussion_r18932968 --- Diff: python/pyspark/mllib/feature.py --- @@ -95,90 +360,46 @@ class Word2Vec(object): sentence = a b * 100 + a c * 10 localDoc

[GitHub] spark pull request: [SPARK-3971] [MLLib] [PySpark] hotfix: Customi...

2014-10-16 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2830 [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode Customized pickler should be registered before unpickling, but in executor, there is no way to register

[GitHub] spark pull request: [SPARK-3971] [MLLib] [PySpark] hotfix: Customi...

2014-10-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2830#issuecomment-59430263 @mengxr @falaki it had passed all the tests, the last two commits are just refactor, I think it's ready to merge. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3971] [MLLib] [PySpark] hotfix: Customi...

2014-10-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2830#issuecomment-59437296 @mengxr The failed case is not related. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3971] [MLLib] [PySpark] hotfix: Customi...

2014-10-16 Thread davies
Github user davies closed the pull request at: https://github.com/apache/spark/pull/2830 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-3982] [Streaming] [PySpark] Python API:...

2014-10-16 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2833 [SPARK-3982] [Streaming] [PySpark] Python API: receiverStream() This patch brings receiverStream() for Python API, it could be used to create an input stream with any arbitrary user implemented

[GitHub] spark pull request: [SPARK-3133] embed small object in broadcast t...

2014-10-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2681#issuecomment-59464471 This error was not happened in tests of this PR, it happened in tests of our product, which have similar pattern as streaming, the job was submitted via py4j

[GitHub] spark pull request: [SPARK-3952] [Streaming] [PySpark] add Python ...

2014-10-16 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2808#discussion_r19001885 --- Diff: docs/streaming-programming-guide.md --- @@ -398,6 +498,30 @@ JavaSparkContext sc = ... //existing JavaSparkContext JavaStreamingContext ssc

[GitHub] spark pull request: [SPARK-3952] [Streaming] [PySpark] add Python ...

2014-10-16 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2808#issuecomment-59469000 @JoshRosen I had addressed your comments, also added code tabs for design patterns section. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2838 [SPARK-3993] [PySpark] fix bug while reuse worker after take() After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted

[GitHub] spark pull request: [SPARK-3952] [Streaming] [PySpark] add Python ...

2014-10-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2808#issuecomment-59576834 @JoshRosen fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3916] [Streaming] discover new appended...

2014-10-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2806#issuecomment-59578404 @tdas Could you help to review this? The failed tests run stable locally, I'm investigating it. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-17 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2743#discussion_r19051269 --- Diff: python/pyspark/conf.py --- @@ -57,6 +57,22 @@ __all__ = ['SparkConf'] +def _parse_memory(s): + +Parse a memory

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-17 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2743#discussion_r19051286 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -63,9 +64,12 @@ private[spark] class PythonRDD( val localdir

[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2838#issuecomment-59596407 @aarondav Yes, before reuse workers, every python task will fork a new python worker. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-3133] embed small object in broadcast t...

2014-10-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2681#issuecomment-59634598 I hope that we can have this in 1.1, some people see regression in 1.1 because of TorrentBroadcast, this patch will help for those. --- If your project is set up

[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2838#discussion_r19057007 --- Diff: python/pyspark/worker.py --- @@ -57,7 +57,7 @@ def main(infile, outfile): boot_time = time.time() split_index = read_int

[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2838#issuecomment-59635035 take() is not the only one which will introduce problems, user could call mapPartitions(), and read parts of the items in the infile. Not only re-use the worker

[GitHub] spark pull request: [SPARK-3952] [Streaming] [PySpark] add Python ...

2014-10-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2808#issuecomment-59635066 @JoshRosen updated the readme. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-19 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2844#discussion_r19063222 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -76,23 +87,20 @@ private[spark] class TorrentBroadcast[T: ClassTag

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-19 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2844#discussion_r19063253 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -62,6 +59,20 @@ private[spark] class TorrentBroadcast[T: ClassTag

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-19 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2844#discussion_r19063271 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -156,6 +158,7 @@ private[spark] class TorrentBroadcast[T: ClassTag

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-19 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2844#discussion_r19063455 --- Diff: core/src/main/scala/org/apache/spark/broadcast/TorrentBroadcast.scala --- @@ -179,43 +183,29 @@ private[spark] class TorrentBroadcast[T: ClassTag

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-19 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2743#discussion_r19066507 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -63,9 +64,12 @@ private[spark] class PythonRDD( val localdir

[GitHub] spark pull request: [SPARK-3958] TorrentBroadcast cleanup / debugg...

2014-10-20 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2844#issuecomment-59803868 LGTM now, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-4023] [MLlib] [PySpark] convert rdd int...

2014-10-20 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2870 [SPARK-4023] [MLlib] [PySpark] convert rdd into RDD of Vector Convert the input rdd to RDD of Vector. cc @mengxr You can merge this pull request into a Git repository by running: $ git

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-20 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2743#issuecomment-59878121 @pwendell There are two PullRequestBuilder plugins, one is work, another one (called NewSparkPullRequestBuilder) is still failing. --- If your project is set up

[GitHub] spark pull request: [SPARK-3888] [PySpark] limit the memory used b...

2014-10-21 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2743#issuecomment-60036985 Hold this PR, we may don't need it anymore. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-4051] [SQL] [PySQL] Convert Row into di...

2014-10-22 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2896 [SPARK-4051] [SQL] [PySQL] Convert Row into dictionary Added a method to Row to turn row into dict: ``` row = Row(a=1) row.asDict() {'a': 1} ``` You can merge this pull

[GitHub] spark pull request: Clarify docstring for Pyspark's foreachPartiti...

2014-10-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2895#issuecomment-60292102 @JoshRosen It will be better if we could easily backport them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-3993] [PySpark] fix bug while reuse wor...

2014-10-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2838#discussion_r19304373 --- Diff: python/pyspark/worker.py --- @@ -131,6 +130,14 @@ def process(): for (aid, accum) in _accumulatorRegistry.items(): pickleSer

[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type

2014-10-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2901#discussion_r19312590 --- Diff: python/pyspark/sql.py --- @@ -1065,7 +1074,9 @@ def applySchema(self, rdd, schema): [Row(field1=1, field2=u'row1'),..., Row(field1=3

[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type

2014-10-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2901#discussion_r19313312 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -372,13 +372,20 @@ private[sql] object JsonRDD extends Logging

[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type

2014-10-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2901#discussion_r19313556 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/json/JsonRDD.scala --- @@ -372,13 +372,20 @@ private[sql] object JsonRDD extends Logging

[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type

2014-10-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2901#issuecomment-60321798 Thanks for fix so many typos! It will be awesome to recognize all Date/Timestamps values in JsonRDD. If it's not easy to do it in this PR, we could do

[GitHub] spark pull request: add a util method for changing the log level w...

2014-10-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2433#issuecomment-60322408 @holdenk Could you add some examples about how the logging levels should be? All list all the valid names in docstring. @tdas We could use this in the Streaming

[GitHub] spark pull request: [SPARK-3569][SQL] Add metadata field to Struct...

2014-10-23 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2701#discussion_r19314368 --- Diff: python/pyspark/sql.py --- @@ -305,12 +305,15 @@ class StructField(DataType): -def __init__(self, name, dataType

[GitHub] spark pull request: [SPARK-2652] [PySpark] donot use KyroSerialize...

2014-10-23 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2916 [SPARK-2652] [PySpark] donot use KyroSerializer as default serializer KyroSerializer can not serialize customized class without registered explicitly, use it as default serializer in PySpark

[GitHub] spark pull request: simplify serializer, use AutoBatchedSerializer...

2014-10-23 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2920 simplify serializer, use AutoBatchedSerializer by default. This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1. You can merge

[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type

2014-10-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2901#discussion_r19346201 --- Diff: python/pyspark/sql.py --- @@ -1084,10 +1096,11 @@ def applySchema(self, rdd, schema): ... StructField(null, DoubleType(), True

[GitHub] spark pull request: [SPARK-3988][SQL] add public API for date type

2014-10-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2901#discussion_r19353769 --- Diff: python/pyspark/sql.py --- @@ -1084,10 +1096,11 @@ def applySchema(self, rdd, schema): ... StructField(null, DoubleType(), True

[GitHub] spark pull request: [SPARK-3133] embed small object in broadcast t...

2014-10-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2681#issuecomment-60431415 @JoshRosen Could you look at this again? I had rebased it on your changes. Hope this could make the tests more stable. --- If your project is set up for it, you can

[GitHub] spark pull request: use broadcast for task only when task is large...

2014-10-24 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2933 use broadcast for task only when task is large enough Using broadcast for small tasks has no benefits or even some regressions (several RPCs), also there some stable issues with broadcast, so we

[GitHub] spark pull request: [SPARK-3133] embed small object in broadcast t...

2014-10-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2681#issuecomment-60445861 PR #2933 will so similar things as this one, that also works for HttpBroadcast. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-4080] Only throw IOException from [writ...

2014-10-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2932#issuecomment-60449095 Cool, LGTM! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-4082] remove unnecessary broadcast for ...

2014-10-24 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2935 [SPARK-4082] remove unnecessary broadcast for conf We already broadcast the task (RDD and closure) itself, so some small data used in RDD or closure do not needed to be broadcasted explicitly any

[GitHub] spark pull request: [SPARK-3133] embed small object in broadcast t...

2014-10-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2681#issuecomment-60452847 Even if we merge #2933, I still would like to have this, because people could use broadcast for small dataset (such as in MLlib), this patch can improve these cases

[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

2014-10-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2933#discussion_r19369470 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -124,6 +123,10 @@ class DAGScheduler( /** If enabled, we may run

[GitHub] spark pull request: [SPARK-2585] remove unnecessary broadcast for ...

2014-10-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2935#issuecomment-60468274 But inside readFields(), it may call new Configuration(), so we still need to synchronize it here. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-2585] remove unnecessary broadcast for ...

2014-10-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2935#discussion_r19373926 --- Diff: core/src/main/scala/org/apache/spark/SerializableWritable.scala --- @@ -38,8 +38,10 @@ class SerializableWritable[T : Writable](@transient var t: T

[GitHub] spark pull request: [SPARK-2585] remove unnecessary broadcast for ...

2014-10-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2935#issuecomment-60472321 @JoshRosen this PR is ready to review, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-3594] [PySpark] [SQL] take more rows to...

2014-10-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2716#issuecomment-60473788 failed: ``` [info] - sorting without aggregation, with spill *** FAILED *** [info] java.io.FileNotFoundException: /tmp/spark-local-20141024230838-6b0e/07

[GitHub] spark pull request: [SPARK-4088] [PySpark] Python worker should ex...

2014-10-25 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2941 [SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM In case of take() or exception in Python, python worker may exit before JVM read() all the response, then the write

[GitHub] spark pull request: [SPARK-4088] [PySpark] Python worker should ex...

2014-10-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2941#issuecomment-60474604 The race is that which of reader or writer thread will know that the worker has exited, If reader find it first, then no problem, but if writer find it first

[GitHub] spark pull request: [SPARK-4088] [PySpark] Python worker should ex...

2014-10-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2941#issuecomment-60475340 It's not easy to reproduce this failure, but it did fail in jenkins: ``` == ERROR

[GitHub] spark pull request: [SPARK-4088] [PySpark] Python worker should ex...

2014-10-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2941#issuecomment-60475355 Also I can not reproduce this without daemon.py (simulate the behavior in Windows). --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-2585] remove unnecessary broadcast for ...

2014-10-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2935#issuecomment-60497056 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

2014-10-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2933#discussion_r19377630 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -124,6 +123,10 @@ class DAGScheduler( /** If enabled, we may run

[GitHub] spark pull request: [SPARK-4087] use broadcast for task only when ...

2014-10-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2933#discussion_r19377652 --- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala --- @@ -124,6 +123,10 @@ class DAGScheduler( /** If enabled, we may run

<    1   2   3   4   5   6   7   8   9   10   >