Re: Mesos Spark Fine Grained Execution - CPU count

2016-12-24 Thread Davies Liu
Using 0 for spark.mesos.mesosExecutor.cores is better than dynamic allocation, but have to pay a little more overhead for launching a task, which should be OK if the task is not trivial. Since the direct result (up to 1M by default) will also go through mesos, it's better to tune it lower,

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Davies Liu
+1 On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Davies Liu
+1 for Matei's point. On Thu, Oct 27, 2016 at 8:36 AM, Matei Zaharia wrote: > Just to comment on this, I'm generally against removing these types of > things unless they create a substantial burden on project contributors. It > doesn't sound like Python 2.6 and Java 7 do

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Davies Liu
+1 On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin wrote: > Greetings from Spark Summit Europe at Brussels. > > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if a > majority of

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Davies Liu
+1 (non-binding) On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley wrote: > +1 > > On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee wrote: >> >> +1 (non-binding) >> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang wrote: >>> >>> +1 >>> >>> On

Re: Making BatchPythonEvaluation actually Batch

2016-03-31 Thread Davies Liu
@Justin, it's fixed by https://github.com/apache/spark/pull/12057 On Thu, Feb 11, 2016 at 11:26 AM, Davies Liu <dav...@databricks.com> wrote: > Had a quick look in your commit, I think that make sense, could you > send a PR for that, then we can review it. > > In order to s

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-07 Thread Davies Liu
ot; do you intend to say that the underlying state might > change , because of some state update APIs ? > Or its due to some other rationale ? > > Regards, > Rishitesh Mishra, > SnappyData . (http://www.snappydata.io/) > > https://in.linkedin.com/in/rishiteshmishra > > On Thu, Mar 3, 20

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Davies Liu
; about dataset size on disk vs. memory. > > -Matt Cheah > > On 3/2/16, 10:15 AM, "Davies Liu" <dav...@databricks.com> wrote: > >>UnsafeHashedRelation and HashedRelation could also be used in Executor >>(for non-broadcast hash join), then the UnsafeRow could co

Re: HashedRelation Memory Pressure on Broadcast Joins

2016-03-02 Thread Davies Liu
UnsafeHashedRelation and HashedRelation could also be used in Executor (for non-broadcast hash join), then the UnsafeRow could come from UnsafeProjection, so We should copy the rows for safety. We could have a smarter copy() for UnsafeRow (avoid the copy if it's already copied), but I don't think

Re: Making BatchPythonEvaluation actually Batch

2016-02-11 Thread Davies Liu
Had a quick look in your commit, I think that make sense, could you send a PR for that, then we can review it. In order to support 2), we need to change the serialized Python function from `f(iter)` to `f(x)`, process one row at a time (not a partition), then we can easily combine them together:

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
+1 On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas wrote: > +1 > > Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python > 2.6 is ancient history and the core Python developers stopped supporting it > in 2013. REHL 5 is not a good enough reason

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Davies Liu
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661 On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote: > i do not think so. > > does the python 2.7 need to be installed on all slaves? if so, we do not > have direct access to those. > > also, spark is easy for us

Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-03 Thread Davies Liu
Does this https://github.com/apache/spark/pull/10134 is valid fix? (still worse than 1.5) On Thu, Dec 3, 2015 at 8:45 AM, mkhaitman wrote: > I reported this in the 1.6 preview thread, but wouldn't mind if someone can > confirm that ctrl-c is not keyboard interrupting /

Re: pyspark with pypy not work for spark-1.5.1

2015-11-13 Thread Davies Liu
We already test CPython 2.6, CPython 3.4 and PyPy 2.5, it took more than 30 min to run (without parallelization), I think it should be enough. PyPy 2.2 is too old that we have not enough resource to support that. On Fri, Nov 6, 2015 at 2:27 AM, Chang Ya-Hsuan wrote: > Hi I

Re: ShuffledHashJoin Possible Issue

2015-10-19 Thread Davies Liu
Can you reproduce it on master? I can't reproduce it with the following code: >>> t2 = sqlContext.range(50).selectExpr("concat('A', id) as id") >>> t1 = sqlContext.range(10).selectExpr("concat('A', id) as id") >>> t1.join(t2).where(t1.id == t2.id).explain() ShuffledHashJoin [id#21], [id#19],

Re: StructType has more rows, than corresponding Row has objects.

2015-10-05 Thread Davies Liu
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet? On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov wrote: > Hi, > > We're building our own framework on top of spark and we give users pretty > complex schema to work with. That requires from

Re: pyspark streaming DStream compute

2015-09-15 Thread Davies Liu
On Tue, Sep 15, 2015 at 1:46 PM, Renyi Xiong wrote: > Can anybody help understand why pyspark streaming uses py4j callback to > execute python code while pyspark batch uses worker.py? There are two kind of callback in pyspark streaming: 1) one operate on RDDs, it take an

Re: Pyspark DataFrame TypeError

2015-09-08 Thread Davies Liu
I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as expected: ``` >>> from pyspark.mllib.linalg import Vectors >>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0, >>> Vectors.sparse(1, [], []))], ["label", "featuers"]) >>> df.show() +-+-+ |label|

Re: [VOTE] Release Apache Spark 1.5.0 (RC3)

2015-09-03 Thread Davies Liu
+1, built 1.5 from source and ran TPC-DS locally and clusters, ran performance benchmark for aggregation and join with difference scales, all worked well. On Thu, Sep 3, 2015 at 10:05 AM, Michael Armbrust wrote: > +1 Ran TPC-DS and ported several jobs over to 1.5 > > On

Re: PySpark on PyPi

2015-08-10 Thread Davies Liu
On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote: We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51

Re: PySpark on PyPi

2015-08-06 Thread Davies Liu
We could do that after 1.5 released, it will have same release cycle as Spark in the future. On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com: // ping do we have any

Re: PySpark GroupByKey implementation question

2015-07-15 Thread Davies Liu
If the map-side-combine is not that necessary, given the fact that it cannot reduce the size of data for shuffling much (do need to serialized the key for each value), but can reduce the number of key-value pairs, and potential reduce the number of operations later (repartition and groupby). On

Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Davies Liu
05ac023dc8d9004a27c2f06ee875b0ff3743ccdd Author: Davies Liu dav...@databricks.com Date: Fri Jul 10 13:05:23 2015 -0700 [HOTFIX] fix flaky test in PySpark SQL I looked at the test code, and it seems that precision in microseconds is lost somewhere in a round trip from Python to DataFrame. Can

Re: pyspark.sql.tests: is test_time_with_timezone a flaky test?

2015-07-12 Thread Davies Liu
Will be fixed by https://github.com/apache/spark/pull/7363 On Sun, Jul 12, 2015 at 7:45 PM, Davies Liu dav...@databricks.com wrote: Thanks for reporting this, I'm working on it. It turned out that it's a bug in when run with Python3.4, will sending out a fix soon. On Sun, Jul 12, 2015 at 1:33

Re: [PySpark DataFrame] When a Row is not a Row

2015-07-12 Thread Davies Liu
We finally fix this in 1.5 (next release), see https://github.com/apache/spark/pull/7301 On Sat, Jul 11, 2015 at 10:32 PM, Jerry Lam chiling...@gmail.com wrote: Hi guys, I just hit the same problem. It is very confusing when Row is not the same Row type at runtime. The worst thing is that

Re: Python UDF performance at large scale

2015-06-25 Thread Davies Liu
of about 100 when I did the tests, because I was worried about deadlocks. Do you have any concerns regarding the batched synchronous version of communication between the Java and Python processes, and if not, should I file a ticket and starting writing it? On Wed, Jun 24, 2015 at 7:27 PM Davies Liu

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
of the synchronous blocking solution. On Tue, Jun 23, 2015 at 7:21 PM Davies Liu dav...@databricks.com wrote: Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread

Re: Python UDF performance at large scale

2015-06-24 Thread Davies Liu
.) Punya On Wed, Jun 24, 2015 at 2:26 AM Davies Liu dav...@databricks.com wrote: Fare points, I also like simpler solutions. The overhead of Python task could be a few of milliseconds, which means we also should eval them as batches (one Python task per batch). Decreasing the batch size for UDF

Re: Python UDF performance at large scale

2015-06-23 Thread Davies Liu
Thanks for looking into it, I'd like the idea of having ForkingIterator. If we have unlimited buffer in it, then will not have the problem of deadlock, I think. The writing thread will be blocked by Python process, so there will be not much rows be buffered(still be a reason to OOM). At least,

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote: There is a module called 'types' in python 3: davies@localhost:~/work/spark$ python3 Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Davies Liu
...@gmail.com wrote: Thanks for clarifying! I don't understand python package and modules names that well, but I thought that the package namespacing would've helped, since you are in pyspark.sql.types. I guess not? On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote

Re: Tungsten's Vectorized Execution

2015-05-21 Thread Davies Liu
We have not start to prototype the vectorized one yet, will evaluated in 1.5 and may targeted for 1.6. We're glad to hear some feedback/suggestions/comments from your side! On Thu, May 21, 2015 at 9:37 AM, Yijie Shen henry.yijies...@gmail.com wrote: Hi all, I’ve seen the Blog of Project

Re: [SparkR] is toDF() necessary

2015-05-17 Thread Davies Liu
toDF() is first introduced in Scala and Python (because createDataFrame is too long), is used in lots places, I think it's useful. On Fri, May 8, 2015 at 11:03 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Agree that toDF is not very useful. In fact it was removed from the

回复: [PySpark DataFrame] When a Row is not a Row

2015-05-12 Thread Davies Liu
(called `Row`). -- Davies Liu Sent with Sparrow (http://www.sparrowmailapp.com/?sig) 已使用 Sparrow (http://www.sparrowmailapp.com/?sig) 在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道: This is really strange. # Spark 1.3.1 print type(results) class

Re: Query regarding infering data types in pyspark

2015-04-15 Thread Davies Liu
that compares the dates. The query I am using is : df.filter(df.Datecol datetime.date(2015,1,1)).show() I do not want to use date as a string to compare them. Please suggest. On Tue, Apr 14, 2015 at 4:59 AM, Davies Liu dav...@databricks.com wrote: Hey Suraj, You should use date

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
The CallbackServer is part of Py4j, it's only used in driver, not used in slaves or workers. On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao todd.gao.2013+sp...@gmail.com wrote: Hi all, I am reading the code of PySpark and its Streaming module. In PySpark Streaming, when the `compute` method of

Re: How to speed PySpark to match Scala/Java performance

2015-01-29 Thread Davies Liu
Hey, Without having Python as fast as Scala/Java, I think it's impossible to similar performance in PySpark as in Scala/Java. Jython is also much slower than Scala/Java. With Jython, we can avoid the cost of manage multiple process and RPC, we may still need to do the data conversion between

Re: Python to Java object conversion of numpy array

2015-01-13 Thread Davies Liu
On Monday 12 January 2015 11:35 PM, Davies Liu wrote: On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi, This is the code I am running. mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799])) membershipMatrix = callMLlibFunc(findPredict

Re: Use of MapConverter, ListConverter in python to java object conversion

2015-01-13 Thread Davies Liu
It's not necessary, I will create a PR to remove them. For larger dict/list/tuple, the pickle approach may have less RPC calls, better performance. Davies On Tue, Jan 13, 2015 at 4:53 AM, Meethu Mathew meethu.mat...@flytxt.com wrote: Hi all, In the python object to java conversion done in

Re: Python to Java object conversion of numpy array

2015-01-12 Thread Davies Liu
looks like? all the arguments of findPredict should be converted into java objects, so what should `mu` be converted to? Regards, Meethu On Monday 12 January 2015 11:46 AM, Davies Liu wrote: Could you post a piece of code here? On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat

Re: Python to Java object conversion of numpy array

2015-01-11 Thread Davies Liu
])) exception is like 'numpy.ndarray' object has no attribute '_get_object_id' Regards, Meethu Mathew Engineer Flytxt www.flytxt.com | Visit our blog | Follow us | Connect on Linkedin On Friday 09 January 2015 11:37 PM, Davies Liu wrote: Hey Meethu, The Java API accepts only Vector

Re: Python to Java object conversion of numpy array

2015-01-09 Thread Davies Liu
Hey Meethu, The Java API accepts only Vector, so you should convert the numpy array into pyspark.mllib.linalg.DenseVector. BTW, which class are you using? the KMeansModel.predict() accept numpy.array, it will do the conversion for you. Davies On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew

Re: Adding third party jars to classpath used by pyspark

2014-12-30 Thread Davies Liu
On Mon, Dec 29, 2014 at 7:39 PM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi Stephen, it should be enough to include --jars /path/to/file.jar in the command line call to either pyspark or spark-submit, as in spark-submit --master local --jars /path/to/file.jar myfile.py

Re: Help, pyspark.sql.List flatMap results become tuple

2014-12-30 Thread Davies Liu
This should be fixed in 1.2, could you try it? On Mon, Dec 29, 2014 at 8:04 PM, guoxu1231 guoxu1...@gmail.com wrote: Hi pyspark guys, I have a json file, and its struct like below: {NAME:George, AGE:35, ADD_ID:1212, POSTAL_AREA:1, TIME_ZONE_ID:1, INTEREST:[{INTEREST_NO:1, INFO:x},

Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Davies Liu
-1 (not binding, +1 for maintainer, -1 for sign off) Agree with Greg and Vinod. In the beginning, everything is better (more efficient, more focus), but after some time, fighting begins. Code style is the most hot topic to fight (we already saw it in some PRs). If two committers (one of them is

Re: [VOTE] Designating maintainers for some Spark components

2014-11-07 Thread Davies Liu
revert my vote to +1. Sorry for this. Davies On Fri, Nov 7, 2014 at 3:18 PM, Davies Liu dav...@databricks.com wrote: -1 (not binding, +1 for maintainer, -1 for sign off) Agree with Greg and Vinod. In the beginning, everything is better (more efficient, more focus), but after some time, fighting

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-18 Thread Davies Liu
be around to revert if it doesn’t). On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote: One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-18 Thread Davies Liu
that’s still failing with the old fetch failure? - Josh On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com) wrote: How can we know the changes has been applied? I had checked several recent builds, they all use the original configs. Davies On Fri, Oct 17, 2014 at 6:17 PM

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

2014-10-17 Thread Davies Liu
One finding is that all the timeout happened with this command: git fetch --tags --progress https://github.com/apache/spark.git +refs/pull/*:refs/remotes/origin/pr/* I'm thinking that maybe this may be a expensive call, we could try to use a more cheap one: git fetch --tags --progress

Re: TorrentBroadcast slow performance

2014-10-07 Thread Davies Liu
Could you create a JIRA for it? maybe it's a regression after https://issues.apache.org/jira/browse/SPARK-3119. We will appreciate that if you could tell how to reproduce it. On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel guillaume.pi...@exensa.com wrote: Hi, I've had no answer to this on

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster is Spark v. PySpark (I didn't really mean Scala v. Python)? I've only used Spark

Re: A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms

2014-08-13 Thread Davies Liu
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote: On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas ignacio.zendejas...@gmail.com wrote: Yep, I thought it was a bogus comparison. I should rephrase my question as it was poorly phrased: on average, how much faster

Re: better compression codecs for shuffle blocks?

2014-07-14 Thread Davies Liu
Maybe we could try LZ4 [1], which has better performance and smaller footprint than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x higher than LZF[2], but memory used is 10x smaller than LZF (16k vs 190k). [1] https://github.com/jpountz/lz4-java [2]