Using 0 for spark.mesos.mesosExecutor.cores is better than dynamic
allocation, but have to pay a little more overhead for launching a
task, which should be OK if the task is not trivial.
Since the direct result (up to 1M by default) will also go through
mesos, it's better to tune it lower,
+1
On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote:
> Please vote on releasing the following candidate as Apache Spark version
> 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a
> majority of at least 3+1 PMC votes are cast.
>
> [ ] +1 Release
+1 for Matei's point.
On Thu, Oct 27, 2016 at 8:36 AM, Matei Zaharia wrote:
> Just to comment on this, I'm generally against removing these types of
> things unless they create a substantial burden on project contributors. It
> doesn't sound like Python 2.6 and Java 7 do
+1
On Thu, Oct 27, 2016 at 12:18 AM, Reynold Xin wrote:
> Greetings from Spark Summit Europe at Brussels.
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if a
> majority of
+1 (non-binding)
On Mon, Sep 26, 2016 at 9:36 AM, Joseph Bradley wrote:
> +1
>
> On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee wrote:
>>
>> +1 (non-binding)
>> On Sun, Sep 25, 2016 at 23:20 Jeff Zhang wrote:
>>>
>>> +1
>>>
>>> On
@Justin, it's fixed by https://github.com/apache/spark/pull/12057
On Thu, Feb 11, 2016 at 11:26 AM, Davies Liu <dav...@databricks.com> wrote:
> Had a quick look in your commit, I think that make sense, could you
> send a PR for that, then we can review it.
>
> In order to s
ot; do you intend to say that the underlying state might
> change , because of some state update APIs ?
> Or its due to some other rationale ?
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Mar 3, 20
; about dataset size on disk vs. memory.
>
> -Matt Cheah
>
> On 3/2/16, 10:15 AM, "Davies Liu" <dav...@databricks.com> wrote:
>
>>UnsafeHashedRelation and HashedRelation could also be used in Executor
>>(for non-broadcast hash join), then the UnsafeRow could co
UnsafeHashedRelation and HashedRelation could also be used in Executor
(for non-broadcast hash join), then the UnsafeRow could come from
UnsafeProjection,
so We should copy the rows for safety.
We could have a smarter copy() for UnsafeRow (avoid the copy if it's
already copied),
but I don't think
Had a quick look in your commit, I think that make sense, could you
send a PR for that, then we can review it.
In order to support 2), we need to change the serialized Python
function from `f(iter)` to `f(x)`, process one row at a time (not a
partition),
then we can easily combine them together:
+1
On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas
wrote:
> +1
>
> Red Hat supports Python 2.6 on REHL 5 until 2020, but otherwise yes, Python
> 2.6 is ancient history and the core Python developers stopped supporting it
> in 2013. REHL 5 is not a good enough reason
Created JIRA: https://issues.apache.org/jira/browse/SPARK-12661
On Tue, Jan 5, 2016 at 2:49 PM, Koert Kuipers wrote:
> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us
Does this https://github.com/apache/spark/pull/10134 is valid fix?
(still worse than 1.5)
On Thu, Dec 3, 2015 at 8:45 AM, mkhaitman wrote:
> I reported this in the 1.6 preview thread, but wouldn't mind if someone can
> confirm that ctrl-c is not keyboard interrupting /
We already test CPython 2.6, CPython 3.4 and PyPy 2.5, it took more
than 30 min to run (without parallelization),
I think it should be enough.
PyPy 2.2 is too old that we have not enough resource to support that.
On Fri, Nov 6, 2015 at 2:27 AM, Chang Ya-Hsuan wrote:
> Hi I
Can you reproduce it on master?
I can't reproduce it with the following code:
>>> t2 = sqlContext.range(50).selectExpr("concat('A', id) as id")
>>> t1 = sqlContext.range(10).selectExpr("concat('A', id) as id")
>>> t1.join(t2).where(t1.id == t2.id).explain()
ShuffledHashJoin [id#21], [id#19],
Could you tell us a way to reproduce this failure? Reading from JSON or Parquet?
On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov
wrote:
> Hi,
>
> We're building our own framework on top of spark and we give users pretty
> complex schema to work with. That requires from
On Tue, Sep 15, 2015 at 1:46 PM, Renyi Xiong wrote:
> Can anybody help understand why pyspark streaming uses py4j callback to
> execute python code while pyspark batch uses worker.py?
There are two kind of callback in pyspark streaming:
1) one operate on RDDs, it take an
I tried with Python 2.7/3.4 and Spark 1.4.1/1.5-RC3, they all work as expected:
```
>>> from pyspark.mllib.linalg import Vectors
>>> df = sqlContext.createDataFrame([(1.0, Vectors.dense([1.0])), (0.0,
>>> Vectors.sparse(1, [], []))], ["label", "featuers"])
>>> df.show()
+-+-+
|label|
+1, built 1.5 from source and ran TPC-DS locally and clusters, ran
performance benchmark for aggregation and join with difference scales,
all worked well.
On Thu, Sep 3, 2015 at 10:05 AM, Michael Armbrust
wrote:
> +1 Ran TPC-DS and ported several jobs over to 1.5
>
> On
On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu dav...@databricks.com wrote:
We could do that after 1.5 released, it will have same release cycle
as Spark in the future.
On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
+1 (once again :) )
2015-07-28 14:51
We could do that after 1.5 released, it will have same release cycle
as Spark in the future.
On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
o.girar...@lateral-thoughts.com wrote:
+1 (once again :) )
2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com:
// ping
do we have any
If the map-side-combine is not that necessary, given the fact that it cannot
reduce the size of data for shuffling much (do need to serialized the key for
each value), but can reduce the number of key-value pairs, and potential reduce
the number of operations later (repartition and groupby).
On
05ac023dc8d9004a27c2f06ee875b0ff3743ccdd
Author: Davies Liu dav...@databricks.com
Date: Fri Jul 10 13:05:23 2015 -0700
[HOTFIX] fix flaky test in PySpark SQL
I looked at the test code, and it seems that precision in microseconds is
lost somewhere in a round trip from Python to DataFrame. Can
Will be fixed by https://github.com/apache/spark/pull/7363
On Sun, Jul 12, 2015 at 7:45 PM, Davies Liu dav...@databricks.com wrote:
Thanks for reporting this, I'm working on it. It turned out that it's
a bug in when run with Python3.4, will sending out a fix soon.
On Sun, Jul 12, 2015 at 1:33
We finally fix this in 1.5 (next release), see
https://github.com/apache/spark/pull/7301
On Sat, Jul 11, 2015 at 10:32 PM, Jerry Lam chiling...@gmail.com wrote:
Hi guys,
I just hit the same problem. It is very confusing when Row is not the same
Row type at runtime. The worst thing is that
of about 100 when I did the tests,
because I was worried about deadlocks. Do you have any concerns regarding
the batched synchronous version of communication between the Java and Python
processes, and if not, should I file a ticket and starting writing it?
On Wed, Jun 24, 2015 at 7:27 PM Davies Liu
of the synchronous
blocking solution.
On Tue, Jun 23, 2015 at 7:21 PM Davies Liu dav...@databricks.com wrote:
Thanks for looking into it, I'd like the idea of having
ForkingIterator. If we have unlimited buffer in it, then will not have
the problem of deadlock, I think. The writing thread
.)
Punya
On Wed, Jun 24, 2015 at 2:26 AM Davies Liu dav...@databricks.com wrote:
Fare points, I also like simpler solutions.
The overhead of Python task could be a few of milliseconds, which
means we also should eval them as batches (one Python task per batch).
Decreasing the batch size for UDF
Thanks for looking into it, I'd like the idea of having
ForkingIterator. If we have unlimited buffer in it, then will not have
the problem of deadlock, I think. The writing thread will be blocked
by Python process, so there will be not much rows be buffered(still be
a reason to OOM). At least,
namespacing would've helped, since
you are in pyspark.sql.types. I guess not?
On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote:
There is a module called 'types' in python 3:
davies@localhost:~/work/spark$ python3
Python 3.4.1 (v3.4.1:c0e311e010fc, May 18 2014, 00:54:21
...@gmail.com wrote:
Thanks for clarifying! I don't understand python package and modules names
that well, but I thought that the package namespacing would've helped, since
you are in pyspark.sql.types. I guess not?
On Tue, May 26, 2015 at 3:03 PM Davies Liu dav...@databricks.com wrote
We have not start to prototype the vectorized one yet, will evaluated
in 1.5 and may targeted for 1.6.
We're glad to hear some feedback/suggestions/comments from your side!
On Thu, May 21, 2015 at 9:37 AM, Yijie Shen henry.yijies...@gmail.com wrote:
Hi all,
I’ve seen the Blog of Project
toDF() is first introduced in Scala and Python (because
createDataFrame is too long), is used in lots places, I think it's
useful.
On Fri, May 8, 2015 at 11:03 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
Agree that toDF is not very useful. In fact it was removed from the
(called
`Row`).
--
Davies Liu
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
已使用 Sparrow (http://www.sparrowmailapp.com/?sig)
在 2015年5月12日 星期二,上午4:49,Nicholas Chammas 写道:
This is really strange.
# Spark 1.3.1
print type(results)
class
that compares the dates.
The query I am using is :
df.filter(df.Datecol datetime.date(2015,1,1)).show()
I do not want to use date as a string to compare them. Please suggest.
On Tue, Apr 14, 2015 at 4:59 AM, Davies Liu dav...@databricks.com wrote:
Hey Suraj,
You should use date
The CallbackServer is part of Py4j, it's only used in driver, not used
in slaves or workers.
On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
todd.gao.2013+sp...@gmail.com wrote:
Hi all,
I am reading the code of PySpark and its Streaming module.
In PySpark Streaming, when the `compute` method of
Hey,
Without having Python as fast as Scala/Java, I think it's impossible to similar
performance in PySpark as in Scala/Java. Jython is also much slower than
Scala/Java.
With Jython, we can avoid the cost of manage multiple process and RPC,
we may still need to do the data conversion between
On Monday 12 January 2015 11:35 PM, Davies Liu wrote:
On Sun, Jan 11, 2015 at 10:21 PM, Meethu Mathew
meethu.mat...@flytxt.com wrote:
Hi,
This is the code I am running.
mu = (Vectors.dense([0.8786, -0.7855]),Vectors.dense([-0.1863, 0.7799]))
membershipMatrix = callMLlibFunc(findPredict
It's not necessary, I will create a PR to remove them.
For larger dict/list/tuple, the pickle approach may have less RPC
calls, better performance.
Davies
On Tue, Jan 13, 2015 at 4:53 AM, Meethu Mathew meethu.mat...@flytxt.com wrote:
Hi all,
In the python object to java conversion done in
looks like? all the arguments of findPredict
should be converted
into java objects, so what should `mu` be converted to?
Regards,
Meethu
On Monday 12 January 2015 11:46 AM, Davies Liu wrote:
Could you post a piece of code here?
On Sun, Jan 11, 2015 at 9:28 PM, Meethu Mathew meethu.mat
])) exception is like
'numpy.ndarray' object has no attribute '_get_object_id'
Regards,
Meethu Mathew
Engineer
Flytxt
www.flytxt.com | Visit our blog | Follow us | Connect on Linkedin
On Friday 09 January 2015 11:37 PM, Davies Liu wrote:
Hey Meethu,
The Java API accepts only Vector
Hey Meethu,
The Java API accepts only Vector, so you should convert the numpy array into
pyspark.mllib.linalg.DenseVector.
BTW, which class are you using? the KMeansModel.predict() accept numpy.array,
it will do the conversion for you.
Davies
On Fri, Jan 9, 2015 at 4:45 AM, Meethu Mathew
On Mon, Dec 29, 2014 at 7:39 PM, Jeremy Freeman
freeman.jer...@gmail.com wrote:
Hi Stephen, it should be enough to include
--jars /path/to/file.jar
in the command line call to either pyspark or spark-submit, as in
spark-submit --master local --jars /path/to/file.jar myfile.py
This should be fixed in 1.2, could you try it?
On Mon, Dec 29, 2014 at 8:04 PM, guoxu1231 guoxu1...@gmail.com wrote:
Hi pyspark guys,
I have a json file, and its struct like below:
{NAME:George, AGE:35, ADD_ID:1212, POSTAL_AREA:1,
TIME_ZONE_ID:1, INTEREST:[{INTEREST_NO:1, INFO:x},
-1 (not binding, +1 for maintainer, -1 for sign off)
Agree with Greg and Vinod. In the beginning, everything is better
(more efficient, more focus), but after some time, fighting begins.
Code style is the most hot topic to fight (we already saw it in some
PRs). If two committers (one of them is
revert my vote to +1. Sorry for this.
Davies
On Fri, Nov 7, 2014 at 3:18 PM, Davies Liu dav...@databricks.com wrote:
-1 (not binding, +1 for maintainer, -1 for sign off)
Agree with Greg and Vinod. In the beginning, everything is better
(more efficient, more focus), but after some time, fighting
be around to revert if it doesn’t).
On October 17, 2014 at 5:26:56 PM, Davies Liu (dav...@databricks.com) wrote:
One finding is that all the timeout happened with this command:
git fetch --tags --progress https://github.com/apache/spark.git
+refs/pull/*:refs/remotes/origin/pr/*
I'm thinking
that’s still failing with the
old fetch failure?
- Josh
On October 17, 2014 at 11:03:14 PM, Davies Liu (dav...@databricks.com)
wrote:
How can we know the changes has been applied? I had checked several
recent builds, they all use the original configs.
Davies
On Fri, Oct 17, 2014 at 6:17 PM
One finding is that all the timeout happened with this command:
git fetch --tags --progress https://github.com/apache/spark.git
+refs/pull/*:refs/remotes/origin/pr/*
I'm thinking that maybe this may be a expensive call, we could try to
use a more cheap one:
git fetch --tags --progress
Could you create a JIRA for it? maybe it's a regression after
https://issues.apache.org/jira/browse/SPARK-3119.
We will appreciate that if you could tell how to reproduce it.
On Mon, Oct 6, 2014 at 1:27 AM, Guillaume Pitel
guillaume.pi...@exensa.com wrote:
Hi,
I've had no answer to this on
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
Yep, I thought it was a bogus comparison.
I should rephrase my question as it was poorly phrased: on average, how
much faster is Spark v. PySpark (I didn't really mean Scala v. Python)?
I've only used Spark
On Wed, Aug 13, 2014 at 2:31 PM, Davies Liu dav...@databricks.com wrote:
On Wed, Aug 13, 2014 at 2:16 PM, Ignacio Zendejas
ignacio.zendejas...@gmail.com wrote:
Yep, I thought it was a bogus comparison.
I should rephrase my question as it was poorly phrased: on average, how
much faster
Maybe we could try LZ4 [1], which has better performance and smaller footprint
than LZF and Snappy. In fast scan mode, the performance is 1.5 - 2x
higher than LZF[2],
but memory used is 10x smaller than LZF (16k vs 190k).
[1] https://github.com/jpountz/lz4-java
[2]
53 matches
Mail list logo