Purpose of broadcast timeout

2019-01-30 Thread Justin Uang
Hi all, We have noticed a lot of broadcast timeouts on our pipelines, and from some inspection, it seems that they happen when I have two threads trying to save two different DataFrames. We use the FIFO scheduler, so if I launch a job that needs all the executors, the second DataFrame's collect

Re: Using UDFs in Java without registration

2017-07-26 Thread Justin Uang
> We added all the typetags for arguments but haven't got around to use them > yet. I think it'd make sense to have them and do the auto cast, but we can > have rules in analysis to forbid certain casts (e.g. don't auto cast double > to int). > > > On Sat, May 30, 2015 at 7:1

Making BatchPythonEvaluation actually Batch

2016-01-31 Thread Justin Uang
Hey guys, BLUF: sorry for the length of this email, trying to figure out how to batch Python UDF executions, and since this is my first time messing with catalyst, would like any feedback My team is starting to use PySpark UDFs quite heavily, and performance is a huge blocker. The extra

Spark SQL: Avoid shuffles when data is already partitioned on disk

2016-01-21 Thread Justin Uang
Hi, If I had a df and I wrote it out via partitionBy("id"), presumably, when I load in the df and do a groupBy("id"), a shuffle shouldn't be necessary right? Effectively, we can load in the dataframe with a hash partitioner already set, since each task can simply read all the folders where id=

Returning numpy types from udfs

2015-12-05 Thread Justin Uang
Hi, I have fallen into the trap of returning numpy types from udfs, such as np.float64 and np.int. It's hard to find the issue because they behave pretty much as regular pure Python floats and doubles, so can we make PYSPARK automatically translate them? If so, I'll create a Jira ticket. Justin

Re: Returning numpy types from udfs

2015-12-05 Thread Justin Uang
Filed here: https://issues.apache.org/jira/browse/SPARK-12157 On Sat, Dec 5, 2015 at 3:08 PM Reynold Xin <r...@databricks.com> wrote: > Not aware of any jira ticket, but it does sound like a great idea. > > > On Sat, Dec 5, 2015 at 11:03 PM, Justin Uang <justin.u...@gmail.

Subtract implementation using broadcast

2015-11-27 Thread Justin Uang
Hi, I have seen massive gains with the broadcast hint for joins with DataFrames, and I was wondering if we have thought about allowing the broadcast hint for the implementation of subtract and intersect. Right now, when I try it, it says that there is no plan for the broadcast hint. Justin

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang
nd 2, and also substantially > simplify the internals. > > > > > On Tue, Nov 3, 2015 at 7:59 AM, Justin Uang <justin.u...@gmail.com> wrote: > >> Yup, but I'm wondering what happens when an executor does get removed, >> but when we're using tachyon. Will the cached dat

Re: Off-heap storage and dynamic allocation

2015-11-03 Thread Justin Uang
can be idle for long periods of time while holding onto cached rdds. On Tue, Nov 3, 2015 at 10:15 PM Reynold Xin <r...@databricks.com> wrote: > It is lost unfortunately (although can be recomputed automatically). > > > On Tue, Nov 3, 2015 at 1:13 PM, Justin Uang <justin.

Off-heap storage and dynamic allocation

2015-10-30 Thread Justin Uang
Hey guys, According to the docs for 1.5.1, when an executor is removed for dynamic allocation, the cached data is gone. If I use off-heap storage like tachyon, conceptually there isn't this issue anymore, but is the cached data still available in practice? This would be great because then we

Re: Python UDAFs

2015-10-02 Thread Justin Uang
Cool, filed here: https://issues.apache.org/jira/browse/SPARK-10915 On Fri, Oct 2, 2015 at 3:21 PM Reynold Xin <r...@databricks.com> wrote: > No, not yet. > > > On Fri, Oct 2, 2015 at 12:20 PM, Justin Uang <justin.u...@gmail.com> > wrote: > >> Hi, &

Python UDAFs

2015-10-02 Thread Justin Uang
Hi, Is there a Python API for UDAFs? Thanks! Justin

Fast Iteration while developing

2015-09-07 Thread Justin Uang
Hi, What is the normal workflow for the core devs? - Do we need to build the assembly jar to be able to run it from the spark repo? - Do you use sbt or maven to do the build? - Is zinc only usuable for maven? I'm asking because the current process I have right now is to do sbt build, which

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang
, 2015 at 5:52 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: +1 (once again :) ) 2015-07-28 14:51 GMT+02:00 Justin Uang justin.u...@gmail.com : // ping do we have any signoff from the pyspark devs to submit a PR to publish to PyPI

Re: PySpark on PyPi

2015-08-20 Thread Justin Uang
One other question: Do we have consensus on publishing the pip-installable source distribution to PyPI? If so, is that something that the maintainers need to add to the process that they use to publish releases? On Thu, Aug 20, 2015 at 5:44 PM Justin Uang justin.u...@gmail.com wrote: I would

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-31 Thread Justin Uang
of the problem as a comment to that ticket and we'll make sure to test both cases and break it out if the root cause ends up being different. On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang justin.u...@gmail.com wrote: Sweet! Does this cover DataFrame#rdd also using the cached query from DataFrame

Re: PySpark on PyPi

2015-07-28 Thread Justin Uang
, 2015 at 12:49 PM Justin Uang justin.u...@gmail.com wrote: // + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang
, Jul 28, 2015 at 2:36 AM, Justin Uang justin.u...@gmail.com wrote: Hey guys, I'm running into some pretty bad performance issues when it comes to using a CrossValidator, because of caching behavior of DataFrames. The root of the problem is that while I have cached my DataFrame representing

Re: PySpark on PyPi

2015-07-22 Thread Justin Uang
// + *Davies* for his comments // + Punya for SA For development and CI, like Olivier mentioned, I think it would be hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI. If anyone wants to develop against PySpark APIs, they need to download the distribution and do a lot of

Re: how can I write a language wrapper?

2015-06-29 Thread Justin Uang
My guess is that if you are just wrapping the spark sql APIs, you can get away with not having to reimplement a lot of the complexities in Pyspark like storing everything in RDDs as pickled byte arrays, pipelining RDDs, doing aggregations and joins in the python interpreters, etc. Since the

Re: Python UDF performance at large scale

2015-06-25 Thread Justin Uang
, you can give it a try. On Wed, Jun 24, 2015 at 4:39 PM, Justin Uang justin.u...@gmail.com wrote: Correct, I was running with a batch size of about 100 when I did the tests, because I was worried about deadlocks. Do you have any concerns regarding the batched synchronous version

Re: Python UDF performance at large scale

2015-06-24 Thread Justin Uang
it? On Wed, Jun 24, 2015 at 7:27 PM Davies Liu dav...@databricks.com wrote: From you comment, the 2x improvement only happens when you have the batch size as 1, right? On Wed, Jun 24, 2015 at 12:11 PM, Justin Uang justin.u...@gmail.com wrote: FYI, just submitted a PR to Pyrolite to remove

Re: Python UDF performance at large scale

2015-06-23 Thread Justin Uang
23, 2015 at 3:27 PM, Justin Uang justin.u...@gmail.com wrote: BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python

Python UDF performance at large scale

2015-06-23 Thread Justin Uang
BLUF: BatchPythonEvaluation's implementation is unusable at large scale, but I have a proof-of-concept implementation that avoids caching the entire dataset. Hi, We have been running into performance problems using Python UDFs with DataFrames at large scale. From the implementation of

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-31 Thread Justin Uang
, 2015 at 9:02 AM, Justin Uang justin.u...@gmail.com wrote: On second thought, perhaps can this be done by writing a rule that builds the dag of dependencies between expressions, then convert it into several layers of projections, where each new layer is allowed to depend on expression results from

Re: Using UDFs in Java without registration

2015-05-30 Thread Justin Uang
class, so we can do the proper type conversion (e.g. if the UDF expects a string, but the input expression is an int, Catalyst can automatically add a cast). However, we haven't implemented those in UserDefinedFunction yet. On Fri, May 29, 2015 at 12:54 PM, Justin Uang justin.u...@gmail.com

Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] PhysicalRDD

Re: Catalyst: Reusing already computed expressions within a projection

2015-05-30 Thread Justin Uang
to this approach? On Sat, May 30, 2015 at 11:30 AM Justin Uang justin.u...@gmail.com wrote: If I do the following df2 = df.withColumn('y', df['x'] * 7) df3 = df2.withColumn('z', df2.y * 3) df3.explain() Then the result is Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64

Using UDFs in Java without registration

2015-05-29 Thread Justin Uang
I would like to define a UDF in Java via a closure and then use it without registration. In Scala, I believe there are two ways to do this: myUdf = functions.udf({ _ + 5}) myDf.select(myUdf(myDf(age))) or myDf.select(functions.callUDF({_ + 5}, DataTypes.IntegerType, myDf(age)))

Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang
In commit 04e44b37, the migration to Python 3, pyspark/sql/types.py was renamed to pyspark/sql/_types.py and then some magic in pyspark/sql/__init__.py dynamically renamed the module back to types. I imagine that this is some naming conflict with Python 3, but what was the error that showed up?

Re: Spark 1.4.0 pyspark and pylint breaking

2015-05-26 Thread Justin Uang
/Frameworks/Python.framework/Versions/3.4/lib/python3.4/types.py' Without renaming, our `types.py` will conflict with it when you run unittests in pyspark/sql/ . On Tue, May 26, 2015 at 11:57 AM, Justin Uang justin.u...@gmail.com wrote: In commit 04e44b37, the migration to Python 3, pyspark/sql

Re: [VOTE] Release Apache Spark 1.4.0 (RC1)

2015-05-22 Thread Justin Uang
I'm working on one of the Palantir teams using Spark, and here is our feedback: We have encountered three issues when upgrading to spark 1.4.0. I'm not sure they qualify as a -1, as they come from using non-public APIs and multiple spark contexts for the purposes of testing, but I do want to

UDTs and StringType upgrade issue for Spark 1.4.0

2015-05-22 Thread Justin Uang
We ran into an issue regarding Strings in UDTs when upgrading to Spark 1.4.0-rc. I understand that it's a non-public APIs, so it's expected, but I just wanted to bring it up for awareness and so we can maybe change the release notes to mention them =) Our UDT was serializing to a StringType, but

Re: RDD split into multiple RDDs

2015-05-19 Thread Justin Uang
To do it in one pass, conceptually what you would need to do is to consume the entire parent iterator and store the values either in memory or on disk, which is generally something you want to avoid given that the parent iterator length is unbounded. If you need to start spilling to disk, you

First-class support for pip/virtualenv in pyspark

2015-04-23 Thread Justin Uang
Hi, I have been trying to figure out how to ship a python package that I have been working on, and this has brought up a couple questions to me. Please note that I'm fairly new to python package management, so any feedback/corrections is welcome =) It looks like the --py-files support we have

Infinite recursion when using SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String])

2015-04-19 Thread Justin Uang
Hi, I have a question regarding SQLContext#createDataFrame(JavaRDD[Row], java.util.List[String]). It looks like when I try to call it, it results in an infinite recursion that overflows the stack. I filed it here: https://issues.apache.org/jira/browse/SPARK-6999. What is the best way to fix