Re: Time for 2.3.1?

2018-05-10 Thread Henry Robinson
+1, I'd like to get a release out with SPARK-23852 fixed. The Parquet
community are about to release 1.8.3 - the voting period closes tomorrow -
and I've tested it with Spark 2.3 and confirmed the bug is fixed. Hopefully
it is released and I can post the version change to branch-2.3 before you
start to roll the RC this weekend.

Henry

On 10 May 2018 at 11:09, Marcelo Vanzin  wrote:

> Hello all,
>
> It's been a while since we shipped 2.3.0 and lots of important bug
> fixes have gone into the branch since then. I took a look at Jira and
> it seems there's not a lot of things explicitly targeted at 2.3.1 -
> the only potential blocker (a parquet issue) is being worked on since
> a new parquet with the fix was just released.
>
> So I'd like to propose to release 2.3.1 soon. If there are important
> fixes that should go into the release, please let those be known (by
> replying here or updating the bug in Jira), otherwise I'm volunteering
> to prepare the first RC soon-ish (around the weekend).
>
> Thanks!
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Time for 2.3.1?

2018-05-10 Thread Ryan Blue
Parquet has a Java patch release, 1.8.3, that should pass tomorrow morning.
I think the plan is to get that in to fix a bug with Parquet data written
by Impala.

On Thu, May 10, 2018 at 11:09 AM, Marcelo Vanzin 
wrote:

> Hello all,
>
> It's been a while since we shipped 2.3.0 and lots of important bug
> fixes have gone into the branch since then. I took a look at Jira and
> it seems there's not a lot of things explicitly targeted at 2.3.1 -
> the only potential blocker (a parquet issue) is being worked on since
> a new parquet with the fix was just released.
>
> So I'd like to propose to release 2.3.1 soon. If there are important
> fixes that should go into the release, please let those be known (by
> replying here or updating the bug in Jira), otherwise I'm volunteering
> to prepare the first RC soon-ish (around the weekend).
>
> Thanks!
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Time for 2.3.1?

2018-05-10 Thread Marcelo Vanzin
Hello all,

It's been a while since we shipped 2.3.0 and lots of important bug
fixes have gone into the branch since then. I took a look at Jira and
it seems there's not a lot of things explicitly targeted at 2.3.1 -
the only potential blocker (a parquet issue) is being worked on since
a new parquet with the fix was just released.

So I'd like to propose to release 2.3.1 soon. If there are important
fixes that should go into the release, please let those be known (by
replying here or updating the bug in Jira), otherwise I'm volunteering
to prepare the first RC soon-ish (around the weekend).

Thanks!


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Felix Cheung
Huge +1 on this!


From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Thursday, May 10, 2018 9:39:26 AM
To: Joseph Bradley
Cc: dev
Subject: Re: Revisiting Online serving of Spark models?



On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
> wrote:
Thanks for bringing this up Holden!  I'm a strong supporter of this.

Awesome! I'm glad other folks think something like this belongs in Spark.
This was one of the original goals for mllib-local: to have local versions of 
MLlib models which could be deployed without the big Spark JARs and without a 
SparkContext or SparkSession.  There are related commercial offerings like this 
: ) but the overhead of maintaining those offerings is pretty high.  Building 
good APIs within MLlib to avoid copying logic across libraries will be well 
worth it.

We've talked about this need at Databricks and have also been syncing with the 
creators of MLeap.  It'd be great to get this functionality into Spark itself.  
Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking a 
Row to the current Models.  Instead, it would be ideal to have local, 
lightweight versions of models in mllib-local, outside of the main mllib 
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to utilize 
elements of Spark SQL, particularly Rows and Types, which could be moved into a 
local sql package.
* This architecture may require some awkward APIs currently to have model 
prediction logic in mllib-local, local model classes in mllib-local, and 
regular (DataFrame-friendly) model classes in mllib.  We might find it helpful 
to break some DeveloperApis in Spark 3.0 to facilitate this architecture while 
making it feasible for 3rd party developers to extend MLlib APIs (especially in 
Java).
I agree this could be interesting, and feed into the other discussion around 
when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in to 
avoid breaking the current APIs but I could be wrong on that point.
* It could also be worth discussing local DataFrames.  They might not be as 
important as per-Row transformations, but they would be helpful for batching 
for higher throughput.
That could be interesting as well.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau 
> wrote:
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a time as 
any to revisit the online serving situation in Spark ML. DB & other's have done 
some excellent working moving a lot of the necessary tools into a local linear 
algebra package that doesn't depend on having a SparkContext.

There are a few different commercial and non-commercial solutions round this, 
but currently our individual transform/predict methods are private so they 
either need to copy or re-implement (or put them selves in org.apache.spark) to 
access them. How would folks feel about adding a new trait for ML pipeline 
stages to expose to do transformation of single element inputs (or local 
collections) that could be optionally implemented by stages which support this? 
That way we can have less copy and paste code possibly getting out of sync with 
our model training.

I think continuing to have on-line serving grow in different projects is 
probably the right path, forward (folks have different needs), but I'd love to 
see us make it simpler for other projects to build reliable serving tools.

I realize this maybe puts some of the folks in an awkward position with their 
own commercial offerings, but hopefully if we make it easier for everyone the 
commercial vendors can benefit as well.

Cheers,

Holden :)

--
Twitter: https://twitter.com/holdenkarau



--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]



--
Twitter: https://twitter.com/holdenkarau


Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Holden Karau
On Thu, May 10, 2018 at 9:25 AM, Joseph Bradley 
wrote:

> Thanks for bringing this up Holden!  I'm a strong supporter of this.
>
> Awesome! I'm glad other folks think something like this belongs in Spark.

> This was one of the original goals for mllib-local: to have local versions
> of MLlib models which could be deployed without the big Spark JARs and
> without a SparkContext or SparkSession.  There are related commercial
> offerings like this : ) but the overhead of maintaining those offerings is
> pretty high.  Building good APIs within MLlib to avoid copying logic across
> libraries will be well worth it.
>
> We've talked about this need at Databricks and have also been syncing with
> the creators of MLeap.  It'd be great to get this functionality into Spark
> itself.  Some thoughts:
> * It'd be valuable to have this go beyond adding transform() methods
> taking a Row to the current Models.  Instead, it would be ideal to have
> local, lightweight versions of models in mllib-local, outside of the main
> mllib package (for easier deployment with smaller & fewer dependencies).
> * Supporting Pipelines is important.  For this, it would be ideal to
> utilize elements of Spark SQL, particularly Rows and Types, which could be
> moved into a local sql package.
> * This architecture may require some awkward APIs currently to have model
> prediction logic in mllib-local, local model classes in mllib-local, and
> regular (DataFrame-friendly) model classes in mllib.  We might find it
> helpful to break some DeveloperApis in Spark 3.0 to facilitate this
> architecture while making it feasible for 3rd party developers to extend
> MLlib APIs (especially in Java).
>
I agree this could be interesting, and feed into the other discussion
around when (or if) we should be considering Spark 3.0
I _think_ we could probably do it with optional traits people could mix in
to avoid breaking the current APIs but I could be wrong on that point.

> * It could also be worth discussing local DataFrames.  They might not be
> as important as per-Row transformations, but they would be helpful for
> batching for higher throughput.
>
That could be interesting as well.

>
> I'll be interested to hear others' thoughts too!
>
> Joseph
>
> On Wed, May 9, 2018 at 7:18 AM, Holden Karau  wrote:
>
>> Hi y'all,
>>
>> With the renewed interest in ML in Apache Spark now seems like a good a
>> time as any to revisit the online serving situation in Spark ML. DB &
>> other's have done some excellent working moving a lot of the necessary
>> tools into a local linear algebra package that doesn't depend on having a
>> SparkContext.
>>
>> There are a few different commercial and non-commercial solutions round
>> this, but currently our individual transform/predict methods are private so
>> they either need to copy or re-implement (or put them selves in
>> org.apache.spark) to access them. How would folks feel about adding a new
>> trait for ML pipeline stages to expose to do transformation of single
>> element inputs (or local collections) that could be optionally implemented
>> by stages which support this? That way we can have less copy and paste code
>> possibly getting out of sync with our model training.
>>
>> I think continuing to have on-line serving grow in different projects is
>> probably the right path, forward (folks have different needs), but I'd love
>> to see us make it simpler for other projects to build reliable serving
>> tools.
>>
>> I realize this maybe puts some of the folks in an awkward position with
>> their own commercial offerings, but hopefully if we make it easier for
>> everyone the commercial vendors can benefit as well.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: eager execution and debuggability

2018-05-10 Thread Ryan Blue
> it would be fantastic if we could make it easier to debug Spark programs
without needing to rely on eager execution.

I agree, it would be great if we could make the errors more clear about
where the error happened (user code or in Spark code) and what assumption
was violated. The problem is that this is a really hard thing to do
generally, like Reynold said. I think we should look for individual cases
where we can improve feedback so we can take a deeper look.

For example, we have an error case where users get a `NullPointerException`
in generated code. This was a huge pain to track down the first time, but
the problem is almost always that the user registered a UDF that returns an
object and Spark inferred that it would be non-null but the user's code
returns null. In these cases, we could add better error messages to
generated code, like "Column 'x = some_udf(y)' is required, but the value
was null". That would be really useful.

> I used to use an evaluate(dataframe) -> DataFrame function that simply
forces the materialization of a dataframe.

We have one of these, too. `display` that will run a dataframe and format
it for notebooks (html and text output). We also have a `materialize`
method that materializes a dataframe or RDD, like people use `count` for,
but that returns the materialized RDD so we can reuse it from the last
shuffle (we use this to avoid caching). It would be great if it were easier
to reuse the RDDs materialized by these calls, or even automatic. Right
now, if you run `show`, Spark doesn't know that a dataframe was
materialized and won't reuse the results unless you keep a reference to it.

We also have a problem where a dataframe used multiple times will cause
several table scans when the filters or projected columns change. That's
because each action optimizes the dataframe without knowing about the next.
I'd love to hear ideas on how to fix this.

On Wed, May 9, 2018 at 5:39 AM, Tim Hunter  wrote:

> The repr() trick is neat when working on a notebook. When working in a
> library, I used to use an evaluate(dataframe) -> DataFrame function that
> simply forces the materialization of a dataframe. As Reynold mentions, this
> is very convenient when working on a lot of chained UDFs, and it is a
> standard trick in lazy environments and languages.
>
> Tim
>
> On Wed, May 9, 2018 at 3:26 AM, Reynold Xin  wrote:
>
>> Yes would be great if possible but it’s non trivial (might be impossible
>> to do in general; we already have stacktraces that point to line numbers
>> when an error occur in UDFs but clearly that’s not sufficient). Also in
>> environments like REPL it’s still more useful to show error as soon as it
>> occurs, rather than showing it potentially 30 lines later.
>>
>> On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> This may be technically impractical, but it would be fantastic if we
>>> could make it easier to debug Spark programs without needing to rely on
>>> eager execution. Sprinkling .count() and .checkpoint() at various
>>> points in my code is still a debugging technique I use, but it always makes
>>> me wish Spark could point more directly to the offending transformation
>>> when something goes wrong.
>>>
>>> Is it somehow possible to have each individual operator (is that the
>>> correct term?) in a DAG include metadata pointing back to the line of code
>>> that generated the operator? That way when an action triggers an error, the
>>> failing operation can point to the relevant line of code — even if it’s a
>>> transformation — and not just the action on the tail end that triggered the
>>> error.
>>>
>>> I don’t know how feasible this is, but addressing it would directly
>>> solve the issue of linking failures to the responsible transformation, as
>>> opposed to leaving the user to break up a chain of transformations with
>>> several debug actions. And this would benefit new and experienced users
>>> alike.
>>>
>>> Nick
>>>
>>> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
>>> 님이 작성:
>>>
>>> I've opened SPARK-24215 to track this.

 On Tue, May 8, 2018 at 3:58 PM, Reynold Xin 
 wrote:

> Yup. Sounds great. This is something simple Spark can do and provide
> huge value to the end users.
>
>
> On Tue, May 8, 2018 at 3:53 PM Ryan Blue  wrote:
>
>> Would be great if it is something more turn-key.
>>
>> We can easily add the __repr__ and _repr_html_ methods and behavior
>> to PySpark classes. We could also add a configuration property to 
>> determine
>> whether the dataset evaluation is eager or not. That would make it 
>> turn-key
>> for anyone running PySpark in Jupyter.
>>
>> For JVM languages, we could also add a dependency on jvm-repr and do
>> the same thing.
>>
>> rb
>> ​
>>

Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Joseph Bradley
Thanks for bringing this up Holden!  I'm a strong supporter of this.

This was one of the original goals for mllib-local: to have local versions
of MLlib models which could be deployed without the big Spark JARs and
without a SparkContext or SparkSession.  There are related commercial
offerings like this : ) but the overhead of maintaining those offerings is
pretty high.  Building good APIs within MLlib to avoid copying logic across
libraries will be well worth it.

We've talked about this need at Databricks and have also been syncing with
the creators of MLeap.  It'd be great to get this functionality into Spark
itself.  Some thoughts:
* It'd be valuable to have this go beyond adding transform() methods taking
a Row to the current Models.  Instead, it would be ideal to have local,
lightweight versions of models in mllib-local, outside of the main mllib
package (for easier deployment with smaller & fewer dependencies).
* Supporting Pipelines is important.  For this, it would be ideal to
utilize elements of Spark SQL, particularly Rows and Types, which could be
moved into a local sql package.
* This architecture may require some awkward APIs currently to have model
prediction logic in mllib-local, local model classes in mllib-local, and
regular (DataFrame-friendly) model classes in mllib.  We might find it
helpful to break some DeveloperApis in Spark 3.0 to facilitate this
architecture while making it feasible for 3rd party developers to extend
MLlib APIs (especially in Java).
* It could also be worth discussing local DataFrames.  They might not be as
important as per-Row transformations, but they would be helpful for
batching for higher throughput.

I'll be interested to hear others' thoughts too!

Joseph

On Wed, May 9, 2018 at 7:18 AM, Holden Karau  wrote:

> Hi y'all,
>
> With the renewed interest in ML in Apache Spark now seems like a good a
> time as any to revisit the online serving situation in Spark ML. DB &
> other's have done some excellent working moving a lot of the necessary
> tools into a local linear algebra package that doesn't depend on having a
> SparkContext.
>
> There are a few different commercial and non-commercial solutions round
> this, but currently our individual transform/predict methods are private so
> they either need to copy or re-implement (or put them selves in
> org.apache.spark) to access them. How would folks feel about adding a new
> trait for ML pipeline stages to expose to do transformation of single
> element inputs (or local collections) that could be optionally implemented
> by stages which support this? That way we can have less copy and paste code
> possibly getting out of sync with our model training.
>
> I think continuing to have on-line serving grow in different projects is
> probably the right path, forward (folks have different needs), but I'd love
> to see us make it simpler for other projects to build reliable serving
> tools.
>
> I realize this maybe puts some of the folks in an awkward position with
> their own commercial offerings, but hopefully if we make it easier for
> everyone the commercial vendors can benefit as well.
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] 


Re: eager execution and debuggability

2018-05-10 Thread Lalwani, Jayesh
If they are struggling to find bugs in their program because of lazy execution 
model of Spark, they are going to struggle to debug issues when the program 
runs into problems in production. Learning how to debug Spark is part of 
learning Spark. It’s better that they run into issues in the classroom, and 
spend time-effort learning how to debug such issues rather than deploy critical 
code to production and not know how to resolve the issues

I would say that if they are struggling how to read and analyze a stack trace, 
then they are missing a prerequisite. They need to be taught how to look at a 
stack trace critically before they start on Spark. Learning how to analyze 
stack traces is part of learning Scala/Java/Python. They need to drop Spark, 
and go back to learning core Scala/Java/Python.



From: Reynold Xin 
Date: Tuesday, May 8, 2018 at 6:45 PM
To: Marco Gaido 
Cc: Ryan Blue , Koert Kuipers , dev 

Subject: Re: eager execution and debuggability

Marco,

There is understanding how Spark works, and there is finding bugs early in 
their own program. One can perfectly understand how Spark works and still find 
it valuable to get feedback asap, and that's why we built eager analysis in the 
first place.

Also I'm afraid you've significantly underestimated the level of technical 
sophistication of users. In many cases they struggle to get anything to work, 
and performance optimization of their programs is secondary to getting things 
working. As John Ousterhout says, "the greatest performance improvement of all 
is when a system goes from not-working to working".

I really like Ryan's approach. Would be great if it is something more turn-key.






On Tue, May 8, 2018 at 2:35 PM Marco Gaido 
> wrote:
I am not sure how this is useful. For students, it is important to understand 
how Spark works. This can be critical in many decision they have to take 
(whether and what to cache for instance) in order to have performant Spark 
application. Creating a eager execution probably can help them having something 
running more easily, but let them also using Spark knowing less about how it 
works, thus they are likely to write worse application and to have more 
problems in debugging any kind of problem which may later (in production) occur 
(therefore affecting their experience with the tool).

Moreover, as Ryan also mentioned, there are tools/ways to force the execution, 
helping in the debugging phase. So they can achieve without a big effort the 
same result, but with a big difference: they are aware of what is really 
happening, which may help them later.

Thanks,
Marco

2018-05-08 21:37 GMT+02:00 Ryan Blue 
>:

At Netflix, we use Jupyter notebooks and consoles for interactive sessions. For 
anyone interested, this mode of interaction is really easy to add in Jupyter 
and PySpark. You would just define a different repr_html or repr method for 
Dataset that runs a take(10) or take(100) and formats the result.

That way, the output of a cell or console execution always causes the dataframe 
to run and get displayed for that immediate feedback. But, there is no change 
to Spark’s behavior because the action is run by the REPL, and only when a 
dataframe is a result of an execution in order to display it. Intermediate 
results wouldn’t be run, but that gives users a way to avoid too many 
executions and would still support method chaining in the dataframe API (which 
would be horrible with an aggressive execution model).

There are ways to do this in JVM languages as well if you are using a Scala or 
Java interpreter (see 
jvm-repr).
 This is actually what we do in our Spark-based SQL interpreter to display 
results.

rb
​

On Tue, May 8, 2018 at 12:05 PM, Koert Kuipers 
> wrote:
yeah we run into this all the time with new hires. they will send emails 
explaining there is an error in the .write operation and they are debugging the 
writing to disk, focusing on that piece of code :)
unrelated, but another frequent cause for confusion is cascading errors. like 
the FetchFailedException. they will be debugging the reducer task not realizing 
the error happened before that, and the FetchFailedException is not the root 
cause.
[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]


On Tue, May 8, 2018 at 2:52 PM, Reynold Xin 
> wrote:
Similar to the thread yesterday about improving ML/DL integration, I'm sending