Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-19 Thread Bryan Cutler
+1, sounds good

On Wed, May 18, 2022 at 9:16 PM Dongjoon Hyun 
wrote:

> +1
>
> Thank you for the suggestion, Hyukjin.
>
> Dongjoon.
>
> On Wed, May 18, 2022 at 11:08 AM Bjørn Jørgensen 
> wrote:
>
>> +1
>> But can will have PR Title and PR label the same,  PS
>>
>> ons. 18. mai 2022 kl. 18:57 skrev Xinrong Meng
>> :
>>
>>> Great!
>>>
>>> It saves us from always specifying "Pandas API on Spark" in PR titles.
>>>
>>> Thanks!
>>>
>>>
>>> Xinrong Meng
>>>
>>> Software Engineer
>>>
>>> Databricks
>>>
>>>
>>> On Tue, May 17, 2022 at 1:08 AM Maciej  wrote:
>>>
 Sounds good!

 +1

 On 5/17/22 06:08, Yikun Jiang wrote:
 > It's a pretty good idea, +1.
 >
 > To be clear in Github:
 >
 > - For each PR Title: [SPARK-XXX][PYTHON][PS] The Pandas on spark pr
 title
 > (*still keep [PYTHON]* and [PS] new added)
 >
 > - For PR label: new added: `PANDAS API ON Spark`, still keep:
 `PYTHON`,
 > `CORE`
 > (*still keep `PYTHON`, `CORE`* and `PANDAS API ON SPARK` new added)
 > https://github.com/apache/spark/pull/36574
 > 
 >
 > Right?
 >
 > Regards,
 > Yikun
 >
 >
 > On Tue, May 17, 2022 at 11:26 AM Hyukjin Kwon >>> > > wrote:
 >
 > Hi all,
 >
 > What about we introduce a component in JIRA "Pandas API on Spark",
 > and use "PS"  (pandas-on-Spark) in PR titles? We already use "ps"
 in
 > many places when we: import pyspark.pandas as ps.
 > This is similar to "Structured Streaming" in JIRA, and "SS" in PR
 title.
 >
 > I think it'd be easier to track the changes here with that.
 > Currently it's a bit difficult to identify it from pure PySpark
 changes.
 >


 --
 Best regards,
 Maciej Szymkiewicz

 Web: https://zero323.net
 PGP: A30CEF0C31A501EC

>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Bryan Cutler
Congratulations everyone!

On Sun, Mar 28, 2021 at 11:00 PM ML Books  wrote:

> Congrats all
>
> On Sat, Mar 27, 2021, 1:58 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers. Please join
>> me in welcoming them to their new role! Our new committers are:
>>
>> - Maciej Szymkiewicz (contributor to PySpark)
>> - Max Gekk (contributor to Spark SQL)
>> - Kent Yao (contributor to Spark SQL)
>> - Attila Zsolt Piros (contributor to decommissioning and Spark on
>> Kubernetes)
>> - Yi Wu (contributor to Spark Core and SQL)
>> - Gabor Somogyi (contributor to Streaming and security)
>>
>> All six of them contributed to Spark 3.1 and we’re very excited to have
>> them join as committers.
>>
>> Matei and the Spark PMC
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Bryan Cutler
+1 (non-binding)

On Fri, Mar 26, 2021 at 9:49 AM Maciej  wrote:

> +1 (nonbinding)
>
> On 3/26/21 3:52 PM, Hyukjin Kwon wrote:
>
> Hi all,
>
> I’d like to start a vote for SPIP: Support pandas API layer on PySpark.
>
> The proposal is to embrace Koalas in PySpark to have the pandas API layer
> on PySpark.
>
>
> Please also refer to:
>
>- Previous discussion in dev mailing list: [DISCUSS] Support pandas
>API layer on PySpark
>
> 
>.
>- JIRA: SPARK-34849 
>- Koalas internals documentation:
>
> https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
>
> --
> Best regards,
> Maciej Szymkiewicz
>
> Web: https://zero323.net
> Keybase: https://keybase.io/zero323
> Gigs: https://www.codementor.io/@zero323
> PGP: A30CEF0C31A501EC
>
>


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Bryan Cutler
+1 the proposal sounds good to me. Having a familiar API built-in will
really help new users get into using Spark that might only have Pandas
experience. It sounds like maintenance costs should be manageable, once the
hurdle with setting up tests is done. Just out of curiosity, does Koalas
pretty much implement all of the Pandas APIs now? If there are some that
are yet to be implemented or others that have differences, are these
documented so users won't be caught off-guard?

On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo  wrote:

> Hi,
>
> Integrating Koalas with pyspark might help enable a richer integration
> between the two. Something that would be useful with a tighter
> integration is support for custom column array types. Currently, Spark
> takes dataframes, converts them to arrow buffers then transmits them
> over the socket to Python. On the other side, pyspark takes the arrow
> buffer and converts it to a Pandas dataframe. Unfortunately, the
> default Pandas representation of a list-type for a column causes it to
> turn what was contiguous value/offset arrays in Arrow into
> deserialized Python objects for each row. Obviously, this kills
> performance.
>
> A PR to extend the pyspark API to elide the pandas conversion
> (https://github.com/apache/spark/pull/26783) was submitted and
> rejected, which is unfortunate, but perhaps this proposed integration
> would provide the hooks via Pandas' ExtensionArray interface to allow
> Spark to performantly interchange jagged/ragged lists to/from python
> UDFs.
>
> Cheers
> Andrew
>
> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon  wrote:
> >
> > Thank you guys for all your feedback. I will start working on SPIP with
> Koalas team.
> > I would expect the SPIP can be sent late this week or early next week.
> >
> >
> > I inlined and answered the questions unanswered as below:
> >
> > Is the community developing the pandas API layer for Spark interested in
> being part of Spark or do they prefer having their own release cycle?
> >
> > Yeah, Koalas team used to have its own release cycle to develop and move
> quickly.
> > Now it became pretty mature with reaching 1.7.0, and the team thinks
> that it’s now
> > fine to have less frequent releases, and they are happy to work together
> with Spark with
> > contributing to it. The active contributors in the Koalas community will
> continue to
> > make the contributions in Spark.
> >
> > How about test code? Does it fit into the PySpark test framework?
> >
> > Yes, this will be one of the places where it needs some efforts. Koalas
> currently uses pytest
> > with various dependency version combinations (e.g., Python version,
> conda vs pip) whereas
> > PySpark uses the plain unittests with less dependency version
> combinations.
> >
> > For pytest in Koalas <> unittests in PySpark:
> >
> >   I am currently thinking we will have to convert the Koalas tests to
> use unittests to match
> >   with PySpark for now.
> >   It is a feasible option for PySpark to migrate to pytest too but it
> will need extra effort to
> >   make it working with our own PySpark testing framework seamlessly.
> >   Koalas team (presumably and likely I) will take a look in any event.
> >
> > For the combinations of dependency versions:
> >
> >   Due to the lack of the resources in GitHub Actions, I currently plan
> to just add the
> >   Koalas tests into the matrix PySpark is currently using.
> >
> > one question I have; what’s an initial goal of the proposal?
> > Is that to port all the pandas interfaces that Koalas has already
> implemented?
> > Or, the basic set of them?
> >
> > The goal of the proposal is to port all of Koalas project into PySpark.
> > For example,
> >
> > import koalas
> >
> > will be equivalent to
> >
> > # Names, etc. might change in the final proposal or during the review
> > from pyspark.sql import pandas
> >
> > Koalas supports pandas APIs with a separate layer to cover a bit of
> difference between
> > DataFrame structures in pandas and PySpark, e.g.) other types as column
> names (labels),
> > index (something like row number in DBMSs) and so on. So I think it
> would make more sense
> > to port the whole layer instead of a subset of the APIs.
> >
> >
> >
> >
> >
> > 2021년 3월 17일 (수) 오전 12:32, Wenchen Fan 님이 작성:
> >>
> >> +1, it's great to have Pandas support in Spark out of the box.
> >>
> >> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro <
> linguin@gmail.com> wrote:
> >>>
> >>> +1; the pandas interfaces are pretty popular and supporting them in
> pyspark looks promising, I think.
> >>> one question I have; what's an initial goal of the proposal?
> >>> Is that to port all the pandas interfaces that Koalas has already
> implemented?
> >>> Or, the basic set of them?
> >>>
> >>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía 
> wrote:
> 
>  +1
> 
>  Bringing a Pandas API for pyspark to upstream Spark will only bring
>  benefits for everyone (more eyes to use/see/fix/improve the API) as
>  well 

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Bryan Cutler
Congratulations and welcome!

On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang  wrote:

> Welcome, Huaxin, Jungtaek, and Dilip!
>
> Congratulations!
>
> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers. Please join
>> me in welcoming them to their new roles! The new committers are:
>>
>> - Huaxin Gao
>> - Jungtaek Lim
>> - Dilip Biswal
>>
>> All three of them contributed to Spark 3.0 and we’re excited to have them
>> join the project.
>>
>> Matei and the Spark PMC
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Bryan Cutler
+1 (non-binding)

On Mon, Jun 8, 2020, 1:49 PM Tom Graves 
wrote:

> +1
>
> Tom
>
> On Saturday, June 6, 2020, 03:09:09 PM CDT, Reynold Xin <
> r...@databricks.com> wrote:
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.0.0.
>
> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes are
> cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.0.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v3.0.0-rc3 (commit
> 3fdfce3120f307147244e5eaf46d61419a723d50):
> https://github.com/apache/spark/tree/v3.0.0-rc3
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1350/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>
> The list of bug fixes going into 3.0.0 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>
> This release is using the release script of the tag v3.0.0-rc3.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.0.0?
> ===
>
> The current list of open tickets targeted at 3.0.0 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.0.0
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>


Re: Revisiting Python / pandas UDF (continues)

2019-12-16 Thread Bryan Cutler
Thanks for taking this on Hyukjin! I'm looking forward to the PRs and happy
to help out where I can.

Bryan

On Wed, Dec 4, 2019 at 9:13 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I would like to finish redesigning Pandas UDF ones in Spark 3.0.
> If you guys don't have a minor concern in general about (see
> https://issues.apache.org/jira/browse/SPARK-28264),
> I would like to start soon after addressing existing comments.
>
> Please take a look and comment on the design docs.
>
> Thanks!
>


Re: Slower than usual on PRs

2019-12-16 Thread Bryan Cutler
Sorry to hear this Holden! Hope you get well soon and take it easy!!

On Tue, Dec 3, 2019 at 6:21 PM Hyukjin Kwon  wrote:

> Yeah, please take care of your heath first!
>
> 2019년 12월 3일 (화) 오후 1:32, Wenchen Fan 님이 작성:
>
>> Sorry to hear that. Hope you get better soon!
>>
>> On Tue, Dec 3, 2019 at 1:28 AM Holden Karau  wrote:
>>
>>> Hi Spark dev folks,
>>>
>>> Just an FYI I'm out dealing with recovering from a motorcycle accident
>>> so my lack of (or slow) responses on PRs/docs is health related and please
>>> don't block on any of my reviews. I'll do my best to find some OSS cycles
>>> once I get back home.
>>>
>>> Cheers,
>>>
>>> Holden
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>


Re: [build system] Upgrading pyarrow, builds might be temporarily broken

2019-11-14 Thread Bryan Cutler
Update: #26133 <https://github.com/apache/spark/pull/26133> has been merged
and builds should be passing now, thanks all!

On Thu, Nov 14, 2019 at 4:12 PM Bryan Cutler  wrote:

> We are in the process of upgrading pyarrow in the testing environment,
> which might cause pyspark test failures until
> https://github.com/apache/spark/pull/26133 is merged. Apologies for the
> lack of notice beforehand, but I jumped the gun a little and forgot this
> would affect other builds too. The PR for the upgrade is all ready to go
> and *should* pass current tests. I'll keep an eye on it and try to get it
> resolved tonight. If it becomes a problem, we can try to downgrade pyarrow
> to where it was.
>
> Thanks,
> Bryan
>


[build system] Upgrading pyarrow, builds might be temporarily broken

2019-11-14 Thread Bryan Cutler
We are in the process of upgrading pyarrow in the testing environment,
which might cause pyspark test failures until
https://github.com/apache/spark/pull/26133 is merged. Apologies for the
lack of notice beforehand, but I jumped the gun a little and forgot this
would affect other builds too. The PR for the upgrade is all ready to go
and *should* pass current tests. I'll keep an eye on it and try to get it
resolved tonight. If it becomes a problem, we can try to downgrade pyarrow
to where it was.

Thanks,
Bryan


[DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-04 Thread Bryan Cutler
Currently, when a PySpark Row is created with keyword arguments, the fields
are sorted alphabetically. This has created a lot of confusion with users
because it is not obvious (although it is stated in the pydocs) that they
will be sorted alphabetically. Then later when applying a schema and the
field order does not match, an error will occur. Here is a list of some of
the JIRAs that I have been tracking all related to this issue: SPARK-24915,
SPARK-22232, SPARK-27939, SPARK-27712, and relevant discussion of the issue
[1].

The original reason for sorting fields is because kwargs in python < 3.6
are not guaranteed to be in the same order that they were entered [2].
Sorting alphabetically ensures a consistent order. Matters are further
complicated with the flag _*from_dict*_ that allows the Row fields to to be
referenced by name when made by kwargs, but this flag is not serialized
with the Row and leads to inconsistent behavior. For instance:

>>> spark.createDataFrame([Row(A="1", B="2")], "B string, A string").first()
Row(B='2', A='1')>>>
spark.createDataFrame(spark.sparkContext.parallelize([Row(A="1",
B="2")]), "B string, A string").first()
Row(B='1', A='2')

I think the best way to fix this is to remove the sorting of fields when
constructing a Row. For users with Python 3.6+, nothing would change
because these versions of Python ensure that the kwargs stays in the
ordered entered. For users with Python < 3.6, using kwargs would check a
conf to either raise an error or fallback to a LegacyRow that sorts the
fields as before. With Python < 3.6 being deprecated now, this LegacyRow
can also be removed at the same time. There are also other ways to create
Rows that will not be affected. I have opened a JIRA [3] to capture this,
but I am wondering what others think about fixing this for Spark 3.0?

[1] https://github.com/apache/spark/pull/20280
[2] https://www.python.org/dev/peps/pep-0468/
[3] https://issues.apache.org/jira/browse/SPARK-29748


Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-31 Thread Bryan Cutler
+1 for deprecating

On Wed, Oct 30, 2019 at 2:46 PM Shane Knapp  wrote:

> sure.  that shouldn't be too hard, but we've historically given very
> little support to it.
>
> On Wed, Oct 30, 2019 at 2:31 PM Maciej Szymkiewicz 
> wrote:
>
>> Could we upgrade to PyPy3.6 v7.2.0?
>> On 10/30/19 9:45 PM, Shane Knapp wrote:
>>
>> one quick thing:  we currently test against python2.7, 3.6 *and*
>> pypy2.5.1 (python2.7).
>>
>> what are our plans for pypy?
>>
>>
>> On Wed, Oct 30, 2019 at 12:26 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you all. I made a PR for that.
>>>
>>> https://github.com/apache/spark/pull/26326
>>>
>>> On Tue, Oct 29, 2019 at 5:45 AM Takeshi Yamamuro 
>>> wrote:
>>>
 +1, too.

 On Tue, Oct 29, 2019 at 4:16 PM Holden Karau 
 wrote:

> +1 to deprecating but not yet removing support for 3.6
>
> On Tue, Oct 29, 2019 at 3:47 AM Shane Knapp 
> wrote:
>
>> +1 to testing the absolute minimum number of python variants as
>> possible.  ;)
>>
>> On Mon, Oct 28, 2019 at 7:46 PM Hyukjin Kwon 
>> wrote:
>>
>>> +1 from me as well.
>>>
>>> 2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng 님이 작성:
>>>
 +1. And we should start testing 3.7 and maybe 3.8 in Jenkins.

 On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:

> Thank you for starting the thread.
>
> In addition to that, we currently are testing Python 3.6 only in
> Apache Spark Jenkins environment.
>
> Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1
> will start next January
> (https://spark.apache.org/versioning-policy.html), I'm +1 for the
> deprecation (Python < 3.6) at Apache Spark 3.0.0.
>
> It's just a deprecation to prepare the next-step development cycle.
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz <
> mszymkiew...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> While deprecation of Python 2 in 3.0.0 has been announced
>> ,
>> there is no clear statement about specific continuing support of 
>> different
>> Python 3 version.
>>
>> Specifically:
>>
>>- Python 3.4 has been retired this year.
>>- Python 3.5 is already in the "security fixes only" mode and
>>should be retired in the middle of 2020.
>>
>> Continued support of these two blocks adoption of many new Python
>> features (PEP 468)  and it is hard to justify beyond 2020.
>>
>> Should these two be deprecated in 3.0.0 as well?
>>
>> --
>> Best regards,
>> Maciej
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


 --
 ---
 Takeshi Yamamuro

>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>> --
>> Best regards,
>> Maciej
>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Welcoming some new committers and PMC members

2019-09-17 Thread Bryan Cutler
Congratulations, all well deserved!

On Thu, Sep 12, 2019, 3:32 AM Jacek Laskowski  wrote:

> Hi,
>
> What a great news! Congrats to all awarded and the community for voting
> them in!
>
> p.s. I think it should go to the user mailing list too.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> Follow me at https://twitter.com/jaceklaskowski
>
>
>
> On Tue, Sep 10, 2019 at 2:32 AM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> The Spark PMC recently voted to add several new committers and one PMC
>> member. Join me in welcoming them to their new roles!
>>
>> New PMC member: Dongjoon Hyun
>>
>> New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
>> Weichen Xu, Ruifeng Zheng
>>
>> The new committers cover lots of important areas including ML, SQL, and
>> data sources, so it’s great to have them here. All the best,
>>
>> Matei and the Spark PMC
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] [SPARK-27495] SPIP: Support Stage level resource configuration and scheduling

2019-09-11 Thread Bryan Cutler
+1 (non-binding), looks good!

On Wed, Sep 11, 2019 at 10:05 AM Ryan Blue 
wrote:

> +1
>
> This is going to be really useful. Thanks for working on it!
>
> On Wed, Sep 11, 2019 at 9:38 AM Felix Cheung 
> wrote:
>
>> +1
>>
>> --
>> *From:* Thomas graves 
>> *Sent:* Wednesday, September 4, 2019 7:24:26 AM
>> *To:* dev 
>> *Subject:* [VOTE] [SPARK-27495] SPIP: Support Stage level resource
>> configuration and scheduling
>>
>> Hey everyone,
>>
>> I'd like to call for a vote on SPARK-27495 SPIP: Support Stage level
>> resource configuration and scheduling
>>
>> This is for supporting stage level resource configuration and
>> scheduling.  The basic idea is to allow the user to specify executor
>> and task resource requirements for each stage to allow the user to
>> control the resources required at a finer grain. One good example here
>> is doing some ETL to preprocess your data in one stage and then feed
>> that data into an ML algorithm (like tensorflow) that would run as a
>> separate stage.  The ETL could need totally different resource
>> requirements for the executors/tasks than the ML stage does.
>>
>> The text for the SPIP is in the jira description:
>>
>> https://issues.apache.org/jira/browse/SPARK-27495
>>
>> I split the API and Design parts into a google doc that is linked to
>> from the jira.
>>
>> This vote is open until next Fri (Sept 13th).
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> I'll start with my +1
>>
>> Thanks,
>> Tom
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] New sections in Github Pull Request description template

2019-07-26 Thread Bryan Cutler
The k8s template is pretty good. Under the behavior change section, it
would be good to add instructions to also describe previous and new
behavior as Hyukjin proposed.

On Tue, Jul 23, 2019 at 10:07 PM Reynold Xin  wrote:

> I like the spirit, but not sure about the exact proposal. Take a look at
> k8s':
> https://raw.githubusercontent.com/kubernetes/kubernetes/master/.github/PULL_REQUEST_TEMPLATE.md
>
>
>
> On Tue, Jul 23, 2019 at 8:27 PM, Hyukjin Kwon  wrote:
>
>> (Plus, it helps to track history too. Spark's commit logs are growing and
>> now it's pretty difficult to track the history and see what change
>> introduced a specific behaviour)
>>
>> 2019년 7월 24일 (수) 오후 12:20, Hyukjin Kwon 님이 작성:
>>
>> Hi all,
>>
>> I would like to discuss about some new sections under "## What changes
>> were proposed in this pull request?":
>>
>> ### Do the changes affect _any_ user/dev-facing input or output?
>>
>> (Please answer yes or no. If yes, answer the questions below)
>>
>> ### What was the previous behavior?
>>
>> (Please provide the console output, description and/or reproducer about the 
>> previous behavior)
>>
>> ### What is the behavior the changes propose?
>>
>> (Please provide the console output, description and/or reproducer about the 
>> behavior the changes propose)
>>
>> See
>> https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>>  .
>>
>> From my experience so far in Spark community, and assuming from the
>> interaction with other
>> committers and contributors, It is pretty critical to know before/after
>> behaviour changes even if it
>> was a bug. In addition, I think this is requested by reviewers often.
>>
>> The new sections will make review process much easier, and we're able to
>> quickly judge how serious the changes are.
>> Given that Spark community still suffer from open PRs just queueing up
>> without review, I think this can help
>> both reviewers and PR authors.
>>
>> I do describe them often when I think it's useful and possible.
>> For instance see https://github.com/apache/spark/pull/24927 - I am sure
>> you guys have clear idea what the
>> PR fixes.
>>
>> I cc'ed some guys I can currently think of for now FYI. Please let me
>> know if you guys have any thought on this!
>>
>>
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
Yeah, PyArrow is the only other PySpark dependency we check for a minimum
version. We updated that not too long ago to be 0.12.1, which I think we
are still good on for now.

On Fri, Jun 14, 2019 at 11:36 AM Felix Cheung 
wrote:

> How about pyArrow?
>
> --
> *From:* Holden Karau 
> *Sent:* Friday, June 14, 2019 11:06:15 AM
> *To:* Felix Cheung
> *Cc:* Bryan Cutler; Dongjoon Hyun; Hyukjin Kwon; dev; shane knapp
> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>
> Are there other Python dependencies we should consider upgrading at the
> same time?
>
> On Fri, Jun 14, 2019 at 7:45 PM Felix Cheung 
> wrote:
>
>> So to be clear, min version check is 0.23
>> Jenkins test is 0.24
>>
>> I’m ok with this. I hope someone will test 0.23 on releases though before
>> we sign off?
>>
> We should maybe add this to the release instruction notes?
>
>>
>> ------
>> *From:* shane knapp 
>> *Sent:* Friday, June 14, 2019 10:23:56 AM
>> *To:* Bryan Cutler
>> *Cc:* Dongjoon Hyun; Holden Karau; Hyukjin Kwon; dev
>> *Subject:* Re: [DISCUSS] Increasing minimum supported version of Pandas
>>
>> excellent.  i shall not touch anything.  :)
>>
>> On Fri, Jun 14, 2019 at 10:22 AM Bryan Cutler  wrote:
>>
>>> Shane, I think 0.24.2 is probably more common right now, so if we were
>>> to pick one to test against, I still think it should be that one. Our
>>> Pandas usage in PySpark is pretty conservative, so it's pretty unlikely
>>> that we will add something that would break 0.23.X.
>>>
>>> On Fri, Jun 14, 2019 at 10:10 AM shane knapp 
>>> wrote:
>>>
>>>> ah, ok...  should we downgrade the testing env on jenkins then?  any
>>>> specific version?
>>>>
>>>> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>>>>
>>>> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler 
>>>> wrote:
>>>>
>>>>> I should have stated this earlier, but when the user does something
>>>>> that requires Pandas, the minimum version is checked against what was
>>>>> imported and will raise an exception if it is a lower version. So I'm
>>>>> concerned that using 0.24.2 might be a little too new for users running
>>>>> older clusters. To give some release dates, 0.23.2 was released about a
>>>>> year ago, 0.24.0 in January and 0.24.2 in March.
>>>>>
>>>> I think given that we’re switching to requiring Python 3 and also a bit
> of a way from cutting a release 0.24 could be Ok as a min version
> requirement
>
>>
>>>>>
>>>>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp 
>>>>> wrote:
>>>>>
>>>>>> just to everyone knows, our python 3.6 testing infra is currently on
>>>>>> 0.24.2...
>>>>>>
>>>>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun <
>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> Thank you for this effort, Bryan!
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I’m +1 for upgrading, although since this is probably the last easy
>>>>>>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>>>>>>>> and pandas combinations. Spark 3 should be good time to increase.
>>>>>>>>>
>>>>>>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> We would like to discuss increasing the minimum supported version
>>>>>>>>>> of Pandas in Spark, which is currently 0.19.2.
>>>>>>>>>>
>>>>>>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
&

Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
Shane, I think 0.24.2 is probably more common right now, so if we were to
pick one to test against, I still think it should be that one. Our Pandas
usage in PySpark is pretty conservative, so it's pretty unlikely that we
will add something that would break 0.23.X.

On Fri, Jun 14, 2019 at 10:10 AM shane knapp  wrote:

> ah, ok...  should we downgrade the testing env on jenkins then?  any
> specific version?
>
> shane, who is loathe (and i mean LOATHE) to touch python envs ;)
>
> On Fri, Jun 14, 2019 at 10:08 AM Bryan Cutler  wrote:
>
>> I should have stated this earlier, but when the user does something that
>> requires Pandas, the minimum version is checked against what was imported
>> and will raise an exception if it is a lower version. So I'm concerned that
>> using 0.24.2 might be a little too new for users running older clusters. To
>> give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
>> January and 0.24.2 in March.
>>
>> On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:
>>
>>> just to everyone knows, our python 3.6 testing infra is currently on
>>> 0.24.2...
>>>
>>> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Thank you for this effort, Bryan!
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>>>> wrote:
>>>>
>>>>> I’m +1 for upgrading, although since this is probably the last easy
>>>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>>>
>>>>>
>>>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow
>>>>>> and pandas combinations. Spark 3 should be good time to increase.
>>>>>>
>>>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We would like to discuss increasing the minimum supported version of
>>>>>>> Pandas in Spark, which is currently 0.19.2.
>>>>>>>
>>>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>>>>>> workarounds in PySpark that could be removed if such an old version is 
>>>>>>> not
>>>>>>> required. This will help to keep code clean and reduce maintenance 
>>>>>>> effort.
>>>>>>>
>>>>>>> The change is targeted for Spark 3.0.0 release, see
>>>>>>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>>>>>>> thought is to bump the version to 0.23.2, but we would like to discuss
>>>>>>> before making a change. Does anyone else have thoughts on this?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Bryan
>>>>>>>
>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [DISCUSS] Increasing minimum supported version of Pandas

2019-06-14 Thread Bryan Cutler
I should have stated this earlier, but when the user does something that
requires Pandas, the minimum version is checked against what was imported
and will raise an exception if it is a lower version. So I'm concerned that
using 0.24.2 might be a little too new for users running older clusters. To
give some release dates, 0.23.2 was released about a year ago, 0.24.0 in
January and 0.24.2 in March.

On Fri, Jun 14, 2019 at 9:27 AM shane knapp  wrote:

> just to everyone knows, our python 3.6 testing infra is currently on
> 0.24.2...
>
> On Fri, Jun 14, 2019 at 9:16 AM Dongjoon Hyun 
> wrote:
>
>> +1
>>
>> Thank you for this effort, Bryan!
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Jun 14, 2019 at 4:24 AM Holden Karau 
>> wrote:
>>
>>> I’m +1 for upgrading, although since this is probably the last easy
>>> chance we’ll have to bump version numbers easily I’d suggest 0.24.2
>>>
>>>
>>> On Fri, Jun 14, 2019 at 4:38 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> I am +1 to go for 0.23.2 - it brings some overhead to test PyArrow and
>>>> pandas combinations. Spark 3 should be good time to increase.
>>>>
>>>> 2019년 6월 14일 (금) 오전 9:46, Bryan Cutler 님이 작성:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We would like to discuss increasing the minimum supported version of
>>>>> Pandas in Spark, which is currently 0.19.2.
>>>>>
>>>>> Pandas 0.19.2 was released nearly 3 years ago and there are some
>>>>> workarounds in PySpark that could be removed if such an old version is not
>>>>> required. This will help to keep code clean and reduce maintenance effort.
>>>>>
>>>>> The change is targeted for Spark 3.0.0 release, see
>>>>> https://issues.apache.org/jira/browse/SPARK-28041. The current
>>>>> thought is to bump the version to 0.23.2, but we would like to discuss
>>>>> before making a change. Does anyone else have thoughts on this?
>>>>>
>>>>> Regards,
>>>>> Bryan
>>>>>
>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


[DISCUSS] Increasing minimum supported version of Pandas

2019-06-13 Thread Bryan Cutler
Hi All,

We would like to discuss increasing the minimum supported version of Pandas
in Spark, which is currently 0.19.2.

Pandas 0.19.2 was released nearly 3 years ago and there are some
workarounds in PySpark that could be removed if such an old version is not
required. This will help to keep code clean and reduce maintenance effort.

The change is targeted for Spark 3.0.0 release, see
https://issues.apache.org/jira/browse/SPARK-28041. The current thought is
to bump the version to 0.23.2, but we would like to discuss before making a
change. Does anyone else have thoughts on this?

Regards,
Bryan


Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Bryan Cutler
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng  wrote:

> Here is the draft announcement:
>
> ===
> Plan for dropping Python 2 support
>
> As many of you already knew, Python core development team and many
> utilized Python packages like Pandas and NumPy will drop Python 2 support
> in or before 2020/01/01. Apache Spark has supported both Python 2 and 3
> since Spark 1.4 release in 2015. However, maintaining Python 2/3
> compatibility is an increasing burden and it essentially limits the use of
> Python 3 features in Spark. Given the end of life (EOL) of Python 2 is
> coming, we plan to eventually drop Python 2 support as well. The current
> plan is as follows:
>
> * In the next major release in 2019, we will deprecate Python 2 support.
> PySpark users will see a deprecation warning if Python 2 is used. We will
> publish a migration guide for PySpark users to migrate to Python 3.
> * We will drop Python 2 support in a future release in 2020, after Python
> 2 EOL on 2020/01/01. PySpark users will see an error if Python 2 is used.
> * For releases that support Python 2, e.g., Spark 2.4, their patch
> releases will continue supporting Python 2. However, after Python 2 EOL, we
> might not take patches that are specific to Python 2.
> ===
>
> Sean helped make a pass. If it looks good, I'm going to upload it to Spark
> website and announce it here. Let me know if you think we should do a VOTE
> instead.
>
> On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng  wrote:
>
>> I created https://issues.apache.org/jira/browse/SPARK-27884 to track the
>> work.
>>
>> On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
>> wrote:
>>
>>> We don’t usually reference a future release on website
>>>
>>> > Spark website and state that Python 2 is deprecated in Spark 3.0
>>>
>>> I suspect people will then ask when is Spark 3.0 coming out then. Might
>>> need to provide some clarity on that.
>>>
>>
>> We can say the "next major release in 2019" instead of Spark 3.0. Spark
>> 3.0 timeline certainly requires a new thread to discuss.
>>
>>
>>>
>>>
>>> --
>>> *From:* Reynold Xin 
>>> *Sent:* Thursday, May 30, 2019 12:59:14 AM
>>> *To:* shane knapp
>>> *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen
>>> Fen; Xiangrui Meng; dev; user
>>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>>
>>> +1 on Xiangrui’s plan.
>>>
>>> On Thu, May 30, 2019 at 7:55 AM shane knapp  wrote:
>>>
 I don't have a good sense of the overhead of continuing to support
> Python 2; is it large enough to consider dropping it in Spark 3.0?
>
> from the build/test side, it will actually be pretty easy to continue
 support for python2.7 for spark 2.x as the feature sets won't be expanding.

>>>
 that being said, i will be cracking a bottle of champagne when i can
 delete all of the ansible and anaconda configs for python2.x.  :)

>>>
>> On the development side, in a future release that drops Python 2 support
>> we can remove code that maintains python 2/3 compatibility and start using
>> python 3 only features, which is also quite exciting.
>>
>>
>>>
 shane
 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-08 Thread Bryan Cutler
+1 (non-binding)

On Tue, May 7, 2019 at 12:04 PM Bobby Evans  wrote:

> I am +!
>
> On Tue, May 7, 2019 at 1:37 PM Thomas graves  wrote:
>
>> Hi everyone,
>>
>> I'd like to call for another vote on SPARK-27396 - SPIP: Public APIs
>> for extended Columnar Processing Support.  The proposal is to extend
>> the support to allow for more columnar processing.  We had previous
>> vote and discussion threads and have updated the SPIP based on the
>> comments to clarify a few things and reduce the scope.
>>
>> You can find the updated proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (May 13th), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thanks!
>> Tom Graves
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-02 Thread Bryan Cutler
I looked at the updated SPIP and I think the reduced scope sounds better.
>From the Spark Summit, it seemed like there was a lot of interest in
columnar processing and this would be a good starting point to enable that.
It would be great to hear some other peoples input too.

Bryan

On Tue, Apr 30, 2019 at 7:21 AM Bobby Evans  wrote:

> I wanted to give everyone a heads up that I have updated the SPIP at
> https://issues.apache.org/jira/browse/SPARK-27396 Please take a look and
> add any comments you might have to the JIRA.  I reduced the scope of the
> SPIP to just the non-controversial parts.  In the background, I will be
> trying to work with the Arrow community to get some form of guarantees
> about the stability of the standard.  That should hopefully unblock stable
> APIs so end users can write columnar UDFs in scala/java and ideally get
> efficient Arrow based batch data transfers to external tools as well.
>
> Thanks,
>
> Bobby
>
> On Tue, Apr 23, 2019 at 12:32 PM Matei Zaharia 
> wrote:
>
> > Just as a note here, if the goal is the format not change, why not make
> > that explicit in a versioning policy? You can always include a format
> > version number and say that future versions may increment the number, but
> > this specific version will always be readable in some specific way. You
> > could also put a timeline on how long old version numbers will be
> > recognized in the official libraries (e.g. 3 years).
> >
> > Matei
> >
> > > On Apr 22, 2019, at 6:36 AM, Bobby Evans  wrote:
> > >
> > > Yes, it is technically possible for the layout to change.  No, it is
> not
> > going to happen.  It is already baked into several different official
> > libraries which are widely used, not just for holding and processing the
> > data, but also for transfer of the data between the various
> > implementations.  There would have to be a really serious reason to force
> > an incompatible change at this point.  So in the worst case, we can
> version
> > the layout and bake that into the API that exposes the internal layout of
> > the data.  That way code that wants to program against a JAVA API can do
> so
> > using the API that Spark provides, those who want to interface with
> > something that expects the data in arrow format will already have to know
> > what version of the format it was programmed against and in the worst
> case
> > if the layout does change we can support the new layout if needed.
> > >
> > > On Sun, Apr 21, 2019 at 12:45 AM Bryan Cutler 
> wrote:
> > > The Arrow data format is not yet stable, meaning there are no
> guarantees
> > on backwards/forwards compatibility. Once version 1.0 is released, it
> will
> > have those guarantees but it's hard to say when that will be. The
> remaining
> > work to get there can be seen at
> >
> https://cwiki.apache.org/confluence/display/ARROW/Columnar+Format+1.0+Milestone
> .
> > So yes, it is a risk that exposing Spark data as Arrow could cause an
> issue
> > if handled by a different version that is not compatible. That being
> said,
> > changes to format are not taken lightly and are backwards compatible when
> > possible. I think it would be fair to mark the APIs exposing Arrow data
> as
> > experimental for the time being, and clearly state the version that must
> be
> > used to be compatible in the docs. Also, adding features like this and
> > SPARK-24579 will probably help adoption of Arrow and accelerate a 1.0
> > release. Adding the Arrow dev list to CC.
> > >
> > > Bryan
> > >
> > > On Sat, Apr 20, 2019 at 5:25 PM Matei Zaharia  >
> > wrote:
> > > Okay, that makes sense, but is the Arrow data format stable? If not, we
> > risk breakage when Arrow changes in the future and some libraries using
> > this feature are begin to use the new Arrow code.
> > >
> > > Matei
> > >
> > > > On Apr 20, 2019, at 1:39 PM, Bobby Evans  wrote:
> > > >
> > > > I want to be clear that this SPIP is not proposing exposing Arrow
> > APIs/Classes through any Spark APIs.  SPARK-24579 is doing that, and
> > because of the overlap between the two SPIPs I scaled this one back to
> > concentrate just on the columnar processing aspects. Sorry for the
> > confusion as I didn't update the JIRA description clearly enough when we
> > adjusted it during the discussion on the JIRA.  As part of the columnar
> > processing, we plan on providing arrow formatted data, but that will be
> > exposed through a Spark owned API.
> > > >
> > > > On Sat, Apr 20, 2

Re: Thoughts on dataframe cogroup?

2019-04-23 Thread Bryan Cutler
Apologies for not leaving feedback yet. I'm a little swamped this week with
the Spark Summit, but this is at the top of my list to get to for next week.

Bryan

On Thu, Apr 18, 2019 at 4:18 AM Chris Martin  wrote:

> Yes, totally agreed with Li here.
>
> For clarity, I'm happy to do the work to implement this, but it would be
> good to get feedback from the community in general and some of the Spark
> committers in particular.
>
> thanks,
>
> Chris
>
> On Wed, Apr 17, 2019 at 9:17 PM Li Jin  wrote:
>
>> I have left some comments. This looks a good proposal to me.
>>
>> As a heavy pyspark user, this is a pattern that we see over and over
>> again and I think could be pretty high value to other pyspark users as
>> well. The fact that Chris and I come to same ideas sort of verifies my
>> intuition. Also, this isn't really something new, RDD has cogroup function
>> from very early on.
>>
>> With that being said, I'd like to call out again for community's feedback
>> on the proposal.
>>
>> On Mon, Apr 15, 2019 at 4:57 PM Chris Martin 
>> wrote:
>>
>>> Ah sorry- I've updated the link which should give you access.  Can you
>>> try again now?
>>>
>>> thanks,
>>>
>>> Chris
>>>
>>>
>>>
>>> On Mon, Apr 15, 2019 at 9:49 PM Li Jin  wrote:
>>>
>>>> Hi Chris,
>>>>
>>>> Thanks! The permission to the google doc is maybe not set up properly.
>>>> I cannot view the doc by default.
>>>>
>>>> Li
>>>>
>>>> On Mon, Apr 15, 2019 at 3:58 PM Chris Martin 
>>>> wrote:
>>>>
>>>>> I've updated the jira so that the main body is now inside a google
>>>>> doc.  Anyone should be able to comment- if you want/need write access
>>>>> please drop me a mail and I can add you.
>>>>>
>>>>> Ryan- regarding your specific point regarding why I'm not proposing to
>>>>> add this to the Scala API, I think the main point is that Scala users can
>>>>> already use Cogroup for Datasets.  For Scala this is probably a better
>>>>> solution as (as far as I know) there is no Scala DataFrame library that
>>>>> could be used in place of Pandas for manipulating  local DataFrames. As a
>>>>> result you'd probably be left with dealing with Iterators of Row objects,
>>>>> which almost certainly isn't what you'd want. This is similar to the
>>>>> existing grouped map Pandas Udfs for which there is no equivalent Scala 
>>>>> Api.
>>>>>
>>>>> I do think there might be a place for allowing a (Scala) DataSet
>>>>> Cogroup to take some sort of grouping expression as the grouping key  
>>>>> (this
>>>>> would mean that you wouldn't have to marshal the key into a JVM object and
>>>>> could possible lend itself to some catalyst optimisations) but I don't
>>>>> think that this should be done as part of this SPIP.
>>>>>
>>>>> thanks,
>>>>>
>>>>> Chris
>>>>>
>>>>> On Mon, Apr 15, 2019 at 6:27 PM Ryan Blue  wrote:
>>>>>
>>>>>> I agree, it would be great to have a document to comment on.
>>>>>>
>>>>>> The main thing that stands out right now is that this is only for
>>>>>> PySpark and states that it will not be added to the Scala API. Why not 
>>>>>> make
>>>>>> this available since most of the work would be done?
>>>>>>
>>>>>> On Mon, Apr 15, 2019 at 7:50 AM Li Jin  wrote:
>>>>>>
>>>>>>> Thank you Chris, this looks great.
>>>>>>>
>>>>>>> Would you mind share a google doc version of the proposal? I believe
>>>>>>> that's the preferred way of discussing proposals (Other people please
>>>>>>> correct me if I am wrong).
>>>>>>>
>>>>>>> Li
>>>>>>>
>>>>>>> On Mon, Apr 15, 2019 at 8:20 AM  wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>  As promised I’ve raised SPARK-27463 for this.
>>>>>>>>
>>>>>>>> All feedback welcome!
>>>>>>>>
>>>>>>>> Chris
>>>>>>>>
>>>>>>>> On 9 Apr 2019, a

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-20 Thread Bryan Cutler
 have
> 1.0 release someday.
> > > 2. ML/DL systems that can benefits from columnar format are mostly in
> Python.
> > > 3. Simple operations, though benefits vectorization, might not be
> worth the data exchange overhead.
> > >
> > > So would an improved Pandas UDF API would be good enough? For example,
> SPARK-26412 (UDF that takes an iterator of of Arrow batches).
> > >
> > > Sorry that I should join the discussion earlier! Hope it is not too
> late:)
> > >
> > > On Fri, Apr 19, 2019 at 1:20 PM  wrote:
> > > +1 (non-binding) for better columnar data processing support.
> > >
> > >
> > >
> > > From: Jules Damji 
> > > Sent: Friday, April 19, 2019 12:21 PM
> > > To: Bryan Cutler 
> > > Cc: Dev 
> > > Subject: Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended
> Columnar Processing Support
> > >
> > >
> > >
> > > + (non-binding)
> > >
> > > Sent from my iPhone
> > >
> > > Pardon the dumb thumb typos :)
> > >
> > >
> > > On Apr 19, 2019, at 10:30 AM, Bryan Cutler  wrote:
> > >
> > > +1 (non-binding)
> > >
> > >
> > >
> > > On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:
> > >
> > > +1 (non-binding).  Looking forward to seeing better support for
> processing columnar data.
> > >
> > >
> > >
> > > Jason
> > >
> > >
> > >
> > > On Tue, Apr 16, 2019 at 10:38 AM Tom Graves
>  wrote:
> > >
> > > Hi everyone,
> > >
> > >
> > >
> > > I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
> extended Columnar Processing Support.  The proposal is to extend the
> support to allow for more columnar processing.
> > >
> > >
> > >
> > > You can find the full proposal in the jira at:
> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
> DISCUSS thread in the dev mailing list.
> > >
> > >
> > >
> > > Please vote as early as you can, I will leave the vote open until next
> Monday (the 22nd), 2pm CST to give people plenty of time.
> > >
> > >
> > >
> > > [ ] +1: Accept the proposal as an official SPIP
> > >
> > > [ ] +0
> > >
> > > [ ] -1: I don't think this is a good idea because ...
> > >
> > >
> > >
> > >
> > >
> > > Thanks!
> > >
> > > Tom Graves
> > >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Bryan Cutler
+1 (non-binding)

On Thu, Apr 18, 2019 at 11:41 AM Jason Lowe  wrote:

> +1 (non-binding).  Looking forward to seeing better support for processing
> columnar data.
>
> Jason
>
> On Tue, Apr 16, 2019 at 10:38 AM Tom Graves 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to call for a vote on SPARK-27396 - SPIP: Public APIs for
>> extended Columnar Processing Support.  The proposal is to extend the
>> support to allow for more columnar processing.
>>
>> You can find the full proposal in the jira at:
>> https://issues.apache.org/jira/browse/SPARK-27396. There was also a
>> DISCUSS thread in the dev mailing list.
>>
>> Please vote as early as you can, I will leave the vote open until next
>> Monday (the 22nd), 2pm CST to give people plenty of time.
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>>
>> Thanks!
>> Tom Graves
>>
>


Re: [SPARK-25079] moving from python 3.4 to python 3.6.8, impacts all active branches

2019-04-18 Thread Bryan Cutler
Great work, thanks Shane!

On Thu, Apr 18, 2019 at 2:46 PM shane knapp  wrote:

> alrighty folks, the future is here and we'll be moving to python 3.6
> monday!
>
> all three PRs are green!
> master PR:  https://github.com/apache/spark/pull/24266
> 2.4 PR:  https://github.com/apache/spark/pull/24379
> 2.3 PR:  https://github.com/apache/spark/pull/24380
>
> more detailed email coming out this afternoon about the upgrade.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Thoughts on dataframe cogroup?

2019-04-08 Thread Bryan Cutler
Chirs, an SPIP sounds good to me. I agree with Li that it wouldn't be too
difficult to extend the currently functionality to transfer multiple
DataFrames.  For the SPIP, I would keep it more high-level and I don't
think it's necessary to include details of the Python worker, we can hash
that out after the SPIP is approved.

Bryan

On Mon, Apr 8, 2019 at 10:43 AM Li Jin  wrote:

> Thanks Chris, look forward to it.
>
> I think sending multiple dataframes to the python worker requires some
> changes but shouldn't be too difficult. We can probably sth like:
>
>
> [numberOfDataFrames][FirstDataFrameInArrowFormat][SecondDataFrameInArrowFormat]
>
> In:
> https://github.com/apache/spark/blob/86d469aeaa492c0642db09b27bb0879ead5d7166/sql/core/src/main/scala/org/apache/spark/sql/execution/python/ArrowPythonRunner.scala#L70
>
> And have ArrowPythonRunner take multiple input iterator/schema.
>
> Li
>
>
> On Mon, Apr 8, 2019 at 5:55 AM  wrote:
>
>> Hi,
>>
>> Just to say, I really do think this is useful and am currently working on
>> a SPIP to formally propose this. One concern I do have, however, is that
>> the current arrow serialization code is tied to passing through a single
>> dataframe as the udf parameter and so any modification to allow multiple
>> dataframes may not be straightforward.  If anyone has any ideas as to how
>> this might be achieved in an elegant manner I’d be happy to hear them!
>>
>> Thanks,
>>
>> Chris
>>
>> On 26 Feb 2019, at 14:55, Li Jin  wrote:
>>
>> Thank you both for the reply. Chris and I have very similar use cases for
>> cogroup.
>>
>> One of the goals for groupby apply + pandas UDF was to avoid things like
>> collect list and reshaping data between Spark and Pandas. Cogroup feels
>> very similar and can be an extension to the groupby apply + pandas UDF
>> functionality.
>>
>> I wonder if any PMC/committers have any thoughts/opinions on this?
>>
>> On Tue, Feb 26, 2019 at 2:17 AM  wrote:
>>
>>> Just to add to this I’ve also implemented my own cogroup previously and
>>> would welcome a cogroup for datafame.
>>>
>>> My specific use case was that I had a large amount of time series data.
>>> Spark has very limited support for time series (specifically as-of joins),
>>> but pandas has good support.
>>>
>>> My solution was to take my two dataframes and perform a group by and
>>> collect list on each. The resulting arrays could be passed into a udf where
>>> they could be marshaled into a couple of pandas dataframes and processed
>>> using pandas excellent time series functionality.
>>>
>>> If cogroup was available natively on dataframes this would have been a
>>> bit nicer. The ideal would have been some pandas udf version of cogroup
>>> that gave me a pandas dataframe for each spark dataframe in the cogroup!
>>>
>>> Chris
>>>
>>> On 26 Feb 2019, at 00:38, Jonathan Winandy 
>>> wrote:
>>>
>>> For info, in our team have defined our own cogroup on dataframe in the
>>> past on different projects using different methods (rdd[row] based or union
>>> all collect list based).
>>>
>>> I might be biased, but find the approach very useful in project to
>>> simplify and speed up transformations, and remove a lot of intermediate
>>> stages (distinct + join => just cogroup).
>>>
>>> Plus spark 2.4 introduced a lot of new operator for nested data. That's
>>> a win!
>>>
>>>
>>> On Thu, 21 Feb 2019, 17:38 Li Jin,  wrote:
>>>
 I am wondering do other people have opinion/use case on cogroup?

 On Wed, Feb 20, 2019 at 5:03 PM Li Jin  wrote:

> Alessandro,
>
> Thanks for the reply. I assume by "equi-join", you mean "equality
> full outer join" .
>
> Two issues I see with equity outer join is:
> (1) equity outer join will give n * m rows for each key (n and m being
> the corresponding number of rows in df1 and df2 for each key)
> (2) User needs to do some extra processing to transform n * m back to
> the desired shape (two sub dataframes with n and m rows)
>
> I think full outer join is an inefficient way to implement cogroup. If
> the end goal is to have two separate dataframes for each key, why joining
> them first and then unjoin them?
>
>
>
> On Wed, Feb 20, 2019 at 5:52 AM Alessandro Solimando <
> alessandro.solima...@gmail.com> wrote:
>
>> Hello,
>> I fail to see how an equi-join on the key columns is different than
>> the cogroup you propose.
>>
>> I think the accepted answer can shed some light:
>>
>> https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark
>>
>> Now you apply an udf on each iterable, one per key value (obtained
>> with cogroup).
>>
>> You can achieve the same by:
>> 1) join df1 and df2 on the key you want,
>> 2) apply "groupby" on such key
>> 3) finally apply a udaf (you can have a look here if you are not
>> familiar with them
>> 

Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-04-02 Thread Bryan Cutler
Nice work Shane! That all sounds good to me.  We might want to use pyarrow
0.12.1 though, there is a major bug that was fixed, but we can discuss in
the PR.  I will put up the code changes in the next few days.
Felix, I think you're right about Python 3.5, they just list one upcoming
release and that's not necessarily the last. Comparing the histories, it
might still be soon though. I think using 3.6 will be fine, as a point of
reference, pyarrow CI uses 2.7 and 3.6.

On Mon, Apr 1, 2019 at 3:09 PM shane knapp  wrote:

> well now!  color me completely surprised...  i decided to whip up a fresh
> python3.6.8 conda environment this morning to "see if things just worked".
>
> well, apparently they do!  :)
>
> regardless, this is pretty awesome news as i will be able to easily update
> the 'py3k' python3.4 environment to a fresh, less bloated, but still
> package-complete python3.6.8 environment (including pyarrow 0.12.0, pandas
> 0.24.2, scipy 1.2.1).
>
> i tested this pretty extensively today on both the ubuntu and centos
> workers, and i think i'm ready to pull the trigger for a build-system-wide
> upgrade...   however, i'll be out wednesday through friday this week and
> don't want to make a massive change before disappearing for a few days.
>
> so:  how does early next week sound for the python upgrade?  :)
>
> shane
>
> On Mon, Apr 1, 2019 at 8:58 AM shane knapp  wrote:
>
>> i'd much prefer that we minimize the number of python versions that we
>> test against...  would 2.7 and 3.6 be sufficient?
>>
>> On Fri, Mar 29, 2019 at 10:23 PM Felix Cheung 
>> wrote:
>>
>>> I don’t take it as Sept 2019 is end of life for python 3.5 tho. It’s
>>> just saying the next release.
>>>
>>> In any case I think in the next release it will be great to get more
>>> Python 3.x release test coverage.
>>>
>>>
>>>
>>> --
>>> *From:* shane knapp 
>>> *Sent:* Friday, March 29, 2019 4:46 PM
>>> *To:* Bryan Cutler
>>> *Cc:* Felix Cheung; Hyukjin Kwon; dev
>>> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x
>>> [SPARK-27276]
>>>
>>> i'm not opposed to 3.6 at all.
>>>
>>> On Fri, Mar 29, 2019 at 4:16 PM Bryan Cutler  wrote:
>>>
>>>> PyArrow dropping Python 3.4 was mainly due to support going away at
>>>> Conda-Forge and other dependencies also dropping it.  I think we better
>>>> upgrade Jenkins Python while we are at it.  Are you all against jumping to
>>>> Python 3.6 so we are not in the same boat in September?
>>>>
>>>> On Thu, Mar 28, 2019 at 7:58 PM Felix Cheung 
>>>> wrote:
>>>>
>>>>> 3.4 is end of life but 3.5 is not. From your link
>>>>>
>>>>> we expect to release Python 3.5.8 around September 2019.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *From:* shane knapp 
>>>>> *Sent:* Thursday, March 28, 2019 7:54 PM
>>>>> *To:* Hyukjin Kwon
>>>>> *Cc:* Bryan Cutler; dev; Felix Cheung
>>>>> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x
>>>>> [SPARK-27276]
>>>>>
>>>>> looks like the same for 3.5...
>>>>> https://www.python.org/dev/peps/pep-0478/
>>>>>
>>>>> let's pick a python version and start testing.
>>>>>
>>>>> On Thu, Mar 28, 2019 at 7:52 PM shane knapp 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>> If there was, it looks inevitable to upgrade Jenkins\s Python from
>>>>>>> 3.4 to 3.5.
>>>>>>>
>>>>>>> this is inevitable.  3.4s final release was 10 days ago (
>>>>>> https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-29 Thread Bryan Cutler
PyArrow dropping Python 3.4 was mainly due to support going away at
Conda-Forge and other dependencies also dropping it.  I think we better
upgrade Jenkins Python while we are at it.  Are you all against jumping to
Python 3.6 so we are not in the same boat in September?

On Thu, Mar 28, 2019 at 7:58 PM Felix Cheung 
wrote:

> 3.4 is end of life but 3.5 is not. From your link
>
> we expect to release Python 3.5.8 around September 2019.
>
>
>
> --
> *From:* shane knapp 
> *Sent:* Thursday, March 28, 2019 7:54 PM
> *To:* Hyukjin Kwon
> *Cc:* Bryan Cutler; dev; Felix Cheung
> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>
> looks like the same for 3.5...   https://www.python.org/dev/peps/pep-0478/
>
> let's pick a python version and start testing.
>
> On Thu, Mar 28, 2019 at 7:52 PM shane knapp  wrote:
>
>>
>>> If there was, it looks inevitable to upgrade Jenkins\s Python from 3.4
>>> to 3.5.
>>>
>>> this is inevitable.  3.4s final release was 10 days ago (
>> https://www.python.org/dev/peps/pep-0429/) so we're basically EOL.
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]

2019-03-26 Thread Bryan Cutler
Thanks Hyukjin.  The plan is to get this done for 3.0 only.  Here is a link
to the JIRA https://issues.apache.org/jira/browse/SPARK-27276.  Shane is
also correct in that newer versions of pyarrow have stopped support for
Python 3.4, so we should probably have Jenkins test against 2.7 and 3.5.

On Mon, Mar 25, 2019 at 9:44 PM Reynold Xin  wrote:

> +1 on doing this in 3.0.
>
>
> On Mon, Mar 25, 2019 at 9:31 PM, Felix Cheung 
> wrote:
>
>> I’m +1 if 3.0
>>
>>
>> --
>> *From:* Sean Owen 
>> *Sent:* Monday, March 25, 2019 6:48 PM
>> *To:* Hyukjin Kwon
>> *Cc:* dev; Bryan Cutler; Takuya UESHIN; shane knapp
>> *Subject:* Re: Upgrading minimal PyArrow version to 0.12.x [SPARK-27276]
>>
>> I don't know a lot about Arrow here, but seems reasonable. Is this for
>> Spark 3.0 or for 2.x? Certainly, requiring the latest for Spark 3
>> seems right.
>>
>> On Mon, Mar 25, 2019 at 8:17 PM Hyukjin Kwon 
>> wrote:
>> >
>> > Hi all,
>> >
>> > We really need to upgrade the minimal version soon. It's actually
>> slowing down the PySpark dev, for instance, by the overhead that sometimes
>> we need currently to test all multiple matrix of Arrow and Pandas. Also, it
>> currently requires to add some weird hacks or ugly codes. Some bugs exist
>> in lower versions, and some features are not supported in low PyArrow, for
>> instance.
>> >
>> > Per, (Apache Arrow'+ Spark committer FWIW), Bryan's recommendation and
>> my opinion as well, we should better increase the minimal version to
>> 0.12.x. (Also, note that Pandas <> Arrow is an experimental feature).
>> >
>> > So, I and Bryan will proceed this roughly in few days if there isn't
>> objections assuming we're fine with increasing it to 0.12.x. Please let me
>> know if there are some concerns.
>> >
>> > For clarification, this requires some jobs in Jenkins to upgrade the
>> minimal version of PyArrow (I cc'ed Shane as well).
>> >
>> > PS: I roughly heard that Shane's busy for some work stuff .. but it's
>> kind of important in my perspective.
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>


Re: [pyspark] dataframe map_partition

2019-03-08 Thread Bryan Cutler
Hi Peng,

I just added support for scalar Pandas UDF to return a StructType as a
Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836. Is
that the functionality you are looking for?

Bryan

On Thu, Mar 7, 2019 at 1:13 PM peng yu  wrote:

> right now, i'm using the colums-at-a-time mapping
> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
>
>
>
> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen  wrote:
>
>> Maybe, it depends on what you're doing. It sounds like you are trying
>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
>> you're doing vectorized? may not help much.
>> Just make the pandas Series into a DataFrame if you want? and a single
>> col back to Series?
>>
>> On Thu, Mar 7, 2019 at 2:45 PM peng yu  wrote:
>> >
>> > pandas/arrow is for the memory efficiency, and mapPartitions is only
>> available to rdds, for sure i can do everything in rdd.
>> >
>> > But i thought that's the whole point of having pandas_udf, so my
>> program run faster and consumes less memory ?
>> >
>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen  wrote:
>> >>
>> >> Are you just applying a function to every row in the DataFrame? you
>> >> don't need pandas at all. Just get the RDD of Row from it and map a
>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>> >> that operates on all columns and returns a new value. mapPartitions is
>> >> also available if you want to transform an iterator of Row to another
>> >> iterator of Row.
>> >>
>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu  wrote:
>> >> >
>> >> > it is very similar to SCALAR, but for SCALAR the output can't be
>> struct/row and the input has to be pd.Series, which doesn't support a row.
>> >> >
>> >> > I'm doing tensorflow batch inference in spark,
>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>> >> >
>> >> > Which i have to do the groupBy in order to use the apply function,
>> i'm wondering why not just enable apply to df ?
>> >> >
>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen  wrote:
>> >> >>
>> >> >> Are you looking for SCALAR? that lets you map one row to one row,
>> but
>> >> >> do it more efficiently in batch. What are you trying to do?
>> >> >>
>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu  wrote:
>> >> >> >
>> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
>> pyspark.Dataframe.
>> >> >> >
>> >> >> > ```
>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>> >> >> > def do_nothing(pandas_df):
>> >> >> > return pandas_df
>> >> >> >
>> >> >> >
>> >> >> > new_df = df.mapPartition(do_nothing)
>> >> >> > ```
>> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support
>> just Map?
>> >> >> >
>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen 
>> wrote:
>> >> >> >>
>> >> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition?
>> Those exist already
>> >> >> >>
>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu  wrote:
>> >> >> >>>
>> >> >> >>> There is a nice map_partition function in R `dapply`.  so that
>> user can pass a row to udf.
>> >> >> >>>
>> >> >> >>> I'm wondering why we don't have that in python?
>> >> >> >>>
>> >> >> >>> I'm trying to have a map_partition function with pandas_udf
>> supported
>> >> >> >>>
>> >> >> >>> thanks!
>>
>


Re: Welcome Jose Torres as a Spark committer

2019-01-30 Thread Bryan Cutler
Congrats Jose!

On Tue, Jan 29, 2019, 10:48 AM Shixiong Zhu  Hi all,
>
> The Apache Spark PMC recently added Jose Torres as a committer on the
> project. Jose has been a major contributor to Structured Streaming. Please
> join me in welcoming him!
>
> Best Regards,
>
> Shixiong Zhu
>
>


Re: Arrow optimization in conversion from R DataFrame to Spark DataFrame

2018-11-09 Thread Bryan Cutler
Great work Hyukjin!  I'm not too familiar with R, but I'll take a look at
the PR.

Bryan

On Fri, Nov 9, 2018 at 9:19 AM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> Thanks Hyukjin! Very cool results
>
> Shivaram
> On Fri, Nov 9, 2018 at 10:58 AM Felix Cheung 
> wrote:
> >
> > Very cool!
> >
> >
> > 
> > From: Hyukjin Kwon 
> > Sent: Thursday, November 8, 2018 10:29 AM
> > To: dev
> > Subject: Arrow optimization in conversion from R DataFrame to Spark
> DataFrame
> >
> > Hi all,
> >
> > I am trying to introduce R Arrow optimization by reusing PySpark Arrow
> optimization.
> >
> > It boosts R DataFrame > Spark DataFrame up to roughly 900% ~ 1200%
> faster.
> >
> > Looks working fine so far; however, I would appreciate if you guys have
> some time to take a look (https://github.com/apache/spark/pull/22954) so
> that we can directly go ahead as soon as R API of Arrow is released.
> >
> > More importantly, I want some more people who're more into Arrow R API
> side but also interested in Spark side. I have already cc'ed some people I
> know but please come, review and discuss for both Spark side and Arrow side.
> >
> > Thanks.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: welcome a new batch of committers

2018-10-04 Thread Bryan Cutler
Congratulations everyone! Very well deserved!!

On Wed, Oct 3, 2018, 1:59 AM Reynold Xin  wrote:

> Hi all,
>
> The Apache Spark PMC has recently voted to add several new committers to
> the project, for their contributions:
>
> - Shane Knapp (contributor to infra)
> - Dongjoon Hyun (contributor to ORC support and other parts of Spark)
> - Kazuaki Ishizaki (contributor to Spark SQL)
> - Xingbo Jiang (contributor to Spark Core and SQL)
> - Yinan Li (contributor to Spark on Kubernetes)
> - Takeshi Yamamuro (contributor to Spark SQL)
>
> Please join me in welcoming them!
>
>


Re: python tests: any reason for a huge tests.py?

2018-09-13 Thread Bryan Cutler
Hi Imran,

I agree it would be good to split up the tests, but there might be a couple
things to discuss first. Right now we have a single "test.py" for each
subpackage. I think it makes sense to roughly have a test file for most
modules, e.g. "test_rdd.py", but it might not always be clear cut and there
could be other ways to split them up.  Also, should we put the test files
in the same directory as source or a subdirectory named "tests." My
preference is for a subdirectory.  As for putting new tests into their own
files right away, it seems better to me to keep them with related tests for
now and separate as it's own task to avoid fragmenting the test suites. If
it's done incrementally, I don't think merge conflicts will cause a
problem. Let be summarize this in SPARK-25344.

Thanks,
Bryan

On Wed, Sep 12, 2018 at 10:48 AM Imran Rashid 
wrote:

> So I've had some offline discussion around this, so I'd like to clarify.
> SPARK-25344 maybe some non-trivial work to do, as its significant
> refactoring.
>
> But can we agree on an *immediate* first step: all new python tests should
> go into their own files?  is there some reason to not do that right away?
>
> I understand that in some case, you'll want to add a test case that really
> is related to an existing test already in those giant files, and it makes
> sense for you to keep them close.   Its fine to decide on a case-by-case
> basis whether we should do the relevant refactoring for that relevant bit
> at the same or just put it in the same file.  But we should still have this
> *goal* in mind, so you should do it in the cases where its really
> independent cases.
>
> That avoid us making the problem worse till we get to SPARK-25344, and
> furthermore it will allow work on SPARK-25344 to eventually proceed without
> never ending merge conflicts with other changes that are also adding new
> tests.
>
> On Wed, Sep 5, 2018 at 1:27 PM Imran Rashid  wrote:
>
>> I filed https://issues.apache.org/jira/browse/SPARK-25344
>>
>> On Fri, Aug 24, 2018 at 11:57 AM Reynold Xin  wrote:
>>
>>> We should break it.
>>>
>>> On Fri, Aug 24, 2018 at 9:53 AM Imran Rashid
>>>  wrote:
>>>
 Hi,

 another question from looking more at python recently.  Is there any
 reason we've got a ton of tests in one humongous tests.py file, rather than
 breaking it out into smaller files?

 Having one huge file doesn't seem great for code organization, and it
 also makes the test parallelization in run-tests.py not work as well.  On
 my laptop, tests.py takes 150s, and the next longest test file takes only
 20s.

 can we at least try to put new tests into smaller files?

 thanks,
 Imran

>>>


Re: [discuss][minor] impending python 3.x jenkins upgrade... 3.5.x? 3.6.x?

2018-08-20 Thread Bryan Cutler
Thanks for looking into this Shane!  If we are choosing a single python
3.x, I think 3.6 would be good. It might still be nice to test against
other versions too, so we can catch any issues. Is it possible to have more
exhaustive testing as part of a nightly or scheduled build? As a point of
reference for Python 3.6, Arrow is using this version for CI.

Bryan

On Sun, Aug 19, 2018 at 9:49 PM Hyukjin Kwon  wrote:

> Actually Python 3.7 is released (
> https://www.python.org/downloads/release/python-370/) too and I fixed the
> compatibility issues accordingly -
> https://github.com/apache/spark/pull/21714
> There has been an issue for 3.6 (comparing to lower versions of Python
> including 3.5) - https://github.com/apache/spark/pull/16429
>
> I am not yet sure what's the best matrix for it actually. In case of R, we
> test lowest version in Jenkins and highest version via AppVeyor FWIW.
> I don't have a strong preference opinion on this since we have been having
> compatibility issues for each Python version.
>
>
> 2018년 8월 14일 (화) 오전 4:15, shane knapp 님이 작성:
>
>> hey everyone!
>>
>> i was checking out the EOL/release cycle for python 3.5 and it looks like
>> we'll have 3.5.6 released in early 2019.
>>
>> this got me to thinking:  instead of 3.5, what about 3.6?
>>
>> i looked around, and according to the 'docs' and 'collective wisdom of
>> the internets', 3.5 and 3.6 should be fully backwards-compatible w/3.4.
>>
>> of course, this needs to be taken w/a grain of salt, as we're mostly
>> focused on actual python package requirements, rather than worrying about
>> core python functionality.
>>
>> thoughts?  comments?
>>
>> thanks in advance,
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Bryan Cutler
I agree that we should hold off on the Arrow upgrade if it requires major
changes to our testing. I did have another thought that maybe we could just
add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
current testing the same? I'm not sure how doable that is right now and
don't want to make a ton of extra work, so no objections from me to hold
off on things for now.

On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:

> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>
>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>> it to Spark 3.0, so that we have more time to test. Any objections?
>>
>
> none here.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Bryan Cutler
Hi All,

I'd like to request a few days extension to the code freeze to complete the
upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several
key improvements and bug fixes.  The RC vote just passed this morning and
code changes are complete in https://github.com/apache/spark/pull/21939. We
just need some time for the release artifacts to be available. Thoughts?

Thanks,
Bryan

On Wed, Aug 1, 2018, 5:34 PM shane knapp  wrote:

> ++ssuchter (who kindly set up the initial k8s builds while i hammered on
> the backend)
>
> while i'm pretty confident (read: 99%) that the pull request builds will
> work on the new ubuntu workers:
>
> 1) i'd like to do more stress testing of other spark builds (in progress)
> 2) i'd like to reimage more centos workers before moving the PRB due to
> potential executor starvation, and my lead sysadmin is out until next monday
> 3) we will need to get rid of the ubuntu-specific k8s builds and merge
> that functionality in to the existing PRB job.  after that:  testing and
> babysitting
>
> regarding (1):  if these damn builds didn't take 4+ hours, it would be
> going a lot quicker.  ;)
> regarding (2):  adding two more ubuntu workers would make me comfortable
> WRT number of available executors, and i will guarantee that can happen by
> EOD on the 7th.
> regarding (3):  this should take about a day, and realistically the
> earliest we can get this started is the 8th.  i haven't even had a chance
> to start looking at this stuff yet, either.
>
> if we push release by a week, i think i can get things sorted w/o
> impacting the release schedule.  there will still be a bunch of stuff to
> clean up from the old centos builds (specifically docs, packaging and
> release), but i'll leave the existing and working infrastructure in place
> for now.
>
> shane
>
> On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson 
> wrote:
>
>> The PR for SparkR support on the kube back-end is completed, but waiting
>> for Shane to make some tweaks to the CI machinery for full testing support.
>> If the code freeze is being delayed, this PR could be merged as well.
>>
>> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Bryan Cutler
+1

On Mon, Jun 4, 2018 at 10:18 AM, Joseph Bradley 
wrote:

> +1
>
> On Mon, Jun 4, 2018 at 10:16 AM, Mark Hamstra 
> wrote:
>
>> +1
>>
>> On Fri, Jun 1, 2018 at 3:29 PM Marcelo Vanzin 
>> wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.3.1.
>>>
>>> Given that I expect at least a few people to be busy with Spark Summit
>>> next
>>> week, I'm taking the liberty of setting an extended voting period. The
>>> vote
>>> will be open until Friday, June 8th, at 19:00 UTC (that's 12:00 PDT).
>>>
>>> It passes with a majority of +1 votes, which must include at least 3 +1
>>> votes
>>> from the PMC.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.3.1
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.3.1-rc4 (commit 30aaa5a3):
>>> https://github.com/apache/spark/tree/v2.3.1-rc4
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1272/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.1-rc4-docs/
>>>
>>> The list of bug fixes going into 2.3.1 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12342432
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 2.3.1?
>>> ===
>>>
>>> The current list of open tickets targeted at 2.3.1 can be found at:
>>> https://s.apache.org/Q3Uo
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>


Re: Feedback on first commit + jira issue I opened

2018-05-31 Thread Bryan Cutler
Hi Andrew,

Please just go ahead and make the pull request.  It's easier to review and
give feedback, thanks!

Bryan

On Thu, May 31, 2018 at 9:44 AM, Long, Andrew  wrote:

> Hello Friends,
>
>
>
> I’m a new committer and I’ve submitted my first patch and I had some
> questions about documentation standards.  In my patch(jira below)  I’ve
> added a config parameter to adjust the number of records show when a user
> calls .show() on a dataframe.  I was hoping someone could double check my
> small diff to make sure I wasn’t making any rookie mistakes before I submit
> a pull request.
>
>
>
> https://issues.apache.org/jira/browse/SPARK-24442
>
>
>
> Cheers Andrew
>


Re: Integrating ML/DL frameworks with Spark

2018-05-15 Thread Bryan Cutler
Thanks for starting this discussion, I'd also like to see some improvements
in this area and glad to hear that the Pandas UDFs / Arrow functionality
might be useful.  I'm wondering if from your initial investigations you
found anything lacking from the Arrow format or possible improvements that
would simplify the data representation?  Also, while data could be handed
off in a UDF, would it make sense to also discuss a more formal way to
externalize the data in a way that would also work for the Scala API?

Thanks,
Bryan

On Wed, May 9, 2018 at 4:31 PM, Xiangrui Meng  wrote:

> Shivaram: Yes, we can call it "gang scheduling" or "barrier
> synchronization". Spark doesn't support it now. The proposal is to have a
> proper support in Spark's job scheduler, so we can integrate well with
> MPI-like frameworks.
>
>
> On Tue, May 8, 2018 at 11:17 AM Nan Zhu  wrote:
>
>> .how I skipped the last part
>>
>> On Tue, May 8, 2018 at 11:16 AM, Reynold Xin  wrote:
>>
>>> Yes, Nan, totally agree. To be on the same page, that's exactly what I
>>> wrote wasn't it?
>>>
>>> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
>>>
 besides that, one of the things which is needed by multiple frameworks
 is to schedule tasks in a single wave

 i.e.

 if some frameworks like xgboost/mxnet requires 50 parallel workers,
 Spark is desired to provide a capability to ensure that either we run 50
 tasks at once, or we should quit the complete application/job after some
 timeout period

 Best,

 Nan

 On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
 wrote:

> I think that's what Xiangrui was referring to. Instead of retrying a
> single task, retry the entire stage, and the entire stage of tasks need to
> be scheduled all at once.
>
>
> On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>>
>>>
- Fault tolerance and execution model: Spark assumes
fine-grained task recovery, i.e. if something fails, only that task 
 is
rerun. This doesn’t match the execution model of distributed ML/DL
frameworks that are typically MPI-based, and rerunning a single 
 task would
lead to the entire system hanging. A whole stage needs to be re-run.

 This is not only useful for integrating with 3rd-party frameworks,
>>> but also useful for scaling MLlib algorithms. One of my earliest 
>>> attempts
>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>>> ). But we ended
>>> up with some compromised solutions. With the new execution model, we can
>>> set up a hybrid cluster and do all-reduce properly.
>>>
>>>
>> Is there a particular new execution model you are referring to or do
>> we plan to investigate a new execution model ?  For the MPI-like model, 
>> we
>> also need gang scheduling (i.e. schedule all tasks at once or none of 
>> them)
>> and I dont think we have support for that in the scheduler right now.
>>
>>>
 --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com]
>>> 
>>>
>>
>>

>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] 
>


Re: 回复: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Bryan Cutler
Congratulations Zhenhua!

On Mon, Apr 2, 2018 at 12:01 PM, ron8hu  wrote:

> Congratulations, Zhenhua!  Well deserved!!
>
> Ron
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Silencing messages from Ivy when calling spark-submit

2018-03-06 Thread Bryan Cutler
Cool, hopefully it will work.  I don't know what setting that would be
though, but it seems like it might be somewhere under here
http://ant.apache.org/ivy/history/latest-milestone/settings/outputters.html.
It's pretty difficult to sort through the docs, and I often found myself
looking at the source to understand some settings.  If you happen to figure
out the answer, please report back here.  I'm sure others would find it
useful too.

Bryan

On Mon, Mar 5, 2018 at 3:50 PM, Nicholas Chammas <nicholas.cham...@gmail.com
> wrote:

> Oh, I didn't know about that. I think that will do the trick.
>
> Would you happen to know what setting I need? I'm looking here
> <http://ant.apache.org/ivy/history/latest-milestone/settings.html>, but
> it's a bit overwhelming. I'm basically looking for a way to set the overall
> Ivy log level to WARN or higher.
>
> Nick
>
> On Mon, Mar 5, 2018 at 2:11 PM Bryan Cutler <cutl...@gmail.com> wrote:
>
>> Hi Nick,
>>
>> Not sure about changing the default to warnings only because I think some
>> might find the resolution output useful, but you can specify your own ivy
>> settings file with "spark.jars.ivySettings" to point to your
>> ivysettings.xml file.  Would that work for you to configure it there?
>>
>> Bryan
>>
>> On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I couldn’t get an answer anywhere else, so I thought I’d ask here.
>>>
>>> Is there a way to silence the messages that come from Ivy when you call
>>> spark-submit with --packages? (For the record, I asked this question on
>>> Stack Overflow <https://stackoverflow.com/q/49000342/877069>.)
>>>
>>> Would it be a good idea to configure Ivy by default to only output
>>> warnings or errors?
>>>
>>> Nick
>>> ​
>>>
>>
>>


Re: Silencing messages from Ivy when calling spark-submit

2018-03-05 Thread Bryan Cutler
Hi Nick,

Not sure about changing the default to warnings only because I think some
might find the resolution output useful, but you can specify your own ivy
settings file with "spark.jars.ivySettings" to point to your
ivysettings.xml file.  Would that work for you to configure it there?

Bryan

On Mon, Mar 5, 2018 at 8:20 AM, Nicholas Chammas  wrote:

> I couldn’t get an answer anywhere else, so I thought I’d ask here.
>
> Is there a way to silence the messages that come from Ivy when you call
> spark-submit with --packages? (For the record, I asked this question on
> Stack Overflow .)
>
> Would it be a good idea to configure Ivy by default to only output
> warnings or errors?
>
> Nick
> ​
>


Re: Welcoming some new committers

2018-03-05 Thread Bryan Cutler
Thanks everyone, this is very exciting!  I'm looking forward to working
with you all and helping out more in the future.  Also, congrats to the
other committers as well!!


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Bryan Cutler
+1
Tests passed and additionally ran Arrow related tests and did some perf
checks with python 2.7.14

On Fri, Feb 23, 2018 at 6:18 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> Note: given the state of Jenkins I'd love to see Bryan Cutler or someone
> with Arrow experience sign off on this release.
>
> On Fri, Feb 23, 2018 at 6:13 PM, Cheng Lian <lian.cs@gmail.com> wrote:
>
>> +1 (binding)
>>
>> Passed all the tests, looks good.
>>
>> Cheng
>>
>> On 2/23/18 15:00, Holden Karau wrote:
>>
>> +1 (binding)
>> PySpark artifacts install in a fresh Py3 virtual env
>>
>> On Feb 23, 2018 7:55 AM, "Denny Lee" <denny.g@gmail.com> wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Fri, Feb 23, 2018 at 07:08 Josh Goldsborough <
>>> joshgoldsboroughs...@gmail.com> wrote:
>>>
>>>> New to testing out Spark RCs for the community but I was able to run
>>>> some of the basic unit tests without error so for what it's worth, I'm a 
>>>> +1.
>>>>
>>>> On Thu, Feb 22, 2018 at 4:23 PM, Sameer Agarwal <samee...@apache.org>
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00
>>>>> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>>>>>
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 2.3.0
>>>>>
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v2.3.0-rc5: https://github.com/apache/spar
>>>>> k/tree/v2.3.0-rc5 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>>>>>
>>>>> List of JIRA tickets resolved in this release can be found here:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>>>>>
>>>>> Release artifacts are signed with the following key:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapache
>>>>> spark-1266/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
>>>>> /_site/index.html
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>> ===
>>>>> What are the unresolved issues targeted for 2.3.0?
>>>>> ===
>>>>>
>>>>> Please see https://s.apache.org/oXKi. At the time of writing, there
>>>>> are currently no known release blockers.
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala you
>>>>> can add the staging repository to your projects resolvers and test with 
>>>>> the
>>>>> RC (make sure to clean up the artifact cache before/after so you don't end
>>>>> up building with a out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 2.3.0?
>>>>> ===
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should be
>>>>> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 
>>>>> as
>>>>> appropriate.
>>>>>
>>>>> ===
>>>>> Why is my bug not fixed?
>>>>> ===
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from 2.2.0. That being
>>>>> said, if there is something which is a regression from 2.2.0 and has not
>>>>> been correctly targeted please ping me or a committer to help target the
>>>>> issue (you can see the open issues listed as impacting Spark 2.3.0 at
>>>>> https://s.apache.org/WmoI).
>>>>>
>>>>
>>>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: JIRA access

2018-02-23 Thread Bryan Cutler
Hi Arun,

The general process is to just leave a comment in the JIRA that you are
working on it so others know.  Once your pull request is merged, the JIRA
will be assigned to you.  You can read
http://spark.apache.org/contributing.html for details.

On Fri, Feb 23, 2018 at 9:08 PM, Arun Manivannan  wrote:

> Hi,
>
> I would like to attempt SPARK-20592
> .   Can I please have
> access to the JIRA so that I could assign it to myself.  My user id is
> : arunodhaya80
>
> Cheers,
> Arun
>


Re: Thoughts on Cloudpickle Update

2018-01-19 Thread Bryan Cutler
Thanks Holden and Hyukjin.  I agree, let's start doing the work first and
see if it the changes are low risk enough, then we can evaluate how best to
proceed.  I made https://issues.apache.org/jira/browse/SPARK-23159 and will
get started on the update and we can continue to discuss in the PR.

On Fri, Jan 19, 2018 at 1:32 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Yea, that sounds good to me.
>
> 2018-01-19 18:29 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>:
>
>> So it is pretty core, but its one of the better indirectly tested
>> components. I think probably the most reasonable path is to see what the
>> diff ends up looking like and make a call at that point for if we want it
>> to go to master or master & branch-2.3?
>>
>> On Fri, Jan 19, 2018 at 12:30 AM, Hyukjin Kwon <gurwls...@gmail.com>
>> wrote:
>>
>>> > So given that it fixes some real world bugs, any particular reason
>>> why? Would you be comfortable with doing it in 2.3.1?
>>>
>>> Ah, I don't feel strongly about this but RC2 will be running on and
>>> cloudpickle's quite core fix to PySpark. Just thought we might want to have
>>> enough time with it.
>>>
>>> One worry is, upgrading it includes a fix about namedtuple too where
>>> PySpark has a custom fix.
>>> I would like to check few things about this.
>>>
>>> So, yea, it's vague. I wouldn't stay against if you'd prefer.
>>>
>>>
>>>
>>>
>>> 2018-01-19 16:42 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>:
>>>
>>>>
>>>>
>>>> On Jan 19, 2018 7:28 PM, "Hyukjin Kwon" <gurwls...@gmail.com> wrote:
>>>>
>>>> > Is it an option to match the latest version of cloudpickle and still
>>>> set protocol level 2?
>>>>
>>>> IMHO, I think this can be an option but I am not fully sure yet if we
>>>> should/could go ahead for it within Spark 2.X. I need some
>>>> investigations including things about Pyrolite.
>>>>
>>>> Let's go ahead with matching it to 0.4.2 first. I am quite clear on
>>>> matching it to 0.4.2 at least.
>>>>
>>>> So given that there is a follow up on which fixes a regression if we're
>>>> not comfortable doing the latest version let's double-check that the
>>>> version we do upgrade to doesn't have that regression.
>>>>
>>>>
>>>>
>>>> > I agree that upgrading to try and match version 0.4.2 would be a
>>>> good starting point. Unless no one objects, I will open up a JIRA and try
>>>> to do this.
>>>>
>>>> Yup but I think we shouldn't make this into Spark 2.3.0 to be clear.
>>>>
>>>> So given that it fixes some real world bugs, any particular reason why?
>>>> Would you be comfortable with doing it in 2.3.1?
>>>>
>>>>
>>>>
>>>> > Also lets try to keep track in our commit messages which version of
>>>> cloudpickle we end up upgrading to.
>>>>
>>>> +1: PR description, commit message or any unit to identify each will be 
>>>> useful.
>>>> It should be easier once we have a matched version.
>>>>
>>>>
>>>>
>>>> 2018-01-19 12:55 GMT+09:00 Holden Karau <hol...@pigscanfly.ca>:
>>>>
>>>>> So if there are different version of Python on the cluster machines I
>>>>> think that's already unsupported so I'm not worried about that.
>>>>>
>>>>> I'd suggest going to the highest released version since there appear
>>>>> to be some useful fixes between 0.4.2 & 0.5.2
>>>>>
>>>>> Also lets try to keep track in our commit messages which version of
>>>>> cloudpickle we end up upgrading to.
>>>>>
>>>>> On Thu, Jan 18, 2018 at 5:45 PM, Bryan Cutler <cutl...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for all the details and background Hyukjin! Regarding the
>>>>>> pickle protocol change, if I understand correctly, it is currently at 
>>>>>> level
>>>>>> 2 in Spark which is good for backwards compatibility for all of Python 2.
>>>>>> Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
>>>>>> above, will pick a level determined by your Python version. So is the
>>>>>> concern here for Spark if someone has

Re: Thoughts on Cloudpickle Update

2018-01-18 Thread Bryan Cutler
Thanks for all the details and background Hyukjin! Regarding the pickle
protocol change, if I understand correctly, it is currently at level 2 in
Spark which is good for backwards compatibility for all of Python 2.
Choosing HIGHEST_PROTOCOL, which is the default for cloudpickle 0.5.0 and
above, will pick a level determined by your Python version. So is the
concern here for Spark if someone has different versions of Python in their
cluster, like 3.5 and 3.3, then different protocols will be used and
deserialization might fail?  Is it an option to match the latest version of
cloudpickle and still set protocol level 2?

I agree that upgrading to try and match version 0.4.2 would be a good
starting point. Unless no one objects, I will open up a JIRA and try to do
this.

Thanks,
Bryan

On Mon, Jan 15, 2018 at 7:57 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Hi Bryan,
>
> Yup, I support to match the version. I pushed it forward before to match
> it with https://github.com/cloudpipe/cloudpickle
> before few times in Spark's copy and also cloudpickle itself with few
> fixes. I believe our copy is closest to 0.4.1.
>
> I have been trying to follow up the changes in cloudpipe/cloudpickle for
> which version we should match, I think we should match
> it with 0.4.2 first (I need to double check) because IMHO they have been
> adding rather radical changes from 0.5.0, including
> pickle protocol change (by default).
>
> Personally, I would like to match it with the latest because there have
> been some important changes. For
> example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138
> (it's pending for reviewing yet) eventually but 0.4.2 should be
> a good start point.
>
> For the strategy, I think we can match it and follow 0.4.x within Spark
> for the conservative and safe choice + minimal cost.
>
>
> I tried to leave few explicit answers to the questions from you, Bryan:
>
> > Spark is currently using a forked version and it seems like updates are
> made every now and then when
> > needed, but it's not really clear where the current state is and how
> much it has diverged.
>
> I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.
>
>
> > Are there any known issues with recent changes from those that follow
> cloudpickle dev?
>
> I am technically involved in cloudpickle dev although less active.
> They changed default pickle protocol (https://github.com/cloudpipe/
> cloudpickle/pull/127). So, if we target 0.5.x+, we should double check
> the potential compatibility issue, or fix the protocol, which I believe is
> introduced from 0.5.x.
>
>
>
> 2018-01-16 11:43 GMT+09:00 Bryan Cutler <cutl...@gmail.com>:
>
>> Hi All,
>>
>> I've seen a couple issues lately related to cloudpickle, notably
>> https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
>> some feedback on updating the version in PySpark which should fix these
>> issues and allow us to remove some workarounds.  Spark is currently using a
>> forked version and it seems like updates are made every now and then when
>> needed, but it's not really clear where the current state is and how much
>> it has diverged.  This makes back-porting fixes difficult.  There was a
>> previous discussion on moving it to a dependency here
>> <http://apache-spark-developers-list.1001551.n3.nabble.com/PYTHON-DISCUSS-Moving-to-cloudpickle-and-or-Py4J-as-a-dependencies-td20954.html>,
>> but given the status right now I think it would be best to do another
>> update and bring things closer to upstream before we talk about completely
>> moving it outside of Spark.  Before starting another update, it might be
>> good to discuss the strategy a little.  Should the version in Spark be
>> derived from a release or at least tied to a specific commit?  It would
>> also be good if we can document where it has diverged.  Are there any known
>> issues with recent changes from those that follow cloudpickle dev?  Any
>> other thoughts or concerns?
>>
>> Thanks,
>> Bryan
>>
>
>


Thoughts on Cloudpickle Update

2018-01-15 Thread Bryan Cutler
Hi All,

I've seen a couple issues lately related to cloudpickle, notably
https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
some feedback on updating the version in PySpark which should fix these
issues and allow us to remove some workarounds.  Spark is currently using a
forked version and it seems like updates are made every now and then when
needed, but it's not really clear where the current state is and how much
it has diverged.  This makes back-porting fixes difficult.  There was a
previous discussion on moving it to a dependency here
,
but given the status right now I think it would be best to do another
update and bring things closer to upstream before we talk about completely
moving it outside of Spark.  Before starting another update, it might be
good to discuss the strategy a little.  Should the version in Spark be
derived from a release or at least tied to a specific commit?  It would
also be good if we can document where it has diverged.  Are there any known
issues with recent changes from those that follow cloudpickle dev?  Any
other thoughts or concerns?

Thanks,
Bryan


Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python

2017-09-07 Thread Bryan Cutler
+1 (non-binding) for the goals and non-goals of this SPIP.  I think it's
fine to work out the minor details of the API during review.

Bryan

On Wed, Sep 6, 2017 at 5:17 AM, Takuya UESHIN 
wrote:

> Hi all,
>
> Thank you for voting and suggestions.
>
> As Wenchen mentioned and also we're discussing at JIRA, we need to discuss
> the size hint for the 0-parameter UDF.
> But I believe we got a consensus about the basic APIs except for the size
> hint, I'd like to submit a pr based on the current proposal and continue
> discussing in its review.
>
> https://github.com/apache/spark/pull/19147
>
> I'd keep this vote open to wait for more opinions.
>
> Thanks.
>
>
> On Wed, Sep 6, 2017 at 9:48 AM, Wenchen Fan  wrote:
>
>> +1 on the design and proposed API.
>>
>> One detail I'd like to discuss is the 0-parameter UDF, how we can specify
>> the size hint. This can be done in the PR review though.
>>
>> On Sat, Sep 2, 2017 at 2:07 AM, Felix Cheung 
>> wrote:
>>
>>> +1 on this and like the suggestion of type in string form.
>>>
>>> Would it be correct to assume there will be data type check, for example
>>> the returned pandas data frame column data types match what are specified.
>>> We have seen quite a bit of issues/confusions with that in R.
>>>
>>> Would it make sense to have a more generic decorator name so that it
>>> could also be useable for other efficient vectorized format in the future?
>>> Or do we anticipate the decorator to be format specific and will have more
>>> in the future?
>>>
>>> --
>>> *From:* Reynold Xin 
>>> *Sent:* Friday, September 1, 2017 5:16:11 AM
>>> *To:* Takuya UESHIN
>>> *Cc:* spark-dev
>>> *Subject:* Re: [VOTE][SPIP] SPARK-21190: Vectorized UDFs in Python
>>>
>>> Ok, thanks.
>>>
>>> +1 on the SPIP for scope etc
>>>
>>>
>>> On API details (will deal with in code reviews as well but leaving a
>>> note here in case I forget)
>>>
>>> 1. I would suggest having the API also accept data type specification in
>>> string form. It is usually simpler to say "long" then "LongType()".
>>>
>>> 2. Think about what error message to show when the rows numbers don't
>>> match at runtime.
>>>
>>>
>>> On Fri, Sep 1, 2017 at 12:29 PM Takuya UESHIN 
>>> wrote:
>>>
 Yes, the aggregation is out of scope for now.
 I think we should continue discussing the aggregation at JIRA and we
 will be adding those later separately.

 Thanks.


 On Fri, Sep 1, 2017 at 6:52 PM, Reynold Xin 
 wrote:

> Is the idea aggregate is out of scope for the current effort and we
> will be adding those later?
>
> On Fri, Sep 1, 2017 at 8:01 AM Takuya UESHIN 
> wrote:
>
>> Hi all,
>>
>> We've been discussing to support vectorized UDFs in Python and we
>> almost got a consensus about the APIs, so I'd like to summarize and
>> call for a vote.
>>
>> Note that this vote should focus on APIs for vectorized UDFs, not
>> APIs for vectorized UDAFs or Window operations.
>>
>> https://issues.apache.org/jira/browse/SPARK-21190
>>
>>
>> *Proposed API*
>>
>> We introduce a @pandas_udf decorator (or annotation) to define
>> vectorized UDFs which takes one or more pandas.Series or one integer
>> value meaning the length of the input value for 0-parameter UDFs. The
>> return value should be pandas.Series of the specified type and the
>> length of the returned value should be the same as input value.
>>
>> We can define vectorized UDFs as:
>>
>>   @pandas_udf(DoubleType())
>>   def plus(v1, v2):
>>   return v1 + v2
>>
>> or we can define as:
>>
>>   plus = pandas_udf(lambda v1, v2: v1 + v2, DoubleType())
>>
>> We can use it similar to row-by-row UDFs:
>>
>>   df.withColumn('sum', plus(df.v1, df.v2))
>>
>> As for 0-parameter UDFs, we can define and use as:
>>
>>   @pandas_udf(LongType())
>>   def f0(size):
>>   return pd.Series(1).repeat(size)
>>
>>   df.select(f0())
>>
>>
>>
>> The vote will be up for the next 72 hours. Please reply with your
>> vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thanks!
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>


 --
 Takuya UESHIN
 Tokyo, Japan

 http://twitter.com/ueshin

>>>
>>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: Run a specific PySpark test or group of tests

2017-08-15 Thread Bryan Cutler
This generally works for me to just run tests within a class or even a
single test.  Not as flexible as pytest -k, which would be nice..

$ SPARK_TESTING=1 bin/pyspark pyspark.sql.tests ArrowTests

On Tue, Aug 15, 2017 at 5:49 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Pytest does support unittest-based tests
> , allowing for
> incremental adoption. I'll see how convenient it is to use with our current
> test layout.
>
> On Tue, Aug 15, 2017 at 1:03 AM Hyukjin Kwon  wrote:
>
>> For me, I would like this if this can be done with relatively small
>> changes.
>> How about adding more granular options, for example, specifying or
>> filtering smaller set of test goals in the run-tests.py script?
>> I think it'd be quite small change and we could roughly reach this goal
>> if I understood correctly.
>>
>>
>> 2017-08-15 3:06 GMT+09:00 Nicholas Chammas :
>>
>>> Say you’re working on something and you want to rerun the PySpark tests,
>>> focusing on a specific test or group of tests. Is there a way to do that?
>>>
>>> I know that you can test entire modules with this:
>>>
>>> ./python/run-tests --modules pyspark-sql
>>>
>>> But I’m looking for something more granular, like pytest’s -k option.
>>>
>>> On that note, does anyone else think it would be valuable to use a test
>>> runner like pytest to run our Python tests? The biggest benefits would be
>>> the use of fixtures ,
>>> and more flexibility on test running and reporting. Just wondering if we’ve
>>> already considered this.
>>>
>>> Nick
>>> ​
>>>
>>
>>


Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Bryan Cutler
Great work Hyukjin and Sameer!

On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan 
wrote:

> Congratulations Hyukjin, Sameer !
>
> Regards,
> Mridul
>
> On Mon, Aug 7, 2017 at 8:53 AM, Matei Zaharia 
> wrote:
> > Hi everyone,
> >
> > The Spark PMC recently voted to add Hyukjin Kwon and Sameer Agarwal as
> committers. Join me in congratulating both of them and thanking them for
> their contributions to the project!
> >
> > Matei
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Some PRs not automatically linked to JIRAs

2017-08-02 Thread Bryan Cutler
Thanks Hyukjin!  I didn't see your previous message..  It looks like your
manual run worked pretty well for the JIRAs I'm following, the only thing
is that it didn't mark them as "in progress", but that's not a big deal.
Otherwise that helps until we can find out why it's not doing this
automatically.  I'm not familiar with that script, can anyone run it to
apply to a single JIRA they are working on?

On Wed, Aug 2, 2017 at 12:09 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> I was wondering about this too..
>
>
> Yes, actually, I have been manually adding some links by resembling the
> same steps in the script before.
>
> I was thinking it'd rather be nicer to run this manually once and then I
> ran this against single JIRA
>
> first - https://issues.apache.org/jira/browse/SPARK-21526 to show how it
> looks like and check if there
>
> is any issue or objection just in case.
>
>
> Will run this manually now once. I will revert all my action manually if
> there is any issue by doing this.
>
>
> 2017-08-03 3:50 GMT+09:00 Sean Owen <so...@cloudera.com>:
>
>> Hyukjin mentioned this here earlier today and had run it manually, but
>> yeah I'm not sure where it normally runs or why it hasn't. Shane not sure
>> if you're the person to ask?
>>
>>
>> On Wed, Aug 2, 2017 at 7:47 PM Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> Hi Devs,
>>>
>>> I've noticed a couple PRs recently have not been automatically linked to
>>> the related JIRAs.  This was one of mine (I linked it manually)
>>> https://issues.apache.org/jira/browse/SPARK-21583, but I've seen it
>>> happen elsewhere.  I think this is the script that does it, but it hasn't
>>> been changed recently https://github.com/ap
>>> ache/spark/blob/master/dev/github_jira_sync.py.  Anyone else seen this
>>> or know what's going on?
>>>
>>> Thanks,
>>> Bryan
>>>
>>
>


Some PRs not automatically linked to JIRAs

2017-08-02 Thread Bryan Cutler
Hi Devs,

I've noticed a couple PRs recently have not been automatically linked to
the related JIRAs.  This was one of mine (I linked it manually)
https://issues.apache.org/jira/browse/SPARK-21583, but I've seen it happen
elsewhere.  I think this is the script that does it, but it hasn't been
changed recently
https://github.com/apache/spark/blob/master/dev/github_jira_sync.py.
Anyone else seen this or know what's going on?

Thanks,
Bryan


Re: welcoming Burak and Holden as committers

2017-01-25 Thread Bryan Cutler
Congratulations Holden and Burak, well deserved!!!

On Tue, Jan 24, 2017 at 10:13 AM, Reynold Xin  wrote:

> Hi all,
>
> Burak and Holden have recently been elected as Apache Spark committers.
>
> Burak has been very active in a large number of areas in Spark, including
> linear algebra, stats/maths functions in DataFrames, Python/R APIs for
> DataFrames, dstream, and most recently Structured Streaming.
>
> Holden has been a long time Spark contributor and evangelist. She has
> written a few books on Spark, as well as frequent contributions to the
> Python API to improve its usability and performance.
>
> Please join me in welcoming the two!
>
>
>


Re: Belief propagation algorithm is open sourced

2016-12-14 Thread Bryan Cutler
I'll check it out, thanks for sharing Alexander!

On Dec 13, 2016 4:58 PM, "Ulanov, Alexander" 
wrote:

> Dear Spark developers and users,
>
>
> HPE has open sourced the implementation of the belief propagation (BP)
> algorithm for Apache Spark, a popular message passing algorithm for
> performing inference in probabilistic graphical models. It provides exact
> inference for graphical models without loops. While inference for graphical
> models with loops is approximate, in practice it is shown to work well. The
> implementation is generic and operates on factor graph representation of
> graphical models. It handles factors of any order, and variable domains of
> any size. It is implemented with Apache Spark GraphX, and thus can scale to
> large scale models. Further, it supports computations in log scale for
> numerical stability. Large scale applications of BP include fraud detection
> in banking transactions and malicious site detection in computer networks.
>
>
> Source code: https://github.com/HewlettPackard/sandpiper
>
>
> Best regards, Alexander
>


Re: Why is there no flatten method on RDD?

2016-12-14 Thread Bryan Cutler
Hi Tarun,

I think there just hasn't been a strong need for it when you can accomplish
the same with just rdd.flatMap(identity).  I see a JIRA was just opened for
this https://issues.apache.org/jira/browse/SPARK-18855

On Mon, Dec 5, 2016 at 2:55 PM, Tarun Kumar  wrote:

> Hi,
>
> Although a flatMap would yield me the results which I look from map
> followed by flatten, but just out of curiosity, Why there is no flatten
> method provided on RDD.
>
> Would it make sense to submit a PR adding flatten on RDD?
>
> Thanks
> Tarun
>


Re: welcoming Xiao Li as a committer

2016-10-04 Thread Bryan Cutler
Congrats Xiao!

On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau  wrote:

> Congratulations :D :) Yay!
>
> On Tue, Oct 4, 2016 at 11:14 AM, Suresh Thalamati <
> suresh.thalam...@gmail.com> wrote:
>
>> Congratulations, Xiao!
>>
>>
>>
>> > On Oct 3, 2016, at 10:46 PM, Reynold Xin  wrote:
>> >
>> > Hi all,
>> >
>> > Xiao Li, aka gatorsmile, has recently been elected as an Apache Spark
>> committer. Xiao has been a super active contributor to Spark SQL. Congrats
>> and welcome, Xiao!
>> >
>> > - Reynold
>> >
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


Re: AccumulatorV2 += operator

2016-08-03 Thread Bryan Cutler
No, I was referring to the programming guide section on accumulators, it
says " Tasks running on a cluster can then add to it using the add method
or the += operator (in Scala and Python)."

On Aug 2, 2016 2:52 PM, "Holden Karau" <hol...@pigscanfly.ca> wrote:

> I believe it was intentional with the idea that it would be more unified
> between Java and Scala APIs. If your talking about the javadoc mention in
> https://github.com/apache/spark/pull/14466/files - I believe the += is
> meant to refer to what the internal implementation of the add function can
> be for someone extending the accumulator (but it certainly could cause
> confusion).
>
> Reynold can provide a more definitive answer in this case.
>
> On Tue, Aug 2, 2016 at 1:46 PM, Bryan Cutler <cutl...@gmail.com> wrote:
>
>> It seems like the += operator is missing from the new accumulator API,
>> although the docs still make reference to it.  Anyone know if it was
>> intentionally not put in?  I'm happy to do a PR for it or update the docs
>> to just use the add() method, just want to check if there was some reason
>> first.
>>
>> Bryan
>>
>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


AccumulatorV2 += operator

2016-08-02 Thread Bryan Cutler
It seems like the += operator is missing from the new accumulator API,
although the docs still make reference to it.  Anyone know if it was
intentionally not put in?  I'm happy to do a PR for it or update the docs
to just use the add() method, just want to check if there was some reason
first.

Bryan


Re: Welcoming Yanbo Liang as a committer

2016-06-05 Thread Bryan Cutler
Congratulations Yanbo!
On Jun 5, 2016 4:03 AM, "Kousuke Saruta"  wrote:

> Congratulations Yanbo!
>
>
> - Kousuke
>
> On 2016/06/04 11:48, Matei Zaharia wrote:
>
>> Hi all,
>>
>> The PMC recently voted to add Yanbo Liang as a committer. Yanbo has been
>> a super active contributor in many areas of MLlib. Please join me in
>> welcoming Yanbo!
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Organizing Spark ML example packages

2016-04-19 Thread Bryan Cutler
+1, adding some organization would make it easier for people to find a
specific example

On Mon, Apr 18, 2016 at 11:52 PM, Yanbo Liang  wrote:

> This sounds good to me, and it will make ML examples more neatly.
>
> 2016-04-14 5:28 GMT-07:00 Nick Pentreath :
>
>> Hey Spark devs
>>
>> I noticed that we now have a large number of examples for ML & MLlib in
>> the examples project - 57 for ML and 67 for MLLIB to be precise. This is
>> bound to get larger as we add features (though I know there are some PRs to
>> clean up duplicated examples).
>>
>> What do you think about organizing them into packages to match the use
>> case and the structure of the code base? e.g.
>>
>> org.apache.spark.examples.ml.recommendation
>>
>> org.apache.spark.examples.ml.feature
>>
>> and so on...
>>
>> Is it worth doing? The doc pages with include_example would need
>> updating, and the run_example script input would just need to change the
>> package slightly. Did I miss any potential issue?
>>
>> N
>>
>
>


Re: pull request template

2016-03-19 Thread Bryan Cutler
+1 on Marcelo's comments.  It would be nice not to pollute commit messages
with the  instructions because some people might forget to remove them.
Nobody has suggested removing the template.

On Tue, Mar 15, 2016 at 3:59 PM, Joseph Bradley 
wrote:
> +1 for keeping the template
>
> I figure any template will require conscientiousness & enforcement.
>
> On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen  wrote:
>>
>> The template is a great thing as it gets instructions even more right
>> in front of people.
>>
>> Another idea is to just write a checklist of items, like "did you
>> describe your changes? did you test? etc." with instructions to delete
>> the text and replace with a description. This keeps the boilerplate
>> titles out of the commit message.
>>
>> The special character and post processing just takes that a step further.
>>
>> On Sat, Mar 12, 2016 at 1:31 AM, Marcelo Vanzin 
>> wrote:
>> > Hey all,
>> >
>> > Just wanted to ask: how do people like this new template?
>> >
>> > While I think it's great to have instructions for people to write
>> > proper commit messages, I think the current template has a few
>> > downsides.
>> >
>> > - I tend to write verbose commit messages already when I'm preparing a
>> > PR. Now when I open the PR I have to edit the summary field to remove
>> > all the boilerplate.
>> > - The template ends up in the commit messages, and sometimes people
>> > forget to remove even the instructions.
>> >
>> > Instead, what about changing the template a bit so that it just has
>> > instructions prepended with some character, and have those lines
>> > removed by the merge_spark_pr.py script? We could then even throw in a
>> > link to the wiki as Sean suggested since it won't end up in the final
>> > commit messages.
>> >
>> >
>> > On Fri, Feb 19, 2016 at 11:53 AM, Reynold Xin 
>> > wrote:
>> >> We can add that too - just need to figure out a good way so people
>> >> don't
>> >> leave a lot of the unnecessary "guideline" messages in the template.
>> >>
>> >> The contributing guide is great, but unfortunately it is not as
>> >> noticeable
>> >> and is often ignored. It's good to have this full-fledged contributing
>> >> guide, and then have a very lightweight version of that in the form of
>> >> templates to force contributors to think about all the important
>> >> aspects
>> >> outlined in the contributing guide.
>> >>
>> >>
>> >>
>> >>
>> >> On Fri, Feb 19, 2016 at 2:36 AM, Sean Owen  wrote:
>> >>>
>> >>> All that seems fine. All of this is covered in the contributing wiki,
>> >>> which is linked from CONTRIBUTING.md (and should be from the
>> >>> template), but people don't seem to bother reading it. I don't mind
>> >>> duplicating some key points, and even a more explicit exhortation to
>> >>> read the whole wiki, before considering opening a PR. We spend way
too
>> >>> much time asking people to fix things they should have taken 60
>> >>> seconds to do correctly in the first place.
>> >>>
>> >>> On Fri, Feb 19, 2016 at 10:33 AM, Iulian Dragoș
>> >>>  wrote:
>> >>> > It's a good idea. I would add in there the spec for the PR title. I
>> >>> > always
>> >>> > get wrong the order between Jira and component.
>> >>> >
>> >>> > Moreover, CONTRIBUTING.md is also lacking them. Any reason not to
>> >>> > add it
>> >>> > there? I can open PRs for both, but maybe you want to keep that
info
>> >>> > on
>> >>> > the
>> >>> > wiki instead.
>> >>> >
>> >>> > iulian
>> >>> >
>> >>> > On Thu, Feb 18, 2016 at 4:18 AM, Reynold Xin 
>> >>> > wrote:
>> >>> >>
>> >>> >> Github introduced a new feature today that allows projects to
>> >>> >> define
>> >>> >> templates for pull requests. I pushed a very simple template to
the
>> >>> >> repository:
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >>
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
>> >>> >>
>> >>> >>
>> >>> >> Over time I think we can see how this works and perhaps add a
small
>> >>> >> checklist to the pull request template so contributors are
reminded
>> >>> >> every
>> >>> >> time they submit a pull request the important things to do in a
>> >>> >> pull
>> >>> >> request
>> >>> >> (e.g. having proper tests).
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> ## What changes were proposed in this pull request?
>> >>> >>
>> >>> >> (Please fill in changes proposed in this fix)
>> >>> >>
>> >>> >>
>> >>> >> ## How was the this patch tested?
>> >>> >>
>> >>> >> (Please explain how this patch was tested. E.g. unit tests,
>> >>> >> integration
>> >>> >> tests, manual tests)
>> >>> >>
>> >>> >>
>> >>> >> (If this patch involves UI changes, please attach a screenshot;
>> >>> >> otherwise,
>> >>> >> remove this)
>> >>> >>
>> >>> >>
>> >>> >
>> >>> >
>> >>> >
>> >>> > --
>> >>> >
>> >>> > --
>> >>> > Iulian Dragos
>> >>> >
>> >>> > --
>> >>> > Reactive Apps on the JVM
>> >>> > 

Re: running lda in spark throws exception

2016-01-14 Thread Bryan Cutler
What I mean is the input to LDA.run() is a RDD[(Long, Vector)] and the
Vector is a vector of counts of each term and should be the same size as
the vocabulary (so if the vocabulary, or dictionary has 10 words, each
vector should have a size of 10).  This probably means that there will be
some elements with zero counts, and a sparse vector might be a good way to
handle that.

On Wed, Jan 13, 2016 at 6:40 PM, Li Li <fancye...@gmail.com> wrote:

> It looks like the problem is the vectors of term counts in the corpus
> are not always the vocabulary size.
> Do you mean some integers not occured in the corpus?
> for example, I have the dictionary is 0 - 9 (total 10 words).
> The docs are:
> 0 2 4 6 8
> 1 3 5 7 9
> Then it will be correct
> If the docs are:
> 0 2 4 6 9
> 1 3 5 6 7
> 8 is not occured in any document, Then it will wrong?
>
> So the workaround is to process the input to re-encode terms?
>
> On Thu, Jan 14, 2016 at 6:53 AM, Bryan Cutler <cutl...@gmail.com> wrote:
> > I was now able to reproduce the exception using the master branch and
> local
> > mode.  It looks like the problem is the vectors of term counts in the
> corpus
> > are not always the vocabulary size.  Once I padded these with zero
> counts to
> > the vocab size, it ran without the exception.
> >
> > Joseph, I also tried calling describeTopics and noticed that with the
> > improper vector size, it will not throw an exception but the term indices
> > will start to be incorrect.  For a small number of iterations, it is ok,
> but
> > increasing iterations causes the indices to get larger also.  Maybe that
> is
> > what is going on in the JIRA you linked to?
> >
> > On Wed, Jan 13, 2016 at 1:17 AM, Li Li <fancye...@gmail.com> wrote:
> >>
> >> I will try spark 1.6.0 to see it is the bug of 1.5.2.
> >>
> >> On Wed, Jan 13, 2016 at 3:58 PM, Li Li <fancye...@gmail.com> wrote:
> >> > I have set up a stand alone spark cluster and use the same codes. it
> >> > still failed with the same exception
> >> > I also preprocessed the data to lines of integers and use the scala
> >> > codes of lda example. it still failed.
> >> > the codes:
> >> >
> >> > import org.apache.spark.mllib.clustering.{ LDA, DistributedLDAModel }
> >> >
> >> > import org.apache.spark.mllib.linalg.Vectors
> >> >
> >> > import org.apache.spark.SparkContext
> >> >
> >> > import org.apache.spark.SparkContext._
> >> >
> >> > import org.apache.spark.SparkConf
> >> >
> >> >
> >> > object TestLDA {
> >> >
> >> >   def main(args: Array[String]) {
> >> >
> >> > if(args.length!=4){
> >> >
> >> >   println("need 4 args inDir outDir topic iternum")
> >> >
> >> >   System.exit(-1)
> >> >
> >> > }
> >> >
> >> > val conf = new SparkConf().setAppName("TestLDA")
> >> >
> >> > val sc = new SparkContext(conf)
> >> >
> >> > // Load and parse the data
> >> >
> >> > val data = sc.textFile(args(0))
> >> >
> >> > val parsedData = data.map(s => Vectors.dense(s.trim.split('
> >> > ').map(_.toDouble)))
> >> >
> >> > // Index documents with unique IDs
> >> >
> >> > val corpus = parsedData.zipWithIndex.map(_.swap).cache()
> >> >
> >> > val topicNum=Integer.valueOf(args(2))
> >> >
> >> > val iterNum=Integer.valueOf(args(1))
> >> >
> >> > // Cluster the documents into three topics using LDA
> >> >
> >> > val ldaModel = new
> >> > LDA().setK(topicNum).setMaxIterations(iterNum).run(corpus)
> >> >
> >> >
> >> > // Output topics. Each is a distribution over words (matching word
> >> > count vectors)
> >> >
> >> > println("Learned topics (as distributions over vocab of " +
> >> > ldaModel.vocabSize + " words):")
> >> >
> >> > val topics = ldaModel.topicsMatrix
> >> >
> >> > for (topic <- Range(0, topicNum)) {
> >> >
> >> >   print("Topic " + topic + ":")
> >> >
> >> >   for (word <- Range(0, ldaModel.vocabSize)) { print(" " +
> >> > topics(word, top

Re: running lda in spark throws exception

2016-01-13 Thread Bryan Cutler
I was now able to reproduce the exception using the master branch and local
mode.  It looks like the problem is the vectors of term counts in the
corpus are not always the vocabulary size.  Once I padded these with zero
counts to the vocab size, it ran without the exception.

Joseph, I also tried calling describeTopics and noticed that with the
improper vector size, it will not throw an exception but the term indices
will start to be incorrect.  For a small number of iterations, it is ok,
but increasing iterations causes the indices to get larger also.  Maybe
that is what is going on in the JIRA you linked to?

On Wed, Jan 13, 2016 at 1:17 AM, Li Li <fancye...@gmail.com> wrote:

> I will try spark 1.6.0 to see it is the bug of 1.5.2.
>
> On Wed, Jan 13, 2016 at 3:58 PM, Li Li <fancye...@gmail.com> wrote:
> > I have set up a stand alone spark cluster and use the same codes. it
> > still failed with the same exception
> > I also preprocessed the data to lines of integers and use the scala
> > codes of lda example. it still failed.
> > the codes:
> >
> > import org.apache.spark.mllib.clustering.{ LDA, DistributedLDAModel }
> >
> > import org.apache.spark.mllib.linalg.Vectors
> >
> > import org.apache.spark.SparkContext
> >
> > import org.apache.spark.SparkContext._
> >
> > import org.apache.spark.SparkConf
> >
> >
> > object TestLDA {
> >
> >   def main(args: Array[String]) {
> >
> > if(args.length!=4){
> >
> >   println("need 4 args inDir outDir topic iternum")
> >
> >   System.exit(-1)
> >
> > }
> >
> > val conf = new SparkConf().setAppName("TestLDA")
> >
> > val sc = new SparkContext(conf)
> >
> > // Load and parse the data
> >
> > val data = sc.textFile(args(0))
> >
> > val parsedData = data.map(s => Vectors.dense(s.trim.split('
> > ').map(_.toDouble)))
> >
> > // Index documents with unique IDs
> >
> > val corpus = parsedData.zipWithIndex.map(_.swap).cache()
> >
> > val topicNum=Integer.valueOf(args(2))
> >
> > val iterNum=Integer.valueOf(args(1))
> >
> > // Cluster the documents into three topics using LDA
> >
> > val ldaModel = new
> > LDA().setK(topicNum).setMaxIterations(iterNum).run(corpus)
> >
> >
> > // Output topics. Each is a distribution over words (matching word
> > count vectors)
> >
> > println("Learned topics (as distributions over vocab of " +
> > ldaModel.vocabSize + " words):")
> >
> > val topics = ldaModel.topicsMatrix
> >
> > for (topic <- Range(0, topicNum)) {
> >
> >   print("Topic " + topic + ":")
> >
> >   for (word <- Range(0, ldaModel.vocabSize)) { print(" " +
> > topics(word, topic)); }
> >
> >   println()
> >
> > }
> >
> >
> > // Save and load model.
> >
> > ldaModel.save(sc, args(1))
> >
> >   }
> >
> >
> > }
> >
> > scripts to submit:
> >
> > ~/spark-1.5.2-bin-hadoop2.6/bin/spark-submit --class
> > com.mobvoi.knowledgegraph.scala_test.TestLDA \
> >
> > --master spark://master:7077 \
> >
> > --num-executors 10 \
> >
> > --executor-memory 4g \
> >
> > --executor-cores 3 \
> >
> > scala_test-1.0-jar-with-dependencies.jar \
> >
> > /test.txt \
> >
> > 100 \
> >
> > 5  \
> >
> > /lda_model
> >
> > test.txt is in attachment
> >
> >
> > On Sat, Jan 9, 2016 at 6:21 AM, Bryan Cutler <cutl...@gmail.com> wrote:
> >> Hi Li,
> >>
> >> I tried out your code and sample data in both local mode and Spark
> >> Standalone and it ran correctly with output that looks good.  Sorry, I
> don't
> >> have a YARN cluster setup right now, so maybe the error you are seeing
> is
> >> specific to that.  Btw, I am running the latest Spark code from the
> master
> >> branch.  Hope that helps some!
> >>
> >> Bryan
> >>
> >> On Mon, Jan 4, 2016 at 8:42 PM, Li Li <fancye...@gmail.com> wrote:
> >>>
> >>> anyone could help? the problem is very easy to reproduce. What's wrong?
> >>>
> >>> On Wed, Dec 30, 2015 at 8:59 PM, Li Li <fancye...@gmail.com> wrote:
> >>> > I use a small data and reproduce the problem.
> >>> > But I do

Re: running lda in spark throws exception

2016-01-08 Thread Bryan Cutler
Hi Li,

I tried out your code and sample data in both local mode and Spark
Standalone and it ran correctly with output that looks good.  Sorry, I
don't have a YARN cluster setup right now, so maybe the error you are
seeing is specific to that.  Btw, I am running the latest Spark code from
the master branch.  Hope that helps some!

Bryan

On Mon, Jan 4, 2016 at 8:42 PM, Li Li  wrote:

> anyone could help? the problem is very easy to reproduce. What's wrong?
>
> On Wed, Dec 30, 2015 at 8:59 PM, Li Li  wrote:
> > I use a small data and reproduce the problem.
> > But I don't know my codes are correct or not because I am not familiar
> > with spark.
> > So I first post my codes here. If it's correct, then I will post the
> data.
> > one line of my data like:
> >
> > { "time":"08-09-17","cmtUrl":"2094361"
> > ,"rvId":"rev_1020","webpageUrl":"
> http://www.dianping.com/shop/2094361","word_vec":[0,1,2,3,4,5,6,2,7,8,9
> >
>  
> ,10,11,12,13,14,15,16,8,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,32,35,36,37,38,15,39,40,41,42,5,43,44,17,45,46,42,47,26,48,49]}
> >
> > it's a json file which contains webpageUrl and word_vec which is the
> > encoded words.
> > The first step is to prase the input rdd to a rdd of VectorUrl.
> > BTW, if public VectorUrl call(String s) return null, is it ok?
> > Then follow the example Index documents with unique IDs
> > Then I create a rdd to map id to url so after lda training, I can find
> > the url of the document. Then save this rdd to hdfs.
> > Then create corpus rdd and train
> >
> > The exception stack is
> >
> > 15/12/30 20:45:42 ERROR yarn.ApplicationMaster: User class threw
> > exception: java.lang.IndexOutOfBoundsException: (454,0) not in
> > [-58,58) x [-100,100)
> > java.lang.IndexOutOfBoundsException: (454,0) not in [-58,58) x [-100,100)
> > at breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
> > at
> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534)
> > at
> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531)
> > at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> > at
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531)
> > at
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523)
> > at
> com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:89)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> > at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:606)
> > at
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
> >
> >
> > ==here is my codes==
> >
> > SparkConf conf = new SparkConf().setAppName(ReviewLDA.class.getName());
> >
> > JavaSparkContext sc = new JavaSparkContext(conf);
> >
> >
> > // Load and parse the data
> >
> > JavaRDD data = sc.textFile(inputDir + "/*");
> >
> > JavaRDD parsedData = data.map(new Function VectorUrl>() {
> >
> >   public VectorUrl call(String s) {
> >
> > JsonParser parser = new JsonParser();
> >
> > JsonObject jo = parser.parse(s).getAsJsonObject();
> >
> > if (!jo.has("word_vec") || !jo.has("webpageUrl")) {
> >
> >   return null;
> >
> > }
> >
> > JsonArray word_vec = jo.get("word_vec").getAsJsonArray();
> >
> > String url = jo.get("webpageUrl").getAsString();
> >
> > double[] values = new double[word_vec.size()];
> >
> > for (int i = 0; i < values.length; i++)
> >
> >   values[i] = word_vec.get(i).getAsInt();
> >
> > return new VectorUrl(Vectors.dense(values), url);
> >
> >   }
> >
> > });
> >
> >
> >
> > // Index documents with unique IDs
> >
> > JavaPairRDD id2doc =
> > JavaPairRDD.fromJavaRDD(parsedData.zipWithIndex().map(
> >
> > new Function, Tuple2>()
> {
> >
> >   public Tuple2 call(Tuple2
> doc_id) {
> >
> > return doc_id.swap();
> >
> >   }
> >
> > }));
> >
> > JavaPairRDD id2Url = JavaPairRDD.fromJavaRDD(id2doc
> >
> > .map(new Function, Tuple2 String>>() {
> >
> >   @Override
> >
> >   public Tuple2 call(Tuple2
> > id2doc) throws Exception {
> >
> > return new Tuple2(id2doc._1, id2doc._2.url);
> >
> >   }
> >
> > }));
> >
> > id2Url.saveAsTextFile(id2UrlPath);
> >
> > JavaPairRDD

Re: A bug in Spark standalone? Worker registration and deregistration

2015-12-10 Thread Bryan Cutler
Hi Jacek,

I also recently noticed those messages, and some others, and am wondering
if there is an issue.  I am also seeing the following when I have event
logging enabled.  The first application is submitted and executes fine, but
all subsequent attempts produce an error log, but the master fails to load
it.  Not sure if this is related to the messages you see, but I would also
like to know if others can reproduce.  Here are the logs

MASTER
15/12/09 21:19:10 INFO Master: Registering app Spark Pi
15/12/09 21:19:10 INFO Master: Registered app Spark Pi with ID
app-20151209211910-0001
15/12/09 21:19:10 INFO Master: Launching executor app-20151209211910-0001/0
on worker worker-20151209211739-***
15/12/09 21:19:14 INFO Master: Received unregister request from application
app-20151209211910-0001
15/12/09 21:19:14 INFO Master: Removing app app-20151209211910-0001
15/12/09 21:19:14 WARN Master: Application Spark Pi is still in progress,
it may be terminated abnormally.
15/12/09 21:19:14 WARN Master: No event logs found for application Spark Pi
in file:/home/bryan/git/spark/logs/.
15/12/09 21:19:14 INFO Master: localhost.localdomain:54174 got
disassociated, removing it.
15/12/09 21:19:14 WARN Master: Got status update for unknown executor
app-20151209211910-0001/0
15/12/09 21:21:59 WARN Master: Got status update for unknown executor
app-20151209211830-/0
15/12/09 21:22:00 INFO Master: localhost.localdomain:54163 got
disassociated, removing it.

WORKER
15/12/09 21:19:14 INFO Worker: Asked to kill executor
app-20151209211910-0001/0
15/12/09 21:19:14 INFO ExecutorRunner: Runner thread for executor
app-20151209211910-0001/0 interrupted
15/12/09 21:19:14 INFO ExecutorRunner: Killing process!
15/12/09 21:19:14 ERROR FileAppender: Error writing stream to file
/home/bryan/git/spark/work/app-20151209211910-0001/0/stderr
java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1730)
at
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/12/09 21:19:14 INFO Worker: Executor app-20151209211910-0001/0 finished
with state KILLED exitStatus 143
15/12/09 21:19:14 INFO Worker: Cleaning up local directories for
application app-20151209211910-0001
15/12/09 21:19:14 INFO ExternalShuffleBlockResolver: Application
app-20151209211910-0001 removed, cleanupLocalDirs = true


On Thu, Dec 10, 2015 at 2:45 AM, Jacek Laskowski  wrote:

> Hi,
>
> I'm on yesterday's master HEAD.
>
> Pozdrawiam,
> Jacek
>
> --
> Jacek Laskowski | https://medium.com/@jaceklaskowski/ |
> http://blog.jaceklaskowski.pl
> Mastering Spark https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
> Follow me at https://twitter.com/jaceklaskowski
> Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski
>
>
> On Thu, Dec 10, 2015 at 9:50 AM, Sasaki Kai 
> wrote:
> > Hi, Jacek
> >
> > What version of Spark do you use?
> > I started sbin/start-master.sh script as you did against master HEAD.
> But there is no warning log such you pasted.
> > While you can specify hostname with -h option, you can also omit it. The
> master name can be set automatically with
> > the name `hostname` command. You can also try it.
> >
> > Kai Sasaki
> >
> >> On Dec 10, 2015, at 5:22 PM, Jacek Laskowski  wrote:
> >>
> >> Hi,
> >>
> >> While toying with Spark Standalone I've noticed the following messages
> >> in the logs of the master:
> >>
> >> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB
> RAM
> >> INFO Master: localhost:59920 got disassociated, removing it.
> >> ...
> >> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> >> we got no heartbeat in 60 seconds
> >> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> >> on 192.168.1.6:59919
> >>
> >> Why does the message "WARN Master: Removing
> >> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> >> 60 seconds" appear when the worker should've been gone already (as
> >> pointed out in "INFO Master: localhost:59920 got disassociated,
> >> removing it.")?
> >>
> >> Could it be that the ids are different - 192.168.1.6:59919 vs
> localhost:59920?
> >>
> >> I started master using "./sbin/start-master.sh -h localhost" and the
> >> workers "./sbin/start-slave.sh 

Re: let spark streaming sample come to stop

2015-11-16 Thread Bryan Cutler
Hi Renyi,

This is the intended behavior of the streaming HdfsWordCount example.  It
makes use of a 'textFileStream' which will monitor a hdfs directory for any
newly created files and push them into a dstream.  It is meant to be run
indefinitely, unless interrupted by ctrl-c, for example.

-bryan
On Nov 13, 2015 10:52 AM, "Renyi Xiong"  wrote:

> Hi,
>
> I try to run the following 1.4.1 sample by putting a words.txt under
> localdir
>
> bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir
>
> 2 questions
>
> 1. it does not pick up words.txt because it's 'old' I guess - any option
> to let it picked up?
> 2. I managed to put a 'new' file on the fly which got picked up, but after
> processing, the program doesn't stop (keeps generating empty RDDs instead),
> any option to let it stop when no new files come in (otherwise it blocks
> others when I want to run multiple samples?)
>
> Thanks,
> Renyi.
>