Re: Welcome Xinrong Meng as a Spark committer

2022-08-09 Thread Weichen Xu
Congrats!

On Tue, Aug 9, 2022 at 5:55 PM Jungtaek Lim 
wrote:

> Congrats Xinrong! Well deserved.
>
> 2022년 8월 9일 (화) 오후 5:13, Hyukjin Kwon 님이 작성:
>
>> Hi all,
>>
>> The Spark PMC recently added Xinrong Meng as a committer on the project.
>> Xinrong is the major contributor of PySpark especially Pandas API on Spark.
>> She has guided a lot of new contributors enthusiastically. Please join me
>> in welcoming Xinrong!
>>
>>


Re: Is RDD thread safe?

2019-11-25 Thread Weichen Xu
emmm, I haven't check code, but I think if an RDD is referenced in several
places, the correct behavior should be: when this RDD data is needed, it
will be computed and then cached only once, otherwise it should be treated
as a bug. If you are suspicious there's a race condition, you could create
a jira ticket.

On Mon, Nov 25, 2019 at 12:21 PM Chang Chen  wrote:

> Sorry I did't describe clearly,  RDD id itself is thread-safe, how about
> cached data?
>
> See codes from BlockManager
>
> def getOrElseUpdate(...)   = {
>   get[T](blockId)(classTag) match {
>case ...
>case _ =>  // 1. no data is cached.
> // Need to compute the block
>  }
>  // Initially we hold no locks on this block
>  doPutIterator(...) match{..}
> }
>
> Considering  two DAGs (contain the same cached RDD ) runs simultaneously,
> if both returns none  when they get same block from BlockManager(i.e. #1
> above), then I guess the same data would be cached twice.
>
> If the later cache could override the previous data, and no memory is
> waste, then this is OK
>
> Thanks
> Chang
>
>
> Weichen Xu  于2019年11月25日周一 上午11:52写道:
>
>> Rdd id is immutable and when rdd object created, the rdd id is generated.
>> So why there is race condition in "rdd id" ?
>>
>> On Mon, Nov 25, 2019 at 11:31 AM Chang Chen  wrote:
>>
>>> I am wonder the concurrent semantics for reason about the correctness.
>>> If the two query simultaneously run the DAGs which use the same cached
>>> DF\RDD,but before cache data actually happen, what will happen?
>>>
>>> By looking into code a litter, I suspect they have different BlockID for
>>> same Dataset which is unexpected behavior, but there is no race condition.
>>>
>>> However RDD id is not lazy, so there is race condition.
>>>
>>> Thanks
>>> Chang
>>>
>>>
>>> Weichen Xu  于2019年11月12日周二 下午1:22写道:
>>>
>>>> Hi Chang,
>>>>
>>>> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>>>>
>>>> Thanks!
>>>>
>>>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen 
>>>> wrote:
>>>>
>>>>> Hi all
>>>>>
>>>>> I meet a case where I need cache a source RDD, and then create
>>>>> different DataFrame from it in different threads to accelerate query.
>>>>>
>>>>> I know that SparkSession is thread safe(
>>>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>>>>> whether RDD  is thread safe or not
>>>>>
>>>>> Thanks
>>>>>
>>>>


Re: Is RDD thread safe?

2019-11-24 Thread Weichen Xu
Rdd id is immutable and when rdd object created, the rdd id is generated.
So why there is race condition in "rdd id" ?

On Mon, Nov 25, 2019 at 11:31 AM Chang Chen  wrote:

> I am wonder the concurrent semantics for reason about the correctness. If
> the two query simultaneously run the DAGs which use the same cached
> DF\RDD,but before cache data actually happen, what will happen?
>
> By looking into code a litter, I suspect they have different BlockID for
> same Dataset which is unexpected behavior, but there is no race condition.
>
> However RDD id is not lazy, so there is race condition.
>
> Thanks
> Chang
>
>
> Weichen Xu  于2019年11月12日周二 下午1:22写道:
>
>> Hi Chang,
>>
>> RDD/Dataframe is immutable and lazy computed. They are thread safe.
>>
>> Thanks!
>>
>> On Tue, Nov 12, 2019 at 12:31 PM Chang Chen  wrote:
>>
>>> Hi all
>>>
>>> I meet a case where I need cache a source RDD, and then create different
>>> DataFrame from it in different threads to accelerate query.
>>>
>>> I know that SparkSession is thread safe(
>>> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
>>> whether RDD  is thread safe or not
>>>
>>> Thanks
>>>
>>


Re: Is RDD thread safe?

2019-11-11 Thread Weichen Xu
Hi Chang,

RDD/Dataframe is immutable and lazy computed. They are thread safe.

Thanks!

On Tue, Nov 12, 2019 at 12:31 PM Chang Chen  wrote:

> Hi all
>
> I meet a case where I need cache a source RDD, and then create different
> DataFrame from it in different threads to accelerate query.
>
> I know that SparkSession is thread safe(
> https://issues.apache.org/jira/browse/SPARK-15135), but i am not sure
> whether RDD  is thread safe or not
>
> Thanks
>


Re: Add spark dependency on on org.opencypher:okapi-shade.okapi

2019-10-18 Thread Weichen Xu
Attach the design doc here
https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit#

I think the initial design intention is to replace graphx in spark. At
first, we plan to merge graphframe project into spark, now this design is
improved to not only include graphframe, but also include a graph query
(CypherQL) engine.

On Thu, Oct 17, 2019 at 1:41 AM Reynold Xin  wrote:

> Just curious - did we discuss why this shouldn't be another Apache sister
> project?
>
>
> On Wed, Oct 16, 2019 at 10:21 AM, Sean Owen  wrote:
>
>> We don't all have to agree on whether to add this -- there are like 10
>> people with an opinion -- and I certainly would not veto it. In practice a
>> medium-sized changes needs someone to review/merge it all the way through,
>> and nobody strongly objecting. I too don't know what to make of the
>> situation; what happened to the supporters here?
>>
>> I am concerned about maintenance, as inevitably any new module falls on
>> everyone to maintain to some degree, and people come and go despite their
>> intentions. But that isn't the substance of why I personally wouldn't merge
>> it. Just doesn't seem like it must live in Spark. But again this is my
>> opinion; you don't need to convince me, just need to
>> (re?)-convince a shepherd, sponsor for this change.
>>
>> Voting on the dependency part or whatever is also not important. It's a
>> detail, and already merged even.
>>
>> The issue to hand is: if nobody supports reviewing and merging the rest
>> of the change, what then? we can't leave it half implemented. The fallback
>> plan is just to back it out and reconsider later. This would be a poor
>> outcome process-wise, but better than leaving it incomplete.
>>
>> On Wed, Oct 16, 2019 at 3:15 AM Martin Junghanns
>>  wrote:
>>
>> I'm slightly confused about this discussion. I worked on all of the
>> aforementioned PRs: the module PR that has been merged, the current PR that
>> introduces the Graph API and the PoC PR that contains the full
>> implementation. The issues around shading were addressed and the module PR
>> eventually got merged. Two PMC members including the SPIP shepherd are
>> working with me (and others) on the current API PR. The SPIP to bring Spark
>> Graph into Apache Spark itself has been successfully voted on earlier this
>> year. I presented this work at Spark Summit in San Fransisco in May and was
>> asked by the organizers to present the topic at the European Spark Summit.
>> I'm currently sitting in the speakers room of that conference preparing for
>> the talk and reading this thread. I hope you understand my confusion.
>>
>> I admit - and Xiangrui pointed this out in the other thread, too - that
>> we made the early mistake of not bringing more Spark committers on board
>> which lead to a stagnation period during summer when Xiangrui wasn't around
>> to help review and bring progress to the project.
>>
>> Sean, if your concern is the lack of maintainers of that module, I
>> personally would like to volunteer to maintain Spark Graph. I'm also a
>> contributor to the Okapi stack and am able to work on whatever issue might
>> come up on that end including updating dependencies etc. FWIW, Okapi is
>> actively maintained by a team at Neo4j.
>>
>> Best, Martin
>>
>> On Wed, 16 Oct 2019, 4:35 AM Sean Owen  wrote:
>>
>> I do not have a very informed opinion here, so take this with a grain of
>> salt.
>>
>> I'd say that we need to either commit a coherent version of this for
>> Spark 3, or not at all. If it doesn't have support, I'd back out the
>> existing changes.
>> I was initially skeptical about how much this needs to be in Spark vs a
>> third-party package, and that still stands.
>>
>> The addition of another dependency isn't that big a deal IMHO, but, yes,
>> it does add something to the maintenance overhead. But that's all the more
>> true of a new module.
>>
>> I don't feel strongly about it, but if this isn't obviously getting
>> support from any committers, can we keep it as a third party library for
>> now?
>>
>> On Tue, Oct 15, 2019 at 8:53 PM Weichen Xu 
>> wrote:
>>
>> Hi Mats Rydberg,
>>
>> Although this dependency "org.opencypher:okapi-shade.okapi" was added
>> into spark, but Xiangrui raised two concerns (see above mail) about it, so
>> we'd better rethink on this and consider whether this is a good choice, so
>> I call this vote.
>>
>> Thanks!
>>
>> On Tue, Oct 15, 2019 at 10:56 PM Mats Rydberg 
>> wrote:
>>

Re: Add spark dependency on on org.opencypher:okapi-shade.okapi

2019-10-15 Thread Weichen Xu
Hi Mats Rydberg,

Although this dependency "org.opencypher:okapi-shade.okapi" was added into
spark, but Xiangrui raised two concerns (see above mail) about it, so we'd
better rethink on this and consider whether this is a good choice, so I
call this vote.

Thanks!

On Tue, Oct 15, 2019 at 10:56 PM Mats Rydberg 
wrote:

> Hello Weichen, community
>
> I'm sorry, I'm feeling a little bit confused about this vote. Is this
> about the PR (https://github.com/apache/spark/pull/24490) that was merged
> in early June and introduced the spark-graph module including the
> okapi-shade dependency?
>
> Regarding the okapi-shade dependency which was developed as part of the
> above PR work, some advice was offered by Scala experts at TripleQuote
> which helped find a satisfactory solution. The shading mechanism used is
> standard and very comparable to a Java library shading solution.
>
> The PR you link (https://github.com/apache/spark/pull/24297) is not meant
> for merging. It is just a proof-of-concept branch containing a full
> implementation of the system, which is kept up-to-date with the API
> discussion on the currently proposed PR:
> https://github.com/apache/spark/pull/24851.
>
> Thank you
> Mats
>
>
> On Tue, Oct 15, 2019 at 10:38 AM Weichen Xu 
> wrote:
>
>> Hi everyone,
>>
>> I'd like to call a new vote on the issue: should we add dependency
>> "org.opencypher:okapi-shade.okapi" into spark ? The issue background is:
>>
>> Spark is going to add a big feature "Spark Graph", the prototypical
>> implementation is here
>> https://github.com/apache/spark/pull/24297
>> which will introduce dependency org.opencypher:okapi-shade.okapi
>> <https://github.com/opencypher/morpheus/blob/master/okapi-shade/build.gradle>
>>
>> Xiangrui already mentioned 2 concerns on this dependency change:
>>
>>> On the technical side, my main concern is the runtime dependency on
>>> org.opencypher:okapi-shade.okapi depends on several Scala libraries. We
>>> came out with the solution to shade a few Scala libraries to avoid
>>> pollution. However, I'm not super confident that the approach is
>>> sustainable for two reasons: a) there exists no proper shading libraries
>>> for Scala, 2) We will have to wait for upgrades from those Scala libraries
>>> before we can upgrade Spark to use a newer Scala version. So it would be
>>> great if some Scala experts can help review the current implementation and
>>> help assess the risk.
>>
>>
>> So let's discuss and vote whether this is a good choice.
>> Before this spark graph feature to get into spark ASAP, this issue should
>> be resolved first.
>>
>> This vote is open until next Tuseday (Oct. 22).
>>
>> [ ] +1: Accept the proposal
>> [ ] +0
>> [ ] -1: I don't think this is a good idea because ...
>>
>> Thank you!
>>
>> Weichen
>>
>>


Add spark dependency on on org.opencypher:okapi-shade.okapi

2019-10-15 Thread Weichen Xu
Hi everyone,

I'd like to call a new vote on the issue: should we add dependency
"org.opencypher:okapi-shade.okapi" into spark ? The issue background is:

Spark is going to add a big feature "Spark Graph", the prototypical
implementation is here
https://github.com/apache/spark/pull/24297
which will introduce dependency org.opencypher:okapi-shade.okapi


Xiangrui already mentioned 2 concerns on this dependency change:

> On the technical side, my main concern is the runtime dependency on
> org.opencypher:okapi-shade.okapi depends on several Scala libraries. We
> came out with the solution to shade a few Scala libraries to avoid
> pollution. However, I'm not super confident that the approach is
> sustainable for two reasons: a) there exists no proper shading libraries
> for Scala, 2) We will have to wait for upgrades from those Scala libraries
> before we can upgrade Spark to use a newer Scala version. So it would be
> great if some Scala experts can help review the current implementation and
> help assess the risk.


So let's discuss and vote whether this is a good choice.
Before this spark graph feature to get into spark ASAP, this issue should
be resolved first.

This vote is open until next Tuseday (Oct. 22).

[ ] +1: Accept the proposal
[ ] +0
[ ] -1: I don't think this is a good idea because ...

Thank you!

Weichen


Re: Spark 3.0 preview release feature list and major changes

2019-10-10 Thread Weichen Xu
Wait... I have some supplement:

*New API:*
SPARK-25097 Support prediction on single instance in KMeans/BiKMeans/GMM
SPARK-28045 add missing RankingEvaluator
SPARK-29121 Support Dot Product for Vectors

*Behavior change or new API with behavior change:*
SPARK-23265 Update multi-column error handling logic in QuantileDiscretizer
SPARK-22798 Add multiple column support to PySpark StringIndexer
SPARK-11215 Add multiple columns support to StringIndexer
SPARK-24102 RegressionEvaluator should use sample weight data
SPARK-24101 MulticlassClassificationEvaluator should use sample weight data
SPARK-24103 BinaryClassificationEvaluator should use sample weight data
SPARK-23469 HashingTF should use corrected MurmurHash3 implementation

*Deprecated API removal:*
SPARK-25382 Remove ImageSchema.readImages in 3.0
SPARK-26133 Remove deprecated OneHotEncoder and rename
OneHotEncoderEstimator to OneHotEncoder
SPARK-25867 Remove KMeans computeCost
SPARK-28243 remove setFeatureSubsetStrategy and setSubsamplingRate from
Python TreeEnsembleParams

Thanks!

Weichen

On Fri, Oct 11, 2019 at 6:11 AM Xingbo Jiang  wrote:

> Hi all,
>
> Here is the updated feature list:
>
>
> SPARK-11215  Multiple
> columns support added to various Transformers: StringIndexer
>
> SPARK-11150  Implement
> Dynamic Partition Pruning
>
> SPARK-13677  Support
> Tree-Based Feature Transformation
>
> SPARK-16692  Add
> MultilabelClassificationEvaluator
>
> SPARK-19591  Add
> sample weights to decision trees
>
> SPARK-19712  Pushing
> Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
>
> SPARK-19827  R API for
> Power Iteration Clustering
>
> SPARK-20286  Improve
> logic for timing out executors in dynamic allocation
>
> SPARK-20636  Eliminate
> unnecessary shuffle with adjacent Window expressions
>
> SPARK-22148  Acquire
> new executors to avoid hang because of blacklisting
>
> SPARK-22796  Multiple
> columns support added to various Transformers: PySpark QuantileDiscretizer
>
> SPARK-23128  A new
> approach to do adaptive execution in Spark SQL
>
> SPARK-23155  Apply
> custom log URL pattern for executor log URLs in SHS
>
> SPARK-23539  Add
> support for Kafka headers
>
> SPARK-23674  Add Spark
> ML Listener for Tracking ML Pipeline Status
>
> SPARK-23710  Upgrade
> the built-in Hive to 2.3.5 for hadoop-3.2
>
> SPARK-24333  Add fit
> with validation set to Gradient Boosted Trees: Python API
>
> SPARK-24417  Build and
> Run Spark on JDK11
>
> SPARK-24615 
> Accelerator-aware task scheduling for Spark
>
> SPARK-24920  Allow
> sharing Netty's memory pool allocators
>
> SPARK-25250  Fix race
> condition with tasks running when new attempt for same stage is created
> leads to other task in the next attempt running on the same partition id
> retry multiple times
>
> SPARK-25341  Support
> rolling back a shuffle map stage and re-generate the shuffle files
>
> SPARK-25348  Data
> source for binary files
>
> SPARK-25390  data
> source V2 API refactoring
>
> SPARK-25501  Add Kafka
> delegation token support
>
> SPARK-25603 
> Generalize Nested Column Pruning
>
> SPARK-26132  Remove
> support for Scala 2.11 in Spark 3.0.0
>
> SPARK-26215  define
> reserved keywords after SQL standard
>
> SPARK-26412  Allow
> Pandas UDF to take an iterator of pd.DataFrames
>
> SPARK-26651  Use
> Proleptic Gregorian calendar
>
> SPARK-26759  Arrow
> optimization in SparkR's interoperability
>
> SPARK-26848 

Re: [DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-07 Thread Weichen Xu
All right we could support both Python 2 and Python 3 for spark 3.0.

On Wed, Aug 7, 2019 at 6:10 PM Hyukjin Kwon  wrote:

> We didn't drop Python 2 yet although it's deprecated. So I think It should
> support both Python 2 and Python 3 at the current status.
>
> 2019년 8월 7일 (수) 오후 6:54, Weichen Xu 님이 작성:
>
>> Hi all,
>>
>> I would like to discuss the compatibility for dev scripts. Because we
>> already decided to deprecate python2 in spark 3.0, for development scripts
>> under dev/ , we have two choice:
>> 1) Migration from Python 2 to Python 3
>> 2) Support both Python 2 and Python 3
>>
>> I tend to option (2) which is more friendly to maintenance.
>>
>> Regards,
>> Weichen
>>
>


[DISCUSS] Migrate development scripts under dev/ from Python2 to Python 3

2019-08-07 Thread Weichen Xu
Hi all,

I would like to discuss the compatibility for dev scripts. Because we
already decided to deprecate python2 in spark 3.0, for development scripts
under dev/ , we have two choice:
1) Migration from Python 2 to Python 3
2) Support both Python 2 and Python 3

I tend to option (2) which is more friendly to maintenance.

Regards,
Weichen


Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Weichen Xu
+1, nice feature!

On Sat, Mar 2, 2019 at 6:11 AM Yinan Li  wrote:

> +1
>
> On Fri, Mar 1, 2019 at 12:37 PM Tom Graves 
> wrote:
>
>> +1 for the SPIP.
>>
>> Tom
>>
>> On Friday, March 1, 2019, 8:14:43 AM CST, Xingbo Jiang <
>> jiangxb1...@gmail.com> wrote:
>>
>>
>> Hi all,
>>
>> I want to call for a vote of SPARK-24615
>> . It improves Spark
>> by making it aware of GPUs exposed by cluster managers, and hence Spark can
>> match GPU resources with user task requests properly. The proposal
>> 
>>  and production doc
>> 
>>  was
>> made available on dev@ to collect input. Your can also find a design
>> sketch at SPARK-27005 
>> .
>>
>> The vote will be up for the next 72 hours. Please reply with your vote:
>>
>> +1: Yeah, let's go forward and implement the SPIP.
>> +0: Don't really care.
>> -1: I don't think this is a good idea because of the following technical
>> reasons.
>>
>> Thank you!
>>
>> Xingbo
>>
>


Re: [VOTE] SPARK 2.4.0 (RC1)

2018-09-20 Thread Weichen Xu
We need to merge this.
https://github.com/apache/spark/pull/22492
Otherwise mleap cannot build against spark 2.4.0
Thanks!

On Wed, Sep 19, 2018 at 1:16 PM Yinan Li  wrote:

> FYI: SPARK-23200 has been resolved.
>
> On Tue, Sep 18, 2018 at 8:49 AM Felix Cheung 
> wrote:
>
>> If we could work on this quickly - it might get on to future RCs.
>>
>>
>>
>> --
>> *From:* Stavros Kontopoulos 
>> *Sent:* Monday, September 17, 2018 2:35 PM
>> *To:* Yinan Li
>> *Cc:* Xiao Li; eerla...@redhat.com; van...@cloudera.com.invalid; Sean
>> Owen; Wenchen Fan; dev
>> *Subject:* Re: [VOTE] SPARK 2.4.0 (RC1)
>>
>> Hi Xiao,
>>
>> I just tested it, it seems ok. There are some questions about which
>> properties we should keep when restoring the config. Otherwise it looks ok
>> to me.
>> The reason this should go in 2.4 is that streaming on k8s is something
>> people want to try day one (or at least it is cool to try) and since 2.4
>> comes with k8s support being refactored a lot,
>> it would be disappointing not to have it in...IMHO.
>>
>> Best,
>> Stavros
>>
>> On Mon, Sep 17, 2018 at 11:13 PM, Yinan Li  wrote:
>>
>>> We can merge the PR and get SPARK-23200 resolved if the whole point is
>>> to make streaming on k8s work first. But given that this is not a blocker
>>> for 2.4, I think we can take a bit more time here and get it right. With
>>> that being said, I would expect it to be resolved soon.
>>>
>>> On Mon, Sep 17, 2018 at 11:47 AM Xiao Li  wrote:
>>>
 Hi, Erik and Stavros,

 This bug fix SPARK-23200 is not a blocker of the 2.4 release. It sounds
 important for the Streaming on K8S. Could the K8S oriented committers speed
 up the reviews?

 Thanks,

 Xiao

 Erik Erlandson  于2018年9月17日周一 上午11:04写道:

>
> I have no binding vote but I second Stavros’ recommendation for
> spark-23200
>
> Per parallel threads on Py2 support I would also like to propose
> deprecating Py2 starting with this 2.4 release
>
> On Mon, Sep 17, 2018 at 10:38 AM Marcelo Vanzin
>  wrote:
>
>> You can log in to https://repository.apache.org and see what's wrong.
>> Just find that staging repo and look at the messages. In your case it
>> seems related to your signature.
>>
>> failureMessageNo public key: Key with id: () was not able to be
>> located on http://gpg-keyserver.de/. Upload your public key and try
>> the operation again.
>> On Sun, Sep 16, 2018 at 10:00 PM Wenchen Fan 
>> wrote:
>> >
>> > I confirmed that
>> https://repository.apache.org/content/repositories/orgapachespark-1285
>> is not accessible. I did it via ./dev/create-release/do-release-docker.sh
>> -d /my/work/dir -s publish , not sure what's going wrong. I didn't see 
>> any
>> error message during it.
>> >
>> > Any insights are appreciated! So that I can fix it in the next RC.
>> Thanks!
>> >
>> > On Mon, Sep 17, 2018 at 11:31 AM Sean Owen 
>> wrote:
>> >>
>> >> I think one build is enough, but haven't thought it through. The
>> >> Hadoop 2.6/2.7 builds are already nearly redundant. 2.12 is
>> probably
>> >> best advertised as a 'beta'. So maybe publish a no-hadoop build of
>> it?
>> >> Really, whatever's the easy thing to do.
>> >> On Sun, Sep 16, 2018 at 10:28 PM Wenchen Fan 
>> wrote:
>> >> >
>> >> > Ah I missed the Scala 2.12 build. Do you mean we should publish
>> a Scala 2.12 build this time? Current for Scala 2.11 we have 3 builds: 
>> with
>> hadoop 2.7, with hadoop 2.6, without hadoop. Shall we do the same thing 
>> for
>> Scala 2.12?
>> >> >
>> >> > On Mon, Sep 17, 2018 at 11:14 AM Sean Owen 
>> wrote:
>> >> >>
>> >> >> A few preliminary notes:
>> >> >>
>> >> >> Wenchen for some weird reason when I hit your key in gpg
>> --import, it
>> >> >> asks for a passphrase. When I skip it, it's fine, gpg can still
>> verify
>> >> >> the signature. No issue there really.
>> >> >>
>> >> >> The staging repo gives a 404:
>> >> >>
>> https://repository.apache.org/content/repositories/orgapachespark-1285/
>> >> >> 404 - Repository "orgapachespark-1285 (staging: open)"
>> >> >> [id=orgapachespark-1285] exists but is not exposed.
>> >> >>
>> >> >> The (revamped) licenses are OK, though there are some minor
>> glitches
>> >> >> in the final release tarballs (my fault) : there's an extra
>> directory,
>> >> >> and the source release has both binary and source licenses.
>> I'll fix
>> >> >> that. Not strictly necessary to reject the release over those.
>> >> >>
>> >> >> Last, when I check the staging repo I'll get my answer, but,
>> were you
>> >> >> able to build 2.12 artifacts as well?
>> >> >>
>> >> >> On Sun, Sep 16, 2018 at 9:48 PM Wenchen Fan <
>> cloud0...@gmail.com> wrote:
>> >> >> >

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-03 Thread Weichen Xu
+1

On Fri, Jun 1, 2018 at 3:41 PM, Xiao Li  wrote:

> +1
>
> 2018-06-01 15:41 GMT-07:00 Xingbo Jiang :
>
>> +1
>>
>> 2018-06-01 9:21 GMT-07:00 Xiangrui Meng :
>>
>>> Hi all,
>>>
>>> I want to call for a vote of SPARK-24374
>>> . It introduces a
>>> new execution mode to Spark, which would help both integration with
>>> external DL/AI frameworks and MLlib algorithm performance. This is one of
>>> the follow-ups from a previous discussion on dev@
>>> 
>>> .
>>>
>>> The vote will be up for the next 72 hours. Please reply with your vote:
>>>
>>> +1: Yeah, let's go forward and implement the SPIP.
>>> +0: Don't really care.
>>> -1: I don't think this is a good idea because of the following technical
>>> reasons.
>>>
>>> Best,
>>> Xiangrui
>>> --
>>>
>>> Xiangrui Meng
>>>
>>> Software Engineer
>>>
>>> Databricks Inc. [image: http://databricks.com] 
>>>
>>
>>
>


Re: [MLLib] Logistic Regression and standadization

2018-04-20 Thread Weichen Xu
Right. If regularization item isn't zero, then enable/disable
standardization will get different result.
But, if comparing results between R-glmnet and mllib, if we set the same
parameters for regularization/standardization/... , then we should get the
same result. If not, then maybe there's a bug. In this case you can paste
your testing code and I can help fix it.

On Sat, Apr 21, 2018 at 1:06 AM, Valeriy Avanesov <acop...@gmail.com> wrote:

> Hi all.
>
> Filipp, do you use l1/l2/elstic-net penalization? I believe in this case
> standardization matters.
>
> Best,
>
> Valeriy.
>
> On 04/17/2018 11:40 AM, Weichen Xu wrote:
>
> Not a bug.
>
> When disabling standadization, mllib LR will still do standadization for
> features, but it will scale the coefficients back at the end (after
> training finished). So it will get the same result with no standadization
> training. The purpose of it is to improve the rate of convergence. So the
> result should be always exactly the same with R's glmnet, no matter
> enable or disable standadization.
>
> Thanks!
>
> On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang <yblia...@gmail.com> wrote:
>
>> Hi Filipp,
>>
>> MLlib’s LR implementation did the same way as R’s glmnet for
>> standardization.
>> Actually you don’t need to care about the implementation detail, as the
>> coefficients are always returned on the original scale, so it should be
>> return the same result as other popular ML libraries.
>> Could you point me where glmnet doesn’t scale features?
>> I suspect other issues cause your prediction quality dropped. If you can
>> share the code and data, I can help to check it.
>>
>> Thanks
>> Yanbo
>>
>>
>> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin <filipp.zhin...@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>> While migrating from custom LR implementation to MLLib's LR
>> implementation my colleagues noticed that prediction quality dropped
>> (accoring to different business metrics).
>> It's turned out that this issue caused by features standardization
>> perfomed by MLLib's LR: disregard to 'standardization' option's value all
>> features are scaled during loss and gradient computation (as well as in few
>> other places): https://github.com/apache/spark/blob/6cc7021a40b64c
>> 41a51f337ec4be9545a25e838c/mllib/src/main/scala/org/
>> apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>>
>> According to comments in the code, standardization should be implemented
>> the same way it was implementes in R's glmnet package. I've looked through
>> corresponding Fortran code, an it seems like glmnet don't scale features
>> when you're disabling standardisation (but MLLib still does).
>>
>> Our models contains multiple one-hot encoded features and scaling them is
>> a pretty bad idea.
>>
>> Why MLLib's LR always scale all features? From my POV it's a bug.
>>
>> Thanks in advance,
>> Filipp.
>>
>>
>>
>
>


Re: [MLLib] Logistic Regression and standadization

2018-04-17 Thread Weichen Xu
Not a bug.

When disabling standadization, mllib LR will still do standadization for
features, but it will scale the coefficients back at the end (after
training finished). So it will get the same result with no standadization
training. The purpose of it is to improve the rate of convergence. So the
result should be always exactly the same with R's glmnet, no matter enable
or disable standadization.

Thanks!

On Sat, Apr 14, 2018 at 2:21 AM, Yanbo Liang  wrote:

> Hi Filipp,
>
> MLlib’s LR implementation did the same way as R’s glmnet for
> standardization.
> Actually you don’t need to care about the implementation detail, as the
> coefficients are always returned on the original scale, so it should be
> return the same result as other popular ML libraries.
> Could you point me where glmnet doesn’t scale features?
> I suspect other issues cause your prediction quality dropped. If you can
> share the code and data, I can help to check it.
>
> Thanks
> Yanbo
>
>
> On Apr 8, 2018, at 1:09 PM, Filipp Zhinkin 
> wrote:
>
> Hi all,
>
> While migrating from custom LR implementation to MLLib's LR implementation
> my colleagues noticed that prediction quality dropped (accoring to
> different business metrics).
> It's turned out that this issue caused by features standardization
> perfomed by MLLib's LR: disregard to 'standardization' option's value all
> features are scaled during loss and gradient computation (as well as in few
> other places): https://github.com/apache/spark/blob/
> 6cc7021a40b64c41a51f337ec4be9545a25e838c/mllib/src/main/
> scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229
>
> According to comments in the code, standardization should be implemented
> the same way it was implementes in R's glmnet package. I've looked through
> corresponding Fortran code, an it seems like glmnet don't scale features
> when you're disabling standardisation (but MLLib still does).
>
> Our models contains multiple one-hot encoded features and scaling them is
> a pretty bad idea.
>
> Why MLLib's LR always scale all features? From my POV it's a bug.
>
> Thanks in advance,
> Filipp.
>
>
>


Re: 回复: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Weichen Xu
Congrats Zhenhua!

On Mon, Apr 2, 2018 at 5:32 PM, Gengliang  wrote:

> Congrats, Zhenhua!
>
>
>
> On Mon, Apr 2, 2018 at 5:19 PM, Marco Gaido 
> wrote:
>
>> Congrats Zhenhua!
>>
>> 2018-04-02 11:00 GMT+02:00 Saisai Shao :
>>
>>> Congrats, Zhenhua!
>>>
>>> 2018-04-02 16:57 GMT+08:00 Takeshi Yamamuro :
>>>
 Congrats, Zhenhua!

 On Mon, Apr 2, 2018 at 4:13 PM, Ted Yu  wrote:

> Congratulations, Zhenhua
>
>  Original message 
> From: 雨中漫步 <601450...@qq.com>
> Date: 4/1/18 11:30 PM (GMT-08:00)
> To: Yuanjian Li , Wenchen Fan <
> cloud0...@gmail.com>
> Cc: dev 
> Subject: 回复: Welcome Zhenhua Wang as a Spark committer
>
> Congratulations Zhenhua Wang
>
>
> -- 原始邮件 --
> *发件人:* "Yuanjian Li";
> *发送时间:* 2018年4月2日(星期一) 下午2:26
> *收件人:* "Wenchen Fan";
> *抄送:* "Spark dev list";
> *主题:* Re: Welcome Zhenhua Wang as a Spark committer
>
> Congratulations Zhenhua!!
>
> 2018-04-02 13:28 GMT+08:00 Wenchen Fan :
>
>> Hi all,
>>
>> The Spark PMC recently added Zhenhua Wang as a committer on the
>> project. Zhenhua is the major contributor of the CBO project, and has 
>> been
>> contributing across several areas of Spark for a while, focusing 
>> especially
>> on analyzer, optimizer in Spark SQL. Please join me in welcoming Zhenhua!
>>
>> Wenchen
>>
>
>


 --
 ---
 Takeshi Yamamuro

>>>
>>>
>>
>


Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-23 Thread Weichen Xu
+1

On Fri, Feb 23, 2018 at 5:40 PM, Gengliang  wrote:

> +1
>
> On Fri, Feb 23, 2018 at 11:35 AM, Xingbo Jiang 
> wrote:
>
>> +1
>>
>> 2018-02-23 11:26 GMT+08:00 Takuya UESHIN :
>>
>>> +1
>>>
>>> On Fri, Feb 23, 2018 at 12:24 PM, Wenchen Fan 
>>> wrote:
>>>
 +1

 On Fri, Feb 23, 2018 at 6:23 AM, Sameer Agarwal 
 wrote:

> Please vote on releasing the following candidate as Apache Spark
> version 2.3.0. The vote is open until Tuesday February 27, 2018 at 8:00:00
> am UTC and passes if a majority of at least 3 PMC +1 votes are cast.
>
>
> [ ] +1 Release this package as Apache Spark 2.3.0
>
> [ ] -1 Do not release this package because ...
>
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.0-rc5: https://github.com/apache/spar
> k/tree/v2.3.0-rc5 (992447fb30ee9ebb3cf794f2d06f4d63a2d792db)
>
> List of JIRA tickets resolved in this release can be found here:
> https://issues.apache.org/jira/projects/SPARK/versions/12339551
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-bin/
>
> Release artifacts are signed with the following key:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapache
> spark-1266/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs
> /_site/index.html
>
>
> FAQ
>
> ===
> What are the unresolved issues targeted for 2.3.0?
> ===
>
> Please see https://s.apache.org/oXKi. At the time of writing, there
> are currently no known release blockers.
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala you
> can add the staging repository to your projects resolvers and test with 
> the
> RC (make sure to clean up the artifact cache before/after so you don't end
> up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.0?
> ===
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should be
> worked on immediately. Everything else please retarget to 2.3.1 or 2.4.0 
> as
> appropriate.
>
> ===
> Why is my bug not fixed?
> ===
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from 2.2.0. That being
> said, if there is something which is a regression from 2.2.0 and has not
> been correctly targeted please ping me or a committer to help target the
> issue (you can see the open issues listed as impacting Spark 2.3.0 at
> https://s.apache.org/WmoI).
>


>>>
>>>
>>> --
>>> Takuya UESHIN
>>> Tokyo, Japan
>>>
>>> http://twitter.com/ueshin
>>>
>>
>>
>


Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-20 Thread Weichen Xu
+1

On Wed, Feb 21, 2018 at 10:07 AM, Marcelo Vanzin 
wrote:

> Done, thanks!
>
> On Tue, Feb 20, 2018 at 6:05 PM, Sameer Agarwal 
> wrote:
> > Sure, please feel free to backport.
> >
> > On 20 February 2018 at 18:02, Marcelo Vanzin 
> wrote:
> >>
> >> Hey Sameer,
> >>
> >> Mind including https://github.com/apache/spark/pull/20643
> >> (SPARK-23468)  in the new RC? It's a minor bug since I've only hit it
> >> with older shuffle services, but it's pretty safe.
> >>
> >> On Tue, Feb 20, 2018 at 5:58 PM, Sameer Agarwal 
> >> wrote:
> >> > This RC has failed due to
> >> > https://issues.apache.org/jira/browse/SPARK-23470.
> >> > Now that the fix has been merged in 2.3 (thanks Marcelo!), I'll follow
> >> > up
> >> > with an RC5 soon.
> >> >
> >> > On 20 February 2018 at 16:49, Ryan Blue  wrote:
> >> >>
> >> >> +1
> >> >>
> >> >> Build & tests look fine, checked signature and checksums for src
> >> >> tarball.
> >> >>
> >> >> On Tue, Feb 20, 2018 at 12:54 PM, Shixiong(Ryan) Zhu
> >> >>  wrote:
> >> >>>
> >> >>> I'm -1 because of the UI regression
> >> >>> https://issues.apache.org/jira/browse/SPARK-23470 : the All Jobs
> page
> >> >>> may be
> >> >>> too slow and cause "read timeout" when there are lots of jobs and
> >> >>> stages.
> >> >>> This is one of the most important pages because when it's broken,
> it's
> >> >>> pretty hard to use Spark Web UI.
> >> >>>
> >> >>>
> >> >>> On Tue, Feb 20, 2018 at 4:37 AM, Marco Gaido <
> marcogaid...@gmail.com>
> >> >>> wrote:
> >> 
> >>  +1
> >> 
> >>  2018-02-20 12:30 GMT+01:00 Hyukjin Kwon :
> >> >
> >> > +1 too
> >> >
> >> > 2018-02-20 14:41 GMT+09:00 Takuya UESHIN  >:
> >> >>
> >> >> +1
> >> >>
> >> >>
> >> >> On Tue, Feb 20, 2018 at 2:14 PM, Xingbo Jiang
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> +1
> >> >>>
> >> >>>
> >> >>> Wenchen Fan 于2018年2月20日 周二下午1:09写道:
> >> 
> >>  +1
> >> 
> >>  On Tue, Feb 20, 2018 at 12:53 PM, Reynold Xin
> >>  
> >>  wrote:
> >> >
> >> > +1
> >> >
> >> > On Feb 20, 2018, 5:51 PM +1300, Sameer Agarwal
> >> > , wrote:
> >> >>
> >> >> this file shouldn't be included?
> >> >>
> >> >> https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc4-bin/
> spark-parent_2.11.iml
> >> >
> >> >
> >> > I've now deleted this file
> >> >
> >> >> From: Sameer Agarwal 
> >> >> Sent: Saturday, February 17, 2018 1:43:39 PM
> >> >> To: Sameer Agarwal
> >> >> Cc: dev
> >> >> Subject: Re: [VOTE] Spark 2.3.0 (RC4)
> >> >>
> >> >> I'll start with a +1 once again.
> >> >>
> >> >> All blockers reported against RC3 have been resolved and the
> >> >> builds are healthy.
> >> >>
> >> >> On 17 February 2018 at 13:41, Sameer Agarwal
> >> >> 
> >> >> wrote:
> >> >>>
> >> >>> Please vote on releasing the following candidate as Apache
> >> >>> Spark
> >> >>> version 2.3.0. The vote is open until Thursday February 22,
> >> >>> 2018 at 8:00:00
> >> >>> am UTC and passes if a majority of at least 3 PMC +1 votes
> are
> >> >>> cast.
> >> >>>
> >> >>>
> >> >>> [ ] +1 Release this package as Apache Spark 2.3.0
> >> >>>
> >> >>> [ ] -1 Do not release this package because ...
> >> >>>
> >> >>>
> >> >>> To learn more about Apache Spark, please see
> >> >>> https://spark.apache.org/
> >> >>>
> >> >>> The tag to be voted on is v2.3.0-rc4:
> >> >>> https://github.com/apache/spark/tree/v2.3.0-rc4
> >> >>> (44095cb65500739695b0324c177c19dfa1471472)
> >> >>>
> >> >>> List of JIRA tickets resolved in this release can be found
> >> >>> here:
> >> >>>
> >> >>> https://issues.apache.org/jira/projects/SPARK/versions/
> 12339551
> >> >>>
> >> >>> The release files, including signatures, digests, etc. can
> be
> >> >>> found at:
> >> >>> https://dist.apache.org/repos/
> dist/dev/spark/v2.3.0-rc4-bin/
> >> >>>
> >> >>> Release artifacts are signed with the following key:
> >> >>> https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >>>
> >> >>> The staging repository for this release can be found at:
> >> >>>
> >> >>>
> >> >>> https://repository.apache.org/content/repositories/
> orgapachespark-1265/
> >> >>>
> >> >>> The documentation corresponding to 

Re: I Want to Help with MLlib Migration

2018-02-16 Thread Weichen Xu
>>The goal is to have these algorithms implemented using the Dataset API.
Currently, the implementation of these classes/algorithms uses RDDs by
wrapping the old (mllib) classes, which will eventually be deprecated (and
deleted).

It need discussion and test for each algorithm before doing that. Simply
migrating to Dataframe implementation is possible to bring performance
regression.
If you have already implemented some algos on dataframe API and found it
bring performance improvement, then you can create JIRA and I will join
discussion.
Thanks!


On Thu, Feb 15, 2018 at 10:39 PM, Yacine Mazari  wrote:

> Thanks for the reply @srowen.
>
> >>I don't think you can move or alter the class APis.
> Agreed. That's not my intention at all.
>
> >>There also isn't much value in copying the code. Maybe there are
> opportunities for moving some internal code.
> There will probably be some copying and moving internal code, but this is
> not the main purpose.
> The goal is to have these algorithms implemented using the Dataset API.
> Currently, the implementation of these classes/algorithms uses RDDs by
> wrapping the old (mllib) classes, which will eventually be deprecated (and
> deleted).
>
> >>But in general I think all this has to wait.
> Do you have any schedule or plan in mind? If deprecation is targeted for
> 3.0, then we roughly have 1.5 years.
> On the other-hand, the current situation prevents us from making
> improvements to the existing classes, for example I'd like to add
> maxDocFreq
> to ml.feature.IDF to make it similar to scikit-learn, but that's hard to do
> because it's just a wrapper mllib.feature.IDF,
>
>
> Thank you for the discussion.
> Yacine.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Hinge Gradient

2017-12-16 Thread Weichen Xu
Hi Deb,

Which library or paper do you find to use this loss function in SVM ?

But I prefer the implementation in LIBLINEAR which use coordinate descent
optimizer.

Thanks.

On Sun, Dec 17, 2017 at 6:52 AM, Yanbo Liang  wrote:

> Hello Deb,
>
> To optimize non-smooth function on LBFGS really should be considered
> carefully.
> Is there any literature that proves changing max to soft-max can behave
> well?
> I’m more than happy to see some benchmarks if you can have.
>
> + Yuhao, who did similar effort in this PR: https://github.com/apache/
> spark/pull/17862
>
> Regards
> Yanbo
>
> On Dec 13, 2017, at 12:20 AM, Debasish Das 
> wrote:
>
> Hi,
>
> I looked into the LinearSVC flow and found the gradient for hinge as
> follows:
>
> Our loss function with {0, 1} labels is max(0, 1 - (2y - 1) (f_w(x)))
> Therefore the gradient is -(2y - 1)*x
>
> max is a non-smooth function.
>
> Did we try using ReLu/Softmax function and use that to smooth the hinge
> loss ?
>
> Loss function will change to SoftMax(0, 1 - (2y-1) (f_w(x)))
>
> Since this function is smooth, gradient will be well defined and
> LBFGS/OWLQN should behave well.
>
> Please let me know if this has been tried already. If not I can run some
> benchmarks.
>
> We have soft-max in multinomial regression and can be reused for LinearSVC
> flow.
>
> Thanks.
> Deb
>
>
>


Re: [VOTE] Spark 2.2.1 (RC2)

2017-11-29 Thread Weichen Xu
+1

On Thu, Nov 30, 2017 at 6:27 AM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> +1
>
> SHA, MD5 and signatures look fine. Built and ran Maven tests on my Macbook.
>
> Thanks
> Shivaram
>
> On Wed, Nov 29, 2017 at 10:43 AM, Holden Karau 
> wrote:
>
>> +1 (non-binding)
>>
>> PySpark install into a virtualenv works, PKG-INFO looks correctly
>> populated (mostly checking for the pypandoc conversion there).
>>
>> Thanks for your hard work Felix (and all of the testers :)) :)
>>
>> On Wed, Nov 29, 2017 at 9:33 AM, Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Thu, Nov 30, 2017 at 1:28 AM, Kazuaki Ishizaki 
>>> wrote:
>>>
 +1 (non-binding)

 I tested it on Ubuntu 16.04 and OpenJDK8 on ppc64le. All of the tests
 for core/sql-core/sql-catalyst/mllib/mllib-local have passed.

 $ java -version
 openjdk version "1.8.0_131"
 OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.1
 6.04.3-b11)
 OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

 % build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7
 -T 24 clean package install
 % build/mvn -Phive -Phive-thriftserver -Pyarn -Phadoop-2.7 test -pl
 core -pl 'sql/core' -pl 'sql/catalyst' -pl mllib -pl mllib-local
 ...
 Run completed in 13 minutes, 54 seconds.
 Total number of tests run: 1118
 Suites: completed 170, aborted 0
 Tests: succeeded 1118, failed 0, canceled 0, ignored 6, pending 0
 All tests passed.
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO]
 [INFO] Spark Project Core . SUCCESS
 [17:13 min]
 [INFO] Spark Project ML Local Library . SUCCESS [
  6.065 s]
 [INFO] Spark Project Catalyst . SUCCESS
 [11:51 min]
 [INFO] Spark Project SQL .. SUCCESS
 [17:55 min]
 [INFO] Spark Project ML Library ... SUCCESS
 [17:05 min]
 [INFO] 
 
 [INFO] BUILD SUCCESS
 [INFO] 
 
 [INFO] Total time: 01:04 h
 [INFO] Finished at: 2017-11-30T01:48:15+09:00
 [INFO] Final Memory: 128M/329M
 [INFO] 
 
 [WARNING] The requested profile "hive" could not be activated because
 it does not exist.

 Kazuaki Ishizaki



 From:Dongjoon Hyun 
 To:Hyukjin Kwon 
 Cc:Spark dev list , Felix Cheung <
 felixche...@apache.org>, Sean Owen 
 Date:2017/11/29 12:56
 Subject:Re: [VOTE] Spark 2.2.1 (RC2)
 --



 +1 (non-binding)

 RC2 is tested on CentOS, too.

 Bests,
 Dongjoon.

 On Tue, Nov 28, 2017 at 4:35 PM, Hyukjin Kwon <*gurwls...@gmail.com*
 > wrote:
 +1

 2017-11-29 8:18 GMT+09:00 Henry Robinson <*he...@apache.org*
 >:
 (My vote is non-binding, of course).

 On 28 November 2017 at 14:53, Henry Robinson <*he...@apache.org*
 > wrote:
 +1, tests all pass for me on Ubuntu 16.04.

 On 28 November 2017 at 10:36, Herman van Hövell tot Westerflier <
 *hvanhov...@databricks.com* > wrote:
 +1

 On Tue, Nov 28, 2017 at 7:35 PM, Felix Cheung <*felixche...@apache.org*
 > wrote:
 +1

 Thanks Sean. Please vote!

 Tested various scenarios with R package. Ubuntu, Debian, Windows
 r-devel and release and on r-hub. Verified CRAN checks are clean (only 1
 NOTE!) and no leaked files (.cache removed, /tmp clean)


 On Sun, Nov 26, 2017 at 11:55 AM Sean Owen <*so...@cloudera.com*
 > wrote:
 Yes it downloads recent releases. The test worked for me on a second
 try, so I suspect a bad mirror. If this comes up frequently we can just add
 retry logic, as the closer.lua script will return different mirrors each
 time.

 The tests all pass for me on the latest Debian, so +1 for this release.

 (I committed the change to set -Xss4m for tests consistently, but this
 shouldn't block a release.)


 On Sat, Nov 25, 2017 at 12:47 PM Felix Cheung <*felixche...@apache.org*
 > wrote:
 Ah sorry digging through the history it looks like this is changed
 relatively recently and should only download previous releases.

 Perhaps we are intermittently hitting a mirror that 

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-03 Thread Weichen Xu
+1.

On Sat, Nov 4, 2017 at 8:04 AM, Matei Zaharia 
wrote:

> +1 from me too.
>
> Matei
>
> > On Nov 3, 2017, at 4:59 PM, Wenchen Fan  wrote:
> >
> > +1.
> >
> > I think this architecture makes a lot of sense to let executors talk to
> source/sink directly, and bring very low latency.
> >
> > On Thu, Nov 2, 2017 at 9:01 AM, Sean Owen  wrote:
> > +0 simply because I don't feel I know enough to have an opinion. I have
> no reason to doubt the change though, from a skim through the doc.
> >
> >
> > On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin  wrote:
> > Earlier I sent out a discussion thread for CP in Structured Streaming:
> >
> > https://issues.apache.org/jira/browse/SPARK-20928
> >
> > It is meant to be a very small, surgical change to Structured Streaming
> to enable ultra-low latency. This is great timing because we are also
> designing and implementing data source API v2. If designed properly, we can
> have the same data source API working for both streaming and batch.
> >
> >
> > Following the SPIP process, I'm putting this SPIP up for a vote.
> >
> > +1: Let's go ahead and design / implement the SPIP.
> > +0: Don't really care.
> > -1: I do not think this is a good idea for the following reasons.
> >
> >
> >
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE][SPIP] SPARK-22026 data source v2 write path

2017-10-11 Thread Weichen Xu
+1

On Thu, Oct 12, 2017 at 10:36 AM, Xiao Li  wrote:

> +1
>
> Xiao
>
> On Mon, 9 Oct 2017 at 7:31 PM Reynold Xin  wrote:
>
>> +1
>>
>> One thing with MetadataSupport - It's a bad idea to call it that unless
>> adding new functions in that trait wouldn't break source/binary
>> compatibility in the future.
>>
>>
>> On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan  wrote:
>>
>>> I'm adding my own +1 (binding).
>>>
>>> On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan 
>>> wrote:
>>>
 I'm going to update the proposal: for the last point, although the
 user-facing API (`df.write.format(...).option(...).mode(...).save()`)
 mixes data and metadata operations, we are still able to separate them in
 the data source write API. We can have a mix-in trait `MetadataSupport`
 which has a method `create(options)`, so that data sources can mix in this
 trait and provide metadata creation support. Spark will call this `create`
 method inside `DataFrameWriter.save` if the specified data source has it.

 Note that file format data sources can ignore this new trait and still
 write data without metadata(it doesn't have metadata anyway).

 With this updated proposal, I'm calling a new vote for the data source
 v2 write path.

 The vote will be up for the next 72 hours. Please reply with your vote:

 +1: Yeah, let's go forward and implement the SPIP.
 +0: Don't really care.
 -1: I don't think this is a good idea because of the following
 technical reasons.

 Thanks!

 On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan 
 wrote:

> Hi all,
>
> After we merge the infrastructure of data source v2 read path, and
> have some discussion for the write path, now I'm sending this email to 
> call
> a vote for Data Source v2 write path.
>
> The full document of the Data Source API V2 is:
> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-
> Z8qU5Frf6WMQZ6jJVM/edit
>
> The ready-for-review PR that implements the basic infrastructure for
> the write path:
> https://github.com/apache/spark/pull/19269
>
>
> The Data Source V1 write path asks implementations to write a
> DataFrame directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not
> good for maintenance.
> 2. Data sources may need to preprocess the input data before writing,
> like cluster/sort the input by some columns. It's better to do the
> preprocessing in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
> To solve these pain points, I'm proposing the data source v2 writing
> framework which is very similar to the reading framework, i.e.,
> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter.
>
> Data Source V2 write path follows the existing FileCommitProtocol, and
> have task/job level commit/abort, so that data sources can implement
> transaction easier.
>
> We can create a mix-in trait for DataSourceV2Writer to specify the
> requirement for input data, like clustering and ordering.
>
> Spark provides a very simple protocol for uses to connect to data
> sources. A common way to write a dataframe to data sources:
> `df.write.format(...).option(...).mode(...).save()`.
> Spark passes the options and save mode to data sources, and schedules
> the write job on the input data. And the data source should take care of
> the metadata, e.g., the JDBC data source can create the table if it 
> doesn't
> exist, or fail the job and ask users to create the table in the
> corresponding database first. Data sources can define some options for
> users to carry some metadata information like partitioning/bucketing.
>
>
> The vote will be up for the next 72 hours. Please reply with your vote:
>
> +1: Yeah, let's go forward and implement the SPIP.
> +0: Don't really care.
> -1: I don't think this is a good idea because of the following
> technical reasons.
>
> Thanks!
>


>>>
>>


Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Weichen Xu
Congratulations Tejas !

On Sat, Sep 30, 2017 at 4:05 PM, Liang-Chi Hsieh  wrote:

>
> Congrats!
>
>
> Matei Zaharia wrote
> > Hi all,
> >
> > The Spark PMC recently added Tejas Patil as a committer on the
> > project. Tejas has been contributing across several areas of Spark for
> > a while, focusing especially on scalability issues and SQL. Please
> > join me in welcoming Tejas!
> >
> > Matei
> >
> > -
> > To unsubscribe e-mail:
>
> > dev-unsubscribe@.apache
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>