Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-11 Thread Joseph Bradley
ved.) >>>>>>>> >>>>>>>> — >>>>>>>> Sent from my iPhone >>>>>>>> Pardon the dumb thumb typos :) >>>>>>>> >>>>>>>> On Jan 29, 2019, at 7:30 AM, Denny Lee >>>>>>

Re: [VOTE] Spark 2.3.1 (RC4)

2018-06-04 Thread Joseph Bradley
. Everything else please retarget to an >> appropriate release. >> >> == >> But my bug isn't fixed? >> == >> >> In order to make timely releases, we will typically not hold the >> release unless the bug in questi

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Joseph Bradley
gt;>>> >>>> The vote will be up for the next 72 hours. Please reply with your vote: >>>> >>>> +1: Yeah, let's go forward and implement the SPIP. >>>> +0: Don't really care. >>>> -1: I don't think this is a good idea because of the

Re: [VOTE] SPIP ML Pipelines in R

2018-05-31 Thread Joseph Bradley
ahead and implement the SPIP. > > +0: Don't really care. > > -1: I do not think this is a good idea for the following reasons. > > > > Thanks, > > --Hossein > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Revisiting Online serving of Spark models?

2018-05-21 Thread Joseph Bradley
gt; _ > From: Felix Cheung <felixcheun...@hotmail.com> > Sent: Thursday, May 10, 2018 10:10 AM > Subject: Re: Revisiting Online serving of Spark models? > To: Holden Karau <hol...@pigscanfly.ca>, Joseph Bradley < > j

Re: Revisiting Online serving of Spark models?

2018-05-10 Thread Joseph Bradley
ects to build reliable serving > tools. > > I realize this maybe puts some of the folks in an awkward position with > their own commercial offerings, but hopefully if we make it easier for > everyone the commercial vendors can benefit as well. > > Cheers, > > Holden :) > > -

SparkR test failures in PR builder

2018-05-02 Thread Joseph Bradley
in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` and suggested that it could be CRAN flakiness. I'm not familiar with CRAN, but do others have thoughts about how to fix this? Thanks! Joseph -- Joseph Bradley Software Engineer - Machine Learning

Re: [build system] jenkins master unreachable, build system currently down

2018-05-01 Thread Joseph Bradley
; UC Berkeley EECS Research / RISELab Staff Technical Lead >>> https://rise.cs.berkeley.edu >>> >> >> >> >> -- >> Shane Knapp >> UC Berkeley EECS Research / RISELab Staff Technical Lead >> https://rise.cs.berkeley.edu >> > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Possible SPIP to improve matrix and vector column type support

2018-04-18 Thread Joseph Bradley
ions to check validity before execution, for example, a matrix > multiply could check dimension match and fail fast. However, there might be > use cases for a column to contain variable shape tensors, I’m open to > discussion here. > > What do you all think? > -- > -- >

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-02 Thread Joseph Bradley
on analyzer, optimizer in Spark SQL. Please join me in welcoming >>> Zhenhua! >>> > >>> > Wenchen >>> >>> - >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>> >>> > > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark.ml roadmap 2.3.0 and beyond

2018-03-20 Thread Joseph Bradley
be useful too. On Thu, Dec 7, 2017 at 3:55 PM, Stephen Boesch <java...@gmail.com> wrote: > Thanks Joseph. We can wait for post 2.3.0. > > 2017-12-07 15:36 GMT-08:00 Joseph Bradley <jos...@databricks.com>: > >> Hi Stephen, >> >> I used to post

Re: [MLlib] QuantRegForest

2018-03-09 Thread Joseph Bradley
hear from you! > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Softw

Re: Welcoming some new committers

2018-03-09 Thread Joseph Bradley
ard to working >> with you all and helping out more in the future. Also, congrats to the >> other committers as well!! >> > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark.ml roadmap 2.3.0 and beyond

2017-12-07 Thread Joseph Bradley
s for prior releases e.g. 1.6 2.0 2.1 2.2 were available: >>> >>> 2.2.0 https://issues.apache.org/jira/browse/SPARK-18813 >>> >>> 2.1.0 https://issues.apache.org/jira/browse/SPARK-15581 >>> .. >>> >>> It seems those roadmaps were not avail

Re: [ML] Migrating transformers from mllib to ml

2017-11-07 Thread Joseph Bradley
k. >> >> Is there any reason why this has not been done so far? Is it to avoid >> code duplication? If so, is it still an issue since we are going to >> deprecate mllib from 2.3 (at least this is what I read on Spark docs)? If >> no, I can work on this. >> >>

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-07 Thread Joseph Bradley
t;>>> Streaming: >>>>>> > >>>>>> > https://issues.apache.org/jira/browse/SPARK-20928 >>>>>> > >>>>>> > It is meant to be a very small, surgical change to Structured >>>>>> Streaming to enable ultra-low latency. This is great timing because we >>>>>> are >>>>>>

Re: HashingTFModel/IDFModel in Structured Streaming

2017-10-20 Thread Joseph Bradley
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > > --------- > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: SparkR is now available on CRAN

2017-10-20 Thread Joseph Bradley
g a number of fixes to meet the CRAN requirements and >>> Holden Karau for the 2.1.2 release. >>> >>> Thanks >>> Shivaram >>> >> >> > > > -- > Twitter: https://twitter.com/holdenkarau > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-27 Thread Joseph Bradley
This vote passes with 11 +1s (4 binding) and no +0s or -1s. +1: Sean Owen (binding) Holden Karau Denny Lee Reynold Xin (binding) Joseph Bradley (binding) Noman Khan Weichen Xu Yanbo Liang Dongjoon Hyun Matei Zaharia (binding) Vaquar Khan Thanks everyone! Joseph On Sat, Sep 23, 2017 at 4:23 PM

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-30 Thread Joseph Bradley
Congrats! On Aug 29, 2017 9:55 AM, "Felix Cheung" wrote: > Congrats! > > -- > *From:* Wenchen Fan > *Sent:* Tuesday, August 29, 2017 9:21:38 AM > *To:* Kevin Yu > *Cc:* Meisam Fathi; dev > *Subject:* Re: Welcoming

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-10 Thread Joseph Bradley
Congrats! On Aug 8, 2017 9:31 PM, "Minho Kim" wrote: > Congrats, Hyukjin and Sameer!! > > 2017-08-09 9:55 GMT+09:00 Sandeep Joshi : > >> Congratulations Hyukjin and Sameer ! >> >> On 7 Aug 2017 9:23 p.m., "Matei Zaharia" wrote:

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-02 Thread Joseph Bradley
at those and triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should be >> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >> >> *But my bug isn't fixed!??!* >> >> In order to make timely releases, we will typically not hold the release >> unless the bug in question is a regression from 2.1.1. >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-13 Thread Joseph Bradley
gt; >>>>> The documentation corresponding to this release can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark- >>>>> 2.2.0-rc4-docs/ >>>>> >>>>> >>>>> *FAQ* >>>>> >>>>> *How can I help test this release?* >>>>> >>>>> If you are a Spark user, you can help us test this release by taking >>>>> an existing Spark workload and running on this release candidate, then >>>>> reporting any regressions. >>>>> >>>>> *What should happen to JIRA tickets still targeting 2.2.0?* >>>>> >>>>> Committers should look at those and triage. Extremely important bug >>>>> fixes, documentation, and API tweaks that impact compatibility should be >>>>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >>>>> >>>>> *But my bug isn't fixed!??!* >>>>> >>>>> In order to make timely releases, we will typically not hold the >>>>> release unless the bug in question is a regression from 2.1.1. >>>>> >>>> >>>> >>>> -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
/tag/release-0.5.0 *Docs*: http://graphframes.github.io/ *Spark Package*: https://spark-packages.org/package/graphframes/graphframes *Source*: https://github.com/graphframes/graphframes Thanks to all contributors and to the community for feedback! Joseph -- Joseph Bradley Software Engineer

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-16 Thread Joseph Bradley
his release? > >> > >> If you are a Spark user, you can help us test this release by taking an > >> existing Spark workload and running on this release candidate, then > >> reporting any regressions. > >> > >> What should happen to JIRA tickets still targeting 2.2.0? > >> > >> Committers should look at those and triage. Extremely important bug > fixes, > >> documentation, and API tweaks that impact compatibility should be > worked on > >> immediately. Everything else please retarget to 2.3.0 or 2.2.1. > >> > >> But my bug isn't fixed!??! > >> > >> In order to make timely releases, we will typically not hold the release > >> unless the bug in question is a regression from 2.1.1. > >> > > > > > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-08 Thread Joseph Bradley
*What should happen to JIRA tickets still targeting 2.2.0?* >> >> Committers should look at those and triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should be >> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >> >> *But my bug isn't fixed!??!* >> >> In order to make timely releases, we will typically not hold the release >> unless the bug in question is a regression from 2.1.1. >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley
;> >>>>> The documentation corresponding to this release can be found at: >>>>> http://people.apache.org/~pwendell/spark-releases/spark- >>>>> 2.2.0-rc1-docs/ >>>>> >>>>> >>>>> *FAQ* >>>>> >>&g

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-04-27 Thread Joseph Bradley
t;>> existing Spark workload and running on this release candidate, then >>> reporting any regressions. >>> >>> *What should happen to JIRA tickets still targeting 2.2.0?* >>> >>> Committers should look at those and triage. Extremely important bug >>> fixes, documentation, and API tweaks that impact compatibility should be >>> worked on immediately. Everything else please retarget to 2.3.0 or 2.2.1. >>> >>> *But my bug isn't fixed!??!* >>> >>> In order to make timely releases, we will typically not hold the release >>> unless the bug in question is a regression from 2.1.1. >>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Pull Request Made, Ignored So Far

2017-03-31 Thread Joseph Bradley
ter pages in the PR list until it’s too deep for anyone to be > expected to find otherwise. > > Best, > > John > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: SPIP docs are live

2017-03-16 Thread Joseph Bradley
t > further if needed. > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Question on Spark's graph libraries roadmap

2017-03-15 Thread Joseph Bradley
t;>> *Md. Rezaul Karim*, BSc, MSc >>>> PhD Researcher, INSIGHT Centre for Data Analytics >>>> National University of Ireland, Galway >>>> IDA Business Park, Dangan, Galway, Ireland >>>> Web: http://www.reza-analytics.eu/index.html >>>> <http://139.59.184.114/index.html> >>>> >>>> On 10 March 2017 at 12:10, Robin East <robin.e...@xense.co.uk> wrote: >>>> >>>> I would love to know the answer to that too. >>>> >>>> --- >>>> Robin East >>>> *Spark GraphX in Action* Michael Malak and Robin East >>>> Manning Publications Co. >>>> http://www.manning.com/books/spark-graphx-in-action >>>> >>>> >>>> >>>> >>>> >>>> On 9 Mar 2017, at 17:42, enzo <e...@smartinsightsfromdata.com> wrote: >>>> >>>> I am a bit confused by the current roadmap for graph and graph >>>> analytics in Apache Spark. >>>> >>>> I understand that we have had for some time two libraries (the >>>> following is my understanding - please amend as appropriate!): >>>> >>>> . GraphX, part of Spark project. This library is based on RDD and it >>>> is only accessible via Scala. It doesn’t look that this library has been >>>> enhanced recently. >>>> . GraphFrames, independent (at the moment?) library for Spark. This >>>> library is based on Spark DataFrames and accessible by Scala & Python. Last >>>> commit on GitHub was 2 months ago. >>>> >>>> GraphFrames cam about with the promise at some point to be integrated >>>> in Apache Spark. >>>> >>>> I can see other projects coming up with interesting libraries and ideas >>>> (e.g. Graphulo on Accumulo, a new project with the goal of >>>> implementing the GraphBlas building blocks for graph algorithms on top >>>> of Accumulo). >>>> >>>> Where is Apache Spark going? >>>> >>>> Where are graph libraries in the roadmap? >>>> >>>> >>>> >>>> Thanks for any clarity brought to this matter. >>>> >>>> Enzo >>>> >>>> >>>> >>>> >>>> >>>> >> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Spark Improvement Proposals

2017-02-24 Thread Joseph Bradley
> >>>>>> with multiple Committers and active users. I heard many > fantastic > >>> >>>>>> ideas. I > >>> >>>>>> believe Spark improvement proposals are good channels to collect > >>> >>>>>> the > >>> >>>>>> requirements/designs. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> IMO, we also need to consider the priority when working on these > >>> >>>>>> items. > >>> >>>>>> Even if the proposal is accepted, it does not mean it will be > >>> >>>>>> implemented > >>> >>>>>> and merged immediately. It is not a FIFO queue. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Even if some PRs are merged, sometimes, we still have to revert > >>> >>>>>> them > >>> >>>>>> back, if the design and implementation are not reviewed > carefully. > >>> >>>>>> We have > >>> >>>>>> to ensure our quality. Spark is not an application software. It > is > >>> >>>>>> an > >>> >>>>>> infrastructure software that is being used by many many > companies. > >>> >>>>>> We have > >>> >>>>>> to be very careful in the design and implementation, especially > >>> >>>>>> adding/changing the external APIs. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> When I developed the Mainframe infrastructure/middleware > software > >>> >>>>>> in > >>> >>>>>> the past 6 years, I were involved in the discussions with > >>> >>>>>> external/internal > >>> >>>>>> customers. The to-do feature list was always above 100. > Sometimes, > >>> >>>>>> the > >>> >>>>>> customers are feeling frustrated when we are unable to deliver > >>> >>>>>> them on time > >>> >>>>>> due to the resource limits and others. Even if they paid us > >>> >>>>>> billions, we > >>> >>>>>> still need to do it phase by phase or sometimes they have to > >>> >>>>>> accept the > >>> >>>>>> workarounds. That is the reality everyone has to face, I think. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Thanks, > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> Xiao Li > >>> >>>>>>> > >>> >>>>>>> > >>> >> > >>> > > >>> > > - > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > > >>> > >>> - > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >>> > >> > >> > >> > >> -- > >> Ryan Blue > >> Software Engineer > >> Netflix > > > > > > > > > > -- > > Regards, > > Vaquar Khan > > +1 -224-436-0783 > > > > IT Architect / Lead Consultant > > Greater Chicago > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-02-23 Thread Joseph Bradley
implementation is visible and for lower level integration, > > What I tend to do is keep my own code in its package and try to do as > think a bridge over to it from the [private] scope. It's also important to > name things obviously, say, org.apache.spark.microsoft , so stack trace

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-15 Thread Joseph Bradley
gt;>> >>> Hi all, >>> >>> Takuya-san has recently been elected an Apache Spark committer. He's >>> been active in the SQL area and writes very small, surgical patches that >>> are high quality. Please join me in congratulating Takuya-san! >>> >>> >>> >>> >> > > > -- > Takuya UESHIN > Tokyo, Japan > > http://twitter.com/ueshin > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

PSA: Java 8 unidoc build

2017-02-06 Thread Joseph Bradley
and others who have made many fixes for this! See these sample PRs for some issues causing failures (especially around links): https://github.com/apache/spark/pull/16741 https://github.com/apache/spark/pull/16604 Thanks, Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc

Re: Feedback on MLlib roadmap process proposal

2017-01-26 Thread Joseph Bradley
work they > believe needs doing, and shepherd work initiated by others (a clear bug > report, a PR) to a resolution. Things get done by doing them, or by > building influence by doing other things the project needs doing. It isn't > a mechanical, objective process, and can't be. But it does work in a > recognizable way. > >> -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: MLlib mission and goals

2017-01-24 Thread Joseph Bradley
ory_Bandwidth_and_Machine_Balance_in_Current_High_Performance_Computers> > > -- > View this message in context: Re: MLlib mission and goals > <http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-mission-and-goals-tp20715p20754.html> &g

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Joseph Bradley
mber of areas in Spark, > including > > > linear algebra, stats/maths functions in DataFrames, Python/R APIs for > > > DataFrames, dstream, and most recently Structured Streaming. > > > > > > Holden has been a long time Spark contributor and evangelist. She has > > > written a few books on Spark, as well as frequent contributions to the > > > Python API to improve its usability and performance. > > > > > > Please join me in welcoming the two! > > > > > > > > > > > > > > > > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

MLlib mission and goals

2017-01-23 Thread Joseph Bradley
bilities, and it will be great to hear the community's thoughts! Thanks, Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Feedback on MLlib roadmap process proposal

2017-01-23 Thread Joseph Bradley
M > Subject: Re: Feedback on MLlib roadmap process proposal > To: Seth Hendrickson <seth.hendrickso...@gmail.com> > Cc: Joseph Bradley <jos...@databricks.com>, <dev@spark.apache.org> > > > > +1 general abstractions like distributed linear algebra. > > On Thu, Ja

Feedback on MLlib roadmap process proposal

2017-01-17 Thread Joseph Bradley
munication. * This is fairly orthogonal to the SIP discussion since this proposal is more about setting release targets than about proposing future plans. Thanks! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: ml word2vec finSynonyms return type

2017-01-05 Thread Joseph Bradley
___ > From: Asher Krim <ak...@hubspot.com> > Sent: Tuesday, January 3, 2017 11:58 PM > Subject: Re: ml word2vec finSynonyms return type > To: Felix Cheung <felixcheun...@hotmail.com> > Cc: <manojkumarsivaraj...@gmail.com>, Joseph Bradley < >

Re: Spark Improvement Proposals

2017-01-03 Thread Joseph Bradley
gt;> >>> run > >> >>> >>> >>> SQL > >> >>> >>> >>> commands on stream but do we really have time to do SQL > >> >>> >>> >>> processing at > >> >>> >>> &g

Re: mllib metrics vs ml evaluators and how to improve apis for users

2017-01-02 Thread Joseph Bradley
(dataset: Dataset[_]): RegressionEvaluation (or > classification/multiclass etc) > > > > where the evaluation class returned will have very similar fields to the > corresponding mllib RegressionMetrics class that can be called by the user. > > > > Any thoughts/ideas about s

Re: [VOTE] Apache Spark 2.1.0 (RC5)

2016-12-16 Thread Joseph Bradley
p test this release?* >>>> >>>> If you are a Spark user, you can help us test this release by taking an >>>> existing Spark workload and running on this release candidate, then >>>> reporting any regressions. >>>> >>>> *What should happen to JIRA tickets still targeting 2.1.0?* >>>> >>>> Committers should look at those and triage. Extremely important bug >>>> fixes, documentation, and API tweaks that impact compatibility should be >>>> worked on immediately. Everything else please retarget to 2.1.1 or 2.2.0. >>>> >>>> *What happened to RC3/RC5?* >>>> >>>> They had issues withe release packaging and as a result were skipped. >>>> >>>> >> > > > -- > > Herman van Hövell > > Software Engineer > > Databricks Inc. > > hvanhov...@databricks.com > > +31 6 420 590 27 > > databricks.com > > [image: http://databricks.com] <http://databricks.com/> > -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Please limit commits for branch-2.1

2016-11-21 Thread Joseph Bradley
changes to master (not branch-2.1). Thanks everyone! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Joseph Bradley
Hi Georg, It's true we need better documentation for this. I'd recommend checking out simple algorithms within Spark for examples: ml.feature.Tokenizer ml.regression.IsotonicRegression You should not need to put your library in Spark's namespace. The shared Params in SPARK-7146 are not

Re: Reduce the memory usage if we do same first in GradientBoostedTrees if subsamplingRate< 1.0

2016-11-15 Thread Joseph Bradley
Thanks for the suggestion. That would be faster, but less accurate in most cases. It's generally better to use a new random sample on each iteration, based on literature and results I've seen. Joseph On Fri, Nov 11, 2016 at 5:13 AM, WangJianfei < wangjianfe...@otcaix.iscas.ac.cn> wrote: > when

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Joseph Bradley
+1 On Fri, Nov 4, 2016 at 11:20 AM, Michael Armbrust wrote: > +1 > > On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 2.0.2. The vote is open until Fri, Nov 4, 2016

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Joseph Bradley
+1 On Thu, Nov 3, 2016 at 9:51 PM, Kousuke Saruta wrote: > +1 (non-binding) > > - Kousuke > > On 2016/11/03 9:40, Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark version >> 1.6.3. The vote is open until Sat, Nov 5, 2016 at

Re: [ML]Random Forest Error : Size exceeds Integer.MAX_VALUE

2016-10-05 Thread Joseph Bradley
Could you please file a bug report JIRA and also include more info about what you ran? * Random forest Param settings * dataset dimensionality, partitions, etc. Thanks! On Tue, Oct 4, 2016 at 10:44 PM, Samkit Shah wrote: > Hello folks, > I am running Random Forest from ml

Re: welcoming Xiao Li as a committer

2016-10-05 Thread Joseph Bradley
Congrats! On Tue, Oct 4, 2016 at 4:09 PM, Kousuke Saruta wrote: > Congratulations Xiao! > > - Kousuke > On 2016/10/05 7:44, Bryan Cutler wrote: > > Congrats Xiao! > > On Tue, Oct 4, 2016 at 11:14 AM, Holden Karau > wrote: > >> Congratulations :D

Re: Nominal Attribute

2016-10-03 Thread Joseph Bradley
There are plans...but not concrete ones yet: https://issues.apache.org/jira/browse/SPARK-8515 I agree categorical data handling is a pain point and that we need to improve it! On Tue, Sep 13, 2016 at 4:45 PM, Danil Kirsanov wrote: > NominalAttribute in MLib is used to

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Joseph Bradley
+1 On Thu, Sep 29, 2016 at 2:11 PM, Dongjoon Hyun wrote: > +1 (non-binding) > > At this time, I tested RC4 on the followings. > > - CentOS 6.8 (Final) > - OpenJDK 1.8.0_101 > - Python 2.7.12 > > /build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver >

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Joseph Bradley
+1 for 4 months. With QA taking about a month, that's very reasonable. My main ask (especially for MLlib) is for contributors and committers to take extra care not to delay on updating the Programming Guide for new APIs. Documentation debt often collects and has to be paid off during QA, and a

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Joseph Bradley
+1 On Mon, Sep 26, 2016 at 7:47 AM, Denny Lee wrote: > +1 (non-binding) > On Sun, Sep 25, 2016 at 23:20 Jeff Zhang wrote: > >> +1 >> >> On Mon, Sep 26, 2016 at 2:03 PM, Shixiong(Ryan) Zhu < >> shixi...@databricks.com> wrote: >> >>> +1 >>> >>> On Sun,

Re: GraphFrames 0.2.0 released

2016-08-26 Thread Joseph Bradley
This should do it: https://github.com/graphframes/graphframes/releases/tag/release-0.2.0 Thanks for the reminder! Joseph On Wed, Aug 24, 2016 at 10:11 AM, Maciej Bryński wrote: > Hi, > Do you plan to add tag for this release on github ? >

Re: Welcoming Felix Cheung as a committer

2016-08-16 Thread Joseph Bradley
Welcome Felix! On Mon, Aug 15, 2016 at 6:16 AM, mayur bhole wrote: > Congrats Felix! > > On Mon, Aug 15, 2016 at 2:57 PM, Paul Roy wrote: > >> Congrats Felix >> >> Paul Roy. >> >> On Mon, Aug 8, 2016 at 9:15 PM, Matei Zaharia

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Joseph Bradley
+1 Mainly tested ML/Graph/R. Perf tests from Tim Hunter showed minor speedups from 1.6 for common ML algorithms. On Thu, Jul 21, 2016 at 9:41 AM, Ricardo Almeida < ricardo.alme...@actnowib.com> wrote: > +1 (non binding) > > Tested PySpark Core, DataFrame/SQL, MLlib and Streaming on a

Re: Hello

2016-06-20 Thread Joseph Bradley
Hi Harmeet, I'll add one more item to the other advice: The community is in the process of putting together a roadmap JIRA for 2.1 for ML: https://issues.apache.org/jira/browse/SPARK-15581 This JIRA lists some of the major items and links to a few umbrella JIRAs with subtasks. I'd expect this

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley
One more note: When you specify the stages in the Pipeline, they need to be in topological order according to the DAG. On Sun, Jun 12, 2016 at 10:47 AM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Pranay, > > Yes, you can do this. The DAG structure should be specified vi

Re: DAG in Pipeline

2016-06-12 Thread Joseph Bradley
Hi Pranay, Yes, you can do this. The DAG structure should be specified via the various Transformers' input and output columns, where a Transformer can have multiple input and/or output columns. Most of the classification and regression Models are good examples of Transformers with multiple

Re: Welcoming Yanbo Liang as a committer

2016-06-12 Thread Joseph Bradley
Congrats & welcome! On Tue, Jun 7, 2016 at 7:15 AM, Xiangrui Meng wrote: > Congrats!! > > On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali > wrote: > >> Congratulations Yanbo Liang! Well deserved. >> >> >> On Sun, Jun 5, 2016 at 7:10 PM,

Re: Shrinking the DataFrame lineage

2016-06-12 Thread Joseph Bradley
ed problem handled in GraphFrames? Suppose, I want to >> use aggregateMessages in the iterative loop, for implementing PageRank. >> >> >> >> Best regards, Alexander >> >> >> >> *From:* Joseph Bradley [mailto:jos...@databricks.com] >> *Sent:* Fr

Re: Implementing linear albegra operations in the distributed linalg package

2016-06-10 Thread Joseph Bradley
I agree that more distributed matrix ops would be good to have, but I think there are a few things which need to happen first: * Now that the spark.ml package has local linear algebra separate from the spark.mllib package, we should migrate the distributed linear algebra implementations over to

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Joseph Bradley
+1 On Wed, May 18, 2016 at 10:49 AM, Reynold Xin wrote: > Hi Ovidiu-Cristian , > > The best source of truth is change the filter with target version to > 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we > get closer to 2.0 release, more will be

Re: Shrinking the DataFrame lineage

2016-05-13 Thread Joseph Bradley
Here's a JIRA for it: https://issues.apache.org/jira/browse/SPARK-13346 I don't have a great method currently, but hacks can get around it: convert the DataFrame to an RDD and back to truncate the query plan lineage. Joseph On Wed, May 11, 2016 at 12:46 PM, Ulanov, Alexander <

Re: Decrease shuffle in TreeAggregate with coalesce ?

2016-04-27 Thread Joseph Bradley
Do you have code which can reproduce this performance drop in treeReduce? It would be helpful to debug. In the 1.6 release, we profiled it via the various MLlib algorithms and did not see performance drops. It's not just renumbering the partitions; it is reducing the number of partitions by a

Re: net.razorvine.pickle.PickleException in Pyspark

2016-04-25 Thread Joseph Bradley
Thanks for your work on this. Can we continue discussing on the JIRA? On Sun, Apr 24, 2016 at 9:39 AM, Caique Marques wrote: > Hello, everyone! > > I'm trying to implement the association rules in Python. I got implement > an association by a frequent element, works

Re: Organizing Spark ML example packages

2016-04-20 Thread Joseph Bradley
Sounds good to me. I'd request we be strict during this process about requiring *no* changes to the example itself, which will make review easier. On Tue, Apr 19, 2016 at 11:12 AM, Bryan Cutler wrote: > +1, adding some organization would make it easier for people to find a >

Re: Different maxBins value for categorical and continuous features in RandomForest implementation.

2016-04-12 Thread Joseph Bradley
That sounds useful. Would you mind creating a JIRA for it? Thanks! Joseph On Mon, Apr 11, 2016 at 2:06 AM, Rahul Tanwani wrote: > Hi, > > Currently the RandomForest algo takes a single maxBins value to decide the > number of splits to take. This sometimes causes

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Joseph Bradley
+1 By the way, the JIRA for tracking (Scala) API parity is: https://issues.apache.org/jira/browse/SPARK-4591 On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia wrote: > This sounds good to me as well. The one thing we should pay attention to > is how we update the docs so

Re: running lda in spark throws exception

2016-04-04 Thread Joseph Bradley
t; >> >>> > at > >> >> >>> > > >> >> >>> > > >> >> >>> > > org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531) > >> >&g

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-29 Thread Joseph Bradley
RDD/DataFrame space. >>> > >>> > So, to promote a more extensive use of Pipelines, PipelineStages, and >>> > Transformers, I was thinking about moving that part to SQL/DataFrame >>> > API where they really belong. If not, I think people might miss the >&

Re: Any plans to migrate Transformer API to Spark SQL (closer to DataFrames)?

2016-03-25 Thread Joseph Bradley
There have been some comments about using Pipelines outside of ML, but I have not yet seen a real need for it. If a user does want to use Pipelines for non-ML tasks, they still can use Transformers + PipelineModels. Will that work? On Fri, Mar 25, 2016 at 8:05 AM, Jacek Laskowski

Re: SparkML algos limitations question.

2016-03-21 Thread Joseph Bradley
e that in most cases I simply won't hit it, but the depth > of the tree would be much more, than 30. > > > -- > Be well! > Jean Morozov > > On Wed, Dec 16, 2015 at 1:00 AM, Joseph Bradley <jos...@databricks.com> > wrote: > >> Hi Eugene, >

Merging ML Estimator and Model

2016-03-21 Thread Joseph Bradley
Spark devs & users, I want to bring attention to a proposal to merge the MLlib (spark.ml) concepts of Estimator and Model in Spark 2.0. Please comment & discuss on SPARK-14033 (not in this email thread). *TL;DR:* *Proposal*: Merge Estimator

Re: pull request template

2016-03-15 Thread Joseph Bradley
+1 for keeping the template I figure any template will require conscientiousness & enforcement. On Sat, Mar 12, 2016 at 1:30 AM, Sean Owen wrote: > The template is a great thing as it gets instructions even more right > in front of people. > > Another idea is to just write

Re: Welcoming two new committers

2016-02-08 Thread Joseph Bradley
Congrats & welcome! On Mon, Feb 8, 2016 at 12:19 PM, Ram Sriharsha wrote: > great job guys! congrats and welcome! > > On Mon, Feb 8, 2016 at 12:05 PM, Amit Chavan wrote: > >> Welcome. >> >> On Mon, Feb 8, 2016 at 2:50 PM, Suresh Thalamati < >>

Re: Adding Naive Bayes sample code in Documentation

2016-01-29 Thread Joseph Bradley
JIRA created! https://issues.apache.org/jira/browse/SPARK-13089 Feel free to pick it up if you're interested. : ) Joseph On Wed, Jan 27, 2016 at 8:43 AM, Vinayak Agrawal wrote: > Hi, > I was reading through Spark ML package and I couldn't find Naive Bayes >

Re: Spark LDA model reuse with new set of data

2016-01-26 Thread Joseph Bradley
Hi, This is more a question for the user list, not the dev list, so I'll CC user. If you're using mllib.clustering.LDAModel (RDD API), then can you make sure you're using a LocalLDAModel (or convert to it from DistributedLDAModel)? You can then call topicDistributions() on the new data. If

Re: running lda in spark throws exception

2015-12-29 Thread Joseph Bradley
Hi Li, I'm wondering if you're running into the same bug reported here: https://issues.apache.org/jira/browse/SPARK-12488 I haven't figured out yet what is causing it. Do you have a small corpus which reproduces this error, and which you can share on the JIRA? If so, that would help a lot in

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-16 Thread Joseph Bradley
This method is tested in the Spark 1.5 unit tests, so I'd guess it's a problem with the Parquet dependency. What version of Parquet are you building Spark 1.5 off of? (I'm not that familiar with Parquet issues myself, but hopefully a SQL person can chime in.) On Tue, Dec 15, 2015 at 3:23 PM,

Re: [VOTE] Release Apache Spark 1.6.0 (RC3)

2015-12-16 Thread Joseph Bradley
+1 On Wed, Dec 16, 2015 at 5:26 PM, Reynold Xin wrote: > +1 > > > On Wed, Dec 16, 2015 at 5:24 PM, Mark Hamstra > wrote: > >> +1 >> >> On Wed, Dec 16, 2015 at 1:32 PM, Michael Armbrust > > wrote: >> >>> Please vote on

Re: BIRCH clustering algorithm

2015-12-15 Thread Joseph Bradley
Hi Dzeno, I'm not familiar with the algorithm myself, but if you have an important use case for it, you could open a JIRA to discuss it. However, if it is a less common algorithm, I'd recommend first submitting it as a Spark package (but publicizing the package on the user list). If it gains

Re: SparkML algos limitations question.

2015-12-15 Thread Joseph Bradley
Hi Eugene, The maxDepth parameter exists because the implementation uses Integer node IDs which correspond to positions in the binary tree. This simplified the implementation. I'd like to eventually modify it to avoid depending on tree node IDs, but that is not yet on the roadmap. There is not

Re: [ML] Missing documentation for the IndexToString feature transformer

2015-12-05 Thread Joseph Bradley
Thanks for reporting this! I just added a JIRA: https://issues.apache.org/jira/browse/SPARK-12159 That would be great if you could send a PR for it; thanks! Joseph On Sat, Dec 5, 2015 at 5:02 AM, Benjamin Fradet wrote: > Hi, > > I was wondering why the IndexToString

Re: Python API for Association Rules

2015-12-02 Thread Joseph Bradley
If you're working on a feature, please comment on the JIRA first (to avoid conflicts / duplicate work). Could you please copy what your wrote to the JIRA to discuss there? Thanks, Joseph On Wed, Dec 2, 2015 at 4:51 AM, caiquermarques95 wrote: > Hello everyone! >

Re: Problem in running MLlib SVM

2015-12-01 Thread Joseph Bradley
around 57% on training set. > > On Mon, Nov 30, 2015 at 6:33 PM, Joseph Bradley <jos...@databricks.com> > wrote: > >> model.predict should return a 0/1 predicted label. The example code is >> misleading when it calls the prediction a "score." >> >> On M

Re: Grid search with Random Forest

2015-12-01 Thread Joseph Bradley
pache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1 >>> On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" <ndj...@gmail.com> wrote: >>> >>>> Hi Joseph, >>>> >>>> Yes Random Forest support Grid Search on Spark 1.5.+ . But I'

Re: Problem in running MLlib SVM

2015-11-30 Thread Joseph Bradley
model.predict should return a 0/1 predicted label. The example code is misleading when it calls the prediction a "score." On Mon, Nov 30, 2015 at 9:13 AM, Fazlan Nazeem wrote: > You should never use the training data to measure your prediction > accuracy. Always use a fresh

Re: Grid search with Random Forest

2015-11-30 Thread Joseph Bradley
It should work with 1.5+. On Thu, Nov 26, 2015 at 12:53 PM, Ndjido Ardo Bar wrote: > > Hi folks, > > Does anyone know whether the Grid Search capability is enabled since the > issue spark-9011 of version 1.4.0 ? I'm getting the "rawPredictionCol > column doesn't exist" when

Re: Unhandled case in VectorAssembler

2015-11-20 Thread Joseph Bradley
Yes, please, could you send a JIRA (and PR)? A custom error message would be better. Thank you! Joseph On Fri, Nov 20, 2015 at 2:39 PM, BenFradet wrote: > Hey there, > > I noticed that there is an unhandled case in the transform method of > VectorAssembler if one of

Re: spark-submit is throwing NPE when trying to submit a random forest model

2015-11-19 Thread Joseph Bradley
Hi, Could you please submit this via JIRA as a bug report? It will be very helpful if you include the Spark version, system details, and other info too. Thanks! Joseph On Thu, Nov 19, 2015 at 1:21 PM, Rachana Srivastava < rachana.srivast...@markmonitor.com> wrote: > *Issue:* > > I have a random

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Joseph Bradley
That sounds useful; would you mind submitting a JIRA (and a PR if you're willing)? Thanks, Joseph On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier wrote: > Hi, > > MLUtils.loadLibSVMFile verifies that indices are 1-based and > increasing, and otherwise triggers an error.

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about """ 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing the benefits to many tree based

Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley
Hi Sergio, Apart from apologies about limited review bandwidth (from me too!), I wanted to add: It would be interesting to hear what feedback you've gotten from users of your package. Perhaps you could collect feedback by (a) emailing the user list and (b) adding a note in the Spark Packages

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-07 Thread Joseph Bradley
+1 tested on OS X On Sat, Nov 7, 2015 at 10:25 AM, Reynold Xin wrote: > +1 myself too > > On Sat, Nov 7, 2015 at 12:01 AM, Robin East > wrote: > >> +1 >> Mac OS X 10.10.5 Yosemite >> >> mvn clean package -DskipTests (13min) >> >> Basic graph tests

Re: Gradient Descent with large model size

2015-10-15 Thread Joseph Bradley
For those numbers of partitions, I don't think you'll actually use tree aggregation. The number of partitions needs to be over a certain threshold (>= 7) before treeAggregate really operates on a tree structure:

Re: What is the difference between ml.classification.LogisticRegression and mllib.classification.LogisticRegressionWithLBFGS

2015-10-07 Thread Joseph Bradley
Hi YiZhi Liu, The spark.ml classes are part of the higher-level "Pipelines" API, which works with DataFrames. When creating this API, we decided to separate it from the old API to avoid confusion. You can read more about it here: http://spark.apache.org/docs/latest/ml-guide.html For (3): We

  1   2   >