Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Nick Pentreath
Wow! end of an era Thanks so much to you Shane for all you work over 10 (!!) years. And to Amplab also! Farewell Spark Jenkins! N On Tue, Dec 7, 2021 at 6:49 AM Nicholas Chammas wrote: > Farewell to Jenkins and its classic weather forecast build status icons: > > [image:

Re: Welcoming six new Apache Spark committers

2021-03-29 Thread Nick Pentreath
Congratulations to all the new committers. Welcome! On Fri, 26 Mar 2021 at 22:22, Matei Zaharia wrote: > Hi all, > > The Spark PMC recently voted to add several new committers. Please join me > in welcoming them to their new role! Our new committers are: > > - Maciej Szymkiewicz (contributor

Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Nick Pentreath
Congratulations and welcome as Apache Spark committers! On Wed, 15 Jul 2020 at 06:59, Prashant Sharma wrote: > Congratulations all ! It's great to have such committed folks as > committers. :) > > On Wed, Jul 15, 2020 at 9:24 AM Yi Wu wrote: > >> Congrats!! >> >> On Wed, Jul 15, 2020 at 8:02

Re: Revisiting Online serving of Spark models?

2018-06-05 Thread Nick Pentreath
I will aim to join up at 4pm tomorrow (Wed) too. Look forward to it. On Sun, 3 Jun 2018 at 00:24 Holden Karau wrote: > On Sat, Jun 2, 2018 at 8:39 PM, Maximiliano Felice < > maximilianofel...@gmail.com> wrote: > >> Hi! >> >> We're already in San Francisco waiting for the summit. We even think

Re: Welcome Zhenhua Wang as a Spark committer

2018-04-03 Thread Nick Pentreath
Congratulations! On Tue, 3 Apr 2018 at 05:34 wangzhenhua (G) wrote: > > > Thanks everyone! It’s my great pleasure to be part of such a professional > and innovative community! > > > > > > best regards, > > -Zhenhua(Xander) > > >

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-27 Thread Nick Pentreath
+1 (binding) Built and ran Scala tests with "-Phadoop-2.6 -Pyarn -Phive", all passed. Python tests passed (also including pyspark-streaming w/kafka-0.8 and flume packages built) On Tue, 27 Feb 2018 at 10:09 Felix Cheung wrote: > +1 > > Tested R: > > install from

Re: [VOTE] Spark 2.3.0 (RC3)

2018-02-14 Thread Nick Pentreath
-1 for me as we elevated https://issues.apache.org/jira/browse/SPARK-23377 to a Blocker. It should be fixed before release. On Thu, 15 Feb 2018 at 07:25 Holden Karau wrote: > If this is a blocker in your view then the vote thread is an important > place to mention it.

Re: redundant decision tree model

2018-02-13 Thread Nick Pentreath
There is a long outstanding JIRA issue about it: https://issues.apache.org/jira/browse/SPARK-3155. It is probably still a useful feature to have for trees but the priority is not that high since it may not be that useful for the tree ensemble models. On Tue, 13 Feb 2018 at 11:52 Alessandro

Re: [VOTE] Spark 2.3.0 (RC2)

2018-02-01 Thread Nick Pentreath
All MLlib QA JIRAs resolved. Looks like SparkR too, so from the ML side that should be everything outstanding. On Thu, 1 Feb 2018 at 06:21 Yin Huai wrote: > seems we are not running tests related to pandas in pyspark tests (see my > email "python tests related to pandas

Re: [VOTE] Spark 2.3.0 (RC2)

2018-01-25 Thread Nick Pentreath
I think this has come up before (and Sean mentions it above), but the sub-items on: SPARK-23105 Spark MLlib, GraphX 2.3 QA umbrella are actually marked as Blockers, but are not targeted to 2.3.0. I think they should be, and I'm not comfortable with those not being resolved before voting

Re: CrossValidation distribution - is it in the roadmap?

2017-11-29 Thread Nick Pentreath
Hi Tomasz Parallel evaluation for CrossValidation and TrainValidationSplit was added for Spark 2.3 in https://issues.apache.org/jira/browse/SPARK-19357 On Wed, 29 Nov 2017 at 16:31 Tomasz Dudek wrote: > Hey, > > is there a way to make the following code: > > val

Re: Timeline for Spark 2.3

2017-11-09 Thread Nick Pentreath
+1 I think that’s practical On Fri, 10 Nov 2017 at 03:13, Erik Erlandson wrote: > +1 on extending the deadline. It will significantly improve the logistics > for upstreaming the Kubernetes back-end. Also agreed, on the general > realities of reduced bandwidth over the

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
eases/spark-release-2-2-0.html#known-issues > before due to this reason. > I believe It should be fine and probably we should note if possible. I > believe this should not be a regression anyway as, if I understood > correctly, it was there from the very first place. > > Thank

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-06 Thread Nick Pentreath
Checked sigs & hashes. Tested on RHEL build/mvn -Phadoop-2.7 -Phive -Pyarn test passed Python tests passed I ran R tests and am getting some failures: https://gist.github.com/MLnick/ddf4d531d5125208771beee0cc9c697e (I seem to recall similar issues on a previous release but I thought it was

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-04 Thread Nick Pentreath
Ah right! Was using a new cloud instance and didn't realize I was logged in as root! thanks On Tue, 3 Oct 2017 at 21:13 Marcelo Vanzin <van...@cloudera.com> wrote: > Maybe you're running as root (or the admin account on your OS)? > > On Tue, Oct 3, 2017 at 12:12 PM, Nick Pentreath

Re: [VOTE] Spark 2.1.2 (RC4)

2017-10-03 Thread Nick Pentreath
Hmm I'm consistently getting this error in core tests: - SPARK-3697: ignore directories that cannot be read. *** FAILED *** 2 was not equal to 1 (FsHistoryProviderSuite.scala:146) Anyone else? Any insight? Perhaps it's my set up. >> >> On Tue, Oct 3, 2017 at 7:24 AM Holden Karau

Re: Should Flume integration be behind a profile?

2017-10-02 Thread Nick Pentreath
I'd agree with #1 or #2. Deprecation now seems fine. Perhaps this should be raised on the user list also? And perhaps it makes sense to look at moving the Flume support into Apache Bahir if there is interest (I've cc'ed Bahir dev list here)? That way the current state of the connector could keep

Re: Welcoming Tejas Patil as a Spark committer

2017-09-30 Thread Nick Pentreath
Congratulations! >> >> Matei Zaharia wrote >> > Hi all, >> > >> > The Spark PMC recently added Tejas Patil as a committer on the >> > project. Tejas has been contributing across several areas of Spark for >> > a while, focusing especially on scalability issues and SQL. Please >> > join me in

Re: Updates on migration guides

2017-08-30 Thread Nick Pentreath
MLlib has tried quite hard to ensure the migration guide is up to date for each release. I think generally we catch all breaking and most major behavior changes On Wed, 30 Aug 2017 at 17:02, Dongjoon Hyun wrote: > +1 > > On Wed, Aug 30, 2017 at 7:54 AM, Xiao Li

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Nick Pentreath
+1 (binding) On Mon, 3 Jul 2017 at 11:53 Yanbo Liang wrote: > +1 > > On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier < > hvanhov...@databricks.com> wrote: > >> +1 >> >> On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida < >> ricardo.alme...@actnowib.com>

Re: [VOTE] Apache Spark 2.2.0 (RC5)

2017-06-21 Thread Nick Pentreath
As before, release looks good, all Scala, Python tests pass. R tests fail with same issue in SPARK-21093 but it's not a blocker. +1 (binding) On Wed, 21 Jun 2017 at 01:49 Michael Armbrust wrote: > I will kick off the voting with a +1. > > On Tue, Jun 20, 2017 at 4:49

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-21 Thread Nick Pentreath
> structField("Avg", "double")) >>> df4 <- gapply( >>> cols = "Sepal_Length", >>> irisDF, >>> function(key, x) { >>> y <- data.frame(key, mean(x$Sepal_Width), stringsAsFactors = FALSE) >>> }, >>

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-14 Thread Nick Pentreath
he same error reported by Nick below. > > _ > From: Hyukjin Kwon <gurwls...@gmail.com> > Sent: Tuesday, June 13, 2017 8:02 PM > > Subject: Re: [VOTE] Apache Spark 2.2.0 (RC4) > To: dev <dev@spark.apache.org> > Cc: Sean Owen <so...@cloudera.c

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-09 Thread Nick Pentreath
All Scala, Python tests pass. ML QA and doc issues are resolved (as well as R it seems). However, I'm seeing the following test failure on R consistently: https://gist.github.com/MLnick/5f26152f97ae8473f807c6895817cf72 On Thu, 8 Jun 2017 at 08:48 Denny Lee wrote: > +1

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
Now, on the subject of (ML) QA JIRAs. >From the ML side, I believe they are required (I think others such as Joseph will agree and in fact have already said as much). Most are marked as Blockers, though of those the Python API coverage is strictly not a Blocker as we will never hold the release

Re: [VOTE] Apache Spark 2.2.0 (RC4)

2017-06-06 Thread Nick Pentreath
The website updates for ML QA (SPARK-20507) are not *actually* critical as the project website certainly can be updated separately from the source code guide and is not part of the release to be voted on. In future that particular work item for the QA process could be marked down in priority, and

Re: RDD MLLib Deprecation Question

2017-05-30 Thread Nick Pentreath
The short answer is those distributed linalg parts will not go away. In the medium term, it's much less likely that the distributed matrix classes will be ported over to DataFrames (though the ideal would be to have DataFrame-backed distributed matrix classes) - given the time and effort it's

Re: [VOTE] Apache Spark 2.2.0 (RC2)

2017-05-19 Thread Nick Pentreath
All the outstanding ML QA doc and user guide items are done for 2.2 so from that side we should be good to cut another RC :) On Thu, 18 May 2017 at 00:18 Russell Spitzer wrote: > Seeing an issue with the DataScanExec and some of our integration tests > for the SCC.

Re: [VOTE] Apache Spark 2.2.0 (RC1)

2017-05-02 Thread Nick Pentreath
I won't +1 just given that it seems certain there will be another RC and there are the outstanding ML QA blocker issues. But clean build and test for JVM and Python tests LGTM on CentOS Linux 7.2.1511, OpenJDK 1.8.0_111 On Mon, 1 May 2017 at 22:42 Frank Austin Nothaft

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Nick Pentreath
As for SPARK-19759 , I don't think that needs to be targeted for 2.1.1 so we don't need to worry about it On Tue, 21 Mar 2017 at 13:49 Holden Karau wrote: > I agree with Michael, I think we've got some outstanding issues

Re: Should we consider a Spark 2.1.1 release?

2017-03-16 Thread Nick Pentreath
Spark 1.5.1 had 87 issues fix version 1 month after 1.5.0. Spark 1.6.1 had 123 issues 2 months after 1.6.0 2.0.1 was larger (317 issues) at 3 months after 2.0.0 - makes sense due to how large a release it was. We are at 185 for 2.1.1 and 3 months after (and not released yet so it could slip

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

2017-03-04 Thread Nick Pentreath
Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from SPARK-19498 specifically to discuss opening up sharedParams traits. On Fri, 3 Mar 2017 at 23:17 Shouheng Yi wrote: > Hi Spark dev list, > > > > Thank you guys so much for all your inputs.

Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
h low-level libraries. > > Tim > > > On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > > Sorry for being late to the discussion. I think Joseph, Sean and others > have covered the issues well. > > Overall I like the pr

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others have covered the issues well. Overall I like the proposed cleaned up roadmap & process (thanks Joseph!). As for the actual critical roadmap items mentioned on SPARK-18813, I think it makes sense and will comment a bit further

Re: Implementation of RNN/LSTM in Spark

2017-02-23 Thread Nick Pentreath
The short answer is there is none and highly unlikely to be inside of Spark MLlib any time in the near future. The best bets are to look at other DL libraries - for JVM there is Deeplearning4J and BigDL (there are others but these seem to be the most comprehensive I have come across) - that run

Re: Google Summer of Code 2017 is coming

2017-02-05 Thread Nick Pentreath
I think Sean raises valid points - that the result is highly dependent on the particular student, project and mentor involved, and that the actual required time investment is very significant. Having said that, it's not all bad certainly. Scikit-learn started as a GSoC project 10 years ago!

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-02-01 Thread Nick Pentreath
Hi Maciej If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then that seems to point to some other underlying issue as the root cause. Even though adding checkpointing should help, we should understand why it's different between 1.6 and 2.0? On Thu, 2 Feb 2017 at 08:22

Re: Why are ml models repartition(1)'d in save methods?

2017-01-13 Thread Nick Pentreath
Yup - it's because almost all model data in spark ML (model coefficients) is "small" - i.e. Non distributed. If you look at ALS you'll see there is no repartitioning since the factor dataframes can be large On Fri, 13 Jan 2017 at 19:42, Sean Owen wrote: > You're referring to

Re: Issue in using DenseVector in RowMatrix, error could be due to ml and mllib package changes

2016-12-08 Thread Nick Pentreath
Yes most likely due to hashing tf returns ml vectors while you need mllib vectors for row matrix. I'd recommend using the vector conversion utils (I think in mllib.linalg.Vectors but I'm on mobile right now so can't recall exactly). There are until methods for converting single vectors as well as

Re: 2.1.0-rc2 cut; committers please set fix version for branch-2.1 to 2.1.1 instead

2016-12-07 Thread Nick Pentreath
I went ahead and re-marked all the existing 2.1.1 fix version JIRAs (that had gone into branch-2.1 since RC1 but before RC2) for Spark ML to 2.1.0 On Thu, 8 Dec 2016 at 09:20 Reynold Xin wrote: > Thanks. >

Re: unhelpful exception thrown on predict() when ALS trained model doesn't contain user or product?

2016-12-06 Thread Nick Pentreath
Indeed, it's being tracked here: https://issues.apache.org/jira/browse/SPARK-18230 though no Pr has been opened yet. On Tue, 6 Dec 2016 at 13:36 chris snow wrote: > I'm using the MatrixFactorizationModel.predict() method and encountered > the following exception: > > Name:

Re: Why don't we imp some adaptive learning rate methods, such as adadelat, adam?

2016-11-30 Thread Nick Pentreath
check out https://github.com/VinceShieh/Spark-AdaOptimizer On Wed, 30 Nov 2016 at 10:52 WangJianfei wrote: > Hi devs: > Normally, the adaptive learning rate methods can have a fast > convergence > then standard SGD, so why don't we imp them? > see the link

Re: Develop custom Estimator / Transformer for pipeline

2016-11-17 Thread Nick Pentreath
@Holden look forward to the blog post - I think a user guide PR based on it would also be super useful :) On Fri, 18 Nov 2016 at 05:29 Holden Karau wrote: > I've been working on a blog post around this and hope to have it published > early next month  > > On Nov 17,

Re: Question about using collaborative filtering in MLlib

2016-11-03 Thread Nick Pentreath
I have a PR for it - https://github.com/apache/spark/pull/12574 Sadly I've been tied up and haven't had a chance to work further on it. The main issue outstanding is deciding on the transform semantics as well as performance testing. Any comments / feedback welcome especially on transform

Re: Is RankingMetrics' NDCG implementation correct?

2016-09-20 Thread Nick Pentreath
(cc'ing dev list also) I think a more general version of ranking metrics that allows arbitrary relevance scores could be useful. Ranking metrics are applicable to other settings like search or other learning-to-rank use cases, so it should be a little more generic than pure recommender settings.

Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
lia...@gmail.com> wrote: >> >>> This sounds good to me, and it will make ML examples more neatly. >>> >>> 2016-04-14 5:28 GMT-07:00 Nick Pentreath <nick.pentre...@gmail.com>: >>> >>>> Hey Spark devs >>>> >>>>

Re: Why can't a Transformer have multiple output columns?

2016-08-23 Thread Nick Pentreath
It's not impossible that a Transformer could output multiple columns - it's simply because none of the current ones do. It's true that it might be a relatively less common use case in general. But take StringIndexer for example. It turns strings (categorical features) into ints (0-based indexes).

Re: Java 8

2016-08-20 Thread Nick Pentreath
Spark already supports compiling with Java 8. What refactoring are you referring to, and where do you expect to see performance gains? On Sat, 20 Aug 2016 at 12:41, Timur Shenkao wrote: > Hello, guys! > > Are there any plans / tickets / branches in repository on Java 8? > >

Re: Serving Spark ML models via a regular Python web app

2016-08-11 Thread Nick Pentreath
Currently there is no direct way in Spark to serve models without bringing in all of Spark as a dependency. For Spark ML, there is actually no way to do it independently of DataFrames either (which for single-instance prediction makes things sub-optimal). That is covered here:

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-02 Thread Nick Pentreath
Note that both HashingTF and CountVectorizer are usually used for creating TF-IDF normalized vectors. The definition ( https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Definition) of term frequency in TF-IDF is actually the "number of times the term occurs in the document". So it's perhaps a bit of a

Re: Internal Deprecation warnings - worth fixing?

2016-07-27 Thread Nick Pentreath
+1 I don't believe there's any reason for the warnings to still be there except for available dev time & focus :) On Wed, 27 Jul 2016 at 21:35, Jacek Laskowski wrote: > Kill 'em all -- one by one slowly yet gradually! :) > > Pozdrawiam, > Jacek Laskowski > >

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-07-04 Thread Nick Pentreath
of these to clean them up and see where they stand). If there are other blockers then we should mark them as such to help tracking progress? On Tue, 28 Jun 2016 at 11:28 Nick Pentreath <nick.pentre...@gmail.com> wrote: > I take it there will be another RC due to some blockers and as there were >

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-28 Thread Nick Pentreath
ss regression from 1.6. >> Looks like the patch is ready though: >> https://github.com/apache/spark/pull/13884 – it would be ideal for this >> patch to make it into the release. >> >> -Matt Cheah >> >> From: Nick Pentreath <nick.pentre...@gmail.com> >>

Re: [VOTE] Release Apache Spark 2.0.0 (RC1)

2016-06-24 Thread Nick Pentreath
I'm getting the following when trying to run ./dev/run-tests (not happening on master) from the extracted source tar. Anyone else seeing this? error: Could not access 'fc0a1475ef' ** File "./dev/run-tests.py", line 69, in

Re: Welcoming Yanbo Liang as a committer

2016-06-04 Thread Nick Pentreath
Congratulations Yanbo and welcome On Sat, 4 Jun 2016 at 10:17, Hortonworks wrote: > Congratulations, Yanbo > > Zhan Zhang > > Sent from my iPhone > > > On Jun 3, 2016, at 8:39 PM, Dongjoon Hyun wrote: > > > > Congratulations > > -- > CONFIDENTIALITY

Re: Cannot build master with sbt

2016-05-25 Thread Nick Pentreath
I've filed https://issues.apache.org/jira/browse/SPARK-15525 For now, you would have to check out sbt-antlr4 at https://github.com/ihji/sbt-antlr4/commit/23eab68b392681a7a09f6766850785afe8dfa53d (since I don't see any branches or tags in the github repo for different versions), and sbt

Re: [VOTE] Removing module maintainer process

2016-05-23 Thread Nick Pentreath
+1 (binding) On Mon, 23 May 2016 at 04:19, Matei Zaharia wrote: > Correction, let's run this for 72 hours, so until 9 PM EST May 25th. > > > On May 22, 2016, at 8:34 PM, Matei Zaharia > wrote: > > > > It looks like the discussion thread on this

Re: Cross Validator to work with K-Fold value of 1?

2016-05-02 Thread Nick Pentreath
There is a JIRA and PR around for supporting polynomial expansion with degree 1. Offhand I can't recall if it's been merged On Mon, 2 May 2016 at 17:45, Julio Antonio Soto de Vicente wrote: > Hi, > > Same goes for the PolynomialExpansion in org.apache.spark.ml.feature. It > would

Re: Duplicated fit into TrainValidationSplit

2016-04-27 Thread Nick Pentreath
You should find that the first set of fits are called on the training set, and the resulting models evaluated on the validation set. The final best model is then retrained on the entire dataset. This is standard practice - usually the dataset passed to the train validation split is itself further

Organizing Spark ML example packages

2016-04-14 Thread Nick Pentreath
Hey Spark devs I noticed that we now have a large number of examples for ML & MLlib in the examples project - 57 for ML and 67 for MLLIB to be precise. This is bound to get larger as we add features (though I know there are some PRs to clean up duplicated examples). What do you think about

Re: ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Ah I got it - Seq[(Int, Float)] is actually represented as Seq[Row] (seq of struct type) internally. So a further extraction is required, e.g. row => row.getSeq[Row](1).map { r => r.getInt(0) } On Wed, 6 Apr 2016 at 13:35 Nick Pentreath <nick.pentre...@gmail.com>

ClassCastException when extracting and collecting DF array column type

2016-04-06 Thread Nick Pentreath
Hi there, In writing some tests for a PR I'm working on, with a more complex array type in a DF, I ran into this issue (running off latest master). Any thoughts? *// create DF with a column of Array[(Int, Double)]* val df = sc.parallelize(Seq( (0, Array((1, 6.0), (1, 4.0))), (1, Array((1, 3.0),

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Nick Pentreath
+1 for this proposal - as you mention I think it's the defacto current situation anyway. Note that from a developer view it's just the user-facing API that will be only "ml" - the majority of the actual algorithms still operate on RDDs under the good currently. On Wed, 6 Apr 2016 at 05:03, Chris

Re: Spark ML - Scaling logistic regression for many features

2016-03-19 Thread Nick Pentreath
summariser? If so, > can you give me the issue key(s)? If not, would you like me to create these > tickets? > > I'm going to look into this some more and see if I can figure out how to > implement these fixes. > > ~Daniel Siegmann > > On Sat, Mar 12, 2016 at 5:53 AM, Nick

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views. On Sat, 12 Mar 2016 at 12:52, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Thanks for the feedback. > > I think Spark can certainly meet your use case when your data size scales > up, as the actual model dimen

Re: Running ALS on comparitively large RDD

2016-03-10 Thread Nick Pentreath
Could you provide more details about: 1. Data set size (# ratings, # users and # products) 2. Spark cluster set up and version Thanks On Fri, 11 Mar 2016 at 05:53 Deepak Gopalakrishnan wrote: > Hello All, > > I've been running Spark's ALS on a dataset of users and rated

Re: ML ALS API

2016-03-08 Thread Nick Pentreath
Hi Maciej Yes, that *train* method is intended to be public, but it is marked as *DeveloperApi*, which means that backward compatibility is not necessarily guaranteed, and that method may change. Having said that, even APIs marked as DeveloperApi do tend to be relatively stable. As the comment

Re: Proposal

2016-01-30 Thread Nick Pentreath
Hi there Sounds like a fun project :) I'd recommend getting familiar with the existing k-means implementation as well as bisecting k-means in Spark, and then implementing yours based off that. You should focus on using the new ML pipelines API, and release it as a package on

Re: Elasticsearch sink for metrics

2016-01-15 Thread Nick Pentreath
I haven't come across anything, but could you provide more detail on what issues you're encountering? On Fri, Jan 15, 2016 at 11:09 AM, Pete Robbins wrote: > Has anyone tried pushing Spark metrics into elasticsearch? We have other > metrics, eg some runtime information,

Re: Write access to wiki

2016-01-12 Thread Nick Pentreath
I'd also like to get Wiki write access - at the least it allows a few of us to amend the "Powered By" and similar pages when those requests come through (Sean has been doing a lot of that recently :) On Mon, Jan 11, 2016 at 11:01 PM, Sean Owen wrote: > ... I forget who can

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
anywhere else (AFAIK it is not, but in case I missed something let me know any good reason to keep the explicit dependency)? N On Fri, Dec 11, 2015 at 6:55 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Yeah also the integration tests need to be specifically run - I would have

Re: Spark streaming with Kinesis broken?

2015-12-11 Thread Nick Pentreath
unning tests on locally now. > Is the AWS SDK not used for reading/writing from S3 or do we get that for > free from the Hadoop dependencies? > On Fri, Dec 11, 2015 at 5:07 AM Nick Pentreath <nick.pentre...@gmail.com> > wrote: >> cc'ing dev list >> >> Ok, looks l

Re: ml.feature.Word2Vec.transform() very slow issue

2015-11-09 Thread Nick Pentreath
Seems a straightforward change that purely enhances efficiency, so yes please submit a JIRA and PR for this On Tue, Nov 10, 2015 at 8:56 AM, Sean Owen wrote: > Since it's a fairly expensive operation to build the Map, I tend to agree > it should not happen in the loop. > >

Re: HyperLogLogUDT

2015-09-13 Thread Nick Pentreath
a UDAF, extending > UserDefinedAggregationFunction is the preferred > approach. AggregateFunction2 is used for built-in aggregate function. > Thanks, > Yin > On Sat, Sep 12, 2015 at 10:40 AM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: >> Ok, that makes sense. So this is

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
gt; 2353 DC Leiderdorp > hvanhov...@questtec.nl > +599 9 521 4402 > > > 2015-09-12 11:06 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>: > >> I should add that surely the idea behind UDT is exactly that it can (a) >> fit automatically into DFs and Tungsten an

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
> than java objects), in order for this to fit the Tungsten execution model > where everything is operating directly against some memory address. > > On Wed, Jul 1, 2015 at 11:00 PM, Nick Pentreath <nick.pentre...@gmail.com> > wrote: > >> Sure I can copy

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
I should add that surely the idea behind UDT is exactly that it can (a) fit automatically into DFs and Tungsten and (b) that it can be used efficiently in writing ones own UDTs and UDAFs? On Sat, Sep 12, 2015 at 11:05 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Can I ask w

Re: HyperLogLogUDT

2015-09-12 Thread Nick Pentreath
eiderdorp > hvanhov...@questtec.nl > +599 9 521 4402 > > > 2015-09-12 10:07 GMT+02:00 Nick Pentreath <nick.pentre...@gmail.com>: > >> Inspired by this post: >> http://eugenezhulenev.com/blog/2015/07/15/interactive-audience-analytics-with-spark-and-hy

Re: HyperLogLogUDT

2015-07-02 Thread Nick Pentreath
/org/apache/spark/rdd/RDD.scala#L1153 and access the HLL directly, or do anything you like. On Wed, Jul 1, 2015 at 5:26 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Any thoughts? — Sent from Mailbox https://www.dropbox.com/mailbox On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath

Re: HyperLogLogUDT

2015-07-01 Thread Nick Pentreath
Any thoughts? — Sent from Mailbox On Tue, Jun 23, 2015 at 11:19 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hey Spark devs I've been looking at DF UDFs and UDAFs. The approx distinct is using hyperloglog, but there is only an option to return the count as a Long. It can be useful

HyperLogLogUDT

2015-06-23 Thread Nick Pentreath
Hey Spark devs I've been looking at DF UDFs and UDAFs. The approx distinct is using hyperloglog, but there is only an option to return the count as a Long. It can be useful to be able to return and store the actual data structure (ie serialized HLL). This effectively allows one to do aggregation

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-18 Thread Nick Pentreath
If it's going into the DataFrame API (which it probably should rather than in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT) which would mean it doesn't have to implement Serializable, as it appears that serialization is taken care of in the UDT def (e.g.

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-10 Thread Nick Pentreath
Looks very interesting, thanks for sharing this. I haven't had much chance to do more than a quick glance over the code. Quick question - are the Word2Vec and GLOVE implementations fully parallel on Spark? On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright ewri...@live.com wrote: The deeplearning4j

Re: [discuss] ending support for Java 6?

2015-05-01 Thread Nick Pentreath
+1 for this think it's high time. We should of course do it with enough warning for users. 1.4 May be too early (not for me though!). Perhaps we specify that 1.5 will officially move to JDK7? — Sent from Mailbox On Fri, May 1, 2015 at 12:16 AM, Ram Sriharsha

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM,

Re: Directly broadcasting (sort of) RDDs

2015-03-21 Thread Nick Pentreath
There is block matrix in Spark 1.3 -  http://spark.apache.org/docs/latest/mllib-data-types.html#blockmatrix However I believe it only supports dense matrix blocks. Still, might be possible to use it or exetend  JIRAs:

Re: Welcoming three new committers

2015-02-04 Thread Nick Pentreath
Congrats and welcome Sean, Joseph and Cheng! On Wed, Feb 4, 2015 at 2:10 PM, Sean Owen so...@cloudera.com wrote: Thanks all, I appreciate the vote of trust. I'll do my best to help keep JIRA and commits moving along, and am ramping up carefully this week. Now get back to work reviewing

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
...@gmail.com wrote: HBaseConverter is in Spark source tree. Therefore I think it makes sense for this improvement to be accepted so that the example is more useful. Cheers On Mon, Jan 5, 2015 at 7:54 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hey These converters are actually just

Re: python converter in HBaseConverter.scala(spark/examples)

2015-01-05 Thread Nick Pentreath
Hey  These converters are actually just intended to be examples of how to set up a custom converter for a specific input format. The converter interface is there to provide flexibility where needed, although with the new SparkSQL data store interface the intention is that most common use

Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
Oh actually I was confused with another project, yours was not LSH sorry! — Sent from Mailbox On Fri, Jan 2, 2015 at 8:19 AM, Nick Pentreath nick.pentre...@gmail.com wrote: I'm sure Spark will sign up for GSoC again this year - and id be surprised if there was not some interest now

Re: Highly interested in contributing to spark

2015-01-01 Thread Nick Pentreath
I'm sure Spark will sign up for GSoC again this year - and id be surprised if there was not some interest now for projects :) If I have the time at that point in the year I'd be happy to mentor a project in MLlib but will have to see how my schedule is at that point! Manoj perhaps some of

Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-13 Thread Nick Pentreath
+1 — Sent from Mailbox On Sat, Dec 13, 2014 at 3:12 PM, GuoQiang Li wi...@qq.com wrote: +1 (non-binding). Tested on CentOS 6.4 -- Original -- From: Patrick Wendell;pwend...@gmail.com; Date: Thu, Dec 11, 2014 05:08 AM To:

Re: [VOTE] Designating maintainers for some Spark components

2014-11-06 Thread Nick Pentreath
+1 (binding) — Sent from Mailbox On Thu, Nov 6, 2014 at 6:52 PM, Debasish Das debasish.da...@gmail.com wrote: +1 The app to track PRs based on component is a great idea... On Thu, Nov 6, 2014 at 8:47 AM, Sean McNamara sean.mcnam...@webtrends.com wrote: +1 Sean On Nov 5, 2014, at 6:32

Re: matrix factorization cross validation

2014-10-30 Thread Nick Pentreath
Looking at https://github.com/apache/spark/blob/814a9cd7fabebf2a06f7e2e5d46b6a2b28b917c2/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RankingMetrics.scala#L82 For each user in test set, you generate an Array of top K predicted item ids (Int or String probably), and an Array of ground

Re: matrix factorization cross validation

2014-10-30 Thread Nick Pentreath
Sean, re my point earlier do you know a more efficient way to compute top k for each user, other than to broadcast the item factors?  (I guess one can use the new asymmetric lsh paper perhaps to assist) — Sent from Mailbox On Thu, Oct 30, 2014 at 11:24 PM, Sean Owen so...@cloudera.com wrote:

Re: Oryx + Spark mllib

2014-10-19 Thread Nick Pentreath
We've built a model server internally, based on Scalatra and Akka Clustering. Our use case is more geared towards serving possibly thousands of smaller models. It's actually very basic, just reads models from S3 as strings (!!) (uses HDFS FileSystem so can read from local, HDFS, S3) and uses

Re: Oryx + Spark mllib

2014-10-19 Thread Nick Pentreath
to understand the scalability Thanks. Deb On Sat, Oct 18, 2014 at 11:22 PM, Nick Pentreath nick.pentre...@gmail.com wrote: We've built a model server internally, based on Scalatra and Akka Clustering. Our use case is more geared towards serving possibly thousands of smaller models. It's

Re: Oryx + Spark mllib

2014-10-19 Thread Nick Pentreath
The shared-nothing load-balanced server architecture works for all but the most massive models - and even then a few big EC2 r3 instances should do the trick. One nice thing about Akka (and especially the new HTTP) is fault tolerance, recovery and potential for persistence. For us arguably the

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-10 Thread Nick Pentreath
the API, though, that would allow us to get moving more quickly. On Wed, Jul 9, 2014 at 8:39 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense

Re: Contributing to MLlib: Proposal for Clustering Algorithms

2014-07-09 Thread Nick Pentreath
Cool seems like a god initiative. Adding a couple extra high quality clustering implantations will be great. I'd say it would make most sense to submit a PR for the Standardised API first, agree that with everyone and then build on it for the specific implementations. — Sent from Mailbox On

  1   2   >