Re: [VOTE] SPIP: Support Docker Official Image for Spark

2022-09-21 Thread Xiangrui Meng
+1 On Wed, Sep 21, 2022 at 6:53 PM Kent Yao wrote: > +1 > > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > interface for large-scale data processing and

Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-28 Thread Xiangrui Meng
+1. And we should start testing 3.7 and maybe 3.8 in Jenkins. On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun wrote: > Thank you for starting the thread. > > In addition to that, we currently are testing Python 3.6 only in Apache > Spark Jenkins environment. > > Given that Python 3.8 is already

Re: SparkGraph review process

2019-10-04 Thread Xiangrui Meng
ity > > We are the developers behind the SparkGraph SPIP, which is a project > created out of our work on openCypher Morpheus ( > https://github.com/opencypher/morpheus). During this year we have > collaborated with mainly Xiangrui Meng of Databricks to define and develop > a new Sp

[ANNOUNCEMENT] Plan for dropping Python 2 support

2019-06-03 Thread Xiangrui Meng
Hi all, Today we announced the plan for dropping Python 2 support [1] in Apache Spark: As many of you already knew, Python core development team and many utilized Python packages like Pandas and NumPy will drop Python 2

Re: Should python-2 be supported in Spark 3.0?

2019-06-03 Thread Xiangrui Meng
-- > *From:* shane knapp > *Sent:* Friday, May 31, 2019 7:38:10 PM > *To:* Denny Lee > *Cc:* Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark > Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; > dev; user > *Subj

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
, I'm going to upload it to Spark website and announce it here. Let me know if you think we should do a VOTE instead. On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng wrote: > I created https://issues.apache.org/jira/browse/SPARK-27884 to track the > work. > > On Thu, May 30, 2019 at 2

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Xiangrui Meng
From:* Reynold Xin > *Sent:* Thursday, May 30, 2019 12:59:14 AM > *To:* shane knapp > *Cc:* Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen > Fen; Xiangrui Meng; dev; user > *Subject:* Re: Should python-2 be supported in Spark 3.0? > > +1 on Xiangrui’s plan. &

Re: Should python-2 be supported in Spark 3.0?

2019-05-29 Thread Xiangrui Meng
Hi all, I want to revive this old thread since no action was taken so far. If we plan to mark Python 2 as deprecated in Spark 3.0, we should do it as early as possible and let users know ahead. PySpark depends on Python, numpy, pandas, and pyarrow, all of which are sunsetting Python 2 support by

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-05-13 Thread Xiangrui Meng
My vote is 0. Since the updated SPIP focuses on ETL use cases, I don't feel strongly about it. I would still suggest doing the following: 1. Link the POC mentioned in Q4. So people can verify the POC result. 2. List public APIs we plan to expose in Appendix A. I did a quick check. Beside

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-22 Thread Xiangrui Meng
different way). It’s a bit harder for a Java API, but maybe Spark could > just expose byte arrays directly and work on those if the API is not > guaranteed to stay stable (that is, we’d still use our own classes to > manipulate the data internally, and end users could use the Arrow library &

Re: [VOTE][SPARK-27396] SPIP: Public APIs for extended Columnar Processing Support

2019-04-19 Thread Xiangrui Meng
I posted my comment in the JIRA . Main concerns here: 1. Exposing third-party Java APIs in Spark is risky. Arrow might have 1.0 release

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
pier. Tom and Andy from NVIDIA are certainly more calibrated on the usefulness of the current proposal. > > On Mon, Mar 25, 2019 at 7:39 PM Xiangrui Meng wrote: > >> There are certainly use cases where different stages require different >> number of CPUs or GPUs under an

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
barrier mode stages having > a need for an inter-task channel resource, gpu-ified stages needing gpu > resources, etc. Have I mentioned that I'm not a fan of the current barrier > mode API, Xiangrui? :) Yes, I know: "Show me something better." > > On Mon, Mar 25, 2019

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-25 Thread Xiangrui Meng
highlight one thing. In >> page 5 of the SPIP, when we talk about DRA, I see: >> >> "For instance, if each executor consists 4 CPUs and 2 GPUs, and each >> task requires 1 CPU and 1GPU, then we shall throw an error on application >> start because we shall always have

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-20 Thread Xiangrui Meng
rLRdQM3y7ejil64/edit#> >> and stories >> <https://docs.google.com/document/d/12JjloksHCdslMXhdVZ3xY5l1Nde3HRhIrqvzGnK_bNE/edit#heading=h.udyua28eu3sg>, >> I hope it now contains clear scope of the changes and enough details for >> SPIP vote. >> Please review the updated docs, thanks! >> >> X

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-19 Thread Xiangrui Meng
eems like a fine position. > > On Mon, Mar 18, 2019 at 1:56 PM Xingbo Jiang > wrote: > > > > Hi all, > > > > I updated the SPIP doc and stories, I hope it now contains clear scope > of the changes and enough details for SPIP vote. > > Please review the

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-05 Thread Xiangrui Meng
to a different thread >> and then come back to this, thoughts? >> >> Note there is a high level design for at least the core piece, which is >> what people seem concerned with, already so including it in the SPIP should >> be straight forward. >> >>

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
On Mon, Mar 4, 2019 at 3:10 PM Mark Hamstra wrote: > :) Sorry, that was ambiguous. I was seconding Imran's comment. > Could you also help review Xingbo's design sketch and help evaluate the cost? > > On Mon, Mar 4, 2019 at 3:09 PM Xiangrui Meng wrote: > >> >> >

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
On Mon, Mar 4, 2019 at 1:56 PM Mark Hamstra wrote: > +1 > Mark, just to be clear, are you +1 on the SPIP or Imran's point? > > On Mon, Mar 4, 2019 at 12:52 PM Imran Rashid wrote: > >> On Sun, Mar 3, 2019 at 6:51 PM Xiangrui Meng wrote: >> >>> On Sun,

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
On Mon, Mar 4, 2019 at 8:23 AM Xiangrui Meng wrote: > > > On Mon, Mar 4, 2019 at 7:24 AM Sean Owen wrote: > >> To be clear, those goals sound fine to me. I don't think voting on >> those two broad points is meaningful, but, does no harm per se. If you >> me

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
oc. Yinan mentioned three options to expose the inferences to users. We need to finalize the design and discuss which option is the best to go. You see that such discussions can be done in parallel. It is not efficient if we block the work on K8s because we cannot decide whether we should support M

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-04 Thread Xiangrui Meng
t; greatly concerning, like “oh scheduler is allocating GPU, but how does it > affect memory” and many more, and so I think finer “high level” goals > should be defined. > > > > > -- > *From:* Sean Owen > *Sent:* Sunday, March 3, 2019 5:24 PM > *To:* X

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-03 Thread Xiangrui Meng
Hi Felix, Just to clarify, we are voting on the SPIP, not the companion scoping doc. What is proposed and what we are voting on is to make Spark accelerator-aware. The companion scoping doc and the design sketch are to help demonstrate that what features could be implemented based on the use

Re: [VOTE] [SPARK-24615] SPIP: Accelerator-aware Scheduling

2019-03-01 Thread Xiangrui Meng
+1 Btw, as Ryan pointed out las time, +0 doesn't mean "Don't really care." Official definitions here: https://www.apache.org/foundation/voting.html#expressing-votes-1-0-1-and-fractions - +0: 'I don't feel strongly about it, but I'm okay with this.' - -0: 'I won't get in the way,

Re: SPIP: Accelerator-aware Scheduling

2019-02-26 Thread Xiangrui Meng
In case there are issues visiting Google doc, I attached PDF files to the JIRA. On Tue, Feb 26, 2019 at 7:41 AM Xingbo Jiang wrote: > Hi all, > > I want send a revised SPIP on implementing Accelerator(GPU)-aware > Scheduling. It improves Spark by making it aware of GPUs exposed by cluster >

[VOTE] [RESULT] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-12 Thread Xiangrui Meng
Hi all, The vote passed with the following +1s (* = binding) and no 0s/-1s: * Denny Lee * Jules Damji * Xiao Li* * Dongjoon Hyun * Mingjie Tang * Yanbo Liang* * Marco Gaido * Joseph Bradley* * Xiangrui Meng* Please watch SPARK-25994 and join future discussions there. Thanks! Best, Xiangrui

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-12 Thread Xiangrui Meng
+1 from myself. The vote passed with the following +1s and no -1s: * Denny Lee * Jules Damji * Xiao Li* * Dongjoon Hyun * Mingjie Tang * Yanbo Liang* * Marco Gaido * Joseph Bradley* * Xiangrui Meng* I will send a result email soon. Please watch SPARK-25994 for future discussions. Thanks! Best

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-30 Thread Xiangrui Meng
> Martin > On 29.01.19 18:59, Dongjoon Hyun wrote: > > Hi, Xiangrui Meng. > > +1 for the proposal. > > However, please update the following section for this vote. As we see, it > seems to be inaccurate because today is Jan. 29th. (Almost February). > (Since I c

[VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-29 Thread Xiangrui Meng
Hi all, I want to call for a vote of SPARK-25994 . It introduces a new DataFrame-based component to Spark, which supports property graph construction, Cypher queries, and graph algorithms. The proposal

SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-01-15 Thread Xiangrui Meng
Hi all, I want to re-send the previous SPIP on introducing a DataFrame-based graph component to collect more feedback. It supports property graphs, Cypher graph queries, and graph algorithms built on top of the DataFrame API. If you are a GraphX user or your workload is essentially graph queries,

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
(don't know why your email ends with ".invalid") On Wed, Dec 19, 2018 at 9:13 AM Xiangrui Meng wrote: > > > On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach > wrote: > > > > [Note: I sent this earlier but it looks like the email was blocked > because I h

Re: barrier execution mode with DataFrame and dynamic allocation

2018-12-19 Thread Xiangrui Meng
On Wed, Dec 19, 2018 at 7:34 AM Ilya Matiach wrote: > > [Note: I sent this earlier but it looks like the email was blocked because I had another email group on the CC line] > > Hi Spark Dev, > > I would like to use the new barrier execution mode introduced in spark 2.4 with LightGBM in the spark

Re: SPIP: Property Graphs, Cypher Queries, and Algorithms

2018-11-13 Thread Xiangrui Meng
jira/browse/SPARK-26028 > Google Doc: > https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing > > Thanks, > > Martin (on behalf of the Neo4j Cypher for Apache Spark team) > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://d

Re: [VOTE] SPARK 2.4.0 (RC5)

2018-11-01 Thread Xiangrui Meng
hitral Verma > Dilip Biswal > Denny Lee > Felix Cheung (binding) > Dongjoon Hyun > > +0: > DB Tsai (binding) > > -1: None > > Thanks, everyone! > > On Thu, Nov 1, 2018 at 1:26 PM Dongjoon Hyun > wrote: > >> +1 >> >> Cheers, >&g

Re: [VOTE] SPARK 2.4.0 (RC2)

2018-10-01 Thread Xiangrui Meng
;>>> > an existing Spark workload and running on this release candidate, >>>>> then >>>>> > reporting any regressions. >>>>> > >>>>> > If you're working in PySpark you can set up a virtual env and install >>>&

Re: 2.4.0 Blockers, Critical, etc

2018-09-21 Thread Xiangrui Meng
> >> SPARK-22809 pyspark is sensitive to imports with dots > >> SPARK-22739 Additional Expression Support for Objects > >> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested > >> list of structures > >> SPARK-21030 extend hint syntax to support any expression for Python and > R > >> SPARK-22386 Data Source V2 improvements > >> SPARK-15117 Generate code that get a value in each compressed column > >> from CachedBatch when DataFrame.cache() is called > >> > >> - > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > >> > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xiangrui Meng
upport is still missing. >>>>>> Great to have in 2.4. >>>>>> >> >>>>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect >>>>>> answers >>>>>> >> This is a long-standing correctness bug, great to have in 2.4. >>>>>> >> >>>>>> >> There are some other important features like the adaptive >>>>>> execution, streaming SQL, etc., not in the list, since I think we are not >>>>>> able to finish them before 2.4. >>>>>> >> >>>>>> >> Feel free to add more things if you think they are important to >>>>>> Spark 2.4 by replying to this email. >>>>>> >> >>>>>> >> Thanks, >>>>>> >> Wenchen >>>>>> >> >>>>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen >>>>>> wrote: >>>>>> >> >>>>>> >> In theory releases happen on a time-based cadence, so it's >>>>>> pretty much wrap up what's ready by the code freeze and ship it. In >>>>>> practice, the cadence slips frequently, and it's very much a negotiation >>>>>> about what features should push the >>>>>> >> code freeze out a few weeks every time. So, kind of a hybrid >>>>>> approach here that works OK. >>>>>> >> >>>>>> >> Certainly speak up if you think there's something that really >>>>>> needs to get into 2.4. This is that discuss thread. >>>>>> >> >>>>>> >> (BTW I updated the page you mention just yesterday, to reflect >>>>>> the plan suggested in this thread.) >>>>>> >> >>>>>> >> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves >>>>>> wrote: >>>>>> >> >>>>>> >> Shouldn't this be a discuss thread? >>>>>> >> >>>>>> >> I'm also happy to see more release managers and agree the time >>>>>> is getting close, but we should see what features are in progress and see >>>>>> how close things are and propose a date based on that. Cutting a branch >>>>>> to >>>>>> soon just creates >>>>>> >> more work for committers to push to more branches. >>>>>> >> >>>>>> >>http://spark.apache.org/versioning-policy.html mentioned the >>>>>> code freeze and release branch cut mid-august. >>>>>> >> >>>>>> >> Tom >>>>>> > >>>>>> > >>>>>> - >>>>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> > >>>>>> >>>>>> >>>> >> -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[SPARK-24579] SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-06-18 Thread Xiangrui Meng
a look and let me know your thoughts in JIRA comments. Thanks! Best, Xiangrui -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: [VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-04 Thread Xiangrui Meng
+1 from myself. The vote passed with the following +1s: * Susham kumar reddy Yerabolu * Xingbo Jiang * Xiao Li* * Weichen Xu * Joseph Bradley* * Henry Robinson * Xiangrui Meng* * Wenchen Fan* Henry, you can find a design sketch at https://issues.apache.org/jira/browse/SPARK-24375. To help

Re: [VOTE] SPIP ML Pipelines in R

2018-06-01 Thread Xiangrui Meng
> > >> > Thanks, >> > --Hossein >> >> - >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >> >> > > > -- > > Joseph Bradley > > Software Engineer - Machine Learning > > Databricks, Inc. > > [image: http://databricks.com] <http://databricks.com/> > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

[VOTE] [SPARK-24374] SPIP: Support Barrier Scheduling in Apache Spark

2018-06-01 Thread Xiangrui Meng
don't think this is a good idea because of the following technical reasons. Best, Xiangrui -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Integrating ML/DL frameworks with Spark

2018-05-23 Thread Xiangrui Meng
* Bryan Cutler <cutl...@gmail.com> > *Sent:* Monday, May 14, 2018 11:37:20 PM > *To:* Xiangrui Meng > *Cc:* Reynold Xin; dev > > *Subject:* Re: Integrating ML/DL frameworks with Spark > Thanks for starting this discussion, I'd also like to see some > improvement

Re: Integrating ML/DL frameworks with Spark

2018-05-09 Thread Xiangrui Meng
is is not only useful for integrating with 3rd-party frameworks, >>>>>> but also useful for scaling MLlib algorithms. One of my earliest attempts >>>>>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485 >>>>>> <ht

Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Xiangrui Meng
roposal. We'd like to hear your feedback and past efforts along those directions if they were not fully captured by our JIRA. > Xiangrui - please also chime in if I didn’t capture everything. > > > -- Xiangrui Meng Software Engineer Databricks Inc. [image: http://databricks.com] <http://databricks.com/>

Re: Welcoming Yanbo Liang as a committer

2016-06-07 Thread Xiangrui Meng
Congrats!! On Mon, Jun 6, 2016, 8:12 AM Gayathri Murali wrote: > Congratulations Yanbo Liang! Well deserved. > > > On Sun, Jun 5, 2016 at 7:10 PM, Shixiong(Ryan) Zhu < > shixi...@databricks.com> wrote: > >> Congrats, Yanbo! >> >> On Sun, Jun 5, 2016 at 6:25 PM,

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
nt: > OutputCommitCoordinator stopped! > 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully > stopped SparkContext > 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown > hook called > 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHo

Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
Is it on 1.6.x? On Wed, May 18, 2016, 6:57 PM Sun Rui wrote: > I saw it, but I can’t see the complete error message on it. > I mean the part after “error in invokingJava(…)” > > On May 19, 2016, at 08:37, Gayathri Murali > wrote: > > There was

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
Not exacly the same as the one you suggested but you can chain it with flatMap to get what you want, if each file is not huge. On Thu, May 19, 2016, 8:41 AM Xiangrui Meng <men...@gmail.com> wrote: > This was implemented as sc.wholeTextFiles. > > On Thu, May 19, 2016, 2:43 AM

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
This was implemented as sc.wholeTextFiles. On Thu, May 19, 2016, 2:43 AM Reynold Xin wrote: > Users would be able to run this already with the 3 lines of code you > supplied right? In general there are a lot of methods already on > SparkContext and we lean towards the more

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
ace in the 2.x series ? > > Thanks > Shivaram > > On Tue, Apr 5, 2016 at 11:48 AM, Sean Owen <so...@cloudera.com> wrote: > > FWIW, all of that sounds like a good plan to me. Developing one API is > > certainly better than two. > > > > On Tue, Apr 5, 20

Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Xiangrui Meng
Hi all, More than a year ago, in Spark 1.2 we introduced the ML pipeline API built on top of Spark SQL’s DataFrames. Since then the new DataFrame-based API has been developed under the spark.ml package, while the old RDD-based API has been developed in parallel under the spark.mllib package.

Re: Various forks

2016-03-19 Thread Xiangrui Meng
We made that fork to hide package private classes/members in the generated Java API doc. Otherwise, the Java API doc is very messy. The patch is to map all private[*] to the default scope in the generated Java code. However, this might not be the expected behavior for other packages. So it didn't

Re: [VOTE] Release Apache Spark 1.5.1 (RC1)

2015-09-24 Thread Xiangrui Meng
+1. Checked user guide and API doc, and ran some MLlib and SparkR examples. -Xiangrui On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin wrote: > I'm going to +1 this myself. Tested on my laptop. > > > > On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote: >>

Re: Are These Issues Suitable for our Senior Project?

2015-07-09 Thread Xiangrui Meng
Hi Emrehan, Thanks for asking! There are actually many TODOs for MLlib. I would recommend starting with small tasks before picking a topic for your senior project. Please check https://issues.apache.org/jira/browse/SPARK-8445 for the 1.5 roadmap and see whether there are ones you are interested

Re: [mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper

2015-06-18 Thread Xiangrui Meng
Hi Yu, Reducing the code complexity on the Python side is certainly what we want to see:) We didn't call Java directly in Python models because Java methods don't work inside RDD closures, e.g., rdd.map(lambda x: model.predict(x[1])) But I agree that for model save/load the implementation

Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)

2015-06-17 Thread Xiangrui Meng
Hi Eron, Please register your Spark Package on http://spark-packages.org, which helps users find your work. Do you have some performance benchmark to share? Thanks! Best, Xiangrui On Wed, Jun 10, 2015 at 10:48 PM, Nick Pentreath nick.pentre...@gmail.com wrote: Looks very interesting, thanks

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
inspect the frame `df.name` gets called and warn users in `df.select(df.name)` but not in `name = df.name`. This could be tricky to implement. -Xiangrui Thanks Shivaram On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng men...@gmail.com wrote: Hi all, In PySpark, a DataFrame column can

Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

2015-05-08 Thread Xiangrui Meng
Hi all, In PySpark, a DataFrame column can be referenced using df[abcd] (__getitem__) and df.abcd (__getattr__). There is a discussion on SPARK-7035 on compatibility issues with the __getattr__ approach, and I want to collect more inputs on this. Basically, if in the future we introduce a new

Re: Pickling error when attempting to add a method in pyspark

2015-05-06 Thread Xiangrui Meng
Hi Stephen, I think it would be easier to see what you implemented by showing the branch diff link on github. There are couple utility class to make Rating work between Scala and Python: 1. serializer:

Re: [discuss] ending support for Java 6?

2015-05-06 Thread Xiangrui Meng
+1. One issue with dropping Java 6: if we use Java 7 to build the assembly jar, it will use zip64. Could Python 2.x (or even 3.x) be able to load zip64 files on PYTHONPATH? -Xiangrui On Tue, May 5, 2015 at 3:25 PM, Reynold Xin r...@databricks.com wrote: OK I sent an email. On Tue, May 5, 2015

Re: OOM error with GMMs on 4GB dataset

2015-05-06 Thread Xiangrui Meng
Did you set `--driver-memory` with spark-submit? -Xiangrui On Mon, May 4, 2015 at 5:16 PM, Vinay Muttineni vmuttin...@ebay.com wrote: Hi, I am training a GMM with 10 gaussians on a 4 GB dataset(720,000 * 760). The spark (1.3.1) job is allocated 120 executors with 6GB each and the driver also

Re: Stochastic gradient descent performance

2015-04-06 Thread Xiangrui Meng
The gap sampling is triggered when the sampling probability is small and the directly underlying storage has constant time lookups, in particular, ArrayBuffer. This is a very strict requirement. If rdd is cached in memory, we use ArrayBuffer to store its elements and rdd.sample will trigger gap

Re: Support parallelized online matrix factorization for Collaborative Filtering

2015-04-06 Thread Xiangrui Meng
This is being discussed in https://issues.apache.org/jira/browse/SPARK-6407. Let's move the discussion there. Thanks for providing references! -Xiangrui On Sun, Apr 5, 2015 at 11:48 PM, Chunnan Yao yaochun...@gmail.com wrote: On-line Collaborative Filtering(CF) has been widely used and studied.

Re: [VOTE] Release Apache Spark 1.3.1

2015-04-05 Thread Xiangrui Meng
+1 Verified some MLlib bug fixes on OS X. -Xiangrui On Sun, Apr 5, 2015 at 1:24 AM, Sean Owen so...@cloudera.com wrote: Signatures and hashes are good. LICENSE, NOTICE still check out. Compiles for a Hadoop 2.6 + YARN + Hive profile. I still see the UISeleniumSuite test failure observed in

Re: mllib.recommendation Design

2015-03-30 Thread Xiangrui Meng
On Tue, Feb 17, 2015 at 3:19 PM, Xiangrui Meng men...@gmail.com wrote: The current ALS implementation allow pluggable solvers for NormalEquation, where we put CholeskeySolver and NNLS solver. Please check the current implementation and let us know how your constraint solver would fit

Re: Using CUDA within Spark / boosting linear algebra

2015-03-30 Thread Xiangrui Meng
Hi Alex, Since it is non-trivial to make nvblas work with netlib-java, it would be great if you can send the instructions to netlib-java as part of the README. Hopefully we don't need to modify netlib-java code to use nvblas. Best, Xiangrui On Thu, Mar 26, 2015 at 9:54 AM, Sean Owen

Re: enum-like types in Spark

2015-03-17 Thread Xiangrui Meng
is why I think #4 is fine. But I figured I'd give my spiel, because every developer loves language wars :) Imran On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote: `case object` inside an `object` doesn't show up in Java. This is the minimal code I found to make

Re: enum-like types in Spark

2015-03-16 Thread Xiangrui Meng
. I doubt it really matters that much for Spark internals, which is why I think #4 is fine. But I figured I'd give my spiel, because every developer loves language wars :) Imran On Thu, Mar 5, 2015 at 1:35 AM, Xiangrui Meng men...@gmail.com wrote

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Xiangrui Meng
Krishna, I tested your linear regression example. For linear regression, we changed its objective function from 1/n * \|A x - b\|_2^2 to 1/(2n) * \|Ax - b\|_2^2 to be consistent with common least squares formulations. It means you could re-produce the same result by multiplying the step size by 2.

enum-like types in Spark

2015-03-04 Thread Xiangrui Meng
Hi all, There are many places where we use enum-like types in Spark, but in different ways. Every approach has both pros and cons. I wonder whether there should be an “official” approach for enum-like types in Spark. 1. Scala’s Enumeration (e.g., SchedulingMode, WorkerState, etc) * All types

Re: [VOTE] Release Apache Spark 1.3.0 (RC2)

2015-03-03 Thread Xiangrui Meng
On Tue, Mar 3, 2015 at 11:15 PM, Krishna Sankar ksanka...@gmail.com wrote: +1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 13:53 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested

Re: Using CUDA within Spark / boosting linear algebra

2015-03-02 Thread Xiangrui Meng
at it) On 27 Feb 2015 20:26, Xiangrui Meng men...@gmail.com wrote: Hey Sam, The running times are not big O estimates: The CPU version finished in 12 seconds. The CPU-GPU-CPU version finished in 2.2 seconds. The GPU version finished in 1.7 seconds. I think there is something wrong

Re: Using CUDA within Spark / boosting linear algebra

2015-02-27 Thread Xiangrui Meng
of this in my talk, with explanations, I can't stress enough how much I recommend that you watch it if you want to understand high performance hardware acceleration for linear algebra :-) On 27 Feb 2015 01:42, Xiangrui Meng men...@gmail.com wrote: The copying overhead should be quadratic on n, while

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
Hey Alexander, I don't quite understand the part where netlib-cublas is about 20x slower than netlib-openblas. What is the overhead of using a GPU BLAS with netlib-java? CC'ed Sam, the author of netlib-java. Best, Xiangrui On Wed, Feb 25, 2015 at 3:36 PM, Joseph Bradley jos...@databricks.com

Re: Google Summer of Code - ideas

2015-02-26 Thread Xiangrui Meng
There are couple things in Scala/Java but missing in Python API: 1. model import/export 2. evaluation metrics 3. distributed linear algebra 4. streaming algorithms If you are interested, we can list/create target JIRAs and hunt them down one by one. Best, Xiangrui On Wed, Feb 25, 2015 at 7:37

Re: Using CUDA within Spark / boosting linear algebra

2015-02-26 Thread Xiangrui Meng
am going to ask the developer of BIDMat on his upcoming talk. Best regards, Alexander From: Sam Halliday [mailto:sam.halli...@gmail.com] Sent: Thursday, February 26, 2015 1:56 PM To: Xiangrui Meng Cc: dev@spark.apache.org; Joseph Bradley; Ulanov, Alexander; Evan R. Sparks Subject: Re

Re: Help vote for Spark talks at the Hadoop Summit

2015-02-25 Thread Xiangrui Meng
Made 3 votes to each of the talks. Looking forward to see them in Hadoop Summit:) -Xiangrui On Tue, Feb 24, 2015 at 9:54 PM, Reynold Xin r...@databricks.com wrote: Hi all, The Hadoop Summit uses community choice voting to decide which talks to feature. It would be great if the community could

Re: Google Summer of Code - ideas

2015-02-24 Thread Xiangrui Meng
Would you be interested in working on MLlib's Python API during the summer? We want everything we implemented in Scala can be used in both Java and Python, but we are not there yet. It would be great if someone is willing to help. -Xiangrui On Sat, Feb 21, 2015 at 11:24 AM, Manoj Kumar

Re: Batch prediciton for ALS

2015-02-18 Thread Xiangrui Meng
a look at it again and try update with the new ALS... On Tue, Feb 17, 2015 at 3:22 PM, Xiangrui Meng men...@gmail.com wrote: It may be too late to merge it into 1.3. I'm going to make another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com

Re: Batch prediciton for ALS

2015-02-17 Thread Xiangrui Meng
It may be too late to merge it into 1.3. I'm going to make another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction

Re: mllib.recommendation Design

2015-02-17 Thread Xiangrui Meng
The current ALS implementation allow pluggable solvers for NormalEquation, where we put CholeskeySolver and NNLS solver. Please check the current implementation and let us know how your constraint solver would fit. For a general matrix factorization package, let's make a JIRA and move our

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-17 Thread Xiangrui Meng
There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them, but do the split for each fold. Then we need to cache 3*3 times. Note that the pipeline API is not yet optimized for performance. It would be

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng
I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block comment. If you press enter in the comment block with the `//` style, IDEs won't add `//` for you. -Xiangrui On Wed, Feb 4, 2015 at 2:15 PM, Reynold Xin r...@databricks.com wrote: We should update the

Re: multi-line comment style

2015-02-09 Thread Xiangrui Meng
(glmnet(features, label, family=gaussian, alpha = 0, lambda = 0)) */ ~~~ So people can copy paste the R commands directly. Xiangrui On Mon, Feb 9, 2015 at 12:18 PM, Xiangrui Meng men...@gmail.com wrote: I like the `/* .. */` style more. Because it is easier for IDEs to recognize it as a block

Re: IDF for ml pipeline

2015-02-03 Thread Xiangrui Meng
Yes, we need a wrapper under spark.ml. Feel free to create a JIRA for it. -Xiangrui On Mon, Feb 2, 2015 at 8:56 PM, masaki rikitoku rikima3...@gmail.com wrote: Hi all I am trying the ml pipeline for text classfication now. recently, i succeed to execute the pipeline processing in ml

Re: KNN for large data set

2015-01-21 Thread Xiangrui Meng
For large datasets, you need hashing in order to compute k-nearest neighbors locally. You can start with LSH + k-nearest in Google scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. msdeva...@gmail.com wrote: Hi all, Please help me

Re: Spectral clustering

2015-01-20 Thread Xiangrui Meng
Fan and Stephen (cc'ed) are working on this feature. They will update the JIRA page and report progress soon. -Xiangrui On Fri, Jan 16, 2015 at 12:04 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: Hi, thinking of picking up this Jira ticket:

Re: DBSCAN for MLlib

2015-01-14 Thread Xiangrui Meng
Please find my comments on the JRIA page. -Xiangrui On Tue, Jan 13, 2015 at 1:49 PM, Muhammad Ali A'råby angelland...@yahoo.com.invalid wrote: I have to say, I have created a Jira task for it: [SPARK-5226] Add DBSCAN Clustering Algorithm to MLlib - ASF JIRA | | | | | | | | |

Re: Re-use scaling means and variances from StandardScalerModel

2015-01-09 Thread Xiangrui Meng
Feel free to create a JIRA for this issue. We might need to discuss what to put in the public constructors. In the meanwhile, you can use Java serialization to save/load the model: sc.parallelize(Seq(model), 1).saveAsObjectFile(/tmp/model) val model =

Announcing Spark Packages

2014-12-22 Thread Xiangrui Meng
Dear Spark users and developers, I’m happy to announce Spark Packages (http://spark-packages.org), a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find, discuss, rate, and install

Re: CrossValidator API in new spark.ml package

2014-12-15 Thread Xiangrui Meng
Yes, regularization path could be viewed as training multiple models at once. -Xiangrui On Sat, Dec 13, 2014 at 6:53 AM, DB Tsai dbt...@dbtsai.com wrote: Okay, I got it. In Estimator, fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Seq[M] can be overwritten to implement regularization

Re: [VOTE] Release Apache Spark 1.2.0 (RC1)

2014-12-15 Thread Xiangrui Meng
, 2.6000e+01, 2.0770e+03, 4.e+00, 6.9350e+03]), 0)] I had overwritten the naive bayes example. Will chase the older versions down Cheers k/ On Wed, Dec 3, 2014 at 4:19 PM, Xiangrui Meng men...@gmail.com wrote: Krishna, could you send me some code

Re: [mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.

2014-11-26 Thread Xiangrui Meng
Hi Yanbo, We scale the model coefficients back after training. So scaling in prediction is not necessary. We had some discussion about this. I'd like to treat feature scaling as part of the feature transformation, and recommend users to apply feature scaling before training. It is a cleaner

Re: [VOTE] Release Apache Spark 1.1.1 (RC2)

2014-11-19 Thread Xiangrui Meng
+1. Checked version numbers and doc. Tested a few ML examples with Java 6 and verified some recently merged bug fixes. -Xiangrui On Wed, Nov 19, 2014 at 2:51 PM, Andrew Or and...@databricks.com wrote: I will start with a +1 2014-11-19 14:51 GMT-08:00 Andrew Or and...@databricks.com: Please

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
`sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should not be many strata. My question is why we need to split on each user's ratings. If a user is missing in

Re: Using sampleByKey

2014-11-18 Thread Xiangrui Meng
in a labeled dataset ~ 100 ? On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng men...@gmail.com wrote: `sampleByKey` with the same fraction per stratum acts the same as `sample`. The operation you want is perhaps `sampleByKeyExact` here. However, when you use stratified sampling, there should

Re: MLlib related query

2014-11-11 Thread Xiangrui Meng
Searched MLlib on Google Scholar and didn't find any:) MLlib implements well-recognized algorithms. Each of which may correspond to a paper or serveral papers. Please find the reference in the code if you are interested. -Xiangrui On Sat, Nov 8, 2014 at 1:37 AM, Manu Kaul manohar.k...@gmail.com

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
, Xiangrui Meng men...@gmail.com wrote: Was user presented in training? We can put a check there and return NaN if the user is not included in the model. -Xiangrui On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I am testing MatrixFactorizationModel.predict

Re: MatrixFactorizationModel predict(Int, Int) API

2014-11-06 Thread Xiangrui Meng
a issue... Any idea how to optimize this so that we can calculate MAP statistics on large samples of data ? On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng men...@gmail.com wrote: ALS model contains RDDs. So you cannot put `model.recommendProducts` inside a RDD closure `userProductsRDD.map

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Xiangrui Meng
+1 (binding) On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 on this proposal. On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Will these

  1   2   >