Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-23 Thread Yanbo Liang
+1 On Sat, Sep 23, 2017 at 7:08 PM, Noman Khan wrote: > +1 > > Regards > Noman > -- > *From:* Denny Lee > *Sent:* Friday, September 22, 2017 2:59:33 AM > *To:* Apache Spark Dev; Sean Owen; Tim Hunter > *Cc:* Danil Kirsanov; Joseph Bradley; Reynold Xin; Sudarshan Suda

Re: Welcoming Tejas Patil as a Spark committer

2017-10-06 Thread Yanbo Liang
Congratulations Tejas. On Fri, Oct 6, 2017 at 1:31 PM, DB Tsai wrote: > Congratulations! > > On Wed, Oct 4, 2017 at 6:55 PM, Liwei Lin wrote: > > Congratulations! > > > > Cheers, > > Liwei > > > > On Wed, Oct 4, 2017 at 2:27 PM, Yuval Itzchakov > wrote: > >> > >> Congratulations and Good luck!

[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 2018. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark for SQL/streaming processing, machine learning and data science. Information on submitting an abstract is at https

Re: Hinge Gradient

2017-12-16 Thread Yanbo Liang
Hello Deb, To optimize non-smooth function on LBFGS really should be considered carefully. Is there any literature that proves changing max to soft-max can behave well? I’m more than happy to see some benchmarks if you can have. + Yuhao, who did similar effort in this PR: https://github.com/apac

[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
Hi All, DataWorks Summit, San Jose, 2018 is a good place to share your experience of advanced analytics, data science, machine learning and deep learning. We have Artificial Intelligence and Data Science session, to cover technologies such as: Apache Spark, Sciki-learn, TensorFlow, Keras, Apache

Re: [MLLib] Logistic Regression and standadization

2018-04-13 Thread Yanbo Liang
Hi Filipp, MLlib’s LR implementation did the same way as R’s glmnet for standardization. Actually you don’t need to care about the implementation detail, as the coefficients are always returned on the original scale, so it should be return the same result as other popular ML libraries. Could yo

Re: [VOTE] SPARK 2.3.2 (RC1)

2018-07-11 Thread Yanbo Liang
+1 On Tue, Jul 10, 2018 at 10:15 PM Saisai Shao wrote: > https://issues.apache.org/jira/browse/SPARK-24530 is just merged, I will > cancel this vote and prepare a new RC2 cut with doc fixed. > > Thanks > Saisai > > Wenchen Fan 于2018年7月11日周三 下午12:25写道: > >> +1 >> >> On Wed, Jul 11, 2018 at 1:31

Re: [VOTE] [SPARK-25994] SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-06 Thread Yanbo Liang
+1 for the proposal On Thu, Jan 31, 2019 at 12:46 PM Mingjie Tang wrote: > +1, this is a very very important feature. > > Mingjie > > On Thu, Jan 31, 2019 at 12:42 AM Xiao Li wrote: > >> Change my vote from +1 to ++1 >> >> Xiangrui Meng 于2019年1月30日周三 上午6:20写道: >> >>> Correction: +0 vote does

Re: MinMaxScaler With features include category variables

2016-07-01 Thread Yanbo Liang
You can combine the columns which are need to be normalized into a vector by VectorAssembler and do normalization on it. Do another assembling for columns should not be normalized. At last, you can assemble the two vector into one vector as the feature column and feed it into model training. Thank

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35 GMT-0

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced memory

Re: Welcoming Felix Cheung as a committer

2016-08-08 Thread Yanbo Liang
Congrats Felix! 2016-08-08 18:21 GMT-07:00 Kai Jiang : > Congrats Felix! > > On Mon, Aug 8, 2016, 18:14 Jeff Zhang wrote: > >> Congrats Felix! >> >> On Tue, Aug 9, 2016 at 8:49 AM, Hyukjin Kwon wrote: >> >>> Congratulations! >>> >>> 2016-08-09 7:47 GMT+09:00 Xiao Li : >>> Congrats Felix! >

Re: Get data from CSV files to feed SparkML library methods

2016-08-10 Thread Yanbo Liang
You can load dataset from CSV file and use VectorAssembler to assemble necessary columns into a single columns of vector type. The output column of VectorAssembler will be the features column which should be feed into ML estimator for model training. You can refer VectorAssembler document: http://s

Re: KMeans calls takeSample() twice?

2016-08-29 Thread Yanbo Liang
I run KMeans with probes and found that takeSample() was called only once actually. It looks like this issue was caused by mistake display at Spark UI. Thanks Yanbo On Mon, Aug 29, 2016 at 2:34 PM, gsamaras wrote: > After reading the internal code of Spark about it, I wasn't able to > understan

Re: KMeans calls takeSample() twice?

2016-08-31 Thread Yanbo Liang
I added println at the start of function takeSample, and found it was printed only once for each run of KMeans. Thanks Yanbo On Tue, Aug 30, 2016 at 10:31 AM, Georgios Samaras < georgesamaras...@gmail.com> wrote: > Good catch Shivaram. However, the very next line states: > > // this shouldn't ha

Discuss SparkR executors/workers support virtualenv

2016-09-06 Thread Yanbo Liang
Hi All, Many users have requirements to use third party R packages in executors/workers, but SparkR can not satisfy this requirements elegantly. For example, you should to mess with the IT/administrators of the cluster to deploy these R packages on each executors/workers node which is very inflex

Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-26 Thread Yanbo Liang
+1 On Mon, Sep 26, 2016 at 4:53 PM, akchin wrote: > +1 (non-bind) > -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Psparkr > CentOS 7.2 / openjdk version "1.8.0_101" > > > > > - > IBM Spark Technology Center > -- > View this message in context: http://apache-spark- > developers-list.1001551

Re: welcoming Xiao Li as a committer

2016-10-04 Thread Yanbo Liang
Congrats and welcome! On Tue, Oct 4, 2016 at 9:01 AM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > Congratulations Xiao! Very well deserved! > > On Mon, Oct 3, 2016 at 10:46 PM, Reynold Xin wrote: > >> Hi all, >> >> Xiao Li, aka gatorsmile, has recently been elected as

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
It's a good question and I had similar requirement in my work. I'm copying the implementation from mllib to ml currently, and then exposing the maximum log likelihood. I will send this PR soon. Thanks. Yanbo On Fri, Oct 7, 2016 at 1:37 AM, 王磊(安全部) wrote: > > Hi, > > Do you guys sometimes need t

Re: Could we expose log likelihood of EM algorithm in MLLIB?

2016-10-07 Thread Yanbo Liang
; 日期: 2016年10月8日 星期六 上午12:21 > 至: Yanbo Liang > > 抄送: "dev@spark.apache.org" , "u...@spark.apache.org" > > 主题: Re: Could we expose log likelihood of EM algorithm in MLLIB? > > Thanks for replying. > When could you send out the PR? > > 发件人: Yanb

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-27 Thread Yanbo Liang
+1 On Thu, Oct 27, 2016 at 3:15 AM, Reynold Xin wrote: > I created a JIRA ticket to track this: https://issues.apache. > org/jira/browse/SPARK-18138 > > > > On Thu, Oct 27, 2016 at 10:19 AM, Steve Loughran > wrote: > >> >> On 27 Oct 2016, at 10:03, Sean Owen wrote: >> >> Seems OK by me. >> How

Re: welcoming Burak and Holden as committers

2017-01-24 Thread Yanbo Liang
Congratulations, Burak and Holden. On Tue, Jan 24, 2017 at 7:32 PM, Chester Chen wrote: > Congratulation to both. > > > > Holden, we need catch up. > > > > > > *Chester Chen * > > ■ Senior Manager – Data Science & Engineering > > 3000 Clearview Way > > San Mateo, CA 94402 > > > > > > *From: *Fe

Re: welcoming Takuya Ueshin as a new Apache Spark committer

2017-02-13 Thread Yanbo Liang
Congratulations! On Mon, Feb 13, 2017 at 3:29 PM, Kazuaki Ishizaki wrote: > Congrats! > > Kazuaki Ishizaki > > > > From:Reynold Xin > To:"dev@spark.apache.org" > Date:2017/02/14 04:18 > Subject:welcoming Takuya Ueshin as a new Apache Spark committer > --

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
Do you want to get sparse model that most of the coefficients are zeros? If yes, using L1 regularization leads to sparsity. But the LogisticRegressionModel coefficients vector's size is still equal with the number of features, you can get the non-zero elements manually. Actually, it would be a spar

[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year, September 20-21. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark. Information on submitting an abstract is at https://dataworkssummit.com/sydney-2017/abstracts/submit-abstract

Re: [VOTE] Apache Spark 2.2.0 (RC6)

2017-07-03 Thread Yanbo Liang
+1 On Mon, Jul 3, 2017 at 5:35 AM, Herman van Hövell tot Westerflier < hvanhov...@databricks.com> wrote: > +1 > > On Sun, Jul 2, 2017 at 11:32 PM, Ricardo Almeida < > ricardo.alme...@actnowib.com> wrote: > >> +1 (non-binding) >> >> Built and tested with -Phadoop-2.7 -Dhadoop.version=2.7.3 -Pyarn

Re: Welcoming Hyukjin Kwon and Sameer Agarwal as committers

2017-08-07 Thread Yanbo Liang
Great. Congratulations, Hyukjin and Sameer! On Tue, Aug 8, 2017 at 7:53 AM, Holden Karau wrote: > Congrats! > > On Mon, Aug 7, 2017 at 3:54 PM Bryan Cutler wrote: > >> Great work Hyukjin and Sameer! >> >> On Mon, Aug 7, 2017 at 10:22 AM, Mridul Muralidharan >> wrote: >> >>> Congratulations Hyu

Re: Welcoming Saisai (Jerry) Shao as a committer

2017-08-28 Thread Yanbo Liang
Congratulations, Jerry. On Tue, Aug 29, 2017 at 9:42 AM, John Deng wrote: > > Congratulations, Jerry ! > > On 8/29/2017 09:28,Matei Zaharia > wrote: > > Hi everyone, > > The PMC recently voted to add Saisai (Jerry) Shao as a > committer. Saisai has been contributing to many areas of > the proje

[mllib] useFeatureScaling likes hardcode in LogisticRegressionWithLBFGS and is not comprehensive for users.

2014-11-26 Thread Yanbo Liang
Hi All, LogisticRegressionWithLBFGS set useFeatureScaling to true default which can improve the convergence during optimization. However, other model training method such as LogisticRegressionWithSGD does not set useFeatureScaling to true by default and the corresponding set function is private in

Re: Exception in saving MatrixFactorizationModel

2015-09-05 Thread Yanbo Liang
Please check the "outPath" and verify whether the saving succeed. Which version did you use? You may hit this issue which is resolved at version 1.5. 2015-09-05 21:47 GMT+08:00 Madawa Soysa : > Hi All, > > I'm getting an error when trying to save

Re: query on SVD++

2015-12-02 Thread Yanbo Liang
You means the SVDPlusPlus in GraphX? If you want to use SVD++ to train CF model, I recommend you to use ALS which is more efficiency and has python interface. 2015-12-02 11:21 GMT+08:00 张志强(旺轩) : > Hi All, > > > > I came across the SVD++ algorithm implementation in Spark code base, but I > was wo

Re: java.lang.NoSuchMethodError while saving a random forest model Spark version 1.5

2015-12-17 Thread Yanbo Liang
Spark 1.5 officially use Parquet 1.7.0, but Spark 1.3 use Parquet 1.6.0. It's better to check which version of Parquet is used in your environment. 2015-12-17 10:26 GMT+08:00 Joseph Bradley : > This method is tested in the Spark 1.5 unit tests, so I'd guess it's a > problem with the Parquet depen

Re: 答复: How can I get the column data based on specific column name and then stored these data in array or list ?

2015-12-25 Thread Yanbo Liang
Actually you can call df.collect_list("a"). 2015-12-25 16:00 GMT+08:00 Jeff Zhang : > You can use udf to convert one column for array type. Here's one sample > > val conf = new SparkConf().setMaster("local[4]").setAppName("test") > val sc = new SparkContext(conf) > val sqlContext = new SQLContext

Re: SparkML algos limitations question.

2015-12-27 Thread Yanbo Liang
Hi Eugene, AFAIK, the current implementation of MultilayerPerceptronClassifier have some scalability problems if the model is very huge (such as >10M), although I think the limitation can cover many use cases already. Yanbo 2015-12-16 6:00 GMT+08:00 Joseph Bradley : > Hi Eugene, > > The maxDept

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
ster should > have more memory because it runs LBFGS) In my experiments, I’ve trained the > models 12M and 32M parameters without issues. > > > > Best regards, Alexander > > > > *From:* Yanbo Liang [mailto:yblia...@gmail.com] > *Sent:* Sunday, December 27, 2015 2:2

Re: Organizing Spark ML example packages

2016-04-18 Thread Yanbo Liang
This sounds good to me, and it will make ML examples more neatly. 2016-04-14 5:28 GMT-07:00 Nick Pentreath : > Hey Spark devs > > I noticed that we now have a large number of examples for ML & MLlib in > the examples project - 57 for ML and 67 for MLLIB to be precise. This is > bound to get large

Re: [spark.ml] Why is private class ColumnPruner?

2016-04-18 Thread Yanbo Liang
Hi Jacek, This is due to ColumnPruner is only used for RFormula currently, we did not expose it as a feature transformer. Please feel free to create JIRA and work on it. Thanks Yanbo 2016-03-25 8:50 GMT-07:00 Jacek Laskowski : > Hi, > > Came across `private class ColumnPruner` with "TODO(ekl) m

Re: Cross Validator to work with K-Fold value of 1?

2016-05-03 Thread Yanbo Liang
Here is the JIRA and PR for supporting PolynomialExpansion with degree 1, and it has been merged. https://issues.apache.org/jira/browse/SPARK-13338 https://github.com/apache/spark/pull/11216 2016-05-02 9:20 GMT-07:00 Nick Pentreath : > There is a JIRA and PR around for supporting polynomial expa

Re: Creation of SparkML Estimators in Java broken?

2016-05-27 Thread Yanbo Liang
This is because we do not have excellent coverage for Java-friendly wrappers. I found we only implement JavaParams who is the wrappers of Scala Params. We still need Java-friendly wrappers for other traits who extends from Scala Params. For example, in Scala we have: trait HasLabelCol extends

Re: Creation of SparkML Estimators in Java broken?

2016-05-27 Thread Yanbo Liang
Create JIRA https://issues.apache.org/jira/browse/SPARK-15605 . 2016-05-27 1:02 GMT-07:00 Yanbo Liang : > This is because we do not have excellent coverage for Java-friendly > wrappers. > I found we only implement JavaParams who is the wrappers of Scala Params. > We still need J

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
Hi All, I found that LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD in master and 1.1 release. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199 In the above code snippet, u

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
I also found https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0 had resolve this issue but it seems that right code snippet not occurs in master or 1.1 release. 2014-09-13 17:12 GMT+08:00 Yanbo Liang : > Hi All, > > I found that LogisticRegressionWithLBFGS

Re: Spark SQL use of alias in where clause

2014-09-24 Thread Yanbo Liang
Maybe it's the way SQL works. The select part is executed after the where filter is applied, so you cannot use alias declared in select part in where clause. Hive and Oracle behavior the same as Spark SQL. 2014-09-25 8:58 GMT+08:00 Du Li : > Hi, > > The following query does not work in Shark n

Re: A Spark Compilation Question

2014-09-26 Thread Yanbo Liang
Hi Hansu, I have encountered the same problem. Maven compiled avro file and generated corresponding Java file in new directory which is not source file directory of the project. I have modified pom.xml file and it can be work. The line marked as red is added, you can add them to your spark-*.*.*/

[MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-28 Thread Yanbo Liang
Hi We have used LogisticRegression with two different optimization method SGD and LBFGS in MLlib. With the same dataset and the same training and test split, but get different weights vector. For example, we use spark-1.1.0/data/mllib/sample_binary_classification_data.txt as our training and test

Re: [MLlib] LogisticRegressionWithSGD and LogisticRegressionWithLBFGS converge with different weights.

2014-09-29 Thread Yanbo Liang
lihood by multiply a constant to the weights. > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Sun, Sep 28, 2014 at 11:48 AM, Yanbo Liang >