[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
Hi All, DataWorks Summit, San Jose, 2018 is a good place to share your experience of advanced analytics, data science, machine learning and deep learning. We have Artificial Intelligence and Data Science session, to cover technologies such as: Apache Spark, Sciki-learn, TensorFlow, Keras,

[CFP] DataWorks Summit Europe 2018 - Call for abstracts

2017-12-09 Thread Yanbo Liang
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19, 2018. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark for SQL/streaming processing, machine learning and data science. Information on submitting an abstract is at

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
aïve Bayes, Random Forest, etc.) in parallel. Am I correct? > > If not, could you please point me to some resources where they have run > multiple algorithms in parallel. > > > > Thank You very much. It is great help, I will try spark-sklearn. > > Prem > > > >

Re: sparkR 3rd library

2017-09-05 Thread Yanbo Liang
I guess you didn't install R package `genalg` for all worker nodes. This is not built-in package for basic R, so you need to install it to all worker nodes manually or running `install.packages` inside of your SparkR UDF. Regards to how to download third party packages and install them inside of

Re: Apache Spark: Parallelization of Multiple Machine Learning ALgorithm

2017-09-05 Thread Yanbo Liang
Hi Prem, How large is your dataset? Can it be fitted in a single node? If no, Spark MLlib provide CrossValidation which can run multiple machine learning algorithms parallel on distributed dataset and do parameter search. FYI: https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation

Re: Training A ML Model on a Huge Dataframe

2017-08-24 Thread Yanbo Liang
Hi Sea, Could you let us know which ML algorithm you use? What's the number instances and dimension of your dataset? AFAIK, Spark MLlib can train model with several millions of feature if you configure it correctly. Thanks Yanbo On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet

Re: [BlockMatrix] multiply is an action or a transformation ?

2017-08-20 Thread Yanbo Liang
BlockMatrix.multiply will return another BlockMatrix. Inside this function, there are lots of steps of RDD operations, but most of them are transformation. If you don't trigger to obtain the blocks(which is an RDD of [(Int, Int, Matrix)] of the result BlockMatrix, the job will not run. Thanks

Re: Huber regression in PySpark?

2017-08-20 Thread Yanbo Liang
Hi Jeff, Actually I have one implementation of robust regression with huber loss for a long time (https://github.com/apache/spark/pull/14326). This is a fairly straightforward porting for scikit-learn HuberRegressor. The PR making huber regression as a separate Estimator, and we found it can be

Re: Collecting matrix's entries raises an error only when run inside a test

2017-07-06 Thread Yanbo Liang
Hi Simone, Would you mind to share the minimized code to reproduce this issue? Yanbo On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti wrote: > Hello, I have this problem and Google is not helping. Instead, it looks > like an unreported bug and there are no hints to

Re: PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-28 Thread Yanbo Liang
It looks like your Spark job was running under user root, but you file system operation was running under user jomernik. Since Spark will call corresponding file system(such as HDFS, S3) to commit job(rename temporary file to persistent one), it should have correct authorization for both Spark and

Re: Help in Parsing 'Categorical' type of data

2017-06-23 Thread Yanbo Liang
Please consider to use other classification models such as logistic regression or GBT. Naive bayes usually consider features as count, which is not suitable to be used on features generated by one-hot encoder. Thanks Yanbo On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote:

Re: RowMatrix: tallSkinnyQR

2017-06-23 Thread Yanbo Liang
Since this function is used to compute QR decomposition for RowMatrix of a tall and skinny shape, the output R is always with small rank. [image: Inline image 1] On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote: > hi > > *def tallSkinnyQR(computeQ: Boolean = false):

Re: spark higher order functions

2017-06-23 Thread Yanbo Liang
See reply here: http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson wrote: > Hi, > > I have seen that databricks have higher order functions

Re: gfortran runtime library for Spark

2017-06-23 Thread Yanbo Liang
gfortran runtime library is still required for Spark 2.1 for better performance. If it's not present on your nodes, you will see a warning message and a pure JVM implementation will be used instead, but you will not get the best performance. Thanks Yanbo On Wed, Jun 21, 2017 at 5:30 PM, Saroj C

Re: BinaryClassificationMetrics only supports AreaUnderPR and AreaUnderROC?

2017-05-12 Thread Yanbo Liang
Yeah, for binary data, you can also use MulticlassClassificationEvaluator to evaluate other metrics which BinaryClassificationEvaluator doesn't cover, such as accuracy, f1, weightedPrecision and weightedRecall. Thanks Yanbo On Thu, May 11, 2017 at 10:31 PM, Lan Jiang

[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year, September 20-21. This is a great place to talk about work you are doing in Apache Spark or how you are using Spark. Information on submitting an abstract is at

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-01 Thread Yanbo Liang
Hi Tim, Spark ML API doesn't support set initial model for GMM currently. I wish we can get this feature in Spark 2.3. Thanks Yanbo On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith wrote: > Hi, > > I am trying to figure out the API to initialize a gaussian mixture model > using

Re: How to create SparkSession using SparkConf?

2017-04-28 Thread Yanbo Liang
sion? The lambda that we > normally pass when we call StreamingContext.getOrCreate. > > > > > > > > > On Thu, Apr 27, 2017 at 8:47 AM, kant kodali <kanth...@gmail.com> wrote: > >> Ahhh Thanks much! I miss my sparkConf.setJars function instead of this >> ha

Re: How to create SparkSession using SparkConf?

2017-04-27 Thread Yanbo Liang
Could you try the following way? val spark = SparkSession.builder.appName("my-application").config("spark.jars", "a.jar, b.jar").getOrCreate() Thanks Yanbo On Thu, Apr 27, 2017 at 9:21 AM, kant kodali wrote: > I am using Spark 2.1 BTW. > > On Wed, Apr 26, 2017 at 3:22

Re: Synonym handling replacement issue with UDF in Apache Spark

2017-04-27 Thread Yanbo Liang
What about JOIN your table with a map table? On Thu, Apr 27, 2017 at 9:58 PM, Nishanth wrote: > I am facing a major issue on replacement of Synonyms in my DataSet. > > I am trying to replace the synonym of the Brand names to its equivalent > names. > > I have

Re: how to create List in pyspark

2017-04-27 Thread Yanbo Liang
​You can try with UDF, like the following code snippet: from pyspark.sql.functions import udf from pyspark.sql.types import ArrayType, StringType df = spark.read.text("./README.md")​ split_func = udf(lambda text: text.split(" "), ArrayType(StringType())) df.withColumn("split_value",

Re: how to retain part of the features in LogisticRegressionModel (spark2.0)

2017-03-20 Thread Yanbo Liang
Do you want to get sparse model that most of the coefficients are zeros? If yes, using L1 regularization leads to sparsity. But the LogisticRegressionModel coefficients vector's size is still equal with the number of features, you can get the non-zero elements manually. Actually, it would be a

Re: How does preprocessing fit into Spark MLlib pipeline

2017-03-17 Thread Yanbo Liang
Hi Adrian, Did you try SQLTransformer? Your preprocessing steps are SQL operations and can be handled by SQLTransformer in MLlib pipeline scope. Thanks Yanbo On Thu, Mar 9, 2017 at 11:02 AM, aATv wrote: > I want to start using PySpark Mllib pipelines, but I don't understand

Re: ML PIC

2016-12-21 Thread Yanbo Liang
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the progress. On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath wrote: > It is part of the general feature parity roadmap. I can't recall offhand > any blocker reasons it's just resources > On Wed, 21

Re: Usage of mllib api in ml

2016-11-20 Thread Yanbo Liang
You can refer this example( http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation) which use BinaryClassificationEvaluator, and it should be very straightforward to switch to MulticlassClassificationEvaluator. Thanks Yanbo On Sat, Nov 19, 2016 at 9:03

Re: Spark ML DataFrame API - need cosine similarity, how to convert to RDD Vectors?

2016-11-19 Thread Yanbo Liang
Hi Russell, Do you want to use RowMatrix.columnSimilarities to calculate cosine similarities? If so, you should using the following steps: val dataset: DataFrame // Convert the type of features column from ml.linalg.Vector type to mllib.linalg.Vector val oldDataset: DataFrame =

Re: VectorUDT and ml.Vector

2016-11-19 Thread Yanbo Liang
The reason behind this error can be inferred from the error log: *MLUtils.convertMatrixColumnsFromML *was used to convert ml.linalg.Matrix to mllib.linalg.Matrix, but it looks like the column type is ml.linalg.Vector in your case. Could you check the type of column "features" in your dataframe

Re: why is method predict protected in PredictionModel

2016-11-19 Thread Yanbo Liang
This function is used internally currently, we will expose it as public to support make prediction on single instance. See discussion at https://issues.apache.org/jira/browse/SPARK-10413. Thanks Yanbo On Thu, Nov 17, 2016 at 1:24 AM, wobu wrote: > Hi, > > we were using

Re: Spark R guidelines for non-spark functions and coxph (Cox Regression for Time-Dependent Covariates)

2016-11-16 Thread Yanbo Liang
Hi Pietro, Actually we have implemented R survreg() counterpart in Spark: Accelerated failure time model. You can refer AFTSurvivalRegression if you use Scala/Java/Python. For SparkR users, you can try spark.survreg(). The algorithms is completely distributed and return the same solution with

Re: HashingTF for TF.IDF computation

2016-10-23 Thread Yanbo Liang
HashingTF was not designed to handle your case, you can try CountVectorizer who will keep the original terms as vocabulary for retrieving. CountVectorizer will compute a global term-to-index map, which can be expensive for a large corpus and has the risk of OOM. IDF can accept feature vectors

Re: Did anybody come across this random-forest issue with spark 2.0.1.

2016-10-17 Thread Yanbo Liang
​Please increase the value of "maxMemoryInMB"​ of your RandomForestClassifier or RandomForestRegressor. It's a warning which will not affect the result but may lead your training slower. Thanks Yanbo On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部) wrote: > Hi Xi Shen >

Re: Logistic Regression Standardization in ML

2016-10-10 Thread Yanbo Liang
AFAIK, we can guarantee with/without standardization, the models always converged to the same solution if there is no regularization. You can refer the test casts at:

Re: SVD output within Spark

2016-08-31 Thread Yanbo Liang
The signs of the eigenvectors are essentially arbitrary, so both result of Spark and Matlab are right. Thanks On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers wrote: > > just looking at a comparision between Matlab and Spark for svd with an > input matrix N > > > this is

Re: Spark MLlib question: load model failed with exception:org.json4s.package$MappingException: Did not find value which can be converted into java.lang.String

2016-08-18 Thread Yanbo Liang
It looks like you mixed use ALS in spark.ml and spark.mllib package. You can train the model by either one, meanwhile, you should use the corresponding save/load functions. You can not train/save the model by spark.mllib ALS, and then use spark.ml ALS to load the model. It will throw exceptions.

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-18 Thread Yanbo Liang
on and original features > together. My question is how to tie them back to other parts of the data, > which was not in LP. > > For example, I have a bunch of other dimensions which are not part of > features or label. > > Sorry if this is a stupid question. > > On Wed, Au

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-18 Thread Yanbo Liang
. >> For now, I decided to just put my code inside org.apache.spark.ml to be >> able to access private classes. >> >> Thanks, >> Alexey >> >> On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang <yblia...@gmail.com> wrote: >> >>> It seams that Vec

Re: VectorUDT with spark.ml.linalg.Vector

2016-08-16 Thread Yanbo Liang
It seams that VectorUDT is private and can not be accessed out of Spark currently. It should be public but we need to do some refactor before make it public. You can refer the discussion at https://github.com/apache/spark/pull/12259 . Thanks Yanbo 2016-08-16 9:48 GMT-07:00 alexeys

Re: SPARK MLLib - How to tie back Model.predict output to original data?

2016-08-16 Thread Yanbo Liang
MLlib will keep the original dataset during transformation, it just append new columns to existing DataFrame. That is you can get both prediction value and original features from the output DataFrame of model.transform. Thanks Yanbo 2016-08-16 17:48 GMT-07:00 ayan guha : >

Re: Spark's Logistic Regression runs unstable on Yarn cluster

2016-08-16 Thread Yanbo Liang
Could you check the log to see how much iterations does your LoR runs? Does your program output same model between different attempts? Thanks Yanbo 2016-08-12 3:08 GMT-07:00 olivierjeunen : > I'm using pyspark ML's logistic regression implementation to do some >

Re: Linear regression, weights constraint

2016-08-16 Thread Yanbo Liang
Spark MLlib does not support boxed constraints on model coefficients currently. Thanks Yanbo 2016-08-15 3:53 GMT-07:00 letaiv : > Hi all, > > Is there any approach to add constrain for weights in linear regression? > What I need is least squares regression with

Re: using matrix as column datatype in SparkSQL Dataframe

2016-08-10 Thread Yanbo Liang
A good way is to implement your own data source to load data of matrix format. You can refer the LibSVM data format ( https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml/source/libsvm) which contains one column of vector type which is very similar with matrix.

Re: Random forest binary classification H20 difference Spark

2016-08-10 Thread Yanbo Liang
Hi Samir, Did you use VectorAssembler to assemble some columns into the feature column? If there are NULLs in your dataset, VectorAssembler will throw this exception. You can use DataFrame.drop() or DataFrame.replace() to drop/substitute NULL values. Thanks Yanbo 2016-08-07 19:51 GMT-07:00

Re: Logistic regression formula string

2016-08-10 Thread Yanbo Liang
I think you can output the schema of DataFrame which will be feed into the estimator such as LogisticRegression. The output array will be the encoded feature names corresponding the coefficients of the model. Thanks Yanbo 2016-08-08 15:53 GMT-07:00 Cesar : > > I have a data

Re: [MLlib] Term Frequency in TF-IDF seems incorrect

2016-08-01 Thread Yanbo Liang
Hi Hao, HashingTF directly apply a hash function (Murmurhash3) to the features to determine their column index. It excluded any thought about the term frequency or the length of the document. It does similar work compared with sklearn FeatureHasher. The result is increased speed and reduced

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang
Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty : > Hi, > > I was trying to evaluate

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty : > Is there any implementation of FPGrowth and Association rules in Spark > Dataframes ? > We

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang
Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley : > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implementation. > Links

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang <yblia...@gmail.com>: > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang
Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale : > Just found Google dataproc has

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang
Hi Gourav, I can not reproduce your problem. The following code snippets works well on my local machine, you can try to verify it in your environment. Or could you provide more information to make others can reproduce your problem? from pyspark.mllib.linalg.distributed import CoordinateMatrix,

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-17 Thread Yanbo Liang
t; functionality is not available to me in the python spark 1.4 api. > > Regards, > Tobi > > On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote: > >> Hi Tobi, >> >> The MLlib RDD-based API does support to apply transformation on both

Re: Feature importance IN random forest

2016-07-16 Thread Yanbo Liang
Spark 1.5 only support getting feature importance for RandomForestClassificationModel and RandomForestRegressionModel by Scala. We support this feature in PySpark until 2.0.0. It's very straight forward with a few lines of code. rf = RandomForestClassifier(numTrees=3, maxDepth=2,

Re: bisecting kmeans model tree

2016-07-16 Thread Yanbo Liang
Currently we do not expose the APIs to get the Bisecting KMeans tree structure, they are private in the ml.clustering package scope. But I think we should make a plan to expose these APIs like what we did for Decision Tree. Thanks Yanbo 2016-07-12 11:45 GMT-07:00 roni : >

Re: Dense Vectors outputs in feature engineering

2016-07-16 Thread Yanbo Liang
Since you use two steps (StringIndexer and OneHotEncoder) to encode categories to Vector, I guess you want to decode the eventual vector into their original categories. Suppose you have a DataFrame with only one column named "name", there are three categories: "b", "a", "c" (ranked by frequency).

Re: Filtering RDD Using Spark.mllib's ChiSqSelector

2016-07-16 Thread Yanbo Liang
Hi Tobi, The MLlib RDD-based API does support to apply transformation on both Vector and RDD, but you did not use the appropriate way to do. Suppose you have a RDD with LabeledPoint in each line, you can refer the following code snippets to train a ChiSqSelectorModel model and do transformation:

Re: QuantileDiscretizer not working properly with big dataframes

2016-07-16 Thread Yanbo Liang
Could you tell us the Spark version you used? We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to these versions and retry. If this issue still exists, please let us know. Thanks Yanbo 2016-07-12 11:03 GMT-07:00 Pasquinell Urbani < pasquinell.urb...@exalitica.com>: > In the

Re: Isotonic Regression, run method overloaded Error

2016-07-11 Thread Yanbo Liang
IsotonicRegression can handle feature column of vector type. It will extract the a certain index (controlled by param "featureIndex") of this feature vector and feed it into model training. It will perform Pool adjacent violators algorithms on each partition, so it's distributed and the data is

Re: Isotonic Regression, run method overloaded Error

2016-07-10 Thread Yanbo Liang
Hi Swaroop, Would you mind to share your code that others can help you to figure out what caused this error? I can run the isotonic regression examples well. Thanks Yanbo 2016-07-08 13:38 GMT-07:00 dsp : > Hi I am trying to perform Isotonic Regression on a data set with

Re: mllib based on dataset or dataframe

2016-07-10 Thread Yanbo Liang
DataFrame is a kind of special case of Dataset, so they mean the same thing. Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in Spark 2.0. We can say that MLlib will focus on the Dataset-based API for futher development more accurately. Thanks Yanbo 2016-07-10 20:35

Re: Spark MLlib: MultilayerPerceptronClassifier error?

2016-07-04 Thread Yanbo Liang
Would you mind to file a JIRA to track this issue? I will take a look when I have time. 2016-07-04 14:09 GMT-07:00 mshiryae : > Hi, > > I am trying to train model by MultilayerPerceptronClassifier. > > It works on sample data from >

Re: Graphframe Error

2016-07-04 Thread Yanbo Liang
Hi Arun, The command bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6 will automatically load the required graphframes jar file from maven repository, it was not affected by the location where the jar file was placed. Your examples works well in my laptop. Or you can use try with

Re: Several questions about how pyspark.ml works

2016-07-02 Thread Yanbo Liang
Hi Nick, Please see my inline reply. Thanks Yanbo 2016-06-12 3:08 GMT-07:00 XapaJIaMnu : > Hey, > > I have some additional Spark ML algorithms implemented in scala that I > would > like to make available in pyspark. For a reference I am looking at the > available logistic

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-07-02 Thread Yanbo Liang
Yes, WeightedLeastSquares can not solve some ill-conditioned problem currently, the community members have paid some efforts to resolve it (SPARK-13777). For the work around, you can set the solver to "l-bfgs" which will train the LogisticRegressionModel by L-BFGS optimization method. 2016-06-09

Re: Get both feature importance and ROC curve from a random forest classifier

2016-07-02 Thread Yanbo Liang
Hi Mathieu, Using the new ml package to train a RandomForestClassificationModel, you can get feature importance. Then you can convert the prediction result to RDD and feed it into BinaryClassificationEvaluator for ROC curve. You can refer the following code snippet: val rf = new

Re: Ideas to put a Spark ML model in production

2016-07-02 Thread Yanbo Liang
Let's suppose you have trained a LogisticRegressionModel and saved it at "/tmp/lr-model". You can copy the directory to production environment and use it to make prediction on users new data. You can refer the following code snippets: val model = LogisiticRegressionModel.load("/tmp/lr-model") val

Re: Custom Optimizer

2016-07-02 Thread Yanbo Liang
Spark MLlib does not support optimizer as a plugin, since the optimizer interface is private. Thanks Yanbo 2016-06-23 16:56 GMT-07:00 Stephen Boesch : > My team has a custom optimization routine that we would have wanted to > plug in as a replacement for the default LBFGS /

Re: Spark ML - Java implementation of custom Transformer

2016-07-02 Thread Yanbo Liang
Hi Mehdi, Could you share your code and then we can help you to figure out the problem? Actually JavaTestParams can work well but there is some compatibility issue for JavaDeveloperApiExample. We have removed JavaDeveloperApiExample temporary at Spark 2.0 in order to not confuse users. Since the

Re: ML regression - spark context dies without error

2016-06-05 Thread Yanbo Liang
Could you tell me which regression algorithm, the parameters you set and the detail exception information? Or it's better to paste your code and exception here if it's applicable, then other members can help you to diagnose the problem. Thanks Yanbo 2016-05-12 2:03 GMT-07:00 AlexModestov

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
g and simply run the glm model. String columns will be directly > one-hot encoded by the glm provided by sparkR ? > > Just wanted to clarify as in R we need to apply as.factor for categorical > variables. > > val dfNew = df.withColumn("C0",df.col("C0").cast("

Re: Running glm in sparkR (data pre-processing step)

2016-05-30 Thread Yanbo Liang
Hi Abhi, In SparkR glm, category features (columns of type string) will be one-hot encoded automatically. So pre-processing like `as.factor` is not necessary, you can directly feed your data to the model training. Thanks Yanbo 2016-05-30 2:06 GMT-07:00 Abhishek Anand :

Re: Possible bug involving Vectors with a single element

2016-05-27 Thread Yanbo Liang
Spark MLlib Vector only supports data of double type, it's reasonable to throw exception when you creating a Vector with element of unicode type. 2016-05-24 7:27 GMT-07:00 flyinggip : > Hi there, > > I notice that there might be a bug in pyspark.mllib.linalg.Vectors when

Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
Actually it's unnecessary to convert csv row to LabeledPoint, because we use DataFrame as the standard data format when training a model by Spark ML. What you should do is converting double attributes to Vector named "feature". Then you can train the ML model by specifying the featureCol and

Re: SparkML Using Pipeline API locally on driver

2016-02-28 Thread Yanbo Liang
Hi Jean, DataFrame is connected with SQLContext which is connected with SparkContext, so I think it's impossible to run `model.transform` without touching Spark. I think what you need is model should support prediction on single instance, then you can make prediction without Spark. You can track

Re: Saving and Loading Dataframes

2016-02-28 Thread Yanbo Liang
oad( InputFile ) > df.show; df.printSchema > > df.write.format("json").mode("overwrite").save( OutputDir ) > val data = sqlc.read.format("json").load( OutputDir ) > data.show; data.printSchema > > def main( args: Array[String]):Unit = {}

Re: Survival Curves using AFT implementation in Spark

2016-02-26 Thread Yanbo Liang
Hi Stuti, AFTSurvivalRegression does not support computing the predicted survival functions/curves currently. I don't know whether the quantile predictions can help you, you can refer the example

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-25 Thread Yanbo Liang
Actually Spark SQL `groupBy` with `count` can get frequency in each bin. You can also try with DataFrameStatFunctions.freqItems() to get the frequent items for columns. Thanks Yanbo 2016-02-24 1:21 GMT+08:00 Burak Yavuz : > You could use the Bucketizer transformer in Spark ML.

Re: Saving and Loading Dataframes

2016-02-25 Thread Yanbo Liang
Hi Raj, Could you share your code which can help others to diagnose this issue? Which version did you use? I can not reproduce this problem in my environment. Thanks Yanbo 2016-02-26 10:49 GMT+08:00 raj.kumar : > Hi, > > I am using mllib. I use the ml vectorization

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-16 Thread Yanbo Liang
= standardScaler.fit(ovarian2) val ovarian3 = ssModel.transform(ovarian2) val aft = new AFTSurvivalRegression().setFeaturesCol("standardized_features") val model = aft.fit(ovarian3) val newCoefficients = model.coefficients.toArray.zip(ssModel.std.toArray).map { x => x._1 / x._2 }

Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti, This is a bug of AFTSurvivalRegression, we did not handle "lossSum == infinity" properly. I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this issue and will send a PR. Thanks for reporting this issue. Yanbo 2016-02-12 15:03 GMT+08:00 Stuti Awasthi

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang
For you case, it's true. But not always correct for a pipeline model, some transformers in pipeline will change the features such as OneHotEncoder. 2016-02-03 1:21 GMT+08:00 jmvllt : > Hi everyone, > > This may sound like a stupid question but I need to be sure of this

Re: Extracting p values in Logistic regression using mllib scala

2016-01-24 Thread Yanbo Liang
Hi Chandan, MLlib only support getting p-value, t-value from Linear Regression model, other models such as Logistic Model are not supported currently. This feature is under development and will be released at the next version(Spark 2.0). Thanks Yanbo 2016-01-18 16:45 GMT+08:00 Chandan Verma

Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
t-classification-1.html > I > do not get the same results. I’ll put my code up on github over the weekend > if anyone is interested > > Andy > > From: Yanbo Liang <yblia...@gmail.com> > Date: Tuesday, January 19, 2016 at 1:11 AM > > To: Andrew Davidson <

Re: can we create dummy variables from categorical variables, using sparkR

2016-01-24 Thread Yanbo Liang
Hi Devesh, RFormula will encode category variables(column of string type) as dummy variables automatically. You do not need to do dummy transform explicitly if you want to train machine learning model using SparkR. Although SparkR only supports a limited ML algorithms(GLM) currently. Thanks

Re: how to save Matrix type result to hdfs file using java

2016-01-24 Thread Yanbo Liang
Matrix can be save as column of type MatrixUDT.

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
n("AEDWIP: indexOfSentence: " + indexOfSentence); > > > int indexOfAnother = tf.indexOf("another"); > > System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother); > > > for (Vector v: localTfIdfs) { > > System.err.println("AEDWIP

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
Hi Andy, Actually, the output of ML IDF model is the TF-IDF vector of each instance rather than IDF vector. So it's unnecessary to do member wise multiplication to calculate TF-IDF value. You can refer the code at here:

Re: Feature importance for RandomForestRegressor in Spark 1.5

2016-01-17 Thread Yanbo Liang
Hi Robin, #1 This feature is available from Spark 1.5.0. #2 You should use the new ML rather than the old MLlib package to train the Random Forest model and get featureImportances, because it was only exposed at ML package. You can refer the documents:

Re: AIC in Linear Regression in ml pipeline

2016-01-15 Thread Yanbo Liang
Hi Arunkumar, It does not support output AIC value for Linear Regression currently. This feature is under development and will be released at Spark 2.0. Thanks Yanbo 2016-01-15 17:20 GMT+08:00 Arunkumar Pillai : > Hi > > Is it possible to get AIC value in Linear

Re: ml.classification.NaiveBayesModel how to reshape theta

2016-01-13 Thread Yanbo Liang
Yep, row of Matrix theta is the number of classes and column of theta is the number of features. 2016-01-13 10:47 GMT+08:00 Andy Davidson : > I am trying to debug my trained model by exploring theta > Theta is a Matrix. The java Doc for Matrix says that it is

Re: Deploying model built in SparkR

2016-01-11 Thread Yanbo Liang
Hi Chandan, Could you tell us the meaning of deploying model? Using the model to make prediction by R? Thanks Yanbo 2016-01-11 20:40 GMT+08:00 Chandan Verma : > Hi All, > > Does any one over here has deployed a model produced in SparkR or atleast > help me with

Re: broadcast params to workers at the very beginning

2016-01-11 Thread Yanbo Liang
Hi, The parameters should be broadcasted again after you update it at driver side, then you can get updated version at worker side. Thanks Yanbo 2016-01-09 23:12 GMT+08:00 octavian.ganea : > Hi, > > In my app, I have a Params scala object that keeps all the specific

Re: StandardScaler in spark.ml.feature requires vector input?

2016-01-11 Thread Yanbo Liang
Hi Kristina, The input column of StandardScaler must be vector type, because it's usually used as feature scaling before model training and the type of feature column should be vector in most cases. If you only want to standardize a numeric column, you can wrap it as a vector and feed into

Re: Date Time Regression as Feature

2016-01-07 Thread Yanbo Liang
First extracting year, month, day, time from the datetime. Then you should decide which variables can be treated as category features such as year/month/day and encode them to boolean form using OneHotEncoder. At last using VectorAssembler to assemble the encoded output vector and the other raw

Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext. sc <- sparkR.init() sqlContext <- sparkRHive.init(sc) 2016-01-06 20:35 GMT+08:00 Sandeep Khurana : > Felix > > I tried the option suggested by you. It gave below error. I am going to > try the option suggested by Prem .

Re: finding distinct count using dataframe

2016-01-05 Thread Yanbo Liang
Hi Arunkumar, You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or approxCountDistinct for a approximate result. 2016-01-05 17:11 GMT+08:00 Arunkumar Pillai : > Hi > > Is there any functions to find distinct count of all the variables in > dataframe.

Re: SparkML algos limitations question.

2016-01-04 Thread Yanbo Liang
can handle large models. (master should > have more memory because it runs LBFGS) In my experiments, I’ve trained the > models 12M and 32M parameters without issues. > > > > Best regards, Alexander > > > > *From:* Yanbo Liang [mailto:yblia...@gmail.com] > *Sent:* Sunda

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-04 Thread Yanbo Liang
thanks for info. Is it likely to change in (near :) ) future? Ability to > call this function only on local data (ie not in rdd) seems to be rather > serious limitation. > > cheers, > Tomasz > > On 02.01.2016 09:45, Yanbo Liang wrote: > >> Hi Tomasz, >> >> The GMM is

Re: GLM I'm ml pipeline

2016-01-03 Thread Yanbo Liang
AFAIK, Spark MLlib will improve and support most GLM functions in the next release(Spark 2.0). 2016-01-03 23:02 GMT+08:00 : > keyStoneML could be an alternative. > > Ardo. > > On 03 Jan 2016, at 15:50, Arunkumar Pillai > wrote: > > Is there any road

Re: Problem embedding GaussianMixtureModel in a closure

2016-01-02 Thread Yanbo Liang
Hi Tomasz, The GMM is bind with the peer Java GMM object, so it need reference to SparkContext. Some of MLlib(not ML) models are simple object such as KMeansModel, LinearRegressionModel etc., but others will refer SparkContext. The later ones and corresponding member functions should not called

Re: frequent itemsets

2016-01-02 Thread Yanbo Liang
Hi Roberto, Could you share your code snippet that others can help to diagnose your problems? 2016-01-02 7:51 GMT+08:00 Roberto Pagliari : > When using the frequent itemsets APIs, I’m running into stackOverflow > exception whenever there are too many combinations to

  1   2   >