Hi All,
DataWorks Summit, San Jose, 2018 is a good place to share your experience of
advanced analytics, data science, machine learning and deep learning.
We have Artificial Intelligence and Data Science session, to cover technologies
such as:
Apache Spark, Sciki-learn, TensorFlow, Keras,
The DataWorks Summit Europe is in Berlin, Germany this year, on April 16-19,
2018. This is a great place to talk about work you are doing in Apache Spark or
how you are using Spark for SQL/streaming processing, machine learning and data
science. Information on submitting an abstract is at
aïve Bayes, Random Forest, etc.) in parallel. Am I correct?
>
> If not, could you please point me to some resources where they have run
> multiple algorithms in parallel.
>
>
>
> Thank You very much. It is great help, I will try spark-sklearn.
>
> Prem
>
>
>
>
I guess you didn't install R package `genalg` for all worker nodes. This is
not built-in package for basic R, so you need to install it to all worker
nodes manually or running `install.packages` inside of your SparkR UDF.
Regards to how to download third party packages and install them inside of
Hi Prem,
How large is your dataset? Can it be fitted in a single node?
If no, Spark MLlib provide CrossValidation which can run multiple machine
learning algorithms parallel on distributed dataset and do parameter
search. FYI:
https://spark.apache.org/docs/latest/ml-tuning.html#cross-validation
Hi Sea,
Could you let us know which ML algorithm you use? What's the number
instances and dimension of your dataset?
AFAIK, Spark MLlib can train model with several millions of feature if you
configure it correctly.
Thanks
Yanbo
On Thu, Aug 24, 2017 at 7:07 AM, Suzen, Mehmet
BlockMatrix.multiply will return another BlockMatrix. Inside this function,
there are lots of steps of RDD operations, but most of them are
transformation. If you don't trigger to obtain the blocks(which is an RDD
of [(Int, Int, Matrix)] of the result BlockMatrix, the job will not run.
Thanks
Hi Jeff,
Actually I have one implementation of robust regression with huber loss for
a long time (https://github.com/apache/spark/pull/14326). This is a fairly
straightforward porting for scikit-learn HuberRegressor.
The PR making huber regression as a separate Estimator, and we found it can
be
Hi Simone,
Would you mind to share the minimized code to reproduce this issue?
Yanbo
On Wed, Jul 5, 2017 at 10:52 PM, Simone Robutti
wrote:
> Hello, I have this problem and Google is not helping. Instead, it looks
> like an unreported bug and there are no hints to
It looks like your Spark job was running under user root, but you file
system operation was running under user jomernik. Since Spark will call
corresponding file system(such as HDFS, S3) to commit job(rename temporary
file to persistent one), it should have correct authorization for both
Spark and
Please consider to use other classification models such as logistic
regression or GBT. Naive bayes usually consider features as count, which is
not suitable to be used on features generated by one-hot encoder.
Thanks
Yanbo
On Wed, May 31, 2017 at 3:58 PM, Amlan Jyoti wrote:
Since this function is used to compute QR decomposition for RowMatrix of a
tall and skinny shape, the output R is always with small rank.
[image: Inline image 1]
On Fri, Jun 9, 2017 at 10:33 PM, Arun wrote:
> hi
>
> *def tallSkinnyQR(computeQ: Boolean = false):
See reply here:
http://apache-spark-developers-list.1001551.n3.nabble.com/Will-higher-order-functions-in-spark-SQL-be-pushed-upstream-td21703.html
On Tue, Jun 20, 2017 at 10:02 PM, AssafMendelson
wrote:
> Hi,
>
> I have seen that databricks have higher order functions
gfortran runtime library is still required for Spark 2.1 for better
performance.
If it's not present on your nodes, you will see a warning message and a
pure JVM implementation will be used instead, but you will not get the best
performance.
Thanks
Yanbo
On Wed, Jun 21, 2017 at 5:30 PM, Saroj C
Yeah, for binary data, you can also use MulticlassClassificationEvaluator
to evaluate other metrics which BinaryClassificationEvaluator doesn't
cover, such as accuracy, f1, weightedPrecision and weightedRecall.
Thanks
Yanbo
On Thu, May 11, 2017 at 10:31 PM, Lan Jiang
The Australia/Pacific version of DataWorks Summit is in Sydney this year,
September 20-21. This is a great place to talk about work you are doing in
Apache Spark or how you are using Spark. Information on submitting an
abstract is at
Hi Tim,
Spark ML API doesn't support set initial model for GMM currently. I wish we
can get this feature in Spark 2.3.
Thanks
Yanbo
On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith wrote:
> Hi,
>
> I am trying to figure out the API to initialize a gaussian mixture model
> using
sion? The lambda that we
> normally pass when we call StreamingContext.getOrCreate.
>
>
>
>
>
>
>
>
> On Thu, Apr 27, 2017 at 8:47 AM, kant kodali <kanth...@gmail.com> wrote:
>
>> Ahhh Thanks much! I miss my sparkConf.setJars function instead of this
>> ha
Could you try the following way?
val spark = SparkSession.builder.appName("my-application").config("spark.jars",
"a.jar, b.jar").getOrCreate()
Thanks
Yanbo
On Thu, Apr 27, 2017 at 9:21 AM, kant kodali wrote:
> I am using Spark 2.1 BTW.
>
> On Wed, Apr 26, 2017 at 3:22
What about JOIN your table with a map table?
On Thu, Apr 27, 2017 at 9:58 PM, Nishanth
wrote:
> I am facing a major issue on replacement of Synonyms in my DataSet.
>
> I am trying to replace the synonym of the Brand names to its equivalent
> names.
>
> I have
You can try with UDF, like the following code snippet:
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
df = spark.read.text("./README.md")
split_func = udf(lambda text: text.split(" "), ArrayType(StringType()))
df.withColumn("split_value",
Do you want to get sparse model that most of the coefficients are zeros? If
yes, using L1 regularization leads to sparsity. But the
LogisticRegressionModel coefficients vector's size is still equal with the
number of features, you can get the non-zero elements manually. Actually,
it would be a
Hi Adrian,
Did you try SQLTransformer? Your preprocessing steps are SQL operations and
can be handled by SQLTransformer in MLlib pipeline scope.
Thanks
Yanbo
On Thu, Mar 9, 2017 at 11:02 AM, aATv wrote:
> I want to start using PySpark Mllib pipelines, but I don't understand
You can track https://issues.apache.org/jira/browse/SPARK-15784 for the
progress.
On Wed, Dec 21, 2016 at 7:08 AM, Nick Pentreath
wrote:
> It is part of the general feature parity roadmap. I can't recall offhand
> any blocker reasons it's just resources
> On Wed, 21
You can refer this example(
http://spark.apache.org/docs/latest/ml-tuning.html#example-model-selection-via-cross-validation)
which use BinaryClassificationEvaluator, and it should be very
straightforward to switch to MulticlassClassificationEvaluator.
Thanks
Yanbo
On Sat, Nov 19, 2016 at 9:03
Hi Russell,
Do you want to use RowMatrix.columnSimilarities to calculate cosine
similarities?
If so, you should using the following steps:
val dataset: DataFrame
// Convert the type of features column from ml.linalg.Vector type to
mllib.linalg.Vector
val oldDataset: DataFrame =
The reason behind this error can be inferred from the error log:
*MLUtils.convertMatrixColumnsFromML *was used to convert ml.linalg.Matrix
to mllib.linalg.Matrix,
but it looks like the column type is ml.linalg.Vector in your case.
Could you check the type of column "features" in your dataframe
This function is used internally currently, we will expose it as public to
support make prediction on single instance.
See discussion at https://issues.apache.org/jira/browse/SPARK-10413.
Thanks
Yanbo
On Thu, Nov 17, 2016 at 1:24 AM, wobu wrote:
> Hi,
>
> we were using
Hi Pietro,
Actually we have implemented R survreg() counterpart in Spark: Accelerated
failure time model. You can refer AFTSurvivalRegression if you use
Scala/Java/Python. For SparkR users, you can try spark.survreg().
The algorithms is completely distributed and return the same solution with
HashingTF was not designed to handle your case, you can try CountVectorizer
who will keep the original terms as vocabulary for retrieving.
CountVectorizer will compute a global term-to-index map, which can be
expensive for a large corpus and has the risk of OOM. IDF can accept
feature vectors
Please increase the value of "maxMemoryInMB" of your
RandomForestClassifier or RandomForestRegressor.
It's a warning which will not affect the result but may lead your training
slower.
Thanks
Yanbo
On Mon, Oct 17, 2016 at 8:21 PM, 张建鑫(市场部)
wrote:
> Hi Xi Shen
>
AFAIK, we can guarantee with/without standardization, the models always
converged to the same solution if there is no regularization. You can refer
the test casts at:
The signs of the eigenvectors are essentially arbitrary, so both result of
Spark and Matlab are right.
Thanks
On Thu, Jul 21, 2016 at 3:50 PM, Martin Somers wrote:
>
> just looking at a comparision between Matlab and Spark for svd with an
> input matrix N
>
>
> this is
It looks like you mixed use ALS in spark.ml and spark.mllib package.
You can train the model by either one, meanwhile, you should use the
corresponding save/load functions.
You can not train/save the model by spark.mllib ALS, and then use spark.ml
ALS to load the model. It will throw exceptions.
on and original features
> together. My question is how to tie them back to other parts of the data,
> which was not in LP.
>
> For example, I have a bunch of other dimensions which are not part of
> features or label.
>
> Sorry if this is a stupid question.
>
> On Wed, Au
.
>> For now, I decided to just put my code inside org.apache.spark.ml to be
>> able to access private classes.
>>
>> Thanks,
>> Alexey
>>
>> On Tue, Aug 16, 2016 at 11:13 PM, Yanbo Liang <yblia...@gmail.com> wrote:
>>
>>> It seams that Vec
It seams that VectorUDT is private and can not be accessed out of Spark
currently. It should be public but we need to do some refactor before make
it public. You can refer the discussion at
https://github.com/apache/spark/pull/12259 .
Thanks
Yanbo
2016-08-16 9:48 GMT-07:00 alexeys
MLlib will keep the original dataset during transformation, it just append
new columns to existing DataFrame. That is you can get both prediction
value and original features from the output DataFrame of model.transform.
Thanks
Yanbo
2016-08-16 17:48 GMT-07:00 ayan guha :
>
Could you check the log to see how much iterations does your LoR runs? Does
your program output same model between different attempts?
Thanks
Yanbo
2016-08-12 3:08 GMT-07:00 olivierjeunen :
> I'm using pyspark ML's logistic regression implementation to do some
>
Spark MLlib does not support boxed constraints on model coefficients
currently.
Thanks
Yanbo
2016-08-15 3:53 GMT-07:00 letaiv :
> Hi all,
>
> Is there any approach to add constrain for weights in linear regression?
> What I need is least squares regression with
A good way is to implement your own data source to load data of matrix
format. You can refer the LibSVM data format (
https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/ml/source/libsvm)
which contains one column of vector type which is very similar with matrix.
Hi Samir,
Did you use VectorAssembler to assemble some columns into the feature
column? If there are NULLs in your dataset, VectorAssembler will throw this
exception. You can use DataFrame.drop() or DataFrame.replace() to
drop/substitute NULL values.
Thanks
Yanbo
2016-08-07 19:51 GMT-07:00
I think you can output the schema of DataFrame which will be feed into the
estimator such as LogisticRegression. The output array will be the encoded
feature names corresponding the coefficients of the model.
Thanks
Yanbo
2016-08-08 15:53 GMT-07:00 Cesar :
>
> I have a data
Hi Hao,
HashingTF directly apply a hash function (Murmurhash3) to the features to
determine their column index. It excluded any thought about the term
frequency or the length of the document. It does similar work compared with
sklearn FeatureHasher. The result is increased speed and reduced
Spark MLlib KMeansModel provides "computeCost" function which return the
sum of squared distances of points to their nearest center as the k-means
cost on the given dataset.
Thanks
Yanbo
2016-07-24 17:30 GMT-07:00 janardhan shetty :
> Hi,
>
> I was trying to evaluate
You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501)
for porting spark.mllib.fpm to spark.ml.
Thanks
Yanbo
2016-07-24 11:18 GMT-07:00 janardhan shetty :
> Is there any implementation of FPGrowth and Association rules in Spark
> Dataframes ?
> We
Hi Janardhan,
Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992)
for the discussion about LSH.
Regards
Yanbo
2016-07-24 7:13 GMT-07:00 Karl Higley :
> Hi Janardhan,
>
> I collected some LSH papers while working on an RDD-based implementation.
> Links
Sorry for the wrong link, what you should refer is jpmml-sparkml (
https://github.com/jpmml/jpmml-sparkml).
Thanks
Yanbo
2016-07-24 4:46 GMT-07:00 Yanbo Liang <yblia...@gmail.com>:
> Spark does not support exporting ML models to PMML currently. You can try
> the third party jpmml-
Spark does not support exporting ML models to PMML currently. You can try
the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package
which supports a part of ML models.
Thanks
Yanbo
2016-07-20 11:14 GMT-07:00 Ajinkya Kale :
> Just found Google dataproc has
Hi Gourav,
I can not reproduce your problem. The following code snippets works well on
my local machine, you can try to verify it in your environment. Or could
you provide more information to make others can reproduce your problem?
from pyspark.mllib.linalg.distributed import CoordinateMatrix,
t; functionality is not available to me in the python spark 1.4 api.
>
> Regards,
> Tobi
>
> On Jul 16, 2016 4:53 AM, "Yanbo Liang" <yblia...@gmail.com> wrote:
>
>> Hi Tobi,
>>
>> The MLlib RDD-based API does support to apply transformation on both
Spark 1.5 only support getting feature importance for
RandomForestClassificationModel and RandomForestRegressionModel by Scala.
We support this feature in PySpark until 2.0.0.
It's very straight forward with a few lines of code.
rf = RandomForestClassifier(numTrees=3, maxDepth=2,
Currently we do not expose the APIs to get the Bisecting KMeans tree
structure, they are private in the ml.clustering package scope.
But I think we should make a plan to expose these APIs like what we did for
Decision Tree.
Thanks
Yanbo
2016-07-12 11:45 GMT-07:00 roni :
>
Since you use two steps (StringIndexer and OneHotEncoder) to encode
categories to Vector, I guess you want to decode the eventual vector into
their original categories.
Suppose you have a DataFrame with only one column named "name", there are
three categories: "b", "a", "c" (ranked by frequency).
Hi Tobi,
The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do
transformation:
Could you tell us the Spark version you used?
We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to
these versions and retry.
If this issue still exists, please let us know.
Thanks
Yanbo
2016-07-12 11:03 GMT-07:00 Pasquinell Urbani <
pasquinell.urb...@exalitica.com>:
> In the
IsotonicRegression can handle feature column of vector type. It will
extract the a certain index (controlled by param "featureIndex") of this
feature vector and feed it into model training. It will perform Pool
adjacent violators algorithms on each partition, so it's distributed and
the data is
Hi Swaroop,
Would you mind to share your code that others can help you to figure out
what caused this error?
I can run the isotonic regression examples well.
Thanks
Yanbo
2016-07-08 13:38 GMT-07:00 dsp :
> Hi I am trying to perform Isotonic Regression on a data set with
DataFrame is a kind of special case of Dataset, so they mean the same thing.
Actually the ML pipeline API will accept Dataset[_] instead of DataFrame in
Spark 2.0.
We can say that MLlib will focus on the Dataset-based API for futher
development more accurately.
Thanks
Yanbo
2016-07-10 20:35
Would you mind to file a JIRA to track this issue? I will take a look when
I have time.
2016-07-04 14:09 GMT-07:00 mshiryae :
> Hi,
>
> I am trying to train model by MultilayerPerceptronClassifier.
>
> It works on sample data from
>
Hi Arun,
The command
bin/pyspark --packages graphframes:graphframes:0.1.0-spark1.6
will automatically load the required graphframes jar file from maven
repository, it was not affected by the location where the jar file was
placed. Your examples works well in my laptop.
Or you can use try with
Hi Nick,
Please see my inline reply.
Thanks
Yanbo
2016-06-12 3:08 GMT-07:00 XapaJIaMnu :
> Hey,
>
> I have some additional Spark ML algorithms implemented in scala that I
> would
> like to make available in pyspark. For a reference I am looking at the
> available logistic
Yes, WeightedLeastSquares can not solve some ill-conditioned problem
currently, the community members have paid some efforts to resolve it
(SPARK-13777). For the work around, you can set the solver to "l-bfgs"
which will train the LogisticRegressionModel by L-BFGS optimization method.
2016-06-09
Hi Mathieu,
Using the new ml package to train a RandomForestClassificationModel, you
can get feature importance. Then you can convert the prediction result to
RDD and feed it into BinaryClassificationEvaluator for ROC curve. You can
refer the following code snippet:
val rf = new
Let's suppose you have trained a LogisticRegressionModel and saved it at
"/tmp/lr-model". You can copy the directory to production environment and
use it to make prediction on users new data. You can refer the following
code snippets:
val model = LogisiticRegressionModel.load("/tmp/lr-model")
val
Spark MLlib does not support optimizer as a plugin, since the optimizer
interface is private.
Thanks
Yanbo
2016-06-23 16:56 GMT-07:00 Stephen Boesch :
> My team has a custom optimization routine that we would have wanted to
> plug in as a replacement for the default LBFGS /
Hi Mehdi,
Could you share your code and then we can help you to figure out the
problem?
Actually JavaTestParams can work well but there is some compatibility issue
for JavaDeveloperApiExample.
We have removed JavaDeveloperApiExample temporary at Spark 2.0 in order to
not confuse users. Since the
Could you tell me which regression algorithm, the parameters you set and
the detail exception information? Or it's better to paste your code and
exception here if it's applicable, then other members can help you to
diagnose the problem.
Thanks
Yanbo
2016-05-12 2:03 GMT-07:00 AlexModestov
g and simply run the glm model. String columns will be directly
> one-hot encoded by the glm provided by sparkR ?
>
> Just wanted to clarify as in R we need to apply as.factor for categorical
> variables.
>
> val dfNew = df.withColumn("C0",df.col("C0").cast("
Hi Abhi,
In SparkR glm, category features (columns of type string) will be one-hot
encoded automatically.
So pre-processing like `as.factor` is not necessary, you can directly feed
your data to the model training.
Thanks
Yanbo
2016-05-30 2:06 GMT-07:00 Abhishek Anand :
Spark MLlib Vector only supports data of double type, it's reasonable to
throw exception when you creating a Vector with element of unicode type.
2016-05-24 7:27 GMT-07:00 flyinggip :
> Hi there,
>
> I notice that there might be a bug in pyspark.mllib.linalg.Vectors when
Actually it's unnecessary to convert csv row to LabeledPoint, because we
use DataFrame as the standard data format when training a model by Spark ML.
What you should do is converting double attributes to Vector named
"feature". Then you can train the ML model by specifying the featureCol and
Hi Jean,
DataFrame is connected with SQLContext which is connected with
SparkContext, so I think it's impossible to run `model.transform` without
touching Spark.
I think what you need is model should support prediction on single
instance, then you can make prediction without Spark. You can track
oad( InputFile )
> df.show; df.printSchema
>
> df.write.format("json").mode("overwrite").save( OutputDir )
> val data = sqlc.read.format("json").load( OutputDir )
> data.show; data.printSchema
>
> def main( args: Array[String]):Unit = {}
Hi Stuti,
AFTSurvivalRegression does not support computing the predicted survival
functions/curves currently.
I don't know whether the quantile predictions can help you, you can refer
the example
Actually Spark SQL `groupBy` with `count` can get frequency in each bin.
You can also try with DataFrameStatFunctions.freqItems() to get the
frequent items for columns.
Thanks
Yanbo
2016-02-24 1:21 GMT+08:00 Burak Yavuz :
> You could use the Bucketizer transformer in Spark ML.
Hi Raj,
Could you share your code which can help others to diagnose this issue?
Which version did you use?
I can not reproduce this problem in my environment.
Thanks
Yanbo
2016-02-26 10:49 GMT+08:00 raj.kumar :
> Hi,
>
> I am using mllib. I use the ml vectorization
= standardScaler.fit(ovarian2)
val ovarian3 = ssModel.transform(ovarian2)
val aft = new
AFTSurvivalRegression().setFeaturesCol("standardized_features")
val model = aft.fit(ovarian3)
val newCoefficients =
model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
x._1 / x._2
}
Hi Stuti,
This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
infinity" properly.
I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this
issue and will send a PR.
Thanks for reporting this issue.
Yanbo
2016-02-12 15:03 GMT+08:00 Stuti Awasthi
For you case, it's true.
But not always correct for a pipeline model, some transformers in pipeline
will change the features such as OneHotEncoder.
2016-02-03 1:21 GMT+08:00 jmvllt :
> Hi everyone,
>
> This may sound like a stupid question but I need to be sure of this
Hi Chandan,
MLlib only support getting p-value, t-value from Linear Regression model,
other models such as Logistic Model are not supported currently. This
feature is under development and will be released at the next version(Spark
2.0).
Thanks
Yanbo
2016-01-18 16:45 GMT+08:00 Chandan Verma
t-classification-1.html
> I
> do not get the same results. I’ll put my code up on github over the weekend
> if anyone is interested
>
> Andy
>
> From: Yanbo Liang <yblia...@gmail.com>
> Date: Tuesday, January 19, 2016 at 1:11 AM
>
> To: Andrew Davidson <
Hi Devesh,
RFormula will encode category variables(column of string type) as dummy
variables automatically. You do not need to do dummy transform explicitly
if you want to train machine learning model using SparkR. Although SparkR
only supports a limited ML algorithms(GLM) currently.
Thanks
Matrix can be save as column of type MatrixUDT.
n("AEDWIP: indexOfSentence: " + indexOfSentence);
>
>
> int indexOfAnother = tf.indexOf("another");
>
> System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother);
>
>
> for (Vector v: localTfIdfs) {
>
> System.err.println("AEDWIP
Hi Andy,
Actually, the output of ML IDF model is the TF-IDF vector of each instance
rather than IDF vector.
So it's unnecessary to do member wise multiplication to calculate TF-IDF
value. You can refer the code at here:
Hi Robin,
#1 This feature is available from Spark 1.5.0.
#2 You should use the new ML rather than the old MLlib package to train the
Random Forest model and get featureImportances, because it was only exposed
at ML package. You can refer the documents:
Hi Arunkumar,
It does not support output AIC value for Linear Regression currently. This
feature is under development and will be released at Spark 2.0.
Thanks
Yanbo
2016-01-15 17:20 GMT+08:00 Arunkumar Pillai :
> Hi
>
> Is it possible to get AIC value in Linear
Yep, row of Matrix theta is the number of classes and column of theta is
the number of features.
2016-01-13 10:47 GMT+08:00 Andy Davidson :
> I am trying to debug my trained model by exploring theta
> Theta is a Matrix. The java Doc for Matrix says that it is
Hi Chandan,
Could you tell us the meaning of deploying model? Using the model to make
prediction by R?
Thanks
Yanbo
2016-01-11 20:40 GMT+08:00 Chandan Verma :
> Hi All,
>
> Does any one over here has deployed a model produced in SparkR or atleast
> help me with
Hi,
The parameters should be broadcasted again after you update it at driver
side, then you can get updated version at worker side.
Thanks
Yanbo
2016-01-09 23:12 GMT+08:00 octavian.ganea :
> Hi,
>
> In my app, I have a Params scala object that keeps all the specific
Hi Kristina,
The input column of StandardScaler must be vector type, because it's
usually used as feature scaling before model training and the type of
feature column should be vector in most cases.
If you only want to standardize a numeric column, you can wrap it as a
vector and feed into
First extracting year, month, day, time from the datetime.
Then you should decide which variables can be treated as category features
such as year/month/day and encode them to boolean form using OneHotEncoder.
At last using VectorAssembler to assemble the encoded output vector and the
other raw
You should ensure your sqlContext is HiveContext.
sc <- sparkR.init()
sqlContext <- sparkRHive.init(sc)
2016-01-06 20:35 GMT+08:00 Sandeep Khurana :
> Felix
>
> I tried the option suggested by you. It gave below error. I am going to
> try the option suggested by Prem .
Hi Arunkumar,
You can use datasetDF.select(countDistinct(col1, col2, col3, ...)) or
approxCountDistinct for a approximate result.
2016-01-05 17:11 GMT+08:00 Arunkumar Pillai :
> Hi
>
> Is there any functions to find distinct count of all the variables in
> dataframe.
can handle large models. (master should
> have more memory because it runs LBFGS) In my experiments, I’ve trained the
> models 12M and 32M parameters without issues.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Yanbo Liang [mailto:yblia...@gmail.com]
> *Sent:* Sunda
thanks for info. Is it likely to change in (near :) ) future? Ability to
> call this function only on local data (ie not in rdd) seems to be rather
> serious limitation.
>
> cheers,
> Tomasz
>
> On 02.01.2016 09:45, Yanbo Liang wrote:
>
>> Hi Tomasz,
>>
>> The GMM is
AFAIK, Spark MLlib will improve and support most GLM functions in the next
release(Spark 2.0).
2016-01-03 23:02 GMT+08:00 :
> keyStoneML could be an alternative.
>
> Ardo.
>
> On 03 Jan 2016, at 15:50, Arunkumar Pillai
> wrote:
>
> Is there any road
Hi Tomasz,
The GMM is bind with the peer Java GMM object, so it need reference to
SparkContext.
Some of MLlib(not ML) models are simple object such as KMeansModel,
LinearRegressionModel etc., but others will refer SparkContext. The later
ones and corresponding member functions should not called
Hi Roberto,
Could you share your code snippet that others can help to diagnose your
problems?
2016-01-02 7:51 GMT+08:00 Roberto Pagliari :
> When using the frequent itemsets APIs, I’m running into stackOverflow
> exception whenever there are too many combinations to
1 - 100 of 197 matches
Mail list logo