Could you attach the executor log? That may help identify the root
cause. -Xiangrui
On Mon, Jan 5, 2015 at 11:12 PM, Priya Ch wrote:
> Hi All,
>
> Word2Vec and TF-IDF algorithms in spark mllib-1.1.0 are working only in
> local mode and not on distributed mode. Null pointer exception has been
> th
This is addressed in https://issues.apache.org/jira/browse/SPARK-4789.
In the new pipeline API, we can simply output two columns, one for the
best predicted class, and the other for probabilities or confidence
scores for each class. -Xiangrui
On Tue, Jan 6, 2015 at 11:43 AM, Jianguo Li wrote:
> H
Which Spark version are you using? We made this configurable in 1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L202
-Xiangrui
On Tue, Jan 6, 2015 at 12:57 PM, Fernando O. wrote:
> Hi,
>I was doing a tests with ALS and I
rk-submit script
> Am I correct?
>
>
>
> On Mon, Jan 5, 2015 at 10:35 PM, Xiangrui Meng wrote:
>>
>> It might be hard to do that with spark-submit, because the executor
>> JVMs may be already up and running before a user runs spark-submit.
>> You can try to
How big is your dataset, and what is the vocabulary size? -Xiangrui
On Sun, Jan 4, 2015 at 11:18 PM, Eric Zhen wrote:
> Hi,
>
> When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup
> usage. Here is the jstack output:
>
> "main" prio=10 tid=0x40112800 nid=0x46f2 runnable
I created a JIRA for it:
https://issues.apache.org/jira/browse/SPARK-5094. Hopefully someone
would work on it and make it available in the 1.3 release. -Xiangrui
On Sun, Jan 4, 2015 at 6:58 PM, Christopher Thom
wrote:
> Hi,
>
>
>
> I wonder if anyone knows when a python API will be added for Grad
It might be hard to do that with spark-submit, because the executor
JVMs may be already up and running before a user runs spark-submit.
You can try to use `System.setProperty` to change the property at
runtime, though it doesn't seem to be a good solution. -Xiangrui
On Fri, Jan 2, 2015 at 6:28 AM,
There is an SVD++ implementation in GraphX. It would be nice if you
can compare its performance vs. Mahout. -Xiangrui
On Wed, Dec 24, 2014 at 6:46 AM, Prafulla Wani wrote:
> hi ,
>
> Is there any plan to add SVDPlusPlus based recommender to MLLib ? It is
> implemented in Mahout from this paper -
vice?)
>
> b0c1
>
> --
> Skype: boci13, Hangout: boci.b...@gmail.com
>
> On Tue, Dec 23, 2014 at 1:35 AM, Xiangrui Meng wrote:
>>
>> How big is the dataset you want to use in prediction? -Xiangrui
>>
&g
Could you post your code? It sounds like a bug. One thing to check is
that wheher you set regType, which is None by default. -Xiangrui
On Tue, Dec 23, 2014 at 3:36 PM, Thomas Kwan wrote:
> Hi there
>
> We are on mllib 1.1.1, and trying different regularization parameters. We
> noticed that the re
Hopefully the new pipeline API addresses this problem. We have a code
example here:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala
-Xiangrui
On Mon, Dec 29, 2014 at 5:22 AM, andy petrella wrote:
> Here is w
Sean's PR may be relevant to this issue
(https://github.com/apache/spark/pull/3702). As a workaround, you can
try to truncate the raw scores to 4 digits (e.g., 0.5643215 -> 0.5643)
before sending it to BinaryClassificationMetrics. This may not work
well if he score distribution is very skewed. See
We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can check the user guide:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering
-Xiangrui
On Tue, D
How big is the dataset you want to use in prediction? -Xiangrui
On Mon, Dec 22, 2014 at 1:47 PM, boci wrote:
> Hi!
>
> I want to try out spark mllib in my spark project, but I got a little
> problem. I have training data (external file), but the real data com from
> another rdd. How can I do that
Did you check the indices in the LIBSVM data and the master file? Do
they match? -Xiangrui
On Sat, Dec 20, 2014 at 8:13 AM, Sameer Tilak wrote:
> Hi All,
> I use LIBSVM format to specify my input feature vector, which used 1-based
> index. When I run regression the o/p is 0-indexed based. I have
Dear Spark users and developers,
I’m happy to announce Spark Packages (http://spark-packages.org), a
community package index to track the growing number of open source
packages and libraries that work with Apache Spark. Spark Packages
makes it easy for users to find, discuss, rate, and install pac
Hi Jay,
Please try increasing executor memory (if the available memory is more
than 2GB) and reduce numBlocks in ALS. The current implementation
stores all subproblems in memory and hence the memory requirement is
significant when k is large. You can also try reducing k and see
whether the problem
;ll try out setting a smaller number of item blocks. And
>> yes, I've been following the JIRA for the new ALS implementation. I'll try
>> it out when it's ready for testing. .
>>
>> On Wed, Dec 3, 2014 at 4:24 AM, Xiangrui Meng wrote:
>>>
>>
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal
wrote:
>
>
> Hi,
>
>
>
> I am a new bee in spark and scala world
>
>
>
> I have been trying to implement Collaborative filtering using MlLib supplied
> out of the box with Spark and Scala
>
>
>
> I have 2 problems
>
>
>
> 1. The best model was
Could you post the full stacktrace? It seems to be some recursive call
in parsing. -Xiangrui
On Tue, Dec 9, 2014 at 7:44 PM, wrote:
> Hi
>
>
>
> I am getting Stack overflow Error
>
> Exception in main java.lang.stackoverflowerror
>
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.sc
Please check the number of partitions after sc.textFile. Use
sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui
On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai wrote:
> You just need to use the latest master code without any configuration
> to get performance improvement from my PR.
>
> Since
Is it possible that after filtering the feature dimension changed?
This may happen if you use LIBSVM format but didn't specify the number
of features. -Xiangrui
On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak wrote:
> Hi All,
>
>
> I was able to run LinearRegressionwithSGD for a largeer dataset (> 2
If you want to train offline and predict online, you can use the
current LR implementation to train a model and then apply
model.predict on the dstream. -Xiangrui
On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan wrote:
> I am new to spark.
> Lets say i want to develop a machine learning model. which tr
you can use the default toString method to get the string
representation. if you want to customized, check the indices/values
fields. -Xiangrui
On Fri, Dec 5, 2014 at 7:32 AM, debbie wrote:
> Basic question:
>
> What is the best way to loop through one of these and print their
> components? Conve
Hi Bharath,
You can try setting a small item blocks in this case. 1200 is
definitely too large for ALS. Please try 30 or even smaller. I'm not
sure whether this could solve the problem because you have 100 items
connected with 10^8 users. There is a JIRA for this issue:
https://issues.apache.org/
The training RMSE may increase due to regularization. Squared loss
only represents part of the global loss. If you watch the sum of the
squared loss and the regularization, it should be non-increasing.
-Xiangrui
On Wed, Nov 26, 2014 at 9:53 AM, Sean Owen wrote:
> I also modified the example to tr
Sent out a PR to make the constructor public and leave a note in the
doc: https://github.com/apache/spark/pull/3459 . If you load
userFeatures and productFeatures back and want to make predictions on
individual records, it might be useful to call
partitionBy(...).cache() on the userFeatures and pro
Besides API stability concerns, models constructed directly from users
rather than returned by ALS may not work well. The userFeatures and
productFeatures are both with partitioners so we can perform quick
lookup for prediction. If you save userFeatures and productFeatures
and load them back, it is
It is data-dependent, and hence needs hyper-parameter tuning, e.g.,
grid search. The first batch is certainly expensive. But after you
figure out a small range for each parameter that fits your data,
following batches should be not that expensive. There is an example
from AMPCamp:
http://ampcamp.b
There is a simple example here:
https://github.com/apache/spark/blob/master/examples/src/main/python/kmeans.py
. You can take advantage of sparsity by computing the distance via
inner products:
http://spark-summit.org/2014/talk/sparse-data-support-in-mllib-2
-Xiangrui
On Tue, Nov 25, 2014 at 2:39
In 1.2, we added streaming k-means:
https://github.com/apache/spark/pull/2942 . -Xiangrui
On Mon, Nov 24, 2014 at 5:25 PM, Joanne Contact wrote:
> Thank you Tobias!
>
> On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer wrote:
>>
>> Hi,
>>
>> On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact
>> wro
Try building Spark with -Pnetlib-lgpl, which includes the JNI library
in the Spark assembly jar. This is the simplest approach. If you want
to include it as part of your project, make sure the library is inside
the assembly jar or you specify it via `--jars` with spark-submit.
-Xiangrui
On Mon, No
KMeansModel is serializable. So you can use Java serialization, try
sc.parallelize(Seq(model)).saveAsObjectFile(outputDir)
sc.objectFile[KMeansModel](outputDir).first()
We will try to address model export/import more formally in 1.3, e.g.,
https://www.github.com/apache/spark/pull/3062
-Xiangrui
The data is in LIBSVM format. So this line won't work:
values = [float(s) for s in line.split(' ')]
Please use the util function in MLUtils to load it as an RDD of LabeledPoint.
http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point
from pyspark.mllib.util import MLUtils
examp
How many features and how many partitions? You set kmeans_clusters to
1. If the feature dimension is large, it would be really
expensive. You can check the WebUI and see task failures there. The
stack trace you posted is from the driver. Btw, the total memory you
have is 64GB * 10, so you can c
PR: https://github.com/apache/spark/pull/3291 . For now, here is a workaround:
val a = sc.parallelize(1 to 10).zipWithIndex()
a.partitions // call .partitions explicitly
a.repartition(10).count()
Thanks for reporting the bug! -Xiangrui
On Sat, Nov 15, 2014 at 8:38 PM, Xiangrui Meng wrote
I think I understand where the bug is now. I created a JIRA
(https://issues.apache.org/jira/browse/SPARK-4433) and will make a PR
soon. -Xiangrui
On Sat, Nov 15, 2014 at 7:39 PM, Xiangrui Meng wrote:
> This is a bug. Could you make a JIRA? -Xiangrui
>
> On Sat, Nov 15, 2014 at 3:2
This is a bug. Could you make a JIRA? -Xiangrui
On Sat, Nov 15, 2014 at 3:27 AM, lev wrote:
> Hi,
>
> I'm having trouble using both zipWithIndex and repartition. When I use them
> both, the following action will get stuck and won't return.
> I'm using spark 1.1.0.
>
>
> Those 2 lines work as expe
If Spark is not installed on the client side, you won't be able to
deserialize the model. Instead of serializing the model object, you
may serialize the model weights array and implement predict on the
client side. -Xiangrui
On Fri, Nov 14, 2014 at 2:54 PM, xiaoyan yu wrote:
> I had the same need
If you use Kryo serialier, you need to register mutable.BitSet and Rating:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala#L102
The JIRA was marked resolved because chill resolved the problem in
v0.4.0 and we have this workaro
You need to use maven to include python files. See
https://github.com/apache/spark/pull/1223 . -Xiangrui
On Wed, Nov 12, 2014 at 4:48 PM, jamborta wrote:
> I have figured out that building the fat jar with sbt does not seem to
> included the pyspark scripts using the following command:
>
> sbt/sb
That means the "-Pnetlib-lgpl" option didn't work. Could you use sbt
to build the assembly jar and see whether the ".so" file is inside the
assembly jar? Which system and Java version are you using? -Xiangrui
On Wed, Nov 12, 2014 at 2:22 PM, jpl wrote:
> Hi Xiangrui, thank you very much for your
regParam=1.0 may penalize too much, because we use the average loss
instead of total loss. I just sent a PR to lower the default:
https://github.com/apache/spark/pull/3232
You can try LogisticRegressionWithLBFGS (and configure parameters
through its optimizer), which should converge faster than SG
Could you try "jar tf" on the assembly jar and grep
"netlib-native_system-linux-x86_64.so"? -Xiangrui
On Tue, Nov 11, 2014 at 7:11 PM, jpl wrote:
> Hi,
> I am having trouble using the BLAS libs with the MLLib functions. I am
> using org.apache.spark.mllib.clustering.KMeans (on a single machine)
I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui
On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
wrote:
> Hi,
>
>
>
> This is my Instrument java constructor.
>
>
>
Could you provide more information? For example, spark version,
dataset size (number of instances/number of features), cluster size,
error messages from both the drive and the executor. -Xiangrui
On Mon, Nov 10, 2014 at 11:28 AM, tsj wrote:
> Hello all,
>
> I have some text data that I am running
gt;
> It looks like the project openscoring.io and jpmml-evaluator are some
> amazing systems for this, but they fundamentally use PMML as the model
> representation here.
>
> I have read some JIRA tickets that Xiangrui Meng is interested in getting
> PMML implemented to export MLLi
e...
>
> Any idea how to optimize this so that we can calculate MAP statistics on
> large samples of data ?
>
>
> On Thu, Nov 6, 2014 at 4:41 PM, Xiangrui Meng wrote:
>>
>> ALS model contains RDDs. So you cannot put `model.recommendProducts`
>> inside a RDD closu
his
> is indeed a bug...
>
> Do I have to cache the models to make userFeatures.lookup(user).head to work
> ?
>
>
> On Mon, Nov 3, 2014 at 9:24 PM, Xiangrui Meng wrote:
>>
>> Was "user" presented in training? We can put a che
Which Spark version did you use? Could you check the WebUI and attach
the error message on executors? -Xiangrui
On Wed, Nov 5, 2014 at 8:23 AM, rok wrote:
> yes, the training set is fine, I've verified it.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabb
You can use breeze for local sparse-sparse matrix multiplication and
then define an RDD of sub-matrices
RDD[(Int, Int, CSCMatrix[Double])] (blockRowId, blockColId, sub-matrix)
and then use join and aggregateByKey to implement this feature, which
is the same as in MapReduce.
-Xiangrui
--
We are working on distributed block matrices. The main JIRA is at:
https://issues.apache.org/jira/browse/SPARK-3434
The goal is to support basic distributed linear algebra, (dense first
and then sparse).
-Xiangrui
On Wed, Nov 5, 2014 at 12:23 AM, ll wrote:
> @sowen.. i am looking for distribut
local matrix-matrix multiplication or distributed?
On Tue, Nov 4, 2014 at 11:58 PM, ll wrote:
> what is the best way to implement a sparse x sparse matrix multiplication
> with spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/sparse-x-sparse
The proposed new set of APIs (SPARK-3573, SPARK-3530) will address
this issue. We "carry over" extra columns with training and prediction
and then leverage on Spark SQL's execution plan optimization to decide
which columns are really needed. For the current set of APIs, we can
add `predictOnValues`
Was "user" presented in training? We can put a check there and return
NaN if the user is not included in the model. -Xiangrui
On Mon, Nov 3, 2014 at 5:25 PM, Debasish Das wrote:
> Hi,
>
> I am testing MatrixFactorizationModel.predict(user: Int, product: Int) but
> the code fails on userFeatures.l
Many ML algorithms are sequential because they were not designed to be
parallel. However, ML is not driven by algorithms in practice, but by
data and applications. As datasets getting bigger and bigger, some
algorithms got revised to work in parallel, like SGD and matrix
factorization. MLlib tries
We recently added metrics for regression:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
and you can use
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificatio
This operation requires two transformers:
1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features
We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late! If
s.filter(r => r._1 != r._2).count.toDouble /
> testParsedData.count
> // println("Training Error = " + trainErr)
> println(Calendar.getInstance().getTime())
> }
> }
>
>
>
>
> Thanks,
> Best,
> Peng
>
> On Thu, Oct 30, 2014 at 1:23 PM, Xiangrui M
You can remove 0.5 from all non-zeros. -Xiangrui
On Wed, Oct 29, 2014 at 9:20 PM, Sameer Tilak wrote:
> Hi All,
> I have my sparse data in libsvm format.
>
> val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc,
> "mllib/data/sample_libsvm_data.txt")
>
> I am running Linear regression. Let
DId you cache the data and check the load balancing? How many
features? Which API are you using, Scala, Java, or Python? -Xiangrui
On Thu, Oct 30, 2014 at 9:13 AM, Jimmy wrote:
> Watch the app manager it should tell you what's running and taking awhile...
> My guess it's a "distinct" function on
e past here:
> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB4QFjAA&url=https%3A%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fincubator-spark-user%2F201410.mbox%2F%253CCAM-S9zS-%2B-MSXVcohWEhjiAEKaCccOKr_N5e0HPXcNgnxZd%3DHw%40mail.gmail.com%253E&
FYI, there is a PR to make mllib.rdd.RDDFunctions public:
https://github.com/apache/spark/pull/2907 -Xiangrui
On Tue, Oct 28, 2014 at 5:18 AM, Yanbo Liang wrote:
> Yes, it can import org.apache.spark.mllib.rdd.RDDFunctions but you can not
> use any method in this class or even new an object of th
Could you save the data before ALS and try to reproduce the problem?
You might try reducing the number of partitions and not using Kryo
serialization, just to narrow down the issue. -Xiangrui
On Mon, Oct 27, 2014 at 1:29 PM, Ilya Ganelin wrote:
> Hi Burak.
>
> I always see this error. I'm running
We are working on the pipeline features, which would make this
procedure much easier in MLlib. This is still a WIP and the main JIRA
is at:
https://issues.apache.org/jira/browse/SPARK-1856
Best,
Xiangrui
On Mon, Oct 27, 2014 at 8:56 AM, chirag lakhani
wrote:
> Hello,
>
> I have been prototyping
If your file is not very large, try
sc.wholeTextFiles("...").values.flatMap(_.split("\n").grouped(4).map(_.mkString("\n")))
-Xiangrui
On Sat, Oct 25, 2014 at 12:57 AM, Parthus wrote:
> Hi,
>
> It might be a naive question, but I still wish that somebody could help me
> handle it.
>
> I have a
Please check out the example code:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/TallSkinnySVD.scala
-Xiangrui
On Tue, Oct 21, 2014 at 5:34 AM, viola wrote:
> Hi,
>
> I am VERY new to spark and mllib and ran into a couple of problems while
> t
Yes. "where the indices are one-based and **in ascending order**". -Xiangrui
On Tue, Oct 21, 2014 at 1:10 PM, Sameer Tilak wrote:
> Hi All,
>
> I have a question regarding the ordering of indices. The document says that
> the indices indices are one-based and in ascending order. However, do the
>
Could you post the error message? -Xiangrui
On Fri, Oct 17, 2014 at 2:00 AM, poiuytrez wrote:
> Hello MLnick,
>
> Have you found a solution on how to install MLlib for Mac OS ? I have also
> some trouble to install the dependencies.
>
> Best,
> poiuytrez
>
>
>
> --
> View this message in context:
Davies is porting features from mllib.feature.* to pyspark
(https://github.com/apache/spark/pull/2819). I'm not aware of anyone
who is working on porting mllib.evaluation.*. So feel free to create a
JIRA and someone may be interested in working on it. -Xiangrui
On Fri, Oct 17, 2014 at 12:58 AM, po
computePrincipalComponents returns a local matrix X, whose columns are
the principal components (ordered), while those column vectors are in
the same feature space as the input feature vectors. -Xiangrui
On Thu, Oct 16, 2014 at 2:39 AM, al123 wrote:
> Hi,
>
> I don't think anybody answered this q
I used "k-means||", which is the default. And it took less than 1
minute to finish. 50 iterations took less than 25 minutes on a cluster
of 9 m3.2xlarge EC2 nodes. Which deploy mode did you use? Is it
yarn-client? -Xiangrui
On Tue, Oct 14, 2014 at 6:03 PM, Ray wrote:
> Hi Xiangrui,
>
> Thanks for
Just ran a test on mnist8m (8m x 784) with k = 100 and numIter = 50.
It worked fine. Ray, the error log you posted is after cluster
termination, which is not the root cause. Could you search your log
and find the real cause? On the executor tab screenshot, I saw only
200MB is used. Did you cache th
LBFGS is better. If you data is easily separable, LR might return
values very close or equal to either 0.0 or 1.0. It is rare but it may
happen. -Xiangrui
On Tue, Oct 14, 2014 at 3:18 PM, Aris wrote:
> Wow...I just tried LogisticRegressionWithLBFGS, and using clearThreshold()
> DOES IN FACT work.
What is the feature dimension? I saw you used 100 partitions. How many
cores does your cluster have? -Xiangrui
On Tue, Oct 14, 2014 at 1:51 PM, Ray wrote:
> Hi guys,
>
> An interesting thing, for the input dataset which has 1.5 million vectors,
> if set the KMeans's k_value = 100 or k_value = 50,
You cannot recover the document from the TF-IDF vector, because
HashingTF is not reversible. You can assign each document a unique ID,
and join back the result after training. HasingTF can transform
individual record:
val docs: RDD[(String, Seq[String])] = ...
val tf = new HashingTF()
val tfWithI
String has huge overhead in JVM. The official tuning guide is very
useful: http://spark.apache.org/docs/latest/tuning.html#memory-tuning
. In your case, since the input elements are all numbers, please
convert them into doubles right after the split (before groupBy) and
try to use primitive arrays
1. No.
2. The seed per partition is fixed. So it should generate
non-overlapping subsets.
3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.
Best,
Xiangrui
On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu wrote:
> Hi, all
>
> When we use MLUtils.kfold to generate training and validation set
please re-try with --driver-memory 10g . The default is 256m. -Xiangrui
On Thu, Oct 9, 2014 at 2:33 AM, Clive Cox wrote:
> Hi,
>
> I'm trying out the DIMSUM item similarity from github master commit
> 69c3f441a9b6e942d6c08afecd59a0349d61cc7b . My matrix is:
> Num items : 8860
> Number of users :
Please use --driver-memory 2g instead of --conf
spark.driver.memory=2g. I'm not sure whether this is a bug. -Xiangrui
On Thu, Oct 9, 2014 at 9:00 AM, Jaonary Rabarisoa wrote:
> Dear all,
>
> I have a spark job with the following configuration
>
> val conf = new SparkConf()
> .setAppName("My
The proper step size partially depends on the Lipschitz constant of
the objective. You should let the machine try different combinations
of parameters and select the best. We are working with people from
AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the
theory, Nesterov's book "Int
Did you test different regularization parameters and step sizes? In
the combination that works, I don't see "A + D". Did you test that
combination? Are there any linear dependency between A's columns and
D's columns? -Xiangrui
On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote:
> BTW, one detail:
It really depends on the type of the computation. For example, if
vertices and edges are associated with properties and you want to
operate on (vertex-edge-vertex) triplets or use the Pregel API, GraphX
is the way to go. -Xiangrui
On Sat, Oct 4, 2014 at 9:39 PM, ll wrote:
> hi. i am working on a
It would be really helpful if you can help test the scalability of the
new ALS impl:
https://github.com/mengxr/spark-als/blob/master/src/main/scala/org/apache/spark/ml/SimpleALS.scala
. It should be faster and more scalable, but the code is messy now.
Best,
Xiangrui
On Fri, Oct 3, 2014 at 11:57
The current impl of ALS constructs least squares subproblems in
memory. So for rank 100, the total memory it requires is about 480,189
* 100^2 / 2 * 8 bytes ~ 20GB, divided by the number of blocks. For
rank 1000, this number goes up to 2TB, unfortunately. There is a JIRA
for optimizing ALS: https:/
Did you add a different version of breeze to the classpath? In Spark
1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze
version you used is different from the one comes with Spark, you might
see class not found. -Xiangrui
On Fri, Oct 3, 2014 at 4:22 AM, Priya Ch wrote:
> Hi Team,
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0.
-Xiangrui
On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain wrote:
> So I am trying to print the model output from MLlib however I am only
> getting things like the following:
>
> org.apache.spark.mllib.tree.model.DecisionTreeMo
The cost depends on the feature dimension, number of instances, number
of classes, and number of partitions. Do you mind sharing those
numbers? -Xiangrui
On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico wrote:
> Hi Everyone,
>
> I'm working on training mllib's Naive Bayes to classify TF/IDF vectoried
Yes, the "bigram" in that demo only has two characters, which could
separate different character sets. -Xiangrui
On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei wrote:
> The program computes hashing bi-gram frequency normalized by total number of
> bigrams then filter out zero values. hashing is a eff
ALS still needs to load and deserialize the in/out blocks (one by one)
from disk and then construct least squares subproblems. All happen in
RAM. The final model is also stored in memory. -Xiangrui
On Wed, Oct 1, 2014 at 4:36 AM, Alex T wrote:
> Hi, thanks for the reply.
>
> I added the ALS.setIn
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
dataset metadata, which may store the default value to replace missing
values. But no one is committed to work on this feature. For now, you
can filter out examples
You may need a cluster with more memory. The current ALS
implementation constructs all subproblems in memory. With rank=10,
that means (6.5M + 2.5M) * 10^2 / 2 * 8 bytes = 3.5GB. The ratings
need 2GB, not counting the overhead. ALS creates in/out blocks to
optimize the computation, which takes abou
The test accuracy doesn't mean the total loss. All points between (-1,
1) can separate points -1 and +1 and give you 1.0 accuracy, but their
coressponding loss are different. -Xiangrui
On Sun, Sep 28, 2014 at 2:48 AM, Yanbo Liang wrote:
> Hi
>
> We have used LogisticRegression with two different
Hi Krishna,
Some planned features for MLlib 1.2 can be found via Spark JIRA:
http://bit.ly/1ywotkm , though this list is not fixed. The feature
freeze will happen by the end of Oct. Then we will cut branch-1.2 and
start QA. I don't recommend using branch-1.2 for hands-on tutorial
around Oct 29th b
We removed commons-math3 from dependencies to avoid version conflict
with hadoop-common. hadoop-common-2.3+ depends on commons-math3-3.1.1,
while breeze depends on commons-math3-3.3. 3.3 is not backward
compatible with 3.1.1. So we removed it because the breeze functions
we use do not touch commons
Please also check the load balance of the RDD on YARN. How many
partitions are you using? Does it match the number of CPU cores?
-Xiangrui
On Thu, Sep 25, 2014 at 12:28 PM, bhusted wrote:
> What is the size of your vector mine is set to 20? I am seeing slow results
> as well with iteration=5, # o
For the vectorizer, what's the output feature dimension and are you
creating sparse vectors or dense vectors? The model on the driver
consists of numClasses * numFeatures doubles. However, the driver
needs more memory in order to receive the task result (of the same
size) from executors. So you nee
7000x7000 is not tall-and-skinny matrix. Storing the dense matrix
requires 784MB. The driver needs more storage for collecting result
from executors as well as making a copy for LAPACK's dgesvd. So you
need more memory. Do you need the full SVD? If not, try to use a small
k, e.g, 50. -Xiangrui
On
You dataset is small. NaiveBayes should work under the default
settings, even in local mode. Could you try local mode first without
changing any Spark settings? Since your dataset is small, could you
save the vectorized data (RDD[LabeledPoint]) and send me a sample? I
want to take a look at the fea
Does feature size 43839 equal to the number of terms? Check the output
dimension of your feature vectorizer and reduce number of partitions
to match the number of physical cores. I saw you set
spark.storage.memoryFaction to 0.0. Maybe it is better to keep the
default. Also please confirm the driver
201 - 300 of 530 matches
Mail list logo