Re: mllib performance on mesos cluster

2014-09-22 Thread Xiangrui Meng
1) MLlib 1.1 should be faster than 1.0 in general. What's the size of your dataset? Is the RDD evenly distributed across nodes? You can check the storage tab of the Spark WebUI. 2) I don't have much experience with mesos deployment. Someone else may be able to answer your question. -Xiangrui On

Re: Unable to load app logs for MLLib programs in history server

2014-09-18 Thread Xiangrui Meng
Could you create a JIRA for it? We can either remove special characters or encode with alphanumerics. -Xiangrui On Thu, Sep 18, 2014 at 3:50 PM, SK wrote: > Hi, > > The default log files for the Mllib examples use a rather long naming > convention that includes special characters like parentheses

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Xiangrui Meng
We don't support kernels because it doesn't scale well. Please check "When to use LIBLINEAR but not LIBSVM" on http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's suggestion on expanding features. -Xiangrui On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam wrote: > Hi Aris, > > A s

Re: SVD on larger than taller matrix

2014-09-18 Thread Xiangrui Meng
Did you cache `features`? Without caching it is slow because we need O(k) iterations. The storage requirement on the driver is about 2 * n * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead. Computing U is also an expensive task in your case. We should use some randomized SVD implementa

Re: MLLib regression model weights

2014-09-18 Thread Xiangrui Meng
The importance should be based on some statistics, for example, the standard deviation of the feature column and the magnitude of the weight. If the columns are scaled to unit standard deviation (using StandardScaler), you can tell the importance by the absolute value of the weight. But there are o

Re: Joining multiple rowMatrix

2014-09-18 Thread Xiangrui Meng
You can use CoGroupedRDD (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.CoGroupedRDD) directly. -Xiangrui On Thu, Sep 18, 2014 at 7:09 AM, Debasish Das wrote: > Hi, > > I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) > and I would like

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread Xiangrui Meng
Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in HashingTF. Please do not cache the input document, but cache the output from HashingTF and IDF instead. We don't have a label indexer yet, so you need a label to index map to map it to double val

Re: MLLib sparse vector

2014-09-15 Thread Xiangrui Meng
Or you can use the factory method `Vectors.sparse`: val sv = Vectors.sparse(numProducts, productIds.map(x => (x, 1.0))) where numProducts should be the largest product id plus one. Best, Xiangrui On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore wrote: > Hi Sameer, > > MLLib uses Breeze’s vector fo

Re: Accuracy hit in classification with Spark

2014-09-15 Thread Xiangrui Meng
Thanks for the update! -Xiangrui On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet wrote: > Hi, > > I have been able to get the same accuracy with MLlib as Mahout's. The > pre-processing phase of Mahout was the reason behind the accuracy mismatch. > After studying and applying the same logic in my co

Re: Define the name of the outputs with Java-Spark.

2014-09-15 Thread Xiangrui Meng
Spark doesn't support MultipleOutput at this time. You can cache the parent RDD. Then create RDDs from it and save them separately. -Xiangrui On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz wrote: > > I would like to define the names of my output in Spark, I have a process > which write many fai

Re: Efficient way to sum multiple columns

2014-09-15 Thread Xiangrui Meng
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui On Mon, Sep 15, 2014 at 1:00 PM, jamborta wrote: > Hi all, > > I have an RDD that contains around 50 columns. I need to sum each column, > which I am doing by running it through a for loop, creating an array and > run

Re: sc.textFile problem due to newlines within a CSV record

2014-09-12 Thread Xiangrui Meng
I wrote an input format for Redshift's tables unloaded UNLOAD the ESCAPE option: https://github.com/mengxr/redshift-input-format , which can recognize multi-line records. Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and the delimiter character. You can apply the same escaping b

Re: Accuracy hit in classification with Spark

2014-09-09 Thread Xiangrui Meng
If you are using the Mahout's Multinomial Naive Bayes, it should be the same as MLlib's. I tried MLlib with news20.scale downloaded from http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html and the test accuracy is 82.4%. -Xiangrui On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet wrot

Re: Solving Systems of Linear Equations Using Spark?

2014-09-08 Thread Xiangrui Meng
or ? You want a >> distributed lp / socp solver where each worker solves a partition of the >> constraint and the full objective...and you want to converge to a global >> solution using consensus ? Or your problem has more structure to partition >> the problem cleanly and don&

Re: A problem for running MLLIB in amazon clound

2014-09-08 Thread Xiangrui Meng
Could you attach the driver log? -Xiangrui On Mon, Sep 8, 2014 at 7:23 AM, Hui Li wrote: > I am running a very simple example using the SVMWithSGD on Amazon EMR. I > haven't got any result after one hour long. > > My instance-type is: m3.large > instance-count is: 3 > Dataset is the data pr

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-08 Thread Xiangrui Meng
When you submit the job to yarn with spark-submit, set --conf spark.yarn.user.classpath.first=true . On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza wrote: > I don't understand what you mean. Can you be more specific? > > > > From: Victor Tso-Guillen > Sent: Sat

Re: prepending jars to the driver class path for spark-submit on YARN

2014-09-07 Thread Xiangrui Meng
There is an undocumented configuration to put users jars in front of spark jar. But I'm not very certain that it works as expected (and this is why it is undocumented). Please try turning on spark.yarn.user.classpath.first . -Xiangrui On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen wrote: > I

Re: Solving Systems of Linear Equations Using Spark?

2014-09-07 Thread Xiangrui Meng
You can try LinearRegression with sparse input. It converges the least squares solution if the linear system is over-determined, while the convergence rate depends on the condition number. Applying standard scaling is popular heuristic to reduce the condition number. If you are interested in spars

Re: Is there any way to control the parallelism in LogisticRegression

2014-09-03 Thread Xiangrui Meng
ch also reduce the GC time. That is exactly >> what we observed in our job. >> >> 2. This approach will not hit the 2G limitation, because it not change >> the partition size. >> >> And I also think that, Spark may change this default value, or at least >>

Re: [MLib] How do you normalize features?

2014-09-03 Thread Xiangrui Meng
Maybe copy the implementation of StandardScaler from 1.1 and use it in v1.0.x. -Xiangrui On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska wrote: > It seems like the next release will add a nice > org.apache.spark.mllib.feature package but what is the recommended way to > normalize features in the cu

Re: Accessing neighboring elements in an RDD

2014-09-03 Thread Xiangrui Meng
There is a sliding method implemented in MLlib (https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala), which is used in computing Area Under Curve: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluat

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-03 Thread Xiangrui Meng
I think they are the same. If you have hub (https://hub.github.com/) installed, you can run hub checkout https://github.com/apache/spark/pull/216 and then `sbt/sbt assembly` -Xiangrui On Wed, Sep 3, 2014 at 12:03 AM, filipus wrote: > howto install? just clone by git clone > https://github.com/

Re: New features (Discretization) for v1.x in xiangrui.pdf

2014-09-02 Thread Xiangrui Meng
We have a pending PR (https://github.com/apache/spark/pull/216) for discretization but it has performance issues. We will try to spend more time to improve it. -Xiangrui On Tue, Sep 2, 2014 at 2:56 AM, filipus wrote: > i guess i found it > > https://github.com/LIDIAgroup/SparkFeatureSelection > >

Re: MLLib decision tree: Weights

2014-09-02 Thread Xiangrui Meng
This is not supported in MLlib. Hopefully, we will add support for weighted examples in v1.2. If you want to train weighted instances with the current tree implementation, please try importance sampling first to adjust the weights. For instance, an example with weight 0.3 is sampled with probabilit

Re: minPartitions ignored for bz2?

2014-08-27 Thread Xiangrui Meng
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files before 1.2 (or a later version). But due to a bug (https://issues.apache.org/jira/browse/HADOOP-10614), you should try hadoop-2.5.0. -Xiangrui On Wed, Aug 27, 2014 at 2:49 PM, jerryye wrote: > Hi, > I'm running on the master br

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
ote: > So, I guess zipWithUniqueId will be similar. > > Is there a way to get unique index? > > > On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng wrote: >> >> No. The indices start at 0 for every RDD. -Xiangrui >> >> On Wed, Aug 27, 2014 at 2:37 PM, Soumitr

Re: Spark Streaming: DStream - zipWithIndex

2014-08-27 Thread Xiangrui Meng
No. The indices start at 0 for every RDD. -Xiangrui On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar wrote: > Hello, > > If I do: > > DStream transform { > rdd.zipWithIndex.map { > > Is the index guaranteed to be unique across all RDDs here? > > } > } > > Thanks, > -Soumitra.

Re: CUDA in spark, especially in MLlib?

2014-08-27 Thread Xiangrui Meng
Hi Wei, Please keep us posted about the performance result you get. This would be very helpful. Best, Xiangrui On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan wrote: > Thank you all. Actually I was looking at JCUDA. Function wise this may be a > perfect solution to offload computation to GPU. Will se

Re: Only master is really busy at KMeans training

2014-08-26 Thread Xiangrui Meng
How many partitions now? Btw, which Spark version are you using? I checked your code and I don't understand why you want to broadcast vectors2, which is an RDD. var vectors2 = vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER) var broadcastVector = sc.bro

Re: What about implementing various hypothesis test for LogisticRegression in MLlib

2014-08-24 Thread Xiangrui Meng
n > using logistic regression to build score cards. > > Xiaobo Gu > > > -- Original -- > From: "Xiangrui Meng";; > Send time: Wednesday, Aug 20, 2014 2:18 PM > To: ""; > Cc: "user@spark.apache.org"; > Sub

Re: What about implementing various hypothesis test for Logistic Regression in MLlib

2014-08-19 Thread Xiangrui Meng
We implemented chi-squared tests in v1.1: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166 and we will add more after v1.1. Feedback on which tests should come first would be greatly appreciated. -Xiangrui On Tue, Aug 19, 2014 at 9:

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
Bayes? > > Phuoc Do > > > On Tue, Aug 19, 2014 at 12:51 AM, Xiangrui Meng wrote: >> >> What is the ratio of examples labeled `s` to those labeled `b`? Also, >> Naive Bayes doesn't work on negative feature values. It assumes term >> frequencies as the i

Re: Only master is really busy at KMeans training

2014-08-19 Thread Xiangrui Meng
There are only 5 worker nodes. So please try to reduce the number of partitions to the number of available CPU cores. 1000 partitions are too bigger, because the driver needs to collect to task result from each partition. -Xiangrui On Tue, Aug 19, 2014 at 1:41 PM, durin wrote: > When trying to us

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-08-19 Thread Xiangrui Meng
On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani > wrote: >> >> Thanks a lot Xiangrui. This will help. >> >> >> On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng wrote: >>> >>> Hi Rahul, >>> >>> We plan to add online model updates with

Re: Decision tree: categorical variables

2014-08-19 Thread Xiangrui Meng
The categorical features must be encoded into indices starting from 0: 0, 1, ..., numCategories - 1. Then you can provide the categoricalFeatureInfo map to specify which columns contain categorical features and the number of categories in each. Joseph is updating the user guide. But if you want to

Re: Naive Bayes

2014-08-19 Thread Xiangrui Meng
What is the ratio of examples labeled `s` to those labeled `b`? Also, Naive Bayes doesn't work on negative feature values. It assumes term frequencies as the input. We should throw an exception on negative feature values. -Xiangrui On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do wrote: > I'm trying Na

Re: ALS checkpoint performance

2014-08-15 Thread Xiangrui Meng
Guoqiang reported some results in his PRs https://github.com/apache/spark/pull/828 and https://github.com/apache/spark/pull/929 . But this is really problem-dependent. -Xiangrui On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das wrote: > Hi, > > Are there any experiments detailing the performance hit

Re: Spark Akka/actor failures.

2014-08-14 Thread Xiangrui Meng
Could you try to map it to row-majored first? Your approach may generate multiple copies of the data. The code should look like this: ~~~ val rows = rdd.map { case (j, values) => values.view.zipWithIndex.map { case (v, i) => (i, (j, v)) } }.groupByKey().map { case (i, entries) => Vectors

Re: training recsys model

2014-08-14 Thread Xiangrui Meng
. > > > On Wed, Aug 13, 2014 at 1:26 PM, Xiangrui Meng wrote: >> >> You can define an evaluation metric first and then use a grid search >> to find the best set of training parameters. Ampcamp has a tutorial >> showing how to do this for ALS: >> >> http://

Re: training recsys model

2014-08-12 Thread Xiangrui Meng
You can define an evaluation metric first and then use a grid search to find the best set of training parameters. Ampcamp has a tutorial showing how to do this for ALS: http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html -Xiangrui On Tue, Aug 12, 2014 at 8:01 PM,

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
will combine multiple partitions into a large partition if I cache it, so > same issues as #1? > > For coalesce, could you share some best practice how to set the right number > of partitions to avoid locality problem? > > Thanks! > > > > On Tue, Aug 12, 2014 at 3:51 PM,

Re: Using very large files for KMeans training -- cluster centers size?

2014-08-12 Thread Xiangrui Meng
What did you set for driver memory? The default value is 256m or 512m, which is too small. Try to set "--driver-memory 10g" with spark-submit or spark-shell and see whether it works or not. -Xiangrui On Mon, Aug 11, 2014 at 6:26 PM, durin wrote: > I'm trying to apply KMeans training to some text

Re: How to save mllib model to hdfs and reload it

2014-08-12 Thread Xiangrui Meng
For linear models, the constructors are now public. You can save the weights to HDFS, then load the weights back and use the constructor to create the model. -Xiangrui On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu wrote: > hello: > > I want to know,if I use history data to training model and I want

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-12 Thread Xiangrui Meng
Assuming that your data is very sparse, I would recommend RDD.repartition. But if it is not the case and you don't want to shuffle the data, you can try a CombineInputFormat and then parse the lines into labeled points. Coalesce may cause locality problems if you didn't use the right number of part

Re: Partitioning a libsvm format file

2014-08-10 Thread Xiangrui Meng
If the file is big enough, you can try MLUtils.loadLibSVMFile with a minPartitions argument. This doesn't shuffle data but it might not give you the exact number of partitions. If you want to have the exact number, use RDD.repartition, which requires data shuffling. -Xiangrui On Sun, Aug 10, 2014

Re: Does Spark 1.0.1 stil collect results in serial???

2014-08-08 Thread Xiangrui Meng
For the reduce/aggregate question, driver collects results in sequence. We now use tree aggregation in MLlib to reduce driver's load: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89 It is faster than aggregate when there are many

Re: scopt.OptionParser

2014-08-08 Thread Xiangrui Meng
Thanks for posting the solution! You can also append `% "provided"` to the `spark-mllib` dependency line and remove `spark-core` (because spark-mllib already depends on spark-core) to make the assembly jar smaller. -Xiangrui On Fri, Aug 8, 2014 at 10:05 AM, SK wrote: > i was using sbt package whe

Re: Where do my partitions go?

2014-08-08 Thread Xiangrui Meng
They are two different RDDs. Spark doesn't guarantee that the first partition of RDD1 and the first partition of RDD2 will stay in the same worker node. If that is the case, if you have 1000 single-partition RDDs the first worker will have very heavy load. -Xiangrui On Thu, Aug 7, 2014 at 2:20 AM,

Re: Spark: Could not load native gpl library

2014-08-07 Thread Xiangrui Meng
Is the GPL library only available on the driver node? If that is the case, you need to add them to `--jars` option of spark-submit. -Xiangrui On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei wrote: > I had the following error when trying to run a very simple spark job (which > uses logistic regression w

Re: KMeans Input Format

2014-08-07 Thread Xiangrui Meng
Besides durin's suggestion, please also confirm driver and executor memory in the WebUI, since they are small according to the log: 14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to memory (estimated size 34.6 KB, free 303.3 MB) -Xiangrui -

Re: questions about MLLib recommendation models

2014-08-07 Thread Xiangrui Meng
ratings.map{ case Rating(u,m,r) => { val pred = model.predict(u, m) (r - pred)*(r - pred) } }.mean() The code doesn't work because the userFeatures and productFeatures stored in the model are RDDs. You tried to serialize them into the task closure, and execute `model.predict` on an execu

Re: Low Performance of Shark over Spark.

2014-08-07 Thread Xiangrui Meng
Did you cache the table? There are couple ways of caching a table in Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide On Thu, Aug 7, 2014 at 6:51 AM, wrote: > Dear all, > > I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0. > > 6 worker nodes each with memory 96GB

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Then this may be a bug. Do you mind sharing the dataset that we can use to reproduce the problem? -Xiangrui On Thu, Aug 7, 2014 at 1:20 AM, SK wrote: > Spark 1.0.1 > > thanks > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameter

Re: Naive Bayes parameters

2014-08-07 Thread Xiangrui Meng
It is used in data loading: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76 On Thu, Aug 7, 2014 at 12:47 AM, SK wrote: > I followed the example in > examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveB

Re: Regularization parameters

2014-08-07 Thread Xiangrui Meng
Which Spark version are you using? -Xiangrui On Thu, Aug 7, 2014 at 1:12 AM, SK wrote: > Hi, > > I am following the code in > examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala > For setting the parameters and parsing the command line options, I am just > reusing t

Re: fail to run LBFS in 5G KDD data in spark 1.0.1?

2014-08-06 Thread Xiangrui Meng
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver? I think the problem is with the feature dimension. KDD data has more than 20M features and in v1.0.1, the driver collects the partial gradients one by one, sums them up, does the update, and then sends the new weights back t

Re: about spark and using machine learning model

2014-08-04 Thread Xiangrui Meng
Some extra work is needed to close the loop. One related example is streaming linear regression added by Jeremy very recently: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala You can use a model trained offline to

Re: MLLib: implementing ALS with distributed matrix

2014-08-03 Thread Xiangrui Meng
To be precise, the optimization is not `get all products that are related to this user` but `get all products that are related to users inside this block`. So a product factor won't be sent to the same block more than once. We considered using GraphX to implement ALS, which is much easier to unders

Re: why a machine learning application run slowly on the spark cluster

2014-07-30 Thread Xiangrui Meng
found the total task reduce from 200 t0 64 after first stage just like this: > > > But I don't know if this is reasonable. > ​ > > > On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng wrote: > >> After you load the data in, call `.repartition(number of >> exe

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
nly distributed to the executors. > > The input data is on the HDFS, not on the spark clusters. How can I make the > data distributed to the excutors? > > > On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng wrote: >> >> The weight vector is usually dense and if you have m

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
The weight vector is usually dense and if you have many partitions, the driver may slow down. You can also take a look at the driver memory inside the Executor tab in WebUI. Another setting to check is the HDFS block size and whether the input data is evenly distributed to the executors. Are the ha

Re: why a machine learning application run slowly on the spark cluster

2014-07-29 Thread Xiangrui Meng
Could you share more details about the dataset and the algorithm? For example, if the dataset has 10M+ features, it may be slow for the driver to collect the weights from executors (just a blind guess). -Xiangrui On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim wrote: > Hi, all > > [Setting] > > Input

Re: KMeans: expensiveness of large vectors

2014-07-29 Thread Xiangrui Meng
Before torrent, http is the default way for broadcasting. The driver holds the data and the executors request the data via http, making the driver the bottleneck if the data is large. -Xiangrui On Tue, Jul 29, 2014 at 10:32 AM, durin wrote: > Development is really rapid here, that's a great thing

Re: SPARK OWLQN Exception: Iteration Stage is so slow

2014-07-29 Thread Xiangrui Meng
Do you mind sharing more details, for example, specs of nodes and data size? -Xiangrui 2014-07-29 2:51 GMT-07:00 John Wu : > Hi all, > > > > There is a problem we can’t resolve. We implement the OWLQN algorithm in > parallel with SPARK, > > We don’t know why It is very slow in every iteration stag

Re: Reading hdf5 formats with pyspark

2014-07-28 Thread Xiangrui Meng
That looks good to me since there is no Hadoop InputFormat for HDF5. But remember to specify the number of partitions in sc.parallelize to use all the nodes. You can change `process` to `read` which yields records one-by-one. Then sc.parallelize(files, numPartitions).flatMap(read) returns an RDD of

Re: evaluating classification accuracy

2014-07-28 Thread Xiangrui Meng
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and master. If you don't want to switch to 1.0.1 or master, try to cache and count test first. -Xiangrui On Mon, Jul 28, 2014 at 6:07 PM, SK wrote: > Hi, > > In order to evaluate the ML classification accuracy, I am zipping up the > p

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
Great! Thanks for testing the new features! -Xiangrui On Mon, Jul 28, 2014 at 8:58 PM, durin wrote: > Hi Xiangrui, > > using the current master meant a huge improvement for my task. Something > that did not even finish before (training with 120G of dense data) now > completes in a reasonable time

Re: KMeans: expensiveness of large vectors

2014-07-28 Thread Xiangrui Meng
1. I meant in the n (1k) by m (10k) case, we need to broadcast k centers and hence the total size is m * k. In 1.0, the driver needs to send the current centers to each partition one by one. In the current master, we use torrent to broadcast the centers to workers, which should be much faster. 2.

Re: Kmeans: set initial centers explicitly

2014-07-27 Thread Xiangrui Meng
I think this is nice to have. Feel free to create a JIRA for it and it would be great if you can send a PR. Thanks! -Xiangrui On Thu, Jul 24, 2014 at 12:39 PM, SK wrote: > > Hi, > > The mllib.clustering.kmeans implementation supports a random or parallel > initialization mode to pick the initial

Re: KMeans: expensiveness of large vectors

2014-07-27 Thread Xiangrui Meng
If you have an m-by-n dataset and train a k-means model with k, the cost for each iteration is O(m * n * k) (assuming dense data) Since m * n * k == n * m * k, so ideally you would expect the same run time. However, 1. Communication. We need to broadcast current centers (m * k), do the computati

Re: Questions about disk IOs

2014-07-25 Thread Xiangrui Meng
17:51:48 INFO executor.Executor: Finished task ID 742 > <—— I have shutdown the App > 14/07/25 18:16:36 INFO executor.CoarseGrainedExecutorBackend: Driver > commanded a shutdown > > On Jul 2, 2014, at 0:08, Xiangrui Meng wrote: > >> Try to reduce number of partitions

Re: NMF implementaion is Spark

2014-07-25 Thread Xiangrui Meng
It is ALS with setNonnegative. -Xiangrui On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia wrote: > Hi, > > Is there an implementation for Nonnegative Matrix Factorization in Spark? I > understand that MLlib comes with matrix factorization, but it does not seem > to cover the nonnegative case.

Announcing Spark 0.9.2

2014-07-23 Thread Xiangrui Meng
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is a maintenance release with bug fixes across several areas of Spark, including Spark Core, PySpark, MLlib, Streaming, and GraphX. We recommend all 0.9.x users to upgrade to this stable release. Contributions to this release came f

Re: spark github source build error

2014-07-23 Thread Xiangrui Meng
try `sbt/sbt clean` first? -Xiangrui On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma wrote: > I am trying to build spark after cloning from github repo: > > I am executing: > ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly > > I am getting following error: > [warn] ^

Re: akka disassociated on GC

2014-07-23 Thread Xiangrui Meng
ast patch, the evaluation has been > processed successfully. > > I expect that these patches are merged in the next major release (v1.1?). > Without them, it would be hard to use mllib for a large dataset. > > Thanks, > Makoto > > > (2014/07/16 15:05), Xiangrui Meng wrot

Re: error from DecisonTree Training:

2014-07-21 Thread Xiangrui Meng
This is a known issue: https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working on it. -Xiangrui On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang wrote: > So this is a bug unsolved (for java) yet? > > > > From: Jack Yang [mailto:j...@uow.edu.au] > Sent: Friday, 18 July 2014 4:52 PM > To: us

Re: Launching with m3.2xlarge instances: /mnt and /mnt2 mounted on 7gb drive

2014-07-21 Thread Xiangrui Meng
You can also try a different region. I tested us-west-2 yesterday, and it worked well. -Xiangrui On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia wrote: > Actually the script in the master branch is also broken (it's pointing to an > older AMI). Try 1.0.1 for launching clusters. > > On Jul 20, 2014

Re: LabeledPoint with weight

2014-07-21 Thread Xiangrui Meng
This is a useful feature but it may be hard to have it in v1.1 due to limited time. Hopefully, we can support it in v1.2. -Xiangrui On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen wrote: > It seems MLlib right now doesn't support weighted training, training samples > have equal importance. Weight

Re: Large Task Size?

2014-07-20 Thread Xiangrui Meng
It was because of the latest change to task serialization: https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a The task size is no longer limited by akka.frameSize but we show warning messages if the task size is above 100KB. Please check the objects referenced in the t

Re: Python: saving/reloading RDD

2014-07-18 Thread Xiangrui Meng
You can save RDDs to text files using RDD.saveAsTextFile and load it back using sc.textFile. But make sure the record to string conversion is correctly implemented if the type is not primitive and you have the parser to load them back. -Xiangrui > On Jul 18, 2014, at 8:39 AM, Roch Denis wrote:

Re: Large scale ranked recommendation

2014-07-18 Thread Xiangrui Meng
Nick's suggestion is a good approach for your data. The item factors to broadcast should be a few MBs. -Xiangrui > On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux wrote: > > And you might want to apply clustering before. It is likely that every user > and every item are not unique. > > Bertran

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
chine. It did bring down the time taken to less than 200 > seconds from over 700 seconds. > > I am not sure how to repartition the data to match the CPU cores. How do I > do it? > > Thank you. > > Ravi > > > On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng wrote: &

Re: Speeding up K-Means Clustering

2014-07-17 Thread Xiangrui Meng
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g and repartition your data to match number of CPU cores such that the data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage the data. Make sure there are enough memory for caching. -Xiangrui On Thu, Jul 17, 2014 at 1

Re: MLLib - Regularized logistic regression in python

2014-07-17 Thread Xiangrui Meng
1) This is a miss, unfortunately ... We will add support for regularization and intercept in the coming v1.1. (JIRA: https://issues.apache.org/jira/browse/SPARK-2550) 2) It has overflow problems in Python but not in Scala. We can stabilize the computation by ensuring exp only takes a negative value

Re: Error: No space left on device

2014-07-17 Thread Xiangrui Meng
hat timing sound about > right, or does it point to a poor configuration? The same script with > MovieLens 1M runs fine in about 30-40s with the same settings. (In both > cases I'm training on 70% of the data.) > > Thanks for your help! > Chris > > > On Wed, Jul 16, 20

Re: Kmeans

2014-07-17 Thread Xiangrui Meng
k-means|| > > > Best Regards > > ... > > Amin Mohebbi > > PhD candidate in Software Engineering > at university of Malaysia > > H/P : +60 18 2040 017 > > > > E-Mail : tp025...@ex.apiit.edu.my > > amin_...@me.com > > >

Re: Kmeans

2014-07-16 Thread Xiangrui Meng
kmeans.py contains a naive implementation of k-means in python, served as an example of how to use pyspark. Please use MLlib's implementation in practice. There is a JIRA for making it clear: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Wed, Jul 16, 2014 at 8:16 PM, amin mohebbi

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
>> tmpfs1917974 1 19179731% /dev/shm >> /dev/xvdv524288000 30 5242879701% /vol >> >> I have reproduced the error while using the MovieLens 10M data set on a >> newly created cluster. >> >> Thanks for the help. >>

Re: Error: No space left on device

2014-07-16 Thread Xiangrui Meng
}' | xargs | sed >> 's/ /,/g’) >> >> Then adding -Dspark.local.dir=$SPACE or simply >> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver >> application >> >> Chris >> >> On Jul 15, 2014, at 11:39 PM, Xiangrui

Re: Error: No space left on device

2014-07-15 Thread Xiangrui Meng
Check the number of inodes (df -i). The assembly build may create many small files. -Xiangrui On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois wrote: > Hi all, > > I am encountering the following error: > > INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space > left on devic

Re: Error when testing with large sparse svm

2014-07-15 Thread Xiangrui Meng
Then it may be a new issue. Do you mind creating a JIRA to track this issue? It would be great if you can help locate the line in BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui On Tue, Jul 15, 2014 at 10:56 PM, crater wrote: > I don't really have "my code", I was just runn

Re: akka disassociated on GC

2014-07-15 Thread Xiangrui Meng
ul 15, 2014 at 10:48 PM, Makoto Yui wrote: > Hello, > > (2014/06/19 23:43), Xiangrui Meng wrote: >>> >>> The execution was slow for more large KDD cup 2012, Track 2 dataset >>> (235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the >>

Re: Error when testing with large sparse svm

2014-07-15 Thread Xiangrui Meng
crater, was the error message the same as what you posted before: 14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote Akka client disassociated 14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0) 14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on nod

Re: KMeansModel Construtor error

2014-07-14 Thread Xiangrui Meng
I don't think MLlib supports model serialization/deserialization. You got the error because the constructor is private. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to make sure it is implemented in v1.1. For now, you can modify the KMeansModel and remove p

Re: ALS on EC2

2014-07-14 Thread Xiangrui Meng
Could you share the code of RecommendationALS and the complete spark-submit command line options? Thanks! -Xiangrui On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S wrote: > Using properties file: null > Main class: > RecommendationALS > Arguments: > _train.csv > _validation.csv > _test.csv > Syste

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
Is it on a standalone server? There are several settings worthing checking: 1) number of partitions, which should match the number of cores 2) driver memory (you can see it from the executor tab of the Spark WebUI and set it with "--driver-memory 10g" 3) the version of Spark you were running Best

Re: Error when testing with large sparse svm

2014-07-14 Thread Xiangrui Meng
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the serialized weight vector. There is a JIRA about switching automatically between sending through akka or broadcast: https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui On Mon, Jul 14, 2014 at 12:15 AM, crater wrote: > Hi,

Re: mapPartitionsWithIndex

2014-07-14 Thread Xiangrui Meng
You should return an iterator in mapPartitionsWIthIndex. This is from the programming guide (http://spark.apache.org/docs/latest/programming-guide.html): mapPartitionsWithIndex(func): Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so

Re: Putting block rdd failed when running example svm on large data

2014-07-12 Thread Xiangrui Meng
By default, Spark uses half of the memory for caching RDDs (configurable by spark.storage.memoryFraction). That is about 25 * 8 / 2 = 100G for your setup, which is smaller than the 202G data size. So you don't have enough memory to fully cache the RDD. You can confirm it in the storage tab of the W

Re: ML classifier and data format for dataset with variable number of features

2014-07-11 Thread Xiangrui Meng
You can load the dataset as an RDD of JSON object and use a flatMap to extract feature vectors at object level. Then you can filter the training examples you want for binary classification. If you want to try multiclass, checkout DB's PR at https://github.com/apache/spark/pull/1379 Best, Xiangrui

<    1   2   3   4   5   6   >