1) MLlib 1.1 should be faster than 1.0 in general. What's the size of
your dataset? Is the RDD evenly distributed across nodes? You can
check the storage tab of the Spark WebUI.
2) I don't have much experience with mesos deployment. Someone else
may be able to answer your question.
-Xiangrui
On
Could you create a JIRA for it? We can either remove special
characters or encode with alphanumerics. -Xiangrui
On Thu, Sep 18, 2014 at 3:50 PM, SK wrote:
> Hi,
>
> The default log files for the Mllib examples use a rather long naming
> convention that includes special characters like parentheses
We don't support kernels because it doesn't scale well. Please check
"When to use LIBLINEAR but not LIBSVM" on
http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's
suggestion on expanding features. -Xiangrui
On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam wrote:
> Hi Aris,
>
> A s
Did you cache `features`? Without caching it is slow because we need
O(k) iterations. The storage requirement on the driver is about 2 * n
* k = 2 * 3 million * 200 ~= 9GB, not considering any overhead.
Computing U is also an expensive task in your case. We should use some
randomized SVD implementa
The importance should be based on some statistics, for example, the
standard deviation of the feature column and the magnitude of the
weight. If the columns are scaled to unit standard deviation (using
StandardScaler), you can tell the importance by the absolute value of
the weight. But there are o
You can use CoGroupedRDD
(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.CoGroupedRDD)
directly. -Xiangrui
On Thu, Sep 18, 2014 at 7:09 AM, Debasish Das wrote:
> Hi,
>
> I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j)
> and I would like
Hi Jatin,
HashingTF should be able to solve the memory problem if you use a
small feature dimension in HashingTF. Please do not cache the input
document, but cache the output from HashingTF and IDF instead. We
don't have a label indexer yet, so you need a label to index map to
map it to double val
Or you can use the factory method `Vectors.sparse`:
val sv = Vectors.sparse(numProducts, productIds.map(x => (x, 1.0)))
where numProducts should be the largest product id plus one.
Best,
Xiangrui
On Mon, Sep 15, 2014 at 12:46 PM, Chris Gore wrote:
> Hi Sameer,
>
> MLLib uses Breeze’s vector fo
Thanks for the update! -Xiangrui
On Sun, Sep 14, 2014 at 11:33 PM, jatinpreet wrote:
> Hi,
>
> I have been able to get the same accuracy with MLlib as Mahout's. The
> pre-processing phase of Mahout was the reason behind the accuracy mismatch.
> After studying and applying the same logic in my co
Spark doesn't support MultipleOutput at this time. You can cache the
parent RDD. Then create RDDs from it and save them separately.
-Xiangrui
On Fri, Sep 12, 2014 at 7:45 AM, Guillermo Ortiz wrote:
>
> I would like to define the names of my output in Spark, I have a process
> which write many fai
Please check the colStats method defined under mllib.stat.Statistics. -Xiangrui
On Mon, Sep 15, 2014 at 1:00 PM, jamborta wrote:
> Hi all,
>
> I have an RDD that contains around 50 columns. I need to sum each column,
> which I am doing by running it through a for loop, creating an array and
> run
I wrote an input format for Redshift's tables unloaded UNLOAD the
ESCAPE option: https://github.com/mengxr/redshift-input-format , which
can recognize multi-line records.
Redshift puts a backslash before any in-record `\\`, `\r`, `\n`, and
the delimiter character. You can apply the same escaping b
If you are using the Mahout's Multinomial Naive Bayes, it should be
the same as MLlib's. I tried MLlib with news20.scale downloaded from
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html
and the test accuracy is 82.4%. -Xiangrui
On Tue, Sep 9, 2014 at 4:58 AM, jatinpreet wrot
or ? You want a
>> distributed lp / socp solver where each worker solves a partition of the
>> constraint and the full objective...and you want to converge to a global
>> solution using consensus ? Or your problem has more structure to partition
>> the problem cleanly and don&
Could you attach the driver log? -Xiangrui
On Mon, Sep 8, 2014 at 7:23 AM, Hui Li wrote:
> I am running a very simple example using the SVMWithSGD on Amazon EMR. I
> haven't got any result after one hour long.
>
> My instance-type is: m3.large
> instance-count is: 3
> Dataset is the data pr
When you submit the job to yarn with spark-submit, set --conf
spark.yarn.user.classpath.first=true .
On Mon, Sep 8, 2014 at 10:46 AM, Penny Espinoza
wrote:
> I don't understand what you mean. Can you be more specific?
>
>
>
> From: Victor Tso-Guillen
> Sent: Sat
There is an undocumented configuration to put users jars in front of
spark jar. But I'm not very certain that it works as expected (and
this is why it is undocumented). Please try turning on
spark.yarn.user.classpath.first . -Xiangrui
On Sat, Sep 6, 2014 at 5:13 PM, Victor Tso-Guillen wrote:
> I
You can try LinearRegression with sparse input. It converges the least
squares solution if the linear system is over-determined, while the
convergence rate depends on the condition number. Applying standard
scaling is popular heuristic to reduce the condition number.
If you are interested in spars
ch also reduce the GC time. That is exactly
>> what we observed in our job.
>>
>> 2. This approach will not hit the 2G limitation, because it not change
>> the partition size.
>>
>> And I also think that, Spark may change this default value, or at least
>>
Maybe copy the implementation of StandardScaler from 1.1 and use it in
v1.0.x. -Xiangrui
On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska wrote:
> It seems like the next release will add a nice
> org.apache.spark.mllib.feature package but what is the recommended way to
> normalize features in the cu
There is a sliding method implemented in MLlib
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala),
which is used in computing Area Under Curve:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluat
I think they are the same. If you have hub (https://hub.github.com/)
installed, you can run
hub checkout https://github.com/apache/spark/pull/216
and then `sbt/sbt assembly`
-Xiangrui
On Wed, Sep 3, 2014 at 12:03 AM, filipus wrote:
> howto install? just clone by git clone
> https://github.com/
We have a pending PR (https://github.com/apache/spark/pull/216) for
discretization but it has performance issues. We will try to spend
more time to improve it. -Xiangrui
On Tue, Sep 2, 2014 at 2:56 AM, filipus wrote:
> i guess i found it
>
> https://github.com/LIDIAgroup/SparkFeatureSelection
>
>
This is not supported in MLlib. Hopefully, we will add support for
weighted examples in v1.2. If you want to train weighted instances
with the current tree implementation, please try importance sampling
first to adjust the weights. For instance, an example with weight 0.3
is sampled with probabilit
Are you using hadoop-1.0? Hadoop doesn't support splittable bz2 files
before 1.2 (or a later version). But due to a bug
(https://issues.apache.org/jira/browse/HADOOP-10614), you should try
hadoop-2.5.0. -Xiangrui
On Wed, Aug 27, 2014 at 2:49 PM, jerryye wrote:
> Hi,
> I'm running on the master br
ote:
> So, I guess zipWithUniqueId will be similar.
>
> Is there a way to get unique index?
>
>
> On Wed, Aug 27, 2014 at 2:39 PM, Xiangrui Meng wrote:
>>
>> No. The indices start at 0 for every RDD. -Xiangrui
>>
>> On Wed, Aug 27, 2014 at 2:37 PM, Soumitr
No. The indices start at 0 for every RDD. -Xiangrui
On Wed, Aug 27, 2014 at 2:37 PM, Soumitra Kumar
wrote:
> Hello,
>
> If I do:
>
> DStream transform {
> rdd.zipWithIndex.map {
>
> Is the index guaranteed to be unique across all RDDs here?
>
> }
> }
>
> Thanks,
> -Soumitra.
Hi Wei,
Please keep us posted about the performance result you get. This would
be very helpful.
Best,
Xiangrui
On Wed, Aug 27, 2014 at 10:33 AM, Wei Tan wrote:
> Thank you all. Actually I was looking at JCUDA. Function wise this may be a
> perfect solution to offload computation to GPU. Will se
How many partitions now? Btw, which Spark version are you using? I
checked your code and I don't understand why you want to broadcast
vectors2, which is an RDD.
var vectors2 =
vectors.repartition(1000).persist(org.apache.spark.storage.StorageLevel.MEMORY_AND_DISK_SER)
var broadcastVector = sc.bro
n
> using logistic regression to build score cards.
>
> Xiaobo Gu
>
>
> -- Original --
> From: "Xiangrui Meng";;
> Send time: Wednesday, Aug 20, 2014 2:18 PM
> To: "";
> Cc: "user@spark.apache.org";
> Sub
We implemented chi-squared tests in v1.1:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166
and we will add more after v1.1. Feedback on which tests should come
first would be greatly appreciated. -Xiangrui
On Tue, Aug 19, 2014 at 9:
Bayes?
>
> Phuoc Do
>
>
> On Tue, Aug 19, 2014 at 12:51 AM, Xiangrui Meng wrote:
>>
>> What is the ratio of examples labeled `s` to those labeled `b`? Also,
>> Naive Bayes doesn't work on negative feature values. It assumes term
>> frequencies as the i
There are only 5 worker nodes. So please try to reduce the number of
partitions to the number of available CPU cores. 1000 partitions are
too bigger, because the driver needs to collect to task result from
each partition. -Xiangrui
On Tue, Aug 19, 2014 at 1:41 PM, durin wrote:
> When trying to us
On Tue, Jul 8, 2014 at 1:30 PM, Rahul Bhojwani
> wrote:
>>
>> Thanks a lot Xiangrui. This will help.
>>
>>
>> On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng wrote:
>>>
>>> Hi Rahul,
>>>
>>> We plan to add online model updates with
The categorical features must be encoded into indices starting from 0:
0, 1, ..., numCategories - 1. Then you can provide the
categoricalFeatureInfo map to specify which columns contain
categorical features and the number of categories in each. Joseph is
updating the user guide. But if you want to
What is the ratio of examples labeled `s` to those labeled `b`? Also,
Naive Bayes doesn't work on negative feature values. It assumes term
frequencies as the input. We should throw an exception on negative
feature values. -Xiangrui
On Tue, Aug 19, 2014 at 12:07 AM, Phuoc Do wrote:
> I'm trying Na
Guoqiang reported some results in his PRs
https://github.com/apache/spark/pull/828 and
https://github.com/apache/spark/pull/929 . But this is really
problem-dependent. -Xiangrui
On Fri, Aug 15, 2014 at 12:30 PM, Debasish Das wrote:
> Hi,
>
> Are there any experiments detailing the performance hit
Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:
~~~
val rows = rdd.map { case (j, values) =>
values.view.zipWithIndex.map { case (v, i) =>
(i, (j, v))
}
}.groupByKey().map { case (i, entries) =>
Vectors
.
>
>
> On Wed, Aug 13, 2014 at 1:26 PM, Xiangrui Meng wrote:
>>
>> You can define an evaluation metric first and then use a grid search
>> to find the best set of training parameters. Ampcamp has a tutorial
>> showing how to do this for ALS:
>>
>> http://
You can define an evaluation metric first and then use a grid search
to find the best set of training parameters. Ampcamp has a tutorial
showing how to do this for ALS:
http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html
-Xiangrui
On Tue, Aug 12, 2014 at 8:01 PM,
will combine multiple partitions into a large partition if I cache it, so
> same issues as #1?
>
> For coalesce, could you share some best practice how to set the right number
> of partitions to avoid locality problem?
>
> Thanks!
>
>
>
> On Tue, Aug 12, 2014 at 3:51 PM,
What did you set for driver memory? The default value is 256m or 512m,
which is too small. Try to set "--driver-memory 10g" with spark-submit
or spark-shell and see whether it works or not. -Xiangrui
On Mon, Aug 11, 2014 at 6:26 PM, durin wrote:
> I'm trying to apply KMeans training to some text
For linear models, the constructors are now public. You can save the
weights to HDFS, then load the weights back and use the constructor to
create the model. -Xiangrui
On Mon, Aug 11, 2014 at 10:27 PM, XiaoQinyu wrote:
> hello:
>
> I want to know,if I use history data to training model and I want
Assuming that your data is very sparse, I would recommend
RDD.repartition. But if it is not the case and you don't want to
shuffle the data, you can try a CombineInputFormat and then parse the
lines into labeled points. Coalesce may cause locality problems if you
didn't use the right number of part
If the file is big enough, you can try MLUtils.loadLibSVMFile with a
minPartitions argument. This doesn't shuffle data but it might not
give you the exact number of partitions. If you want to have the exact
number, use RDD.repartition, which requires data shuffling. -Xiangrui
On Sun, Aug 10, 2014
For the reduce/aggregate question, driver collects results in
sequence. We now use tree aggregation in MLlib to reduce driver's
load:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/RDDFunctions.scala#L89
It is faster than aggregate when there are many
Thanks for posting the solution! You can also append `% "provided"` to
the `spark-mllib` dependency line and remove `spark-core` (because
spark-mllib already depends on spark-core) to make the assembly jar
smaller. -Xiangrui
On Fri, Aug 8, 2014 at 10:05 AM, SK wrote:
> i was using sbt package whe
They are two different RDDs. Spark doesn't guarantee that the first
partition of RDD1 and the first partition of RDD2 will stay in the
same worker node. If that is the case, if you have 1000
single-partition RDDs the first worker will have very heavy load.
-Xiangrui
On Thu, Aug 7, 2014 at 2:20 AM,
Is the GPL library only available on the driver node? If that is the
case, you need to add them to `--jars` option of spark-submit.
-Xiangrui
On Thu, Aug 7, 2014 at 6:59 PM, Jikai Lei wrote:
> I had the following error when trying to run a very simple spark job (which
> uses logistic regression w
Besides durin's suggestion, please also confirm driver and executor
memory in the WebUI, since they are small according to the log:
14/08/07 19:59:10 INFO MemoryStore: Block broadcast_0 stored as values to
memory (estimated size 34.6 KB, free 303.3 MB)
-Xiangrui
-
ratings.map{ case Rating(u,m,r) => {
val pred = model.predict(u, m)
(r - pred)*(r - pred)
}
}.mean()
The code doesn't work because the userFeatures and productFeatures
stored in the model are RDDs. You tried to serialize them into the
task closure, and execute `model.predict` on an execu
Did you cache the table? There are couple ways of caching a table in
Shark: https://github.com/amplab/shark/wiki/Shark-User-Guide
On Thu, Aug 7, 2014 at 6:51 AM, wrote:
> Dear all,
>
> I am using Spark 0.9.2 in Standalone mode. Hive and HDFS in CDH 5.1.0.
>
> 6 worker nodes each with memory 96GB
Then this may be a bug. Do you mind sharing the dataset that we can
use to reproduce the problem? -Xiangrui
On Thu, Aug 7, 2014 at 1:20 AM, SK wrote:
> Spark 1.0.1
>
> thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Regularization-parameter
It is used in data loading:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveBayes.scala#L76
On Thu, Aug 7, 2014 at 12:47 AM, SK wrote:
> I followed the example in
> examples/src/main/scala/org/apache/spark/examples/mllib/SparseNaiveB
Which Spark version are you using? -Xiangrui
On Thu, Aug 7, 2014 at 1:12 AM, SK wrote:
> Hi,
>
> I am following the code in
> examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala
> For setting the parameters and parsing the command line options, I am just
> reusing t
Do you mind testing 1.1-SNAPSHOT and allocating more memory to the driver?
I think the problem is with the feature dimension. KDD data has more than
20M features and in v1.0.1, the driver collects the partial gradients one
by one, sums them up, does the update, and then sends the new weights back
t
Some extra work is needed to close the loop. One related example is
streaming linear regression added by Jeremy very recently:
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingLinearRegression.scala
You can use a model trained offline to
To be precise, the optimization is not `get all products that are
related to this user` but `get all products that are related to users
inside this block`. So a product factor won't be sent to the same
block more than once. We considered using GraphX to implement ALS,
which is much easier to unders
found the total task reduce from 200 t0 64 after first stage just like this:
>
>
> But I don't know if this is reasonable.
>
>
>
> On Wed, Jul 30, 2014 at 2:11 PM, Xiangrui Meng wrote:
>
>> After you load the data in, call `.repartition(number of
>> exe
nly distributed to the executors.
>
> The input data is on the HDFS, not on the spark clusters. How can I make the
> data distributed to the excutors?
>
>
> On Wed, Jul 30, 2014 at 1:52 PM, Xiangrui Meng wrote:
>>
>> The weight vector is usually dense and if you have m
The weight vector is usually dense and if you have many partitions,
the driver may slow down. You can also take a look at the driver
memory inside the Executor tab in WebUI. Another setting to check is
the HDFS block size and whether the input data is evenly distributed
to the executors. Are the ha
Could you share more details about the dataset and the algorithm? For
example, if the dataset has 10M+ features, it may be slow for the driver to
collect the weights from executors (just a blind guess). -Xiangrui
On Tue, Jul 29, 2014 at 9:15 PM, Tan Tim wrote:
> Hi, all
>
> [Setting]
>
> Input
Before torrent, http is the default way for broadcasting. The driver
holds the data and the executors request the data via http, making the
driver the bottleneck if the data is large. -Xiangrui
On Tue, Jul 29, 2014 at 10:32 AM, durin wrote:
> Development is really rapid here, that's a great thing
Do you mind sharing more details, for example, specs of nodes and data
size? -Xiangrui
2014-07-29 2:51 GMT-07:00 John Wu :
> Hi all,
>
>
>
> There is a problem we can’t resolve. We implement the OWLQN algorithm in
> parallel with SPARK,
>
> We don’t know why It is very slow in every iteration stag
That looks good to me since there is no Hadoop InputFormat for HDF5.
But remember to specify the number of partitions in sc.parallelize to
use all the nodes. You can change `process` to `read` which yields
records one-by-one. Then sc.parallelize(files,
numPartitions).flatMap(read) returns an RDD of
Are you using 1.0.0? There was a bug, which was fixed in 1.0.1 and
master. If you don't want to switch to 1.0.1 or master, try to cache
and count test first. -Xiangrui
On Mon, Jul 28, 2014 at 6:07 PM, SK wrote:
> Hi,
>
> In order to evaluate the ML classification accuracy, I am zipping up the
> p
Great! Thanks for testing the new features! -Xiangrui
On Mon, Jul 28, 2014 at 8:58 PM, durin wrote:
> Hi Xiangrui,
>
> using the current master meant a huge improvement for my task. Something
> that did not even finish before (training with 120G of dense data) now
> completes in a reasonable time
1. I meant in the n (1k) by m (10k) case, we need to broadcast k
centers and hence the total size is m * k. In 1.0, the driver needs to
send the current centers to each partition one by one. In the current
master, we use torrent to broadcast the centers to workers, which
should be much faster.
2.
I think this is nice to have. Feel free to create a JIRA for it and it
would be great if you can send a PR. Thanks! -Xiangrui
On Thu, Jul 24, 2014 at 12:39 PM, SK wrote:
>
> Hi,
>
> The mllib.clustering.kmeans implementation supports a random or parallel
> initialization mode to pick the initial
If you have an m-by-n dataset and train a k-means model with k, the
cost for each iteration is
O(m * n * k) (assuming dense data)
Since m * n * k == n * m * k, so ideally you would expect the same run
time. However,
1. Communication. We need to broadcast current centers (m * k), do the
computati
17:51:48 INFO executor.Executor: Finished task ID 742
> <—— I have shutdown the App
> 14/07/25 18:16:36 INFO executor.CoarseGrainedExecutorBackend: Driver
> commanded a shutdown
>
> On Jul 2, 2014, at 0:08, Xiangrui Meng wrote:
>
>> Try to reduce number of partitions
It is ALS with setNonnegative. -Xiangrui
On Fri, Jul 25, 2014 at 7:38 AM, Aureliano Buendia wrote:
> Hi,
>
> Is there an implementation for Nonnegative Matrix Factorization in Spark? I
> understand that MLlib comes with matrix factorization, but it does not seem
> to cover the nonnegative case.
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is
a maintenance release with bug fixes across several areas of Spark,
including Spark Core, PySpark, MLlib, Streaming, and GraphX. We
recommend all 0.9.x users to upgrade to this stable release.
Contributions to this release came f
try `sbt/sbt clean` first? -Xiangrui
On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma wrote:
> I am trying to build spark after cloning from github repo:
>
> I am executing:
> ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly
>
> I am getting following error:
> [warn] ^
ast patch, the evaluation has been
> processed successfully.
>
> I expect that these patches are merged in the next major release (v1.1?).
> Without them, it would be hard to use mllib for a large dataset.
>
> Thanks,
> Makoto
>
>
> (2014/07/16 15:05), Xiangrui Meng wrot
This is a known issue:
https://issues.apache.org/jira/browse/SPARK-2197 . Joseph is working
on it. -Xiangrui
On Mon, Jul 21, 2014 at 4:20 PM, Jack Yang wrote:
> So this is a bug unsolved (for java) yet?
>
>
>
> From: Jack Yang [mailto:j...@uow.edu.au]
> Sent: Friday, 18 July 2014 4:52 PM
> To: us
You can also try a different region. I tested us-west-2 yesterday, and
it worked well. -Xiangrui
On Sun, Jul 20, 2014 at 4:35 PM, Matei Zaharia wrote:
> Actually the script in the master branch is also broken (it's pointing to an
> older AMI). Try 1.0.1 for launching clusters.
>
> On Jul 20, 2014
This is a useful feature but it may be hard to have it in v1.1 due to
limited time. Hopefully, we can support it in v1.2. -Xiangrui
On Mon, Jul 21, 2014 at 12:58 AM, Jiusheng Chen wrote:
> It seems MLlib right now doesn't support weighted training, training samples
> have equal importance. Weight
It was because of the latest change to task serialization:
https://github.com/apache/spark/commit/1efb3698b6cf39a80683b37124d2736ebf3c9d9a
The task size is no longer limited by akka.frameSize but we show
warning messages if the task size is above 100KB. Please check the
objects referenced in the t
You can save RDDs to text files using RDD.saveAsTextFile and load it back using
sc.textFile. But make sure the record to string conversion is correctly
implemented if the type is not primitive and you have the parser to load them
back. -Xiangrui
> On Jul 18, 2014, at 8:39 AM, Roch Denis wrote:
Nick's suggestion is a good approach for your data. The item factors to
broadcast should be a few MBs. -Xiangrui
> On Jul 18, 2014, at 12:59 AM, Bertrand Dechoux wrote:
>
> And you might want to apply clustering before. It is likely that every user
> and every item are not unique.
>
> Bertran
chine. It did bring down the time taken to less than 200
> seconds from over 700 seconds.
>
> I am not sure how to repartition the data to match the CPU cores. How do I
> do it?
>
> Thank you.
>
> Ravi
>
>
> On Thu, Jul 17, 2014 at 3:17 PM, Xiangrui Meng wrote:
&
Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui
On Thu, Jul 17, 2014 at 1
1) This is a miss, unfortunately ... We will add support for
regularization and intercept in the coming v1.1. (JIRA:
https://issues.apache.org/jira/browse/SPARK-2550)
2) It has overflow problems in Python but not in Scala. We can
stabilize the computation by ensuring exp only takes a negative value
hat timing sound about
> right, or does it point to a poor configuration? The same script with
> MovieLens 1M runs fine in about 30-40s with the same settings. (In both
> cases I'm training on 70% of the data.)
>
> Thanks for your help!
> Chris
>
>
> On Wed, Jul 16, 20
k-means||
>
>
> Best Regards
>
> ...
>
> Amin Mohebbi
>
> PhD candidate in Software Engineering
> at university of Malaysia
>
> H/P : +60 18 2040 017
>
>
>
> E-Mail : tp025...@ex.apiit.edu.my
>
> amin_...@me.com
>
>
>
kmeans.py contains a naive implementation of k-means in python, served
as an example of how to use pyspark. Please use MLlib's implementation
in practice. There is a JIRA for making it clear:
https://issues.apache.org/jira/browse/SPARK-2434
-Xiangrui
On Wed, Jul 16, 2014 at 8:16 PM, amin mohebbi
>> tmpfs1917974 1 19179731% /dev/shm
>> /dev/xvdv524288000 30 5242879701% /vol
>>
>> I have reproduced the error while using the MovieLens 10M data set on a
>> newly created cluster.
>>
>> Thanks for the help.
>>
}' | xargs | sed
>> 's/ /,/g’)
>>
>> Then adding -Dspark.local.dir=$SPACE or simply
>> -Dspark.local.dir=/mnt/spark/,/mnt2/spark/ when you run your driver
>> application
>>
>> Chris
>>
>> On Jul 15, 2014, at 11:39 PM, Xiangrui
Check the number of inodes (df -i). The assembly build may create many
small files. -Xiangrui
On Tue, Jul 15, 2014 at 11:35 PM, Chris DuBois wrote:
> Hi all,
>
> I am encountering the following error:
>
> INFO scheduler.TaskSetManager: Loss was due to java.io.IOException: No space
> left on devic
Then it may be a new issue. Do you mind creating a JIRA to track this
issue? It would be great if you can help locate the line in
BinaryClassificationMetrics that caused the problem. Thanks! -Xiangrui
On Tue, Jul 15, 2014 at 10:56 PM, crater wrote:
> I don't really have "my code", I was just runn
ul 15, 2014 at 10:48 PM, Makoto Yui wrote:
> Hello,
>
> (2014/06/19 23:43), Xiangrui Meng wrote:
>>>
>>> The execution was slow for more large KDD cup 2012, Track 2 dataset
>>> (235M+ records of 16.7M+ (2^24) sparse features in about 33.6GB) due to the
>>
crater, was the error message the same as what you posted before:
14/07/14 11:32:20 ERROR TaskSchedulerImpl: Lost executor 1 on node7: remote
Akka client disassociated
14/07/14 11:32:20 WARN TaskSetManager: Lost TID 20 (task 13.0:0)
14/07/14 11:32:21 ERROR TaskSchedulerImpl: Lost executor 3 on nod
I don't think MLlib supports model serialization/deserialization. You
got the error because the constructor is private. I created a JIRA for
this: https://issues.apache.org/jira/browse/SPARK-2488 and we try to
make sure it is implemented in v1.1. For now, you can modify the
KMeansModel and remove p
Could you share the code of RecommendationALS and the complete
spark-submit command line options? Thanks! -Xiangrui
On Mon, Jul 14, 2014 at 11:23 PM, Srikrishna S wrote:
> Using properties file: null
> Main class:
> RecommendationALS
> Arguments:
> _train.csv
> _validation.csv
> _test.csv
> Syste
Is it on a standalone server? There are several settings worthing checking:
1) number of partitions, which should match the number of cores
2) driver memory (you can see it from the executor tab of the Spark
WebUI and set it with "--driver-memory 10g"
3) the version of Spark you were running
Best
You need to set a larger `spark.akka.frameSize`, e.g., 128, for the
serialized weight vector. There is a JIRA about switching
automatically between sending through akka or broadcast:
https://issues.apache.org/jira/browse/SPARK-2361 . -Xiangrui
On Mon, Jul 14, 2014 at 12:15 AM, crater wrote:
> Hi,
You should return an iterator in mapPartitionsWIthIndex. This is from
the programming guide
(http://spark.apache.org/docs/latest/programming-guide.html):
mapPartitionsWithIndex(func): Similar to mapPartitions, but also
provides func with an integer value representing the index of the
partition, so
By default, Spark uses half of the memory for caching RDDs
(configurable by spark.storage.memoryFraction). That is about 25 * 8 /
2 = 100G for your setup, which is smaller than the 202G data size. So
you don't have enough memory to fully cache the RDD. You can confirm
it in the storage tab of the W
You can load the dataset as an RDD of JSON object and use a flatMap to
extract feature vectors at object level. Then you can filter the
training examples you want for binary classification. If you want to
try multiclass, checkout DB's PR at
https://github.com/apache/spark/pull/1379
Best,
Xiangrui
301 - 400 of 530 matches
Mail list logo