Hi all,
I am using the following dataset from kaggle (train.csv):
https://www.kaggle.com/c/lshtc/data
The dataset is in libSVM format.
However while trying to load it using load_svmlight_file, i get the
following error
File "_svmlight_format.pyx", line 72, in
sklearn.datasets._svmlight_format._
Hi Gunjan,
Apparently the dataset is multi-label, so you need to use the
multilabel=True option.
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
Mathieu
On Fri, Feb 12, 2016 at 10:04 PM, Gunjan Dewan
wrote:
> Hi all,
>
> I am using the following datas
Hi Mathieu,
Thanks a lot for the help.
But even after changing the multilabel option it is giving a value error :
File "_svmlight_format.pyx", line 67, in
sklearn.datasets._svmlight_format._load_svmlight_file
(sklearn\datasets\_svmlight_format.c:2055)
ValueError: could not convert string to f
Hi,
I am trying to fit my model using regression trees but the problem is, it
consumes a lot of RAM, which makes my code unresponsive. By looking at
different forums and platforms, I think this is a common problem. I was
wondering, how you free up memory or what are the best ways to run the
fittin
Hi, Waseem,
I think lowering the value of n_jobs would help; as far as I know, each process
get a copy of the data? Just stumbled upon spark-sklearn a few days ago, maybe
that could help as well:
https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
When I understand
Hi Sebastian,
This is true but only if the data is less than 1M. After that it is
memmapped to a temp folder and is shared by all processes (
https://pythonhosted.org/joblib/parallel.html#working-with-numerical-data-in-shared-memory-memmaping
)
You can try varying "max_nbytes" parameter wherever
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this
affect the results and time it takes to run cross_validation, grid_search
etc?
Thanks
Kindest Regards
Waseem
On Fri, Feb 12, 2016 at 4:42 PM, Sebastian Raschka
wrote:
> Hi, Waseem,
> I think lowering
Hi Sebastian and Manoj,
@Manoj: What should be the value of max_nbytes parameter and will this
affect the results and time it takes to run cross_validation, grid_search
etc?
@Sebastian: Will the Spark implication will also improve the memory use or
just the CPU?
Thanks
Kindest Regards
On Fri, Fe
Thanks for the note, Manoj, didn't know that!
@muhammad So if there's no duplication of data across all processes, I guess
that the you would also run into troubles with n_jobs=1. But just to make sure
that data duplication is not an issue, could you try running it with n_jobs=1?
In this case,
Thanks Jacob V. and Jacob S.
I have forked scikit-learn into my github and will start making my changes
to my branch. I will send a code-review once I am done.
Mahesh
On Thu, Feb 11, 2016 at 11:18 AM, Jacob Vanderplas <
jake...@cs.washington.edu> wrote:
> Thanks Mahesh,
> That particular code wa
I would be interested in knowing if using typed memoryviews did not
decrease performance. Please ping me once you have results!
On Fri, Feb 12, 2016 at 11:04 AM, mahesh ravishankar <
mahesh.ravishan...@gmail.com> wrote:
> Thanks Jacob V. and Jacob S.
> I have forked scikit-learn into my github an
Hi,
That would depend on the size of the original dataset.
But I think you should try Sebastian's suggestion first to make sure if the
real issue is data duplication or not.
On Fri, Feb 12, 2016 at 12:29 PM, muhammad waseem
wrote:
> Hi Sebastian and Manoj,
> @Manoj: What should be the value of
@Sebastian: I tried with n_jobs=10 (total is equal to 12) and it still
created the same problem. I could try running it by using n_jobs=1 but it
would be so slow that it will take ages to complete. The machine has 32GB
RAM and it started using Swap memory after consuming full RAM.
Is there a way t
I'd suggest trying n_jobs=1 and check if swap memory is used (you don't have to
run it until completion). If this runs fine without swap, we can work further
from there.
Sent from my iPhone
> On Feb 12, 2016, at 2:57 PM, muhammad waseem wrote:
>
> @Sebastian: I tried with n_jobs=10 (total is
I don't think that the data is copied for tree based classifiers. It uses
the threading backend, so each thread should be sharing memory.
On Fri, Feb 12, 2016 at 12:32 PM, Sebastian Raschka
wrote:
> I'd suggest trying n_jobs=1 and check if swap memory is used (you don't
> have to run it until co
It seems like our svmlight reader doesn't support spaces between labels:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_svmlight_format.pyx#L71
Could you report an issue on github?
In the mean time, you can write a small Python script that deletes the
space between lab
Ill do that.
Thanks a lot.
Gunjan
On Sat, Feb 13, 2016 at 6:04 AM, Mathieu Blondel
wrote:
> It seems like our svmlight reader doesn't support spaces between labels:
>
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/datasets/_svmlight_format.pyx#L71
>
> Could you report an is
17 matches
Mail list logo