Re: [scikit-learn] Dimension Reduction - MDS

2018-10-11 Thread Brown J.B. via scikit-learn
Hi Guillaume,

The good news is that your script works as-is on smaller datasets, and
hopefully does the logic for your task correctly.

In addition to Alex's comment about data size and MDS tractability, I would
also point out a philosophical issue -- why consider MDS for such a large
dataset?
At least in two dimensions, once MDS gets beyond 1000 samples or so, the
resulting sample coordinates and its visualization are potentially highly
dispersed (e.g.,  like a 2D-uniform distribution) and may not lead to
interpretability.
One can move to three-dimensional MDS, but perhaps even then a few thousand
samples gets to the limit of graphical interpretability.
It very obviously depends on the relationships in your data.

Also, as you continue your work, keep in mind that the per-sample
dimensionality (number of entries in a single sample's descriptor vector)
will not be the primary determinant of the memory consumption requirements
for the MDS algorithm, because in any case you must compute (either inline
or pre-compute) the distance matrix between each pair of samples, and that
matrix stays in memory during coordinate generation (as far as I know).
So, 10 chemical descriptors (since I noticed you mentioning Dragon) or 1000
descriptors will still result in the same memory requirement for the
distance matrix, and then scaling to hundreds of thousands of samples will
eat all of the compute node's RAM.

Since you have 200k samples, you could potentially do some type of repeated
partial clustering (e.g., on random subsamples of data) to find a
reasonable number of clusters per repetition, analyze those results to make
an estimate of a number of clusters for a global clustering, and then
select a limited number of samples per cluster to use for projection to a
coordinate space by MDS.
Or a diversity selection (either by vector distance or in your case,
differing compound scaffolds) may be a way to get a quick subset and
visualize distance relationships.

Hope this helps.

Sincerely,
J.B. Brown

2018年10月11日(木) 20:14 Alexandre Gramfort :

> hi Guillaume,
>
> I cannot use our MDS solver at this scale. Even if you fit it in RAM
> it will be slow.
>
> I would play with https://github.com/lmcinnes/umap unless you really
> what a classic MDS.
>
> Alex
>
> On Thu, Oct 11, 2018 at 10:31 AM Guillaume Favelier
>  wrote:
> >
> > Hello J.B,
> >
> > Thank you for your quick reply.
> >
> > > If you try with a very small (e.g., 100 sample) data file, does your
> code
> > > employing MDS work?
> > > As you increase the number of samples, does the script continue to
> work?
> > So I tried the same script while increasing the number of samples (100,
> > 1000 and 1) and it works indeed without swapping on my workstation.
> >
> > > That is 49,000,000 entries, plus overhead for a data structure.
> > I thought that even 49M entries of doubles would be able to be processed
> > with 64G of RAM. Is there something to configure to allow this
> computation?
> >
> > The typical datasets I use can have around 200-300k rows with a few
> columns
> > (usually up to 3).
> >
> > Best regards,
> >
> > Guillaume
> >
> > Quoting "Brown J.B. via scikit-learn" :
> >
> > > Hello Guillaume,
> > >
> > > You are computing a distance matrix of shape 7x7 to generate
> MDS
> > > coordinates.
> > > That is 49,000,000 entries, plus overhead for a data structure.
> > >
> > > If you try with a very small (e.g., 100 sample) data file, does your
> code
> > > employing MDS work?
> > > As you increase the number of samples, does the script continue to
> work?
> > >
> > > Hope this helps you get started.
> > > J.B.
> > >
> > > 2018年10月9日(火) 18:22 Guillaume Favelier :
> > >
> > >> Hi everyone,
> > >>
> > >> I'm trying to use some dimension reduction algorithm [1] on my dataset
> > >> [2] in a
> > >> python script [3] but for some reason, Python seems to consume a lot
> of my
> > >> main memory and even swap on my configuration [4] so I don't have the
> > >> expected result
> > >> but a memory error instead.
> > >>
> > >> I have the impression that this behaviour is not intended so can you
> > >> help me know
> > >> what I did wrong or miss somewhere please?
> > >>
> > >> [1]: MDS -
> > >>
> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
> > >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z)
> > >> [3]: dragon.py - 10 lines
> > >> [4]: dragon_swap.png - htop on my workstation
> > >>
> > >> TAR archive:
> > >> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn
> > >>
> > >> Best regards,
> > >>
> > >> Guillaume Favelier
> > >>
> > >> ___
> > >> scikit-learn mailing list
> > >> scikit-learn@python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> >
> >
> >
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 

Re: [scikit-learn] Dimension Reduction - MDS

2018-10-11 Thread Alexandre Gramfort
hi Guillaume,

I cannot use our MDS solver at this scale. Even if you fit it in RAM
it will be slow.

I would play with https://github.com/lmcinnes/umap unless you really
what a classic MDS.

Alex

On Thu, Oct 11, 2018 at 10:31 AM Guillaume Favelier
 wrote:
>
> Hello J.B,
>
> Thank you for your quick reply.
>
> > If you try with a very small (e.g., 100 sample) data file, does your code
> > employing MDS work?
> > As you increase the number of samples, does the script continue to work?
> So I tried the same script while increasing the number of samples (100,
> 1000 and 1) and it works indeed without swapping on my workstation.
>
> > That is 49,000,000 entries, plus overhead for a data structure.
> I thought that even 49M entries of doubles would be able to be processed
> with 64G of RAM. Is there something to configure to allow this computation?
>
> The typical datasets I use can have around 200-300k rows with a few columns
> (usually up to 3).
>
> Best regards,
>
> Guillaume
>
> Quoting "Brown J.B. via scikit-learn" :
>
> > Hello Guillaume,
> >
> > You are computing a distance matrix of shape 7x7 to generate MDS
> > coordinates.
> > That is 49,000,000 entries, plus overhead for a data structure.
> >
> > If you try with a very small (e.g., 100 sample) data file, does your code
> > employing MDS work?
> > As you increase the number of samples, does the script continue to work?
> >
> > Hope this helps you get started.
> > J.B.
> >
> > 2018年10月9日(火) 18:22 Guillaume Favelier :
> >
> >> Hi everyone,
> >>
> >> I'm trying to use some dimension reduction algorithm [1] on my dataset
> >> [2] in a
> >> python script [3] but for some reason, Python seems to consume a lot of my
> >> main memory and even swap on my configuration [4] so I don't have the
> >> expected result
> >> but a memory error instead.
> >>
> >> I have the impression that this behaviour is not intended so can you
> >> help me know
> >> what I did wrong or miss somewhere please?
> >>
> >> [1]: MDS -
> >> http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
> >> [2]: dragon.csv - 69827 rows, 3 columns (x,y,z)
> >> [3]: dragon.py - 10 lines
> >> [4]: dragon_swap.png - htop on my workstation
> >>
> >> TAR archive:
> >> https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn
> >>
> >> Best regards,
> >>
> >> Guillaume Favelier
> >>
> >> ___
> >> scikit-learn mailing list
> >> scikit-learn@python.org
> >> https://mail.python.org/mailman/listinfo/scikit-learn
> >>
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Dimension Reduction - MDS

2018-10-11 Thread Guillaume Favelier

Hello J.B,

Thank you for your quick reply.


If you try with a very small (e.g., 100 sample) data file, does your code
employing MDS work?
As you increase the number of samples, does the script continue to work?

So I tried the same script while increasing the number of samples (100,
1000 and 1) and it works indeed without swapping on my workstation.


That is 49,000,000 entries, plus overhead for a data structure.

I thought that even 49M entries of doubles would be able to be processed
with 64G of RAM. Is there something to configure to allow this computation?

The typical datasets I use can have around 200-300k rows with a few columns
(usually up to 3).

Best regards,

Guillaume

Quoting "Brown J.B. via scikit-learn" :


Hello Guillaume,

You are computing a distance matrix of shape 7x7 to generate MDS
coordinates.
That is 49,000,000 entries, plus overhead for a data structure.

If you try with a very small (e.g., 100 sample) data file, does your code
employing MDS work?
As you increase the number of samples, does the script continue to work?

Hope this helps you get started.
J.B.

2018年10月9日(火) 18:22 Guillaume Favelier :


Hi everyone,

I'm trying to use some dimension reduction algorithm [1] on my dataset
[2] in a
python script [3] but for some reason, Python seems to consume a lot of my
main memory and even swap on my configuration [4] so I don't have the
expected result
but a memory error instead.

I have the impression that this behaviour is not intended so can you
help me know
what I did wrong or miss somewhere please?

[1]: MDS -
http://scikit-learn.org/stable/modules/generated/sklearn.manifold.MDS.html
[2]: dragon.csv - 69827 rows, 3 columns (x,y,z)
[3]: dragon.py - 10 lines
[4]: dragon_swap.png - htop on my workstation

TAR archive:
https://drive.google.com/open?id=1d1S99XeI7wNEq131wkBUCBrctPQRgpxn

Best regards,

Guillaume Favelier

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn





___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn