Re: [Scikit-learn-general] [GSoC2015 metric learning]

2015-05-31 Thread Michael Eickenberg
On Sun, May 31, 2015 at 7:25 PM, Artem barmaley@gmail.com wrote:

 I added a simple benchmark
 https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py
  that
 compares NCA-assisted 1NN with default one (with Euclidean distance) on
 Wine dataset (it was one of the datasets reported in the NCA paper). See my
 output here https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd.

 It also compares semivectorized and vectorized implementations:
 surprisingly, semivectorizes is about 2 times faster. I think, this might
 be a reason to throw fully vectorized (nca_vectorized_oracle)
 implementation away.


This is very interesting. I like the way you made the semi-vectorized way
and this seems to show that you are still benefitting from the remaining
vectorizations you have.
However, I am surprised to see such a difference between the two methods.
Do you have any idea why this could be the case? Where do you think the
vectorized version looses all this time?





 On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg 
 michael.eickenb...@gmail.com wrote:

 Thanks for the update.

  So, what's the consensus on benchmarks? I can share ipython notebooks
 via gist, for example.

 My (weak) preference would be to have a script within the sklearn repo,
 just to keep stuff in one place for easy future reference.

 Michael


 On Fri, May 29, 2015 at 6:24 PM, Artem barmaley@gmail.com wrote:

 So, I created a WIP PR dedicated to NCA:
 https://github.com/scikit-learn/scikit-learn/pull/4789

 As suggested by Michael, I refactored the meat into a function. I also
 rewrote it as a first order oracle, so I can (and I do) use scipy's
 optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
 sometimes stopping at some weird point (a local minimum / saddle point?),
 whereas gradient descent seems to always converge. Though, I didn't test
 either of them extensively.

 I also fully vectorized function and gradient calculations, no loops
 involved.

 So, what's the consensus on benchmarks? I can share ipython notebooks
 via gist, for example.

 On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg 
 michael.eickenb...@gmail.com wrote:

 Hi Aurélien,

 thanks for these very good pointers!
 (Now we also know who else to bug periodically for opinions ;))

 Michael


 On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet 
 aurelien.bel...@telecom-paristech.fr wrote:

 Hi everyone,

 A few additional things to consider for scaling-up NCA to large
 datasets:

 - Take a look at the t-SNE (technique for visualization/dim reduction
 very similar to NCA) implementations, I think they have a few speed-up
 tricks that you could potentially re-use:
 http://lvdmaaten.github.io/tsne/

 - Like you said, SGD can help reduce the computational cost - you could
 also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.

 - Similarly to what was suggested in previous replies, a general idea
 is
 to only consider a neighborhood around each point (either fixed in
 advance, or updated every now and then during the course of
 optimization), since the probabilities decrease very fast with the
 distance so farther points can be safely ignored in the computation.
 This is explored for instance in:
 http://dl.acm.org/citation.cfm?id=2142432

 - Another related idea is to construct class representatives (for
 instance using k-means), and to model the distribution only wrt these
 points instead of the entire dataset. This is especially useful if some
 classes are very large. An extreme version of this is to reframe NCA
 for
 a Nearest Class Mean classifier, where each class is only modeled by
 its
 center:

 https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf

 Hope this helps.

 Aurelien

 Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
 
 
  On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
 
  Code-wise, I would attack the problem as a function first. Write a
  function that takes X and y (plus maybe some options) and gives back
  L. You can put a skeleton of a sklearn estimator around it by
 calling
  this function from fit.
  Please keep your code either in a sklearn WIP PR or a public gist,
 so
  it can be reviewed. Writing benchmarks can be framed as writing
  examples, i.e. plot_* functions (maybe Andy or Olivier have a
 comment
  on how benchmarks have been handled in the past?).
 
  There is a benchmark folder, which is in a horrible shape.
  Basically there are three ways to do it: examples (with or without
 plot
  depending on the runtime), a script in the benchmark folder, or a
 gist.
  Often we just use a gist and the PR person posts the output. Not that
  great for reproducibility, though.
 
 
 --
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  

Re: [Scikit-learn-general] [GSoC2015 metric learning]

2015-05-31 Thread Artem
I added a simple benchmark
https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py
that
compares NCA-assisted 1NN with default one (with Euclidean distance) on
Wine dataset (it was one of the datasets reported in the NCA paper). See my
output here https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd.

It also compares semivectorized and vectorized implementations:
surprisingly, semivectorizes is about 2 times faster. I think, this might
be a reason to throw fully vectorized (nca_vectorized_oracle)
implementation away.

On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg 
michael.eickenb...@gmail.com wrote:

 Thanks for the update.

  So, what's the consensus on benchmarks? I can share ipython notebooks
 via gist, for example.

 My (weak) preference would be to have a script within the sklearn repo,
 just to keep stuff in one place for easy future reference.

 Michael


 On Fri, May 29, 2015 at 6:24 PM, Artem barmaley@gmail.com wrote:

 So, I created a WIP PR dedicated to NCA:
 https://github.com/scikit-learn/scikit-learn/pull/4789

 As suggested by Michael, I refactored the meat into a function. I also
 rewrote it as a first order oracle, so I can (and I do) use scipy's
 optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
 sometimes stopping at some weird point (a local minimum / saddle point?),
 whereas gradient descent seems to always converge. Though, I didn't test
 either of them extensively.

 I also fully vectorized function and gradient calculations, no loops
 involved.

 So, what's the consensus on benchmarks? I can share ipython notebooks via
 gist, for example.

 On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg 
 michael.eickenb...@gmail.com wrote:

 Hi Aurélien,

 thanks for these very good pointers!
 (Now we also know who else to bug periodically for opinions ;))

 Michael


 On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet 
 aurelien.bel...@telecom-paristech.fr wrote:

 Hi everyone,

 A few additional things to consider for scaling-up NCA to large
 datasets:

 - Take a look at the t-SNE (technique for visualization/dim reduction
 very similar to NCA) implementations, I think they have a few speed-up
 tricks that you could potentially re-use:
 http://lvdmaaten.github.io/tsne/

 - Like you said, SGD can help reduce the computational cost - you could
 also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.

 - Similarly to what was suggested in previous replies, a general idea is
 to only consider a neighborhood around each point (either fixed in
 advance, or updated every now and then during the course of
 optimization), since the probabilities decrease very fast with the
 distance so farther points can be safely ignored in the computation.
 This is explored for instance in:
 http://dl.acm.org/citation.cfm?id=2142432

 - Another related idea is to construct class representatives (for
 instance using k-means), and to model the distribution only wrt these
 points instead of the entire dataset. This is especially useful if some
 classes are very large. An extreme version of this is to reframe NCA for
 a Nearest Class Mean classifier, where each class is only modeled by its
 center:

 https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf

 Hope this helps.

 Aurelien

 Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
 
 
  On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
 
  Code-wise, I would attack the problem as a function first. Write a
  function that takes X and y (plus maybe some options) and gives back
  L. You can put a skeleton of a sklearn estimator around it by calling
  this function from fit.
  Please keep your code either in a sklearn WIP PR or a public gist, so
  it can be reviewed. Writing benchmarks can be framed as writing
  examples, i.e. plot_* functions (maybe Andy or Olivier have a comment
  on how benchmarks have been handled in the past?).
 
  There is a benchmark folder, which is in a horrible shape.
  Basically there are three ways to do it: examples (with or without
 plot
  depending on the runtime), a script in the benchmark folder, or a
 gist.
  Often we just use a gist and the PR person posts the output. Not that
  great for reproducibility, though.
 
 
 --
  ___
  Scikit-learn-general mailing list
  Scikit-learn-general@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
 


 --
 ___
 Scikit-learn-general mailing list
 Scikit-learn-general@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/scikit-learn-general




 --

 ___
 Scikit-learn-general 

Re: [Scikit-learn-general] [GSoC2015 metric learning]

2015-05-31 Thread Artem
Apparently, the problem was with multiplying big 4D tensors:

grad = np.sum(proba[:, :, np.newaxis, np.newaxis] * outer, axis=1).sum(axis=
0)
I don't know the details, but line-prof said that most of the time was
spent here (and the other similar line), probably it was due to copies. I
recalled your words on np.einsum, replaced sums with it and got quite a
speedup (x4)!

New output log: link
https://gist.github.com/Barmaley-exe/d5ca99d7b94fb665ad03. Now fully
vectorized version takes ~2s per iteration and is faster than
semivectorized one.

P.S. I made the same optimization in semivectorized version, but its
speedup isn't that significant.

On Sun, May 31, 2015 at 9:29 PM, Michael Eickenberg 
michael.eickenb...@gmail.com wrote:



 On Sun, May 31, 2015 at 7:25 PM, Artem barmaley@gmail.com wrote:

 I added a simple benchmark
 https://github.com/Barmaley-exe/scikit-learn/blob/metric-learning/benchmarks/bench_nca.py
  that
 compares NCA-assisted 1NN with default one (with Euclidean distance) on
 Wine dataset (it was one of the datasets reported in the NCA paper). See my
 output here https://gist.github.com/Barmaley-exe/a713f23f74eb53a2f2bd.

 It also compares semivectorized and vectorized implementations:
 surprisingly, semivectorizes is about 2 times faster. I think, this might
 be a reason to throw fully vectorized (nca_vectorized_oracle)
 implementation away.


 This is very interesting. I like the way you made the semi-vectorized way
 and this seems to show that you are still benefitting from the remaining
 vectorizations you have.
 However, I am surprised to see such a difference between the two methods.
 Do you have any idea why this could be the case? Where do you think the
 vectorized version looses all this time?





 On Sat, May 30, 2015 at 12:33 AM, Michael Eickenberg 
 michael.eickenb...@gmail.com wrote:

 Thanks for the update.

  So, what's the consensus on benchmarks? I can share ipython notebooks
 via gist, for example.

 My (weak) preference would be to have a script within the sklearn repo,
 just to keep stuff in one place for easy future reference.

 Michael


 On Fri, May 29, 2015 at 6:24 PM, Artem barmaley@gmail.com wrote:

 So, I created a WIP PR dedicated to NCA:
 https://github.com/scikit-learn/scikit-learn/pull/4789

 As suggested by Michael, I refactored the meat into a function. I
 also rewrote it as a first order oracle, so I can (and I do) use scipy's
 optimizers. I've seen scipy.optimize.minimize (apparently, with BFGS)
 sometimes stopping at some weird point (a local minimum / saddle point?),
 whereas gradient descent seems to always converge. Though, I didn't test
 either of them extensively.

 I also fully vectorized function and gradient calculations, no loops
 involved.

 So, what's the consensus on benchmarks? I can share ipython notebooks
 via gist, for example.

 On Fri, May 29, 2015 at 10:51 AM, Michael Eickenberg 
 michael.eickenb...@gmail.com wrote:

 Hi Aurélien,

 thanks for these very good pointers!
 (Now we also know who else to bug periodically for opinions ;))

 Michael


 On Fri, May 29, 2015 at 12:05 AM, Aurélien Bellet 
 aurelien.bel...@telecom-paristech.fr wrote:

 Hi everyone,

 A few additional things to consider for scaling-up NCA to large
 datasets:

 - Take a look at the t-SNE (technique for visualization/dim reduction
 very similar to NCA) implementations, I think they have a few speed-up
 tricks that you could potentially re-use:
 http://lvdmaaten.github.io/tsne/

 - Like you said, SGD can help reduce the computational cost - you
 could
 also consider recent improvements of SGD, such as SAG/SAGA, SVRG, etc.

 - Similarly to what was suggested in previous replies, a general idea
 is
 to only consider a neighborhood around each point (either fixed in
 advance, or updated every now and then during the course of
 optimization), since the probabilities decrease very fast with the
 distance so farther points can be safely ignored in the computation.
 This is explored for instance in:
 http://dl.acm.org/citation.cfm?id=2142432

 - Another related idea is to construct class representatives (for
 instance using k-means), and to model the distribution only wrt these
 points instead of the entire dataset. This is especially useful if
 some
 classes are very large. An extreme version of this is to reframe NCA
 for
 a Nearest Class Mean classifier, where each class is only modeled by
 its
 center:

 https://hal.archives-ouvertes.fr/file/index/docid/722313/filename/mensink12eccv.final.pdf

 Hope this helps.

 Aurelien

 Le 5/28/15 11:20 PM, Andreas Mueller a écrit :
 
 
  On 05/28/2015 05:11 PM, Michael Eickenberg wrote:
 
  Code-wise, I would attack the problem as a function first. Write a
  function that takes X and y (plus maybe some options) and gives
 back
  L. You can put a skeleton of a sklearn estimator around it by
 calling
  this function from fit.
  Please keep your code either in a sklearn WIP PR or a public gist,
 so
  it can be reviewed. 

[Scikit-learn-general] my silence

2015-05-31 Thread Joel Nothman
Just a quick note that I've been silent lately because I've been Busy With
Life, but also because github was notifying an email address hosted at my
previous employer, which was deactivated a fortnight ago. If there were
issues that sought my particular attention, please let me know.
--
___
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general