Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-11 Thread Łukasz Ręcławowicz
W dniu 12 sierpnia 2011 05:19 u¿ytkownik Chris Howden <
ch...@trickysolutions.com.au> napisa³:

> Thanks for the suggestion, I'll look into it
>

It seems to work! :)

library(multiv)
data(iris)
iris <- as.matrix(iris[,1:4])
h <- hierclust(iris, method=2)
d<- dist(iris)
hk<-hclust(d)
>str(hk)
List of 7
$ merge : int [1:149, 1:2] -102 -8 -1 -10 -129 -11 -5 -20 -30 -58 ...
$ height : num [1:149] 0 0.1 0.1 0.1 0.1 ...
$ order : int [1:150] 108 131 103 126 130 119 106 123 118 132 ...
$ labels : NULL
$ method : chr "complete"
$ call : language hclust(d = d)
$ dist.method: chr "euclidean"
- attr(*, "class")= chr "hclust"
> str(h)
List of 3
$ merge : int [1:149, 1:2] -102 -8 -1 -10 -129 -11 -41 -5 -20 7 ...
$ height: num [1:149] 0 0.01 0.01 0.01 0.01 ...
$ order : int [1:150] 42 23 15 16 45 34 33 17 21 32 ...

test.mat<-matrix(rnorm(90523*24),,24)
out<-hierclust(test.mat, method = 1, bign = T)
> print(object.size(out),u="Mb")
1.7 Mb
> str(out)
List of 3
$ merge : int [1:90522, 1:2] -35562 -19476 -60344 -66060 -38949 -14537 -3322
-20248 -19464 -78693 ...
$ height: num [1:90522] 1.93 1.94 1.96 1.98 2 ...
$ order : int [1:90523] 24026 61915 71685 16317 85828 11577 36034 37324
65754 55381 ...
> R.version$os
[1] "mingw32"


-- 
Mi³ego dnia

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-11 Thread Łukasz Ręcławowicz
2011/8/11 Chris Howden 

> In that my distance matrix has too many entries for R's architecture to
> know how to store in
>  memory
>

There was an multiv package with hierclust function and a bign option.
"Is n big? If storage is problemsome, a different implementation of the Ward
criterion may be tried. This determines dissimilarities on the fly, and
hence requires O(n) storage."
But it's orphaned now and makeconf has troubles with making dll.



-- 
Mi³ego dnia

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-10 Thread Chris Howden
Thanks Corey,



I’ve looked into them before and I don’t think they can help me with this
problem. The Big functions are great for handling and analysing data sets
that are too big for R to store in memory.



However I believe my problem goes 1 step beyond that. In that my distance
matrix has too many entries for R’s architecture to know how to store in
memory, even if I had memory that was big enough to store it.



Again, I’m no expert in this so I may be wrong.



Chris Howden

Founding Partner

Tricky Solutions

Tricky Solutions 4 Tricky Problems

Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training

(mobile) 0410 689 945

(fax / office)

ch...@trickysolutions.com.au



Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are not
the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy, use or
disclose this communication or any attachments without our consent. Although
this email has been checked by anti-virus software, there is a risk that
email messages may be corrupted or infected by viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the company. Tricky
Solutions always does our best to provide accurate forecasts and analyses
based on the data supplied, however it is possible that some important
predictors were not included in the data sent to us. Information provided by
us should not be solely relied upon when making decisions and clients should
use their own judgement.





*From:* Corey Dow-Hygelund [mailto:godelsthe...@gmail.com]
*Sent:* Thursday, 11 August 2011 3:00 AM
*To:* Chris Howden
*Cc:* r-help@r-project.org
*Subject:* Re: [R] Can R handle a matrix with 8 billion entries?



You might want to look into the packages bigmemory and biganalytics.

Corey

On Tue, Aug 9, 2011 at 8:38 PM, Chris Howden 
wrote:

Hi,

I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
I’m running into problems using the dist() function.

I’ve been looking at a few threads about R’s memory and have read the
memory limits section in R help. However I’m no computer expert so I’m
hoping I’ve misunderstood something and R can handle my Big Data set,
somehow. Although at the moment I think my dataset is simply too big and
there is no way around it, but I’d like to be proved wrong!

My data set has 90523 rows of data and 24 columns.

My understanding is that this means the distance matrix has a min of
90523^2 elements which is 8194413529. Which roughly translates as 8GB of
memory being required (if I assume each entry requires 1 bit). I only have
4GB on a 32bit build of windows and R. So there is no way that’s going to
work.

So then I thought of getting access to a more powerful computer, and maybe
using cloud computing.

However the R memory limit help mentions  “On all builds of R, the maximum
length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
distance matrix I require has more elements than this does this mean it’s
too big for R no matter what I do?

Any ideas would be welcome.

Thanks.


Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office)
ch...@trickysolutions.com.au

Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are
not the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy, use or
disclose this communication or any attachments without our consent.
Although this email has been checked by anti-virus software, there is a
risk that email messages may be corrupted or infected by viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the company.
Tricky Solutions always does our best to provide accurate forecasts and
analyses based on the data supplied, however it is possible that some
important predictors were not included in the data sent to us. Information
provided by us should not be solely relied upon when making decisions and
clients should use their own judgement.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




-- 
*The mark of a successful man is one that has spent an entire day on the
bank of a river without feeling guilty about it.*

[[alternative HTML version deleted

Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-10 Thread Corey Dow-Hygelund
You might want to look into the packages bigmemory and biganalytics.

Corey

On Tue, Aug 9, 2011 at 8:38 PM, Chris Howden
wrote:

> Hi,
>
> I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
> I’m running into problems using the dist() function.
>
> I’ve been looking at a few threads about R’s memory and have read the
> memory limits section in R help. However I’m no computer expert so I’m
> hoping I’ve misunderstood something and R can handle my Big Data set,
> somehow. Although at the moment I think my dataset is simply too big and
> there is no way around it, but I’d like to be proved wrong!
>
> My data set has 90523 rows of data and 24 columns.
>
> My understanding is that this means the distance matrix has a min of
> 90523^2 elements which is 8194413529. Which roughly translates as 8GB of
> memory being required (if I assume each entry requires 1 bit). I only have
> 4GB on a 32bit build of windows and R. So there is no way that’s going to
> work.
>
> So then I thought of getting access to a more powerful computer, and maybe
> using cloud computing.
>
> However the R memory limit help mentions  “On all builds of R, the maximum
> length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
> distance matrix I require has more elements than this does this mean it’s
> too big for R no matter what I do?
>
> Any ideas would be welcome.
>
> Thanks.
>
>
> Chris Howden
> Founding Partner
> Tricky Solutions
> Tricky Solutions 4 Tricky Problems
> Evidence Based Strategic Development, IP Commercialisation and Innovation,
> Data Analysis, Modelling and Training
> (mobile) 0410 689 945
> (fax / office)
> ch...@trickysolutions.com.au
>
> Disclaimer: The information in this email and any attachments to it are
> confidential and may contain legally privileged information. If you are
> not the named or intended recipient, please delete this communication and
> contact us immediately. Please note you are not authorised to copy, use or
> disclose this communication or any attachments without our consent.
> Although this email has been checked by anti-virus software, there is a
> risk that email messages may be corrupted or infected by viruses or other
> interferences. No responsibility is accepted for such interference. Unless
> expressly stated, the views of the writer are not those of the company.
> Tricky Solutions always does our best to provide accurate forecasts and
> analyses based on the data supplied, however it is possible that some
> important predictors were not included in the data sent to us. Information
> provided by us should not be solely relied upon when making decisions and
> clients should use their own judgement.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
*The mark of a successful man is one that has spent an entire day on the
bank of a river without feeling guilty about it.*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-09 Thread Prof Brian Ripley

On Tue, 9 Aug 2011, Peter Langfelder wrote:


Assuming you need the full distance matrix at one time (which you do not for
hierarchical clustering, itself a highly dubious method for more than a few
hundred points).


Apologies if this hijacks the thread, but why is hierarchical
clustering "highly dubious for more than a few
hundred points"?


That is off-topic for R-help: see the posting guide.



Peter



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-09 Thread Peter Langfelder
> Assuming you need the full distance matrix at one time (which you do not for
> hierarchical clustering, itself a highly dubious method for more than a few
> hundred points).

Apologies if this hijacks the thread, but why is hierarchical
clustering "highly dubious for more than a few
hundred points"?

Peter

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Can R handle a matrix with 8 billion entries?

2011-08-09 Thread Peter Langfelder
Sorry if this is a duplicate... my email is giving me trouble this evening...

On Tue, Aug 9, 2011 at 8:38 PM, Chris Howden
 wrote:
> Hi,
>
> I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
> I’m running into problems using the dist() function.
>
> I’ve been looking at a few threads about R’s memory and have read the
> memory limits section in R help. However I’m no computer expert so I’m
> hoping I’ve misunderstood something and R can handle my Big Data set,
> somehow. Although at the moment I think my dataset is simply too big and
> there is no way around it, but I’d like to be proved wrong!
>
> My data set has 90523 rows of data and 24 columns.
>
> My understanding is that this means the distance matrix has a min of
> 90523^2 elements which is 8194413529. Which roughly translates as 8GB of
> memory being required (if I assume each entry requires 1 bit). I only have
> 4GB on a 32bit build of windows and R. So there is no way that’s going to
> work.
>
> So then I thought of getting access to a more powerful computer, and maybe
> using cloud computing.
>
> However the R memory limit help mentions  “On all builds of R, the maximum
> length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
> distance matrix I require has more elements than this does this mean it’s
> too big for R no matter what I do?

You have understood correctly.

>
> Any ideas would be welcome.

You have a couple options, some more involved than others. If you want
to stick with R, I would suggest using a two-step clustering approach
in which you first use k-means (assuming your distance is Euclidean)
or a modification (for example, for correlation-based distances, the
package WGCNA contains a function called projectiveKMeans) to
pre-cluster your 90k+ variables into "blocks" of about 8-10k each
(that's about as much as your computer will handle). The k-means
algorithm only requires memory storage of order n*k where k is the
number of clusters (or blocks) which can be small, say 500, and n is
the number of your variables. Then you do hierarchical clustering in
each block separately. Make sure you install and load the package
flashClust or fastCluster to make the hierarchical clustering run
reasonably fast (the stock R implementation of hclust is horribly slow
with large data sets).

The mentioned WGCNA package contains a function called
blockwiseModules that does just such a procedure, but there the
distance is based on correlations which may or may not suit your
problem.

HTH,

Peter

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-09 Thread Prof Brian Ripley

On Wed, 10 Aug 2011, David Winsemius wrote:



On Aug 9, 2011, at 11:38 PM, Chris Howden wrote:


Hi,

I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
I’m running into problems using the dist() function.

I’ve been looking at a few threads about R’s memory and have read the
memory limits section in R help. However I’m no computer expert so I’m
hoping I’ve misunderstood something and R can handle my Big Data set,
somehow. Although at the moment I think my dataset is simply too big and
there is no way around it, but I’d like to be proved wrong!

My data set has 90523 rows of data and 24 columns.

My understanding is that this means the distance matrix has a min of
90523^2 elements which is 8194413529. Which roughly translates as 8GB of


A bit less than half that: it is symmetric.


memory being required (if I assume each entry requires 1 bit).


Hmm, that would be a 0/1 distance: there are simpler methods to 
cluster such distances.


I only have 4GB on a 32bit build of windows and R. So there is no 
way that’s going to work.


So then I thought of getting access to a more powerful computer, and maybe
using cloud computing.

However the R memory limit help mentions  “On all builds of R, the maximum
length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
distance matrix I require has more elements than this does this mean it’s
too big for R no matter what I do?


Yes. Vector indexing is done with 4 byte integers.


Assuming you need the full distance matrix at one time (which you do 
not for hierarchical clustering, itself a highly dubious method for 
more than a few hundred points).




--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can R handle a matrix with 8 billion entries?

2011-08-09 Thread David Winsemius


On Aug 9, 2011, at 11:38 PM, Chris Howden wrote:


Hi,

I’m trying to do a hierarchical cluster analysis in R with a Big  
Data set.

I’m running into problems using the dist() function.

I’ve been looking at a few threads about R’s memory and have read the
memory limits section in R help. However I’m no computer expert so I’m
hoping I’ve misunderstood something and R can handle my Big Data set,
somehow. Although at the moment I think my dataset is simply too big  
and

there is no way around it, but I’d like to be proved wrong!

My data set has 90523 rows of data and 24 columns.

My understanding is that this means the distance matrix has a min of
90523^2 elements which is 8194413529. Which roughly translates as  
8GB of
memory being required (if I assume each entry requires 1 bit). I  
only have
4GB on a 32bit build of windows and R. So there is no way that’s  
going to

work.

So then I thought of getting access to a more powerful computer, and  
maybe

using cloud computing.

However the R memory limit help mentions  “On all builds of R, the  
maximum
length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now  
as the
distance matrix I require has more elements than this does this mean  
it’s

too big for R no matter what I do?


Yes. Vector indexing is done with 4 byte integers.

--

David Winsemius, MD
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Can R handle a matrix with 8 billion entries?

2011-08-09 Thread Chris Howden
Hi,

I’m trying to do a hierarchical cluster analysis in R with a Big Data set.
I’m running into problems using the dist() function.

I’ve been looking at a few threads about R’s memory and have read the
memory limits section in R help. However I’m no computer expert so I’m
hoping I’ve misunderstood something and R can handle my Big Data set,
somehow. Although at the moment I think my dataset is simply too big and
there is no way around it, but I’d like to be proved wrong!

My data set has 90523 rows of data and 24 columns.

My understanding is that this means the distance matrix has a min of
90523^2 elements which is 8194413529. Which roughly translates as 8GB of
memory being required (if I assume each entry requires 1 bit). I only have
4GB on a 32bit build of windows and R. So there is no way that’s going to
work.

So then I thought of getting access to a more powerful computer, and maybe
using cloud computing.

However the R memory limit help mentions  “On all builds of R, the maximum
length (number of elements) of a vector is 2^31 - 1 ~ 2*10^9”. Now as the
distance matrix I require has more elements than this does this mean it’s
too big for R no matter what I do?

Any ideas would be welcome.

Thanks.


Chris Howden
Founding Partner
Tricky Solutions
Tricky Solutions 4 Tricky Problems
Evidence Based Strategic Development, IP Commercialisation and Innovation,
Data Analysis, Modelling and Training
(mobile) 0410 689 945
(fax / office)
ch...@trickysolutions.com.au

Disclaimer: The information in this email and any attachments to it are
confidential and may contain legally privileged information. If you are
not the named or intended recipient, please delete this communication and
contact us immediately. Please note you are not authorised to copy, use or
disclose this communication or any attachments without our consent.
Although this email has been checked by anti-virus software, there is a
risk that email messages may be corrupted or infected by viruses or other
interferences. No responsibility is accepted for such interference. Unless
expressly stated, the views of the writer are not those of the company.
Tricky Solutions always does our best to provide accurate forecasts and
analyses based on the data supplied, however it is possible that some
important predictors were not included in the data sent to us. Information
provided by us should not be solely relied upon when making decisions and
clients should use their own judgement.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.