Hello R experts,
  I am trying to do a job where I need to take random subsample from a data
and then frequency count of that. Then the median or the frequency from
say, 1000 replicates. Should I try this with subsample in loop or
bootstrap?
My data format is

> str(Data)

'data.frame': 155752 obs. of  2 variables:

 $ ReadName: Factor w/ 155752 levels
"HWI-ST884185C1PEWACXX:3:1101:10047:62439#0/2",..: 49 325 800 624 786 77 203
825 249 369 ...

 $ Taxa    : Factor w/ 25 levels "Acidimicrobium",..: 1 1 1 1 1 1 1 1 1 1 ..

and then if I take 10 sample like

> Data[sample(nrow(Data), 10), ]

                                           ReadName          Taxa

122657 HWI-ST884185C1PEWACXX:4:2105:16386:68246#0/2       Frankia

91721  HWI-ST884185C1PEWACXX:3:2314:16967:14996#0/1   Rhodococcus

62980  HWI-ST884185C1PEWACXX:4:2101:13052:29946#0/1 Mycobacterium

::::

::::

And count the frequency as:

counts <- ddply(Sample, .(Sample$Taxa), nrow), which results like

> counts

    Sample$Taxa V1

1   Actinomyces  1

2       Frankia  3

3      Gordonia  1

4 Modestobacter  1

5 Mycobacterium  2

6   Rhodococcus  1

7  Tsukamurella  1

Now I need to do this 1000 times and get a median of counts (V1 col). Can
you please suggest the quickest way?

I want to do this with really big data, and my subsample size will be 1
mil, replicate 1000, out of 10 mil size (row) data.

Thanks a lot for help.

Mitra

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to