Hello R experts, I am trying to do a job where I need to take random subsample from a data and then frequency count of that. Then the median or the frequency from say, 1000 replicates. Should I try this with subsample in loop or bootstrap? My data format is
> str(Data) 'data.frame': 155752 obs. of 2 variables: $ ReadName: Factor w/ 155752 levels "HWI-ST884185C1PEWACXX:3:1101:10047:62439#0/2",..: 49 325 800 624 786 77 203 825 249 369 ... $ Taxa : Factor w/ 25 levels "Acidimicrobium",..: 1 1 1 1 1 1 1 1 1 1 .. and then if I take 10 sample like > Data[sample(nrow(Data), 10), ] ReadName Taxa 122657 HWI-ST884185C1PEWACXX:4:2105:16386:68246#0/2 Frankia 91721 HWI-ST884185C1PEWACXX:3:2314:16967:14996#0/1 Rhodococcus 62980 HWI-ST884185C1PEWACXX:4:2101:13052:29946#0/1 Mycobacterium :::: :::: And count the frequency as: counts <- ddply(Sample, .(Sample$Taxa), nrow), which results like > counts Sample$Taxa V1 1 Actinomyces 1 2 Frankia 3 3 Gordonia 1 4 Modestobacter 1 5 Mycobacterium 2 6 Rhodococcus 1 7 Tsukamurella 1 Now I need to do this 1000 times and get a median of counts (V1 col). Can you please suggest the quickest way? I want to do this with really big data, and my subsample size will be 1 mil, replicate 1000, out of 10 mil size (row) data. Thanks a lot for help. Mitra [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.