Re: [R] problem with white space

2008-03-30 Thread jim holtman
Here is one way of doing it.  I would suggest that you read in the
data with readLines and then combine into one single string so that
you can use substring on it.  Since you did not provide  provide
commented, minimal, self-contained, reproducible code, I will take a
guess at that your data looks like:

# create some test data -- might be read in the readLines
sdata - sapply(1:10, function(x){  # 10 lines of strings with 50 characters
paste(sample(LETTERS, 50, TRUE), collapse='')
})
# put into one large string so you can do substring on it
sdata - paste(sdata, collapse='')
# now create 10 sample of size 20 and write in files (file1, file2, ... file10)
for (i in 1:10){
x - sample(nchar(sdata), 20)
writeLines(paste(substring(sdata, x, x), collapse=''),
con=paste(file, i, sep=''))
}





On Sun, Mar 30, 2008 at 3:41 PM, Suraaga Kulkarni
[EMAIL PROTECTED] wrote:
 Hi,

 I need to resample characters from a dataset that consists of an extremely
 long string that is written over hundreds of thousands of lines, each of
 length 50 characters.  I am currently doing this by first inserting a space
 after each character in the dataset and then using the following commands:

 y - as.matrix(read.table(data.txt), stringsAsFactors=FALSE)
 bstrap - sample(length(y), 10, TRUE)
 write(y[bstrap], file=Rep1.txt, ncolumns=50, append=FALSE)
 bstrap - sample(length(y), 10, TRUE)
 write(y[bstrap], file=Rep2.txt, ncolumns=50, append=FALSE)
 bstrap - sample(length(y), 10, TRUE)
 .
 .
 .
 and so on for 500 reps.


 I think there should be a better way of doing this.  My specific questions:

 1. Is there a way to avoid inserting spaces between the characters before
 calling the sample command (because I don't want spaces between the
 resampled characters in the output either; see number 2 below)?

 2. If I have no choice but to insert the spaces in my data before
 resampling, is there a way to output the resampled data without spaces, but
 simply as 50-character long strings one below the other)?  I tried inserting
 the following command: strip.white=TRUE in the write command line, but it
 gave me an error as it did not understand the command.

 3. Finally, since I have to get 500 such resampled reps from each dataset
 (and there are over 20 such huge datasets) is there a way around having to
 write a separate write command for each rep?

 Any suggestions will be greatly appreciated.

 Thanks,

 S.

[[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem with white space

2008-03-30 Thread jim holtman
How long is it taking?  Can you send me the code that you are using.

Another technique is to recode you characters into numbers and store
them as integers.  You can then sample the values and reconstruct the
output.  Here is a faster way:

# create some test data -- might be read in the readLines
# use 'raw' class for the data.
sdata - sapply(1:10, function(x){
charToRaw(paste(sample(LETTERS, 50, TRUE), collapse=))#
encode the character as a number
})
# now create 10 sample of size 20 and write in files
for (i in 1:10){
x - sample(sdata, 10, TRUE)
# convert back to characters
writeLines(rawToChar(x), con=paste(file, i, sep=''))
}






On Sun, Mar 30, 2008 at 6:06 PM, Suraaga Kulkarni
[EMAIL PROTECTED] wrote:
 Jim,

 Thanks very much.  I am very new to R and am trying to understand your code.
 It works perfectly on your sample data of course.  I tried your code on my
 data.  While it works, it takes too much time to generate each replicate.
 At present I'm outputting the replicates with only 2000 resampled
 characters.  I actually need to resample something like 1-5 million
 characters.  I work with the human genome, and need to generate 500
 bootstrap replicates of a scaled down version (about 2%) of each chromosome
 by means of resampling with replacement.

 Sorry about the cryptic code but I thought my initial description of the
 problem explained it.  In any case, your guess was correct.

 Let me see if I can rework your code to suit my purposes.  In the meanwhile,
 if you have any other suggestions, I'll be happy to hear them.

 Thanks again for the prompt response.

 S.



 On Sun, Mar 30, 2008 at 6:15 PM, jim holtman [EMAIL PROTECTED] wrote:
  Here is one way of doing it.  I would suggest that you read in the
  data with readLines and then combine into one single string so that
  you can use substring on it.  Since you did not provide  provide
  commented, minimal, self-contained, reproducible code, I will take a
  guess at that your data looks like:
 
  # create some test data -- might be read in the readLines
  sdata - sapply(1:10, function(x){  # 10 lines of strings with 50
 characters
 paste(sample(LETTERS, 50, TRUE), collapse='')
  })
  # put into one large string so you can do substring on it
  sdata - paste(sdata, collapse='')
  # now create 10 sample of size 20 and write in files (file1, file2, ...
 file10)
  for (i in 1:10){
 x - sample(nchar(sdata), 20)
 writeLines(paste(substring(sdata, x, x), collapse=''),
  con=paste(file, i, sep=''))
 
 
 
  }
 
 
 
 
 
  On Sun, Mar 30, 2008 at 3:41 PM, Suraaga Kulkarni
  [EMAIL PROTECTED] wrote:
   Hi,
  
   I need to resample characters from a dataset that consists of an
 extremely
   long string that is written over hundreds of thousands of lines, each of
   length 50 characters.  I am currently doing this by first inserting a
 space
   after each character in the dataset and then using the following
 commands:
  
   y - as.matrix(read.table(data.txt), stringsAsFactors=FALSE)
   bstrap - sample(length(y), 10, TRUE)
   write(y[bstrap], file=Rep1.txt, ncolumns=50, append=FALSE)
   bstrap - sample(length(y), 10, TRUE)
   write(y[bstrap], file=Rep2.txt, ncolumns=50, append=FALSE)
   bstrap - sample(length(y), 10, TRUE)
   .
   .
   .
   and so on for 500 reps.
  
  
   I think there should be a better way of doing this.  My specific
 questions:
  
   1. Is there a way to avoid inserting spaces between the characters
 before
   calling the sample command (because I don't want spaces between the
   resampled characters in the output either; see number 2 below)?
  
   2. If I have no choice but to insert the spaces in my data before
   resampling, is there a way to output the resampled data without spaces,
 but
   simply as 50-character long strings one below the other)?  I tried
 inserting
   the following command: strip.white=TRUE in the write command line, but
 it
   gave me an error as it did not understand the command.
  
   3. Finally, since I have to get 500 such resampled reps from each
 dataset
   (and there are over 20 such huge datasets) is there a way around having
 to
   write a separate write command for each rep?
  
   Any suggestions will be greatly appreciated.
  
   Thanks,
  
   S.
  
  [[alternative HTML version deleted]]
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
  
 
 
 
  --
  Jim Holtman
  Cincinnati, OH
  +1 513 646 9390
 
  What is the problem you are trying to solve?
 





-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide