[R] Sum of column from another df based of row values of df1
Dear All, I need to sum a column from another dataframe based on the row values of one dataframe. I am stuck in a loop trying to accomplish it and at current speed it will take more than 80 hours to complete. Needless to say I am looking for a more elegant/quicker solution. Really need some help here. Here is the issue: I have a dataframe CALL (the dput of head is given below) which has close to a million rows. There are 2 date columns which are of importance, DATE and EXPDATE. There is another dataframe, VOL (dput of head given), which has 2 columns, DATE and VOL. It has the volatility corresponding to each day and it has a total of 124 records (corresponding to 6 months). I want to add another column in the CALL dataframe which would contain the sum of all the volatilities from the VOL df for the period specified by the interval of DATE and EXPDATE in each row of CALL df. For ex: In the first row, DATE is '03-01-2011' and EXPDATE is '27-01-2011'. So I want the SUM column (A new column in CALL df) to contain the sum of volatilities of 03-01, 04-01, 05-01 till 27-01 from the VOL dataframe. I have to repeat this process for all the rows in the dataframe. Here is the for-loop version of the solution: for (k in 1:nrow(CALL)){ CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE = CALL$DATE[k] VOL$DATE = CALL$EXPDATE[k])) } The loop will run for close to a million times, it has been running for more than 10 hours and its just 12% complete. It would take more than 80 hours to complete, not the mention the toll it would take on my laptop. So is there a better way that I can accomplish this task? Any input would be greatly appreciated. Below are the dput of the two dataframes. One point of note is that there are only 124 DISTINCT values of DATE and 6 DISTINCT values of EXPDATE, in case it can be used in some way. dput(CALL) structure(list(NAME = c(STK, STK, STK, STK, STK, STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L, 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30, 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3, 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, 14977), class = Date), DTTM = structure(c(1294044516, 1294048422, 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct, POSIXt), tzone = ), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c(NAME, EXPDATE, STRIKE, TMSTMP, PRICE, QUANT, DATE, DTTM, TTE), row.names = c(1, 2, 3, 4, 5, 6), class = data.frame) dput(VOL) structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, 1294252200, 1294338600, 1294597800), class = c(POSIXct, POSIXt ), tzone = ), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, 0.000216076713994619)), .Names = c(DATE, VOL), row.names = c(NA, 6L), class = data.frame) Please do let me know if any more information from my side would help or if I need to explain the issue more clearly. Any minor improvement will be great help. Thanks in advance. -Shivam -- *Victoria Concordia Crescit* __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Sum of column from another df based of row values of df1
Thanks a lot Jim, it works a treat. Just had to change the date format in the mCALL as well. But you saved me 80 hours of fretting and frustration. Really thankful for it. Regards, Shivam On Mon, Sep 10, 2012 at 1:33 AM, jim holtman jholt...@gmail.com wrote: How about an improvement to 16 seconds. The first thing to do is to convert you data to a matrix because accessing data in a dataframe is very expensive. If you run Rprof on your code you will see that all the time is spent in retrieving the information. Converting to a matrix and using matrix accessing is considerably faster. I did convert the POSIXct to Date. You were also paying a lot in the constant conversion of POSIXct to Date for your comparisons. I just replicated your CALL to 1 million rows for testing. CALL - + structure(list(NAME = c(STK, STK, STK, STK, STK, + STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, + 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L, + 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30, + 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3, + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, + 14977), class = Date), DTTM = structure(c(1294044516, 1294048422, + 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct, + POSIXt), tzone = ), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c(NAME, + EXPDATE, STRIKE, TMSTMP, PRICE, QUANT, DATE, DTTM, + TTE), row.names = c(1, 2, 3, 4, 5, 6), class = data.frame) VOL - + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, + 1294252200, 1294338600, 1294597800), class = c(POSIXct, POSIXt + ), tzone = ), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, + 0.000216076713994619)), .Names = c(DATE, VOL), row.names = c(NA, + 6L), class = data.frame) # convert to matrices for faster testing mCALL - cbind(CALL$DATE, CALL$EXPDATE) mVOL - cbind(as.Date(VOL$DATE), VOL$VOL) # convert POSIXct to Date # create 1M rows in mCALL mCALL - rbind(mCALL, mCALL[rep(1L, 1e6),]) result - numeric(nrow(mCALL)) system.time({ + for (i in 1:nrow(mCALL)){ + result[i] - sum(mVOL[(mVOL[, 1L] = mCALL[i,1L]) + (mVOL[, 1L] = mCALL[i, 2L]), 2L]) + } + }) user system elapsed 15.940.00 16.07 On Sun, Sep 9, 2012 at 2:58 PM, Shivam shivamsi...@gmail.com wrote: Dear All, I need to sum a column from another dataframe based on the row values of one dataframe. I am stuck in a loop trying to accomplish it and at current speed it will take more than 80 hours to complete. Needless to say I am looking for a more elegant/quicker solution. Really need some help here. Here is the issue: I have a dataframe CALL (the dput of head is given below) which has close to a million rows. There are 2 date columns which are of importance, DATE and EXPDATE. There is another dataframe, VOL (dput of head given), which has 2 columns, DATE and VOL. It has the volatility corresponding to each day and it has a total of 124 records (corresponding to 6 months). I want to add another column in the CALL dataframe which would contain the sum of all the volatilities from the VOL df for the period specified by the interval of DATE and EXPDATE in each row of CALL df. For ex: In the first row, DATE is '03-01-2011' and EXPDATE is '27-01-2011'. So I want the SUM column (A new column in CALL df) to contain the sum of volatilities of 03-01, 04-01, 05-01 till 27-01 from the VOL dataframe. I have to repeat this process for all the rows in the dataframe. Here is the for-loop version of the solution: for (k in 1:nrow(CALL)){ CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE = CALL$DATE[k] VOL$DATE = CALL$EXPDATE[k])) } The loop will run for close to a million times, it has been running for more than 10 hours and its just 12% complete. It would take more than 80 hours to complete, not the mention the toll it would take on my laptop. So is there a better way that I can accomplish this task? Any input would be greatly appreciated. Below are the dput of the two dataframes. One point of note is that there are only 124 DISTINCT values of DATE and 6 DISTINCT values of EXPDATE, in case it can be used in some way. dput(CALL) structure(list(NAME = c(STK, STK, STK, STK, STK, STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L, 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30, 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3, 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, 14977), class = Date), DTTM = structure(c(1294044516, 1294048422, 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct, POSIXt), tzone = ), TTE
Re: [R] Sum of column from another df based of row values of df1
Just to add, I did not know that the speed of data access is so much different in matrix and dataframes. This is one for the future. Thanks again Jim :) -Shivam On Mon, Sep 10, 2012 at 3:29 AM, Shivam shivamsi...@gmail.com wrote: Thanks a lot Jim, it works a treat. Just had to change the date format in the mCALL as well. But you saved me 80 hours of fretting and frustration. Really thankful for it. Regards, Shivam On Mon, Sep 10, 2012 at 1:33 AM, jim holtman jholt...@gmail.com wrote: How about an improvement to 16 seconds. The first thing to do is to convert you data to a matrix because accessing data in a dataframe is very expensive. If you run Rprof on your code you will see that all the time is spent in retrieving the information. Converting to a matrix and using matrix accessing is considerably faster. I did convert the POSIXct to Date. You were also paying a lot in the constant conversion of POSIXct to Date for your comparisons. I just replicated your CALL to 1 million rows for testing. CALL - + structure(list(NAME = c(STK, STK, STK, STK, STK, + STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, + 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L, + 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30, + 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3, + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L, + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977, + 14977), class = Date), DTTM = structure(c(1294044516, 1294048422, + 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct, + POSIXt), tzone = ), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c(NAME, + EXPDATE, STRIKE, TMSTMP, PRICE, QUANT, DATE, DTTM, + TTE), row.names = c(1, 2, 3, 4, 5, 6), class = data.frame) VOL - + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800, + 1294252200, 1294338600, 1294597800), class = c(POSIXct, POSIXt + ), tzone = ), VOL = c(2.32666706461792e-05, 6.79164443640051e-05, + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865, + 0.000216076713994619)), .Names = c(DATE, VOL), row.names = c(NA, + 6L), class = data.frame) # convert to matrices for faster testing mCALL - cbind(CALL$DATE, CALL$EXPDATE) mVOL - cbind(as.Date(VOL$DATE), VOL$VOL) # convert POSIXct to Date # create 1M rows in mCALL mCALL - rbind(mCALL, mCALL[rep(1L, 1e6),]) result - numeric(nrow(mCALL)) system.time({ + for (i in 1:nrow(mCALL)){ + result[i] - sum(mVOL[(mVOL[, 1L] = mCALL[i,1L]) + (mVOL[, 1L] = mCALL[i, 2L]), 2L]) + } + }) user system elapsed 15.940.00 16.07 On Sun, Sep 9, 2012 at 2:58 PM, Shivam shivamsi...@gmail.com wrote: Dear All, I need to sum a column from another dataframe based on the row values of one dataframe. I am stuck in a loop trying to accomplish it and at current speed it will take more than 80 hours to complete. Needless to say I am looking for a more elegant/quicker solution. Really need some help here. Here is the issue: I have a dataframe CALL (the dput of head is given below) which has close to a million rows. There are 2 date columns which are of importance, DATE and EXPDATE. There is another dataframe, VOL (dput of head given), which has 2 columns, DATE and VOL. It has the volatility corresponding to each day and it has a total of 124 records (corresponding to 6 months). I want to add another column in the CALL dataframe which would contain the sum of all the volatilities from the VOL df for the period specified by the interval of DATE and EXPDATE in each row of CALL df. For ex: In the first row, DATE is '03-01-2011' and EXPDATE is '27-01-2011'. So I want the SUM column (A new column in CALL df) to contain the sum of volatilities of 03-01, 04-01, 05-01 till 27-01 from the VOL dataframe. I have to repeat this process for all the rows in the dataframe. Here is the for-loop version of the solution: for (k in 1:nrow(CALL)){ CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE = CALL$DATE[k] VOL$DATE = CALL$EXPDATE[k])) } The loop will run for close to a million times, it has been running for more than 10 hours and its just 12% complete. It would take more than 80 hours to complete, not the mention the toll it would take on my laptop. So is there a better way that I can accomplish this task? Any input would be greatly appreciated. Below are the dput of the two dataframes. One point of note is that there are only 124 DISTINCT values of DATE and 6 DISTINCT values of EXPDATE, in case it can be used in some way. dput(CALL) structure(list(NAME = c(STK, STK, STK, STK, STK, STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029, 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L, 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30, 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3, 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L
[R] Query about TSRV
Hi All, This question may not belong here, but I asked this on the R-SIG-FINANCE list and so far have not got any reply, so was hoping someone here may help. I have a basic query about TSRV and was hoping you all can shed some light on the issue. I have 22500 records for each day. So if I take the slow frequency as 5 mins I can have 75 subgrids. An estimate of the 5-minute realized volatility can be obtained by selecting the first observation from each of the 75 five-minute periods (each containing 300 observations). Another estimate can be made by selecting the second price observation from each of the 75 sets. Similarly, 298 additional estimates can be obtained from the remaining data. Thus, 300 estimates of realized volatility can be obtained for a 5-minute sampling interval. The average of these 300 estimates is the low frequency volatility. If P1,P2,...P301,P302,...P22500 are the prices at each timestamp, then we are effectively doing the below (using simple returns for illustration only): (P301/P1)^2 + (P601/P301)^2 + (P901/P601)^2 + + (P22201/P21901)^2 = V1 (P302/P2)^2 + (P602/P302)^2 + (P902/P602)^2 + + (P22202/P21902)^2 = V2 (P303/P3)^2 + (P603/P303)^2 + (P903/P603)^2 + + (P22203/P21903)^2 = V3 (P600/P300)^2 + (P900/P600)^2 + (P1200/P900)^2 + + (P22500/P22200)^2 = V300 Then we average the Vi's (V1+V2++V300)/300. This gives us the 5 min realized volatility. Please correct me if I've made a mistake somewhere. As I think this was the approach suggested by Zhang et al in their paper. What my question is, is the intermediate step of calculating V1,V2 etc necessary? Cant we simply calculate returns at a lag of 300 for the entire day and then divide it by 300 to achieve the same result. What am I missing here? Can anyone point out please. Thanks and regards, Shivam __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Query about creating time sequences
Thanks for the effort Michael, but the problem here is that the dates for which the sequences need to be created have gaps in between. Basically I need the sequence for only those days on which the security market is open (I have the dates in a file which is present at the end of THIS mail). What I have been able to do is to create a list, where each element of the list is a sequence for a single day. It was done like below: for(i in 1:124){ seqtimes[[i]] = xts(,seq(as.POSIXct(paste(fdates[i],'09:15:00', sep= )), as.POSIXct(paste(fdates[i],'15:30:00', sep= )), by = 1))} where 'fdates' is the file which contains the dates for which the sequences need to be created. But now I am stuck. I need a way to get all these sequences in a (vector/dataframe/xts object) where all the list items are sequentially present. I tried merge.xts, but to no avail. seqtimes[[1]] Data: numeric(0) Index: POSIXct[1:22501], format: 2011-01-03 09:15:00 2011-01-03 09:15:01 2011-01-03 09:15:02 2011-01-03 09:15:03 2011-01-03 09:15:04 2011-01-03 09:15:05 ... seqtimes[[2]] Data: numeric(0) Index: POSIXct[1:22501], format: 2011-01-04 09:15:00 2011-01-04 09:15:01 2011-01-04 09:15:02 2011-01-04 09:15:03 2011-01-04 09:15:04 2011-01-04 09:15:05 ... tseq = merge.xts(seqtimes[[1]],seqtimes[[2]], all = TRUE) tseq Data: numeric(0) Index: integer(0) Any help would be greatly appreciated. Thanks in advance, Regards, Shivam P.S. - The dput of the fdates file: dput(fdates) structure(c(2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, 2011-01-10, 2011-01-11, 2011-01-12, 2011-01-13, 2011-01-14, 2011-01-17, 2011-01-18, 2011-01-19, 2011-01-20, 2011-01-21, 2011-01-24, 2011-01-25, 2011-01-27, 2011-01-28, 2011-01-31, 2011-02-01, 2011-02-02, 2011-02-03, 2011-02-04, 2011-02-07, 2011-02-08, 2011-02-09, 2011-02-10, 2011-02-11, 2011-02-14, 2011-02-15, 2011-02-16, 2011-02-17, 2011-02-18, 2011-02-21, 2011-02-22, 2011-02-23, 2011-02-24, 2011-02-25, 2011-02-28, 2011-03-01, 2011-03-03, 2011-03-04, 2011-03-07, 2011-03-08, 2011-03-09, 2011-03-10, 2011-03-11, 2011-03-14, 2011-03-15, 2011-03-16, 2011-03-17, 2011-03-18, 2011-03-21, 2011-03-22, 2011-03-23, 2011-03-24, 2011-03-25, 2011-03-28, 2011-03-29, 2011-03-30, 2011-03-31, 2011-04-01, 2011-04-04, 2011-04-05, 2011-04-06, 2011-04-07, 2011-04-08, 2011-04-11, 2011-04-13, 2011-04-15, 2011-04-18, 2011-04-19, 2011-04-20, 2011-04-21, 2011-04-25, 2011-04-26, 2011-04-27, 2011-04-28, 2011-04-29, 2011-05-02, 2011-05-03, 2011-05-04, 2011-05-05, 2011-05-06, 2011-05-09, 2011-05-10, 2011-05-11, 2011-05-12, 2011-05-13, 2011-05-16, 2011-05-17, 2011-05-18, 2011-05-19, 2011-05-20, 2011-05-23, 2011-05-24, 2011-05-25, 2011-05-26, 2011-05-27, 2011-05-30, 2011-05-31, 2011-06-01, 2011-06-02, 2011-06-03, 2011-06-06, 2011-06-07, 2011-06-08, 2011-06-09, 2011-06-10, 2011-06-13, 2011-06-14, 2011-06-15, 2011-06-16, 2011-06-17, 2011-06-20, 2011-06-21, 2011-06-22, 2011-06-23, 2011-06-24, 2011-06-27, 2011-06-28, 2011-06-29, 2011-06-30 ), .Dim = c(124L, 1L)) On Sat, May 26, 2012 at 6:22 AM, R. Michael Weylandt michael.weyla...@gmail.com michael.weyla...@gmail.com wrote: One (somewhat kludgy) way would be to use seq() to make one day's worth of times then to pass those to outer() to add in the needed days and then coerce the whole thing back to a sorted vector. I'm not at a computer right now so this won't be quite right but something like x - seq(x.start.first.day, x.end.first.day, by = sec) y - 24*60*60 *(1:n.days) sort(as.vector(outer(x, y, +))) Changing the order of x and y might make the sort unnecessary. M On May 25, 2012, at 1:14 PM, Shivam shivamsi...@gmail.com wrote: Hi All, I have a query about time based sequences. I know such questions have been asked a lot on forums, but I couldnt find the exact thing that I was looking for. I want to create a time-based sequence which will mimic the trading window AND would span multiple days. Something like below: 2011-01-03 09:15:00 IST 2011-01-03 09:15:01 IST 2011-01-03 15:29:59 IST 2011-01-03 15:30:00 IST 2011-01-04 09:15:00 IST 2011-01-04 09:15:01 IST 2011-01-04 15:29:59 IST 2011-01-04 15:30:00 IST Kindly notice the change of date in the sequence. The Indian Equity markets open at 09:15:00 and close at 15:30:00. I have equity data that spans 124 days, and I need to create a corresponding sequence which I will later use to regularize the irregular dataset to make a regular time-series. I was able to accomplish this task for a single day (i.e. creating a sequence then merging my dataset with it and use na.locf to make my dataset regular) but am unable to create a sequence for 'n' number of days. Can anyone help me with this? If it is of any help, I have a file which contains all the dates for which I need the sequence. The dput of the file is placed at the end of the email. One option is to create sequences for the entire days
Re: [R] Query about creating time sequences
Thanks for the responses ppl. @Gabor - The issue with your approach was that I had to select the time window for many days (124), which would be very difficult to achieve. I really appreciate you time though. @Jeff - Your solution works. I had to tweak it a little as I needed the sequence for each second (as against the 15 min interval you had exhibited). But I was able to achieve that without much hassle. Thanks a lot, been stuck at this for almost a week :) Pasting the tweaked version for future reference. pdates - as.POSIXct( fdates ) hstart - new_period( hour=9, minute=15, second=0) hend - new_period( hour=15, minute=30,second=0 ) mperiod - new_period( second=1 ) numperday - (hend-hstart)/mperiod dtms - expand.grid( dt=pdates, tm=hstart + mperiod * seq( from=0, to=numperday ) ) dtms$dtm - with( dtms, dt + tm ) dtms - dtms[ order( dtms$dtm ), ] Regards, Shivam On Mon, May 28, 2012 at 2:51 AM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote: Try this: # Setting TZ is optional, but I find it helps me to be more aware of # timezone effects Sys.setenv( TZ=Asia/Kolkata) library(lubridate) pdates - as.POSIXct( fdates ) hstart - new_period( hour=9, minute=15 ) hend - new_period( hour=15, minute=30 ) mperiod - new_period( minute=15 ) numperday - (hend-hstart)/mperiod dtms - expand.grid( dt=pdates, tm=hstart + mperiod * seq( from=0, to=numperday ) ) dtms$dtm - with( dtyms, dt + tm ) dtms - dtms[ order( dtms$dtm ), ] You can discard everything but dtms$dtm once it has been created. On Mon, 28 May 2012, Shivam wrote: Thanks for the effort Michael, but the problem here is that the dates for which the sequences need to be created have gaps in between. Basically I need the sequence for only those days on which the security market is open (I have the dates in a file which is present at the end of THIS mail). What I have been able to do is to create a list, where each element of the list is a sequence for a single day. It was done like below: for(i in 1:124){ seqtimes[[i]] = xts(,seq(as.POSIXct(paste(**fdates[i],'09:15:00', sep= )), as.POSIXct(paste(fdates[i],'**15:30:00', sep= )), by = 1))} where 'fdates' is the file which contains the dates for which the sequences need to be created. But now I am stuck. I need a way to get all these sequences in a (vector/dataframe/xts object) where all the list items are sequentially present. I tried merge.xts, but to no avail. seqtimes[[1]] Data: numeric(0) Index: POSIXct[1:22501], format: 2011-01-03 09:15:00 2011-01-03 09:15:01 2011-01-03 09:15:02 2011-01-03 09:15:03 2011-01-03 09:15:04 2011-01-03 09:15:05 ... seqtimes[[2]] Data: numeric(0) Index: POSIXct[1:22501], format: 2011-01-04 09:15:00 2011-01-04 09:15:01 2011-01-04 09:15:02 2011-01-04 09:15:03 2011-01-04 09:15:04 2011-01-04 09:15:05 ... tseq = merge.xts(seqtimes[[1]],**seqtimes[[2]], all = TRUE) tseq Data: numeric(0) Index: integer(0) Any help would be greatly appreciated. Thanks in advance, Regards, Shivam P.S. - The dput of the fdates file: dput(fdates) structure(c(2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, 2011-01-10, 2011-01-11, 2011-01-12, 2011-01-13, 2011-01-14, 2011-01-17, 2011-01-18, 2011-01-19, 2011-01-20, 2011-01-21, 2011-01-24, 2011-01-25, 2011-01-27, 2011-01-28, 2011-01-31, 2011-02-01, 2011-02-02, 2011-02-03, 2011-02-04, 2011-02-07, 2011-02-08, 2011-02-09, 2011-02-10, 2011-02-11, 2011-02-14, 2011-02-15, 2011-02-16, 2011-02-17, 2011-02-18, 2011-02-21, 2011-02-22, 2011-02-23, 2011-02-24, 2011-02-25, 2011-02-28, 2011-03-01, 2011-03-03, 2011-03-04, 2011-03-07, 2011-03-08, 2011-03-09, 2011-03-10, 2011-03-11, 2011-03-14, 2011-03-15, 2011-03-16, 2011-03-17, 2011-03-18, 2011-03-21, 2011-03-22, 2011-03-23, 2011-03-24, 2011-03-25, 2011-03-28, 2011-03-29, 2011-03-30, 2011-03-31, 2011-04-01, 2011-04-04, 2011-04-05, 2011-04-06, 2011-04-07, 2011-04-08, 2011-04-11, 2011-04-13, 2011-04-15, 2011-04-18, 2011-04-19, 2011-04-20, 2011-04-21, 2011-04-25, 2011-04-26, 2011-04-27, 2011-04-28, 2011-04-29, 2011-05-02, 2011-05-03, 2011-05-04, 2011-05-05, 2011-05-06, 2011-05-09, 2011-05-10, 2011-05-11, 2011-05-12, 2011-05-13, 2011-05-16, 2011-05-17, 2011-05-18, 2011-05-19, 2011-05-20, 2011-05-23, 2011-05-24, 2011-05-25, 2011-05-26, 2011-05-27, 2011-05-30, 2011-05-31, 2011-06-01, 2011-06-02, 2011-06-03, 2011-06-06, 2011-06-07, 2011-06-08, 2011-06-09, 2011-06-10, 2011-06-13, 2011-06-14, 2011-06-15, 2011-06-16, 2011-06-17, 2011-06-20, 2011-06-21, 2011-06-22, 2011-06-23, 2011-06-24, 2011-06-27, 2011-06-28, 2011-06-29, 2011-06-30 ), .Dim = c(124L, 1L)) On Sat, May 26, 2012 at 6:22 AM, R. Michael Weylandt michael.weyla...@gmail.com michael.weyla...@gmail.com wrote: One (somewhat kludgy) way would be to use seq() to make one day's worth of times then to pass those to outer() to add in the needed days and then coerce the whole thing back to a sorted vector. I'm not at a computer right now so
Re: [R] Query about creating time sequences
Its not the number of days per se, it is the random gaps between the dates (corresponding to the dates on which the security market was closed) which will be difficult to accommodate in the solution proposed by you. So I would have to remove the sequence corresponding to those days from the entire sequence. This was the part which I deemed as difficult to achieve. I had mentioned this issue in my previous mails but you might have missed it. Thanks again for your time :) Regards, Shivam On Mon, May 28, 2012 at 5:06 AM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Sun, May 27, 2012 at 7:03 PM, Shivam shivamsi...@gmail.com wrote: Thanks for the responses ppl. @Gabor - The issue with your approach was that I had to select the time window for many days (124), which would be very difficult to achieve. I really appreciate you time though. Why does the number of days make it difficult to achieve? The number of days does not affect the code at all. Is there some aspect of the problem you haven't mentioned? -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -- *Victoria Concordia Crescit* [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about creating time sequences
Hi All, I have a query about time based sequences. I know such questions have been asked a lot on forums, but I couldnt find the exact thing that I was looking for. I want to create a time-based sequence which will mimic the trading window AND would span multiple days. Something like below: 2011-01-03 09:15:00 IST 2011-01-03 09:15:01 IST 2011-01-03 15:29:59 IST 2011-01-03 15:30:00 IST 2011-01-04 09:15:00 IST 2011-01-04 09:15:01 IST 2011-01-04 15:29:59 IST 2011-01-04 15:30:00 IST Kindly notice the change of date in the sequence. The Indian Equity markets open at 09:15:00 and close at 15:30:00. I have equity data that spans 124 days, and I need to create a corresponding sequence which I will later use to regularize the irregular dataset to make a regular time-series. I was able to accomplish this task for a single day (i.e. creating a sequence then merging my dataset with it and use na.locf to make my dataset regular) but am unable to create a sequence for 'n' number of days. Can anyone help me with this? If it is of any help, I have a file which contains all the dates for which I need the sequence. The dput of the file is placed at the end of the email. One option is to create sequences for the entire days and then later remove all these records after merging. Although I havent checked the feasibility of this method, it would be complex and more so it will increase the data four folds (I already have 2 million records in the dataframe which I have to make regular). Another approach that I could think of was to make a timebased sequence based on the date from the file and then use a loop to append one sequence after another. But am not having much success there either. Any kind of help would be greatly appreciated. Thanks and regards, Shivam structure(list(20110103, 20110104, 20110105, 20110106, 20110107, 20110110, 20110111, 20110112, 20110113, 20110114, 20110117, 20110118, 20110119, 20110120, 20110121, 20110124, 20110125, 20110127, 20110128, 20110131, 20110201, 20110202, 20110203, 20110204, 20110207, 20110208, 20110209, 20110210, 20110211, 20110214, 20110215, 20110216, 20110217, 20110218, 20110221, 20110222, 20110223, 20110224, 20110225, 20110228, 20110301, 20110303, 20110304, 20110307, 20110308, 20110309, 20110310, 20110311, 20110314, 20110315, 20110316, 20110317, 20110318, 20110321, 20110322, 20110323, 20110324, 20110325, 20110328, 20110329, 20110330, 20110331, 20110401, 20110404, 20110405, 20110406, 20110407, 20110408, 20110411, 20110413, 20110415, 20110418, 20110419, 20110420, 20110421, 20110425, 20110426, 20110427, 20110428, 20110429, 20110502, 20110503, 20110504, 20110505, 20110506, 20110509, 20110510, 20110511, 20110512, 20110513, 20110516, 20110517, 20110518, 20110519, 20110520, 20110523, 20110524, 20110525, 20110526, 20110527, 20110530, 20110531, 20110601, 20110602, 20110603, 20110606, 20110607, 20110608, 20110609, 20110610, 20110613, 20110614, 20110615, 20110616, 20110617, 20110620, 20110621, 20110622, 20110623, 20110624, 20110627, 20110628, 20110629, 20110630), .Dim = c(124L, 1L), .Dimnames = list(c(X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X27, X28, X29, X30, X31, X32, X33, X34, X35, X36, X37, X38, X39, X40, X41, X42, X43, X44, X45, X46, X47, X48, X49, X50, X51, X52, X53, X54, X55, X56, X57, X58, X59, X60, X61, X62, X63, X64, X65, X66, X67, X68, X69, X70, X71, X72, X73, X74, X75, X76, X77, X78, X79, X80, X81, X82, X83, X84, X85, X86, X87, X88, X89, X90, X91, X92, X93, X94, X95, X96, X97, X98, X99, X100, X101, X102, X103, X104, X105, X106, X107, X108, X109, X110, X111, X112, X113, X114, X115, X116, X117, X118, X119, X120, X121, X122, X123, X124), NULL)) [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query regarding date as argument in functions - and about sqldf
Hi, I have a query about sqldf, and dates in general. I couldnt find much on the net or on the forums, hence I am here. Here is the issue: I want to write a function that accepts 3 arguments: date1, date2 and a dataframe, say 'df'. Within the function, I want to populate a temp dataframe which essentially contains the output of the query select * from df where DATE between date1 and date2. DATE is a column (of class Date) which will be present in the input dataframe. This is how the function will look like: testfun = function(date1, date2, df){ temp = sqldf(select * from df where DATE between 'date1' and 'date2') return(temp) #not sure if I'm doing this correctly } There are multiple issues here: One is, how do I accept date as a parameter? Do I need to change the class of date1 and date2 at input level or within the sql query? In general how to format the input and how to return the temp dataframe once the function ends, kindly help. It may be basic but I've spent hours on it with no success. The above was the bigger picture. Another issue that I faced is when running a simple select (using sqldf) on a sample dataframe, I could not get a result. But when I use 'subset', I get an output. Below is the sample dataset named 'dlfhead': structure(list(NAME = c(DLF, DLF, DLF, DLF, DLF, DLF ), EXPDATE = structure(c(15001, 15001, 15001, 15001, 15001, 15001 ), class = Date), STRIKE = c(280, 280, 280, 280, 280, 280), TMSTMP = c(14:54:17, 14:42:38, 14:09:04, 14:20:25, 12:27:26, 11:59:36), PRICE = c(22, 23, 23, 23, 23, 23.2 ), DATE = structure(c(14977, 14977, 14977, 14977, 14977, 14977), class = Date), TTE = c(24, 24, 24, 24, 24, 24), DTTM = structure(c(1294046657, 1294045958, 1294043944, 1294044625, 1294037846, 1294036176), class = c(POSIXct, POSIXt)), SPOT = c(295, 294.5, 294.8, 293.85, 294.7, 294.55), MONEYNESS = c(1.05357142857143, 1.05178571428571, 1.05285714285714, 1.04946428571429, 1.0525, 1.05196428571429), TTEBY7 = c(4, 4, 4, 4, 4, 4), COMB = c(2011-01-034, 2011-01-034, 2011-01-034, 2011-01-034, 2011-01-034, 2011-01-034), RATE = c(7.55219988366433, 7.55260732560891, 7.55302228192484, 7.55344470721609, 7.55387455629645, 7.82914555610361 )), .Names = c(NAME, EXPDATE, STRIKE, TMSTMP, PRICE, DATE, TTE, DTTM, SPOT, MONEYNESS, TTEBY7, COMB, RATE), row.names = c(NA, 6L), class = data.frame) Below are the commands and their outputs: dlftest = sqldf(select * from dlfhead where DATE = '2011-01-03') nrow(dlftest) [1] 0 Kindly note that in the above, the query does complete successfully with 'dlftest' created with 0 rows. When I run the select on a column with class other than date I get the output. dlftest = sqldf(select * from dlfhead where TTE=24) nrow(dlftest) [1] 6 But when I use the 'subset' command, I do get an answer: dlftest = subset(dlfhead, DATE == '2011-01-03') nrow(dlftest) [1] 6 So the issue has to be the date-class columns in the dataframe, but what is the issue I have no idea. Can anyone help please? Thanks a lot for reading though this long email. Kindly reply if you have an idea about either issue. Regards, Shivam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem with SQLDF - Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table:
Thanks Gabor, Jim, POSIXct is working fine :) Regards, Shivam On Thu, May 10, 2012 at 5:22 AM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Wed, May 9, 2012 at 7:24 PM, Shivam shivamsi...@gmail.com wrote: Ohh ... Thanks Gabor. I have a few related queries then, kindly have a look: 1. Does it only hinder the sqldf package or are there other issues with using POSIXlt in a dataframe? Am asking because I have a few dataframes with columns of class(POSIXlt ). 2. I have columns containing 'date+timestamp', something like '2011-01-03 09:07:07' which are of class POSIXlt. I need to perform some arithmetic operations on these columns. Which class would be most appropriate for such kind of data? Its not just sqldf. You will have other problems too if you put POSIXlt objects in data frames too. See R News 4/1. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -- *Victoria Concordia Crescit* [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problem with SQLDF - Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table:
Hi All, I am having trouble executing SQL statements on a few dataframes, but the funny thing is that I am able to execute the statement on some other dataframes. To test, I have 2 very small dataframes (6 rows and some columns). One is 'lessliq', the dput is given below. dput(head(lessliq)) structure(list(V1 = c(50464677L, 50464846L, 50432581L, 50426614L, 50504329L, 50504735L), V2 = c(TATASTEEL, TATASTEEL, TATASTEEL, TATASTEEL, TATASTEEL, TATASTEEL), V3 = c(OPTSTK, OPTSTK, OPTSTK, OPTSTK, OPTSTK, OPTSTK), V4 = structure(c(15029, 15029, 15029, 15029, 15029, 15029), class = Date), V5 = c(CE, CE, CE, CE, CE, CE), V6 = c(0L, 0L, 0L, 0L, 0L, 0L), V7 = c(700, 700, 700, 700, 700, 700), V8 = c(14:15:45, 14:15:51, 13:51:12, 13:45:13, 14:39:53, 14:40:08 ), V9 = c(37, 37, 37.75, 37, 37.5, 37.5), V10 = c(500L, 500L, 500L, 500L, 2000L, 500L), V11 = structure(c(14977, 14977, 14977, 14977, 14977, 14977), class = Date), V12 = c(52, 52, 52, 52, 52, 52)), .Names = c(V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12), row.names = c(NA, 6L), class = data.frame) I run the below command: new2 = sqldf(select * from lessliq) This works fine. But on many other dataframes it is not working. I have a dataframe 'testeq'. dput given below: dput(head(testeq)) structure(list(NAME = c(DLF, DLF, DLF, DLF, DLF, DLF ), TMSTMP = c(09:07:07, 09:15:03, 09:15:03, 09:15:03, 09:15:03, 09:15:04), PRICE = c(295, 294.5, 293.9, 294.9, 295, 294.5), DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c(2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, 2011-01-10, 2011-01-11, 2011-01-12, 2011-01-13, 2011-01-14, 2011-01-17, 2011-01-18, 2011-01-19, 2011-01-20, 2011-01-21, 2011-01-24, 2011-01-25, 2011-01-27, 2011-01-28, 2011-01-31, 2011-02-01, 2011-02-02, 2011-02-03, 2011-02-04, 2011-02-07, 2011-02-08, 2011-02-09, 2011-02-10, 2011-02-11, 2011-02-14, 2011-02-15, 2011-02-16, 2011-02-17, 2011-02-18, 2011-02-21, 2011-02-22, 2011-02-23, 2011-02-24, 2011-02-25, 2011-02-28, 2011-03-01, 2011-03-03, 2011-03-04, 2011-03-07, 2011-03-08, 2011-03-09, 2011-03-10, 2011-03-11, 2011-03-14, 2011-03-15, 2011-03-16, 2011-03-17, 2011-03-18, 2011-03-21, 2011-03-22, 2011-03-23, 2011-03-24, 2011-03-25, 2011-03-28, 2011-03-29, 2011-03-30, 2011-03-31, 2011-04-01, 2011-04-04, 2011-04-05, 2011-04-06, 2011-04-07, 2011-04-08, 2011-04-11, 2011-04-13, 2011-04-15, 2011-04-18, 2011-04-19, 2011-04-20, 2011-04-21, 2011-04-25, 2011-04-26, 2011-04-27, 2011-04-28, 2011-04-29, 2011-05-02, 2011-05-03, 2011-05-04, 2011-05-05, 2011-05-06, 2011-05-09, 2011-05-10, 2011-05-11, 2011-05-12, 2011-05-13, 2011-05-16, 2011-05-17, 2011-05-18, 2011-05-19, 2011-05-20, 2011-05-23, 2011-05-24, 2011-05-25, 2011-05-26, 2011-05-27, 2011-05-30, 2011-05-31, 2011-06-01, 2011-06-02, 2011-06-03, 2011-06-06, 2011-06-07, 2011-06-08, 2011-06-09, 2011-06-10, 2011-06-13, 2011-06-14, 2011-06-15, 2011-06-16, 2011-06-17, 2011-06-20, 2011-06-21, 2011-06-22, 2011-06-23, 2011-06-24, 2011-06-27, 2011-06-28, 2011-06-29, 2011-06-30), class = factor), DTTM = structure(list(sec = c(7, 3, 3, 3, 3, 4), min = c(7L, 15L, 15L, 15L, 15L, 15L), hour = c(9L, 9L, 9L, 9L, 9L, 9L ), mday = c(3L, 3L, 3L, 3L, 3L, 3L), mon = c(0L, 0L, 0L, 0L, 0L, 0L), year = c(111L, 111L, 111L, 111L, 111L, 111L), wday = c(1L, 1L, 1L, 1L, 1L, 1L), yday = c(2L, 2L, 2L, 2L, 2L, 2L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c(sec, min, hour, mday, mon, year, wday, yday, isdst ), class = c(POSIXlt, POSIXt))), .Names = c(NAME, TMSTMP, PRICE, DATE, DTTM), row.names = c(NA, 6L), class = data.frame) I run the above command again but this time I get an error: new3 = sqldf(select * from testeq) Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table: testeq) In addition: Warning message: In value[[3L]](cond) : RAW() can only be applied to a 'raw', not a 'double' Can anyone guide me if there is any difference in the structure of the two dataframes or what else can be the issue? Thanks in advance, Regards, Shivam Singh [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem with SQLDF - Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table:
Ohh ... Thanks Gabor. I have a few related queries then, kindly have a look: 1. Does it only hinder the sqldf package or are there other issues with using POSIXlt in a dataframe? Am asking because I have a few dataframes with columns of class(POSIXlt ). 2. I have columns containing 'date+timestamp', something like '2011-01-03 09:07:07' which are of class POSIXlt. I need to perform some arithmetic operations on these columns. Which class would be most appropriate for such kind of data? Thanks and Regards, Shivam On Thu, May 10, 2012 at 4:45 AM, Gabor Grothendieck ggrothendi...@gmail.com wrote: On Wed, May 9, 2012 at 5:58 PM, Shivam shivamsi...@gmail.com wrote: Hi All, I am having trouble executing SQL statements on a few dataframes, but the funny thing is that I am able to execute the statement on some other dataframes. Never put POSIXlt objects into data frames. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com -- *Victoria Concordia Crescit* [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Query about memory used in list and dataframe
Hi, I had a query regarding which object, a list or a dataframe, consumes more R memory. Let me clarify this: For example, I have a df of 6 rows and 12 columns, say 'test'. I do object.size() and find it uses 3.3 KB of memory. I run a loop and make a list, say 'testlist', of 6 elements, each element being the above mentioned df 'test'. The size of this list is 19.9 KB, understandably. Now I combine this list into a dataframe using rbind. The df formed has 12 cols and 36 rows. The size of this df is just 5.8 KB, almost a 75% reduction in memory. I had to work with a much larger list, and I thought of using the same method to convert my bigger list (62 dataframes, each having 4 cols and close to 200,000 rows) into a single dataframe. The big list, sat LIST A, had a size of 571 MB. But when I convert it into a dataframe, say DF A, using rbind, the object size increases to 1.35 GB. This was in contradiction to the earlier result. What am I missing? Why a 75% reduction in size in one case and double size in other? Anyone with any explanation? Sorry for the verbose email, just wanted to make my case clear. Thanks in advance, Regards Shivam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to insert filename as column in a file
Thanks Jeff. I had tried the 'list' approach as well but got stuck with the below error: Error in `$-.data.frame`(`*tmp*`, date, value = 20100701) : replacement has 1 rows, data has 0 Couldnt find a work around to this, hence resorted to the multiple dataframes approach. Any insights into this? On Tue, Apr 24, 2012 at 9:51 PM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote: Programatically dealing with large numbers of separately-named objects leads to syntactically complicated code that is hard to read and maintain. Load the data frames into a list so you can access them by numeric or named index, and then getting at the loaded data will be much easier. fnames = list.files(path = getwd()) # preallocating the list for efficiency (execution speed) dtalist - vector( list, length(fnames) ) for (i in seq_len(length(fnames))){ dtalist[[i]] - read.csv.sql(fnames[i], sql = select * from file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol =\n)) dtalist[[i]]$date - substr(fnames[i],1,8)) } names(dtalist) - fnames # now you can optionally refer to dtalist$file20120424.csv or dtalist[[file20120424]] if you wish. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Shivam shivamsi...@gmail.com wrote: Reposting in hope of a reply. On Tue, Apr 24, 2012 at 1:12 AM, Shivam shivamsi...@gmail.com wrote: Thanks for the quick response. It works for an individual dataframe, but I have many dataframes. This is the code so far fnames = list.files(path = getwd()) for (i in 1:length(fnames)){ assign(paste(file,i,sep=),read.csv.sql(fnames[i], sql = select * from file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol = \n)) } This generates dataframes named as as file1,file2,...,file250. Is there a way to do something like below within the same loop? file1$date = substr(fnames[1],1,8)) file2$date = substr(fnames[2],1,8)) . . file250$date = substr(fnames[250],1,8)) assign(paste(file,i,sep=)$date doesnt work. Any help? On Tue, Apr 24, 2012 at 12:01 AM, MacQueen, Don macque...@llnl.govwrote: This little example might help. foo - data.frame(a=1:10, b=letters[1:0]) foo a b 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 a 7 7 a 8 8 a 9 9 a 10 10 a foo$date - '20120423' foo a b date 1 1 a 20120423 2 2 a 20120423 3 3 a 20120423 4 4 a 20120423 5 5 a 20120423 6 6 a 20120423 7 7 a 20120423 8 8 a 20120423 9 9 a 20120423 10 10 a 20120423 In other words, immediately after reading the data into a data frame, add a date column as in the example. You'll have to extract the date from the filename, of course. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 4/23/12 9:29 AM, Shivam shivamsi...@gmail.com wrote: Hi, I am relatively new to R. Have scourged the help files and the www but havent been able to get a solution. I have around 250 csv files, one file for each date. They have columns of all types, numeric, string etc. The name of each file is the date in the form of 'mmdd'. There is no column within the file which helps me identify the date on which the file was generated, only the filename has that info. I am selecting some data (using read.csv.sql) from each file and creating a dataset for each day. Ultimately I will combine all the datasets. I can accomplish the select and combine part, but after combining I wont have a record as to the date corresponding to the data. Hence I want to insert the filename as a column in the respective file to help me in identifying to what date each data row belongs to. Sorry for the long mail, but wanted to make myself clear. Any help would be greatly appreciated. Thanks in advance, Shivam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- *Victoria Concordia Crescit* -- *Victoria Concordia Crescit* [[alternative HTML version deleted]] __ R
Re: [R] How to insert filename as column in a file
Reposting in hope of a reply. On Tue, Apr 24, 2012 at 1:12 AM, Shivam shivamsi...@gmail.com wrote: Thanks for the quick response. It works for an individual dataframe, but I have many dataframes. This is the code so far fnames = list.files(path = getwd()) for (i in 1:length(fnames)){ assign(paste(file,i,sep=),read.csv.sql(fnames[i], sql = select * from file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol = \n)) } This generates dataframes named as as file1,file2,...,file250. Is there a way to do something like below within the same loop? file1$date = substr(fnames[1],1,8)) file2$date = substr(fnames[2],1,8)) . . file250$date = substr(fnames[250],1,8)) assign(paste(file,i,sep=)$date doesnt work. Any help? On Tue, Apr 24, 2012 at 12:01 AM, MacQueen, Don macque...@llnl.govwrote: This little example might help. foo - data.frame(a=1:10, b=letters[1:0]) foo a b 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 a 7 7 a 8 8 a 9 9 a 10 10 a foo$date - '20120423' foo a b date 1 1 a 20120423 2 2 a 20120423 3 3 a 20120423 4 4 a 20120423 5 5 a 20120423 6 6 a 20120423 7 7 a 20120423 8 8 a 20120423 9 9 a 20120423 10 10 a 20120423 In other words, immediately after reading the data into a data frame, add a date column as in the example. You'll have to extract the date from the filename, of course. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 4/23/12 9:29 AM, Shivam shivamsi...@gmail.com wrote: Hi, I am relatively new to R. Have scourged the help files and the www but havent been able to get a solution. I have around 250 csv files, one file for each date. They have columns of all types, numeric, string etc. The name of each file is the date in the form of 'mmdd'. There is no column within the file which helps me identify the date on which the file was generated, only the filename has that info. I am selecting some data (using read.csv.sql) from each file and creating a dataset for each day. Ultimately I will combine all the datasets. I can accomplish the select and combine part, but after combining I wont have a record as to the date corresponding to the data. Hence I want to insert the filename as a column in the respective file to help me in identifying to what date each data row belongs to. Sorry for the long mail, but wanted to make myself clear. Any help would be greatly appreciated. Thanks in advance, Shivam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- *Victoria Concordia Crescit* -- *Victoria Concordia Crescit* [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to insert filename as column in a file
Hi, I am relatively new to R. Have scourged the help files and the www but havent been able to get a solution. I have around 250 csv files, one file for each date. They have columns of all types, numeric, string etc. The name of each file is the date in the form of 'mmdd'. There is no column within the file which helps me identify the date on which the file was generated, only the filename has that info. I am selecting some data (using read.csv.sql) from each file and creating a dataset for each day. Ultimately I will combine all the datasets. I can accomplish the select and combine part, but after combining I wont have a record as to the date corresponding to the data. Hence I want to insert the filename as a column in the respective file to help me in identifying to what date each data row belongs to. Sorry for the long mail, but wanted to make myself clear. Any help would be greatly appreciated. Thanks in advance, Shivam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to insert filename as column in a file
Thanks for the quick response. It works for an individual dataframe, but I have many dataframes. This is the code so far fnames = list.files(path = getwd()) for (i in 1:length(fnames)){ assign(paste(file,i,sep=),read.csv.sql(fnames[i], sql = select * from file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol = \n)) } This generates dataframes named as as file1,file2,...,file250. Is there a way to do something like below within the same loop? file1$date = substr(fnames[1],1,8)) file2$date = substr(fnames[2],1,8)) . . file250$date = substr(fnames[250],1,8)) assign(paste(file,i,sep=)$date doesnt work. Any help? On Tue, Apr 24, 2012 at 12:01 AM, MacQueen, Don macque...@llnl.gov wrote: This little example might help. foo - data.frame(a=1:10, b=letters[1:0]) foo a b 1 1 a 2 2 a 3 3 a 4 4 a 5 5 a 6 6 a 7 7 a 8 8 a 9 9 a 10 10 a foo$date - '20120423' foo a b date 1 1 a 20120423 2 2 a 20120423 3 3 a 20120423 4 4 a 20120423 5 5 a 20120423 6 6 a 20120423 7 7 a 20120423 8 8 a 20120423 9 9 a 20120423 10 10 a 20120423 In other words, immediately after reading the data into a data frame, add a date column as in the example. You'll have to extract the date from the filename, of course. -Don -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 4/23/12 9:29 AM, Shivam shivamsi...@gmail.com wrote: Hi, I am relatively new to R. Have scourged the help files and the www but havent been able to get a solution. I have around 250 csv files, one file for each date. They have columns of all types, numeric, string etc. The name of each file is the date in the form of 'mmdd'. There is no column within the file which helps me identify the date on which the file was generated, only the filename has that info. I am selecting some data (using read.csv.sql) from each file and creating a dataset for each day. Ultimately I will combine all the datasets. I can accomplish the select and combine part, but after combining I wont have a record as to the date corresponding to the data. Hence I want to insert the filename as a column in the respective file to help me in identifying to what date each data row belongs to. Sorry for the long mail, but wanted to make myself clear. Any help would be greatly appreciated. Thanks in advance, Shivam [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- *Victoria Concordia Crescit* [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.