[R] Sum of column from another df based of row values of df1

2012-09-09 Thread Shivam
Dear All,

I need to sum a column from another dataframe based on the row values
of one dataframe. I am stuck in a loop trying to accomplish it and at
current speed it will take more than 80 hours to complete. Needless to
say I am looking for a more elegant/quicker solution. Really need some
help here. Here is the issue:

I have a dataframe CALL (the dput of head is given below) which has
close to a million rows. There are 2 date columns which are of
importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
head given), which has 2 columns, DATE and VOL. It has the volatility
corresponding to each day and it has a total of 124 records
(corresponding to 6 months). I want to add another column in the CALL
dataframe which would contain the sum of all the volatilities from the
VOL df for the period specified by the interval of DATE and EXPDATE in
each row of CALL df.

For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
'27-01-2011'. So I want the SUM column (A new column in CALL df) to
contain the sum of volatilities of 03-01, 04-01, 05-01  till 27-01
from the VOL dataframe.

I have to repeat this process for all the rows in the dataframe. Here
is the for-loop version of the solution:

for (k in 1:nrow(CALL)){
CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE = CALL$DATE[k]  VOL$DATE
= CALL$EXPDATE[k]))
}

The loop will run for close to a million times, it has been running
for more than 10 hours and its just 12% complete. It would take more
than 80 hours to complete, not the mention the toll it would take on
my laptop. So is there a better way that I can accomplish this task?
Any input would be greatly appreciated. Below are the dput of the two
dataframes.

One point of note is that there are only 124 DISTINCT values of DATE
and 6 DISTINCT values of EXPDATE, in case it can be used in some way.

 dput(CALL)
structure(list(NAME = c(STK, STK, STK, STK, STK,
STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L,
6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30,
15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3,
101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
14977), class = Date), DTTM = structure(c(1294044516, 1294048422,
1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct,
POSIXt), tzone = ), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c(NAME,
EXPDATE, STRIKE, TMSTMP, PRICE, QUANT, DATE, DTTM,
TTE), row.names = c(1, 2, 3, 4, 5, 6), class = data.frame)


 dput(VOL)
structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
1294252200, 1294338600, 1294597800), class = c(POSIXct, POSIXt
), tzone = ), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
0.000216076713994619)), .Names = c(DATE, VOL), row.names = c(NA,
6L), class = data.frame)

Please do let me know if any more information from my side would help
or if I need to explain the issue more clearly.

Any minor improvement will be great help.

Thanks in advance.

-Shivam


-- 
*Victoria Concordia Crescit*

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sum of column from another df based of row values of df1

2012-09-09 Thread Shivam
Thanks a lot Jim, it works a treat. Just had to change the date format
in the mCALL as well. But you saved me 80 hours of fretting and
frustration. Really thankful for it.

Regards,
Shivam

On Mon, Sep 10, 2012 at 1:33 AM, jim holtman jholt...@gmail.com wrote:
 How about an improvement to 16 seconds.  The first thing to do is to
 convert you data to a matrix because accessing data in a dataframe is
 very expensive.  If you run Rprof on your code you will see that all
 the time is spent in retrieving the information.  Converting to a
 matrix and using matrix accessing is considerably faster.  I did
 convert the POSIXct to Date.  You were also paying a lot in the
 constant conversion of POSIXct to Date for your comparisons.  I just
 replicated your CALL to 1 million rows for testing.


 CALL -
 + structure(list(NAME = c(STK, STK, STK, STK, STK,
 + STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
 + 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L,
 + 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30,
 + 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3,
 + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
 + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
 + 14977), class = Date), DTTM = structure(c(1294044516, 1294048422,
 + 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct,
 + POSIXt), tzone = ), TTE = c(38, 38, 38, 38, 38, 38)), .Names = c(NAME,
 + EXPDATE, STRIKE, TMSTMP, PRICE, QUANT, DATE, DTTM,
 + TTE), row.names = c(1, 2, 3, 4, 5, 6), class = data.frame)

 VOL -
 + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
 + 1294252200, 1294338600, 1294597800), class = c(POSIXct, POSIXt
 + ), tzone = ), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
 + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
 + 0.000216076713994619)), .Names = c(DATE, VOL), row.names = c(NA,
 + 6L), class = data.frame)

 # convert to matrices for faster testing
 mCALL - cbind(CALL$DATE, CALL$EXPDATE)
 mVOL - cbind(as.Date(VOL$DATE), VOL$VOL)  # convert POSIXct to Date

 # create 1M rows in mCALL
 mCALL - rbind(mCALL, mCALL[rep(1L, 1e6),])

 result - numeric(nrow(mCALL))
 system.time({
 + for (i in 1:nrow(mCALL)){
 + result[i] - sum(mVOL[(mVOL[, 1L] = mCALL[i,1L])
 +  (mVOL[, 1L] = mCALL[i, 2L]), 2L])
 + }
 + })
user  system elapsed
   15.940.00   16.07






 On Sun, Sep 9, 2012 at 2:58 PM, Shivam shivamsi...@gmail.com wrote:
 Dear All,

 I need to sum a column from another dataframe based on the row values
 of one dataframe. I am stuck in a loop trying to accomplish it and at
 current speed it will take more than 80 hours to complete. Needless to
 say I am looking for a more elegant/quicker solution. Really need some
 help here. Here is the issue:

 I have a dataframe CALL (the dput of head is given below) which has
 close to a million rows. There are 2 date columns which are of
 importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
 head given), which has 2 columns, DATE and VOL. It has the volatility
 corresponding to each day and it has a total of 124 records
 (corresponding to 6 months). I want to add another column in the CALL
 dataframe which would contain the sum of all the volatilities from the
 VOL df for the period specified by the interval of DATE and EXPDATE in
 each row of CALL df.

 For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
 '27-01-2011'. So I want the SUM column (A new column in CALL df) to
 contain the sum of volatilities of 03-01, 04-01, 05-01  till 27-01
 from the VOL dataframe.

 I have to repeat this process for all the rows in the dataframe. Here
 is the for-loop version of the solution:

 for (k in 1:nrow(CALL)){
 CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE = CALL$DATE[k]  VOL$DATE
 = CALL$EXPDATE[k]))
 }

 The loop will run for close to a million times, it has been running
 for more than 10 hours and its just 12% complete. It would take more
 than 80 hours to complete, not the mention the toll it would take on
 my laptop. So is there a better way that I can accomplish this task?
 Any input would be greatly appreciated. Below are the dput of the two
 dataframes.

 One point of note is that there are only 124 DISTINCT values of DATE
 and 6 DISTINCT values of EXPDATE, in case it can be used in some way.

 dput(CALL)
 structure(list(NAME = c(STK, STK, STK, STK, STK,
 STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L,
 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30,
 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3,
 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
 14977), class = Date), DTTM = structure(c(1294044516, 1294048422,
 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct,
 POSIXt), tzone = ), TTE

Re: [R] Sum of column from another df based of row values of df1

2012-09-09 Thread Shivam
Just to add, I did not know that the speed of data access is so much
different in matrix and dataframes. This is one for the future.

Thanks again Jim :)

-Shivam

On Mon, Sep 10, 2012 at 3:29 AM, Shivam shivamsi...@gmail.com wrote:
 Thanks a lot Jim, it works a treat. Just had to change the date format
 in the mCALL as well. But you saved me 80 hours of fretting and
 frustration. Really thankful for it.

 Regards,
 Shivam

 On Mon, Sep 10, 2012 at 1:33 AM, jim holtman jholt...@gmail.com wrote:
 How about an improvement to 16 seconds.  The first thing to do is to
 convert you data to a matrix because accessing data in a dataframe is
 very expensive.  If you run Rprof on your code you will see that all
 the time is spent in retrieving the information.  Converting to a
 matrix and using matrix accessing is considerably faster.  I did
 convert the POSIXct to Date.  You were also paying a lot in the
 constant conversion of POSIXct to Date for your comparisons.  I just
 replicated your CALL to 1 million rows for testing.


 CALL -
 + structure(list(NAME = c(STK, STK, STK, STK, STK,
 + STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
 + 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L,
 + 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30,
 + 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3,
 + 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L, 2000L, 1700L, 2000L,
 + 2000L), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
 + 14977), class = Date), DTTM = structure(c(1294044516, 1294048422,
 + 1294048350, 1294048453, 1294048327, 1294048347), class = c(POSIXct,
 + POSIXt), tzone = ), TTE = c(38, 38, 38, 38, 38, 38)), .Names = 
 c(NAME,
 + EXPDATE, STRIKE, TMSTMP, PRICE, QUANT, DATE, DTTM,
 + TTE), row.names = c(1, 2, 3, 4, 5, 6), class = data.frame)

 VOL -
 + structure(list(DATE = structure(c(1293993000, 1294079400, 1294165800,
 + 1294252200, 1294338600, 1294597800), class = c(POSIXct, POSIXt
 + ), tzone = ), VOL = c(2.32666706461792e-05, 6.79164443640051e-05,
 + 5.66390788200039e-05, 7.25422438459608e-05, 0.000121727951296865,
 + 0.000216076713994619)), .Names = c(DATE, VOL), row.names = c(NA,
 + 6L), class = data.frame)

 # convert to matrices for faster testing
 mCALL - cbind(CALL$DATE, CALL$EXPDATE)
 mVOL - cbind(as.Date(VOL$DATE), VOL$VOL)  # convert POSIXct to Date

 # create 1M rows in mCALL
 mCALL - rbind(mCALL, mCALL[rep(1L, 1e6),])

 result - numeric(nrow(mCALL))
 system.time({
 + for (i in 1:nrow(mCALL)){
 + result[i] - sum(mVOL[(mVOL[, 1L] = mCALL[i,1L])
 +  (mVOL[, 1L] = mCALL[i, 2L]), 2L])
 + }
 + })
user  system elapsed
   15.940.00   16.07






 On Sun, Sep 9, 2012 at 2:58 PM, Shivam shivamsi...@gmail.com wrote:
 Dear All,

 I need to sum a column from another dataframe based on the row values
 of one dataframe. I am stuck in a loop trying to accomplish it and at
 current speed it will take more than 80 hours to complete. Needless to
 say I am looking for a more elegant/quicker solution. Really need some
 help here. Here is the issue:

 I have a dataframe CALL (the dput of head is given below) which has
 close to a million rows. There are 2 date columns which are of
 importance, DATE and EXPDATE. There is another dataframe, VOL (dput of
 head given), which has 2 columns, DATE and VOL. It has the volatility
 corresponding to each day and it has a total of 124 records
 (corresponding to 6 months). I want to add another column in the CALL
 dataframe which would contain the sum of all the volatilities from the
 VOL df for the period specified by the interval of DATE and EXPDATE in
 each row of CALL df.

 For ex: In the first row, DATE is '03-01-2011' and EXPDATE is
 '27-01-2011'. So I want the SUM column (A new column in CALL df) to
 contain the sum of volatilities of 03-01, 04-01, 05-01  till 27-01
 from the VOL dataframe.

 I have to repeat this process for all the rows in the dataframe. Here
 is the for-loop version of the solution:

 for (k in 1:nrow(CALL)){
 CALL$SUM[k] = sum(subset(VOL$VOL, VOL$DATE = CALL$DATE[k]  VOL$DATE
 = CALL$EXPDATE[k]))
 }

 The loop will run for close to a million times, it has been running
 for more than 10 hours and its just 12% complete. It would take more
 than 80 hours to complete, not the mention the toll it would take on
 my laptop. So is there a better way that I can accomplish this task?
 Any input would be greatly appreciated. Below are the dput of the two
 dataframes.

 One point of note is that there are only 124 DISTINCT values of DATE
 and 6 DISTINCT values of EXPDATE, in case it can be used in some way.

 dput(CALL)
 structure(list(NAME = c(STK, STK, STK, STK, STK,
 STK), EXPDATE = structure(c(15029, 15029, 15029, 15029, 15029,
 15029), class = Date), STRIKE = c(6300L, 6300L, 6300L, 6300L,
 6300L, 6300L), TMSTMP = c(14:18:36, 15:23:42, 15:22:30,
 15:24:13, 15:22:07, 15:22:27), PRICE = c(107, 102.05, 101.3,
 101.5, 101.2, 101.2), QUANT = c(1850L, 2000L

[R] Query about TSRV

2012-06-14 Thread Shivam
Hi All,

This question may not belong here, but I asked this on the
R-SIG-FINANCE list and so far have not got any reply, so was hoping
someone here may help.

I have a basic query about TSRV and was hoping you all can shed some
light on the issue.

I have 22500 records for each day. So if I take the slow frequency as
5 mins I can have 75 subgrids. An estimate of the 5-minute realized
volatility can be obtained by selecting the first observation from
each of the 75 five-minute periods (each containing 300 observations).
Another estimate can be made by selecting the second price observation
from each of the 75 sets. Similarly, 298 additional estimates can be
obtained from the remaining data. Thus, 300 estimates of realized
volatility can be obtained for a 5-minute sampling interval. The
average of these 300 estimates is the low frequency volatility.

If P1,P2,...P301,P302,...P22500 are the prices at each timestamp, then
we are effectively doing the below (using simple returns for
illustration only):

(P301/P1)^2 + (P601/P301)^2 + (P901/P601)^2 +  + (P22201/P21901)^2  = V1

(P302/P2)^2 + (P602/P302)^2 + (P902/P602)^2 +  + (P22202/P21902)^2 = V2

(P303/P3)^2 + (P603/P303)^2 + (P903/P603)^2 +  + (P22203/P21903)^2 = V3



(P600/P300)^2 + (P900/P600)^2 + (P1200/P900)^2 +  + (P22500/P22200)^2 = V300

Then we average the Vi's (V1+V2++V300)/300. This gives us the 5
min realized volatility.

Please correct me if I've made a mistake somewhere. As I think this
was the approach suggested by Zhang et al in their paper.

What my question is, is the intermediate step of calculating V1,V2
etc necessary? Cant we simply calculate returns at a lag of 300 for
the entire day and then divide it by 300 to achieve the same result.
What am I missing here? Can anyone point out please.

Thanks and regards,
Shivam

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Query about creating time sequences

2012-05-27 Thread Shivam
Thanks for the effort Michael, but the problem here is that the dates for
which the sequences need to be created have gaps in between. Basically I
need the sequence for only those days on which the security market is open
(I have the dates in a file which is present at the end of THIS mail).

What I have been able to do is to create a list, where each element of the
list is a sequence for a single day. It was done like below:

for(i in 1:124){
seqtimes[[i]] = xts(,seq(as.POSIXct(paste(fdates[i],'09:15:00', sep= )),
as.POSIXct(paste(fdates[i],'15:30:00', sep= )), by = 1))}

where 'fdates' is the file which contains the dates for which the sequences
need to be created.

But now I am stuck. I need a way to get all these sequences in a
(vector/dataframe/xts object) where all the list items are sequentially
present.

I tried merge.xts, but to no avail.

 seqtimes[[1]]
Data:
numeric(0)

Index:
 POSIXct[1:22501], format: 2011-01-03 09:15:00 2011-01-03 09:15:01
2011-01-03 09:15:02 2011-01-03 09:15:03 2011-01-03 09:15:04
2011-01-03 09:15:05 ...

 seqtimes[[2]]
Data:
numeric(0)

Index:
 POSIXct[1:22501], format: 2011-01-04 09:15:00 2011-01-04 09:15:01
2011-01-04 09:15:02 2011-01-04 09:15:03 2011-01-04 09:15:04
2011-01-04 09:15:05 ...

 tseq = merge.xts(seqtimes[[1]],seqtimes[[2]], all = TRUE)
 tseq
Data:
numeric(0)

Index:
integer(0)

Any help would be greatly appreciated.

Thanks in advance,
Regards,
Shivam

P.S. - The dput of the fdates file:

 dput(fdates)
structure(c(2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06,
2011-01-07, 2011-01-10, 2011-01-11, 2011-01-12, 2011-01-13,
2011-01-14, 2011-01-17, 2011-01-18, 2011-01-19, 2011-01-20,
2011-01-21, 2011-01-24, 2011-01-25, 2011-01-27, 2011-01-28,
2011-01-31, 2011-02-01, 2011-02-02, 2011-02-03, 2011-02-04,
2011-02-07, 2011-02-08, 2011-02-09, 2011-02-10, 2011-02-11,
2011-02-14, 2011-02-15, 2011-02-16, 2011-02-17, 2011-02-18,
2011-02-21, 2011-02-22, 2011-02-23, 2011-02-24, 2011-02-25,
2011-02-28, 2011-03-01, 2011-03-03, 2011-03-04, 2011-03-07,
2011-03-08, 2011-03-09, 2011-03-10, 2011-03-11, 2011-03-14,
2011-03-15, 2011-03-16, 2011-03-17, 2011-03-18, 2011-03-21,
2011-03-22, 2011-03-23, 2011-03-24, 2011-03-25, 2011-03-28,
2011-03-29, 2011-03-30, 2011-03-31, 2011-04-01, 2011-04-04,
2011-04-05, 2011-04-06, 2011-04-07, 2011-04-08, 2011-04-11,
2011-04-13, 2011-04-15, 2011-04-18, 2011-04-19, 2011-04-20,
2011-04-21, 2011-04-25, 2011-04-26, 2011-04-27, 2011-04-28,
2011-04-29, 2011-05-02, 2011-05-03, 2011-05-04, 2011-05-05,
2011-05-06, 2011-05-09, 2011-05-10, 2011-05-11, 2011-05-12,
2011-05-13, 2011-05-16, 2011-05-17, 2011-05-18, 2011-05-19,
2011-05-20, 2011-05-23, 2011-05-24, 2011-05-25, 2011-05-26,
2011-05-27, 2011-05-30, 2011-05-31, 2011-06-01, 2011-06-02,
2011-06-03, 2011-06-06, 2011-06-07, 2011-06-08, 2011-06-09,
2011-06-10, 2011-06-13, 2011-06-14, 2011-06-15, 2011-06-16,
2011-06-17, 2011-06-20, 2011-06-21, 2011-06-22, 2011-06-23,
2011-06-24, 2011-06-27, 2011-06-28, 2011-06-29, 2011-06-30
), .Dim = c(124L, 1L))



On Sat, May 26, 2012 at 6:22 AM, R. Michael Weylandt 
michael.weyla...@gmail.com michael.weyla...@gmail.com wrote:

 One (somewhat kludgy) way would be to use seq() to make one day's worth of
 times then to pass those to outer() to add in the needed days and then
 coerce the whole thing back to a sorted vector.

 I'm not at a computer right now so this won't be quite right but something
 like


 x - seq(x.start.first.day, x.end.first.day, by = sec)

 y - 24*60*60 *(1:n.days)

 sort(as.vector(outer(x, y, +)))

 Changing the order of x and y might make the sort unnecessary.

 M

 On May 25, 2012, at 1:14 PM, Shivam shivamsi...@gmail.com wrote:

  Hi All,
 
  I have a query about time based sequences. I know such questions have
 been
  asked a lot on forums, but I couldnt find the exact thing that I was
  looking for.
 
  I want to create a time-based sequence which will mimic the trading
 window
  AND would span multiple days. Something like below:
 
  2011-01-03 09:15:00 IST
  2011-01-03 09:15:01 IST
  
  
  
  2011-01-03 15:29:59 IST
  2011-01-03 15:30:00 IST
  2011-01-04 09:15:00 IST
  2011-01-04 09:15:01 IST
  
  
  
  2011-01-04 15:29:59 IST
  2011-01-04 15:30:00 IST
 
  Kindly notice the change of date in the sequence.
 
  The Indian Equity markets open at 09:15:00 and close at 15:30:00. I have
  equity data that spans 124 days, and I need to create a corresponding
  sequence which I will later use to regularize the irregular dataset to
 make
  a regular time-series.
 
  I was able to accomplish this task for a single day (i.e. creating a
  sequence then merging my dataset with it and use na.locf to make my
 dataset
  regular) but am unable to create a sequence for 'n' number of days. Can
  anyone help me with this?
 
  If it is of any help, I have a file which contains all the dates for
 which
  I need the sequence. The dput of the file is placed at the end of the
  email.
 
  One option is to create sequences for the entire days

Re: [R] Query about creating time sequences

2012-05-27 Thread Shivam
Thanks for the responses ppl.

@Gabor - The issue with your approach was that I had to select the time
window for many days (124), which would be very difficult to achieve. I
really appreciate you time though.

@Jeff - Your solution works. I had to tweak it a little as I needed the
sequence for each second (as against the 15 min interval you had
exhibited). But I was able to achieve that without much hassle. Thanks a
lot, been stuck at this for almost a week :)

Pasting the tweaked version for future reference.

pdates - as.POSIXct( fdates )
hstart - new_period( hour=9, minute=15, second=0)
hend - new_period( hour=15, minute=30,second=0 )
mperiod - new_period( second=1 )
numperday - (hend-hstart)/mperiod
dtms - expand.grid( dt=pdates, tm=hstart + mperiod * seq( from=0,
to=numperday ) )
dtms$dtm - with( dtms, dt + tm )
dtms - dtms[ order( dtms$dtm ), ]


Regards,
Shivam

On Mon, May 28, 2012 at 2:51 AM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote:

 Try this:

 # Setting TZ is optional, but I find it helps me to be more aware of #
 timezone effects
 Sys.setenv( TZ=Asia/Kolkata)
 library(lubridate)

 pdates - as.POSIXct( fdates )
 hstart - new_period( hour=9, minute=15 )
 hend - new_period( hour=15, minute=30 )
 mperiod - new_period( minute=15 )
 numperday - (hend-hstart)/mperiod
 dtms - expand.grid( dt=pdates, tm=hstart + mperiod * seq( from=0,
 to=numperday ) )
 dtms$dtm - with( dtyms, dt + tm )
 dtms - dtms[ order( dtms$dtm ), ]

 You can discard everything but dtms$dtm once it has been created.


 On Mon, 28 May 2012, Shivam wrote:

  Thanks for the effort Michael, but the problem here is that the dates for
 which the sequences need to be created have gaps in between. Basically I
 need the sequence for only those days on which the security market is open
 (I have the dates in a file which is present at the end of THIS mail).

 What I have been able to do is to create a list, where each element of the
 list is a sequence for a single day. It was done like below:

 for(i in 1:124){
 seqtimes[[i]] = xts(,seq(as.POSIXct(paste(**fdates[i],'09:15:00', sep=
 )),
 as.POSIXct(paste(fdates[i],'**15:30:00', sep= )), by = 1))}

 where 'fdates' is the file which contains the dates for which the
 sequences
 need to be created.

 But now I am stuck. I need a way to get all these sequences in a
 (vector/dataframe/xts object) where all the list items are sequentially
 present.

 I tried merge.xts, but to no avail.

  seqtimes[[1]]

 Data:
 numeric(0)

 Index:
 POSIXct[1:22501], format: 2011-01-03 09:15:00 2011-01-03 09:15:01
 2011-01-03 09:15:02 2011-01-03 09:15:03 2011-01-03 09:15:04
 2011-01-03 09:15:05 ...

  seqtimes[[2]]

 Data:
 numeric(0)

 Index:
 POSIXct[1:22501], format: 2011-01-04 09:15:00 2011-01-04 09:15:01
 2011-01-04 09:15:02 2011-01-04 09:15:03 2011-01-04 09:15:04
 2011-01-04 09:15:05 ...

  tseq = merge.xts(seqtimes[[1]],**seqtimes[[2]], all = TRUE)
 tseq

 Data:
 numeric(0)

 Index:
 integer(0)

 Any help would be greatly appreciated.

 Thanks in advance,
 Regards,
 Shivam

 P.S. - The dput of the fdates file:

  dput(fdates)

 structure(c(2011-01-03, 2011-01-04, 2011-01-05, 2011-01-06,
 2011-01-07, 2011-01-10, 2011-01-11, 2011-01-12, 2011-01-13,
 2011-01-14, 2011-01-17, 2011-01-18, 2011-01-19, 2011-01-20,
 2011-01-21, 2011-01-24, 2011-01-25, 2011-01-27, 2011-01-28,
 2011-01-31, 2011-02-01, 2011-02-02, 2011-02-03, 2011-02-04,
 2011-02-07, 2011-02-08, 2011-02-09, 2011-02-10, 2011-02-11,
 2011-02-14, 2011-02-15, 2011-02-16, 2011-02-17, 2011-02-18,
 2011-02-21, 2011-02-22, 2011-02-23, 2011-02-24, 2011-02-25,
 2011-02-28, 2011-03-01, 2011-03-03, 2011-03-04, 2011-03-07,
 2011-03-08, 2011-03-09, 2011-03-10, 2011-03-11, 2011-03-14,
 2011-03-15, 2011-03-16, 2011-03-17, 2011-03-18, 2011-03-21,
 2011-03-22, 2011-03-23, 2011-03-24, 2011-03-25, 2011-03-28,
 2011-03-29, 2011-03-30, 2011-03-31, 2011-04-01, 2011-04-04,
 2011-04-05, 2011-04-06, 2011-04-07, 2011-04-08, 2011-04-11,
 2011-04-13, 2011-04-15, 2011-04-18, 2011-04-19, 2011-04-20,
 2011-04-21, 2011-04-25, 2011-04-26, 2011-04-27, 2011-04-28,
 2011-04-29, 2011-05-02, 2011-05-03, 2011-05-04, 2011-05-05,
 2011-05-06, 2011-05-09, 2011-05-10, 2011-05-11, 2011-05-12,
 2011-05-13, 2011-05-16, 2011-05-17, 2011-05-18, 2011-05-19,
 2011-05-20, 2011-05-23, 2011-05-24, 2011-05-25, 2011-05-26,
 2011-05-27, 2011-05-30, 2011-05-31, 2011-06-01, 2011-06-02,
 2011-06-03, 2011-06-06, 2011-06-07, 2011-06-08, 2011-06-09,
 2011-06-10, 2011-06-13, 2011-06-14, 2011-06-15, 2011-06-16,
 2011-06-17, 2011-06-20, 2011-06-21, 2011-06-22, 2011-06-23,
 2011-06-24, 2011-06-27, 2011-06-28, 2011-06-29, 2011-06-30
 ), .Dim = c(124L, 1L))



 On Sat, May 26, 2012 at 6:22 AM, R. Michael Weylandt 
 michael.weyla...@gmail.com michael.weyla...@gmail.com wrote:

  One (somewhat kludgy) way would be to use seq() to make one day's worth
 of
 times then to pass those to outer() to add in the needed days and then
 coerce the whole thing back to a sorted vector.

 I'm not at a computer right now so

Re: [R] Query about creating time sequences

2012-05-27 Thread Shivam
Its not the number of days per se, it is the random gaps between the dates
(corresponding to the dates on which the security market was closed) which
will be difficult to accommodate in the solution proposed by you. So I
would have to remove the sequence corresponding to those days from the
entire sequence. This was the part which I deemed as difficult to achieve.
I had mentioned this issue in my previous mails but you might have missed
it.

Thanks again for your time :)

Regards,
Shivam

On Mon, May 28, 2012 at 5:06 AM, Gabor Grothendieck ggrothendi...@gmail.com
 wrote:

 On Sun, May 27, 2012 at 7:03 PM, Shivam shivamsi...@gmail.com wrote:
  Thanks for the responses ppl.
 
  @Gabor - The issue with your approach was that I had to select the time
  window for many days (124), which would be very difficult to achieve. I
  really appreciate you time though.

 Why does the number of days make it difficult to achieve?  The
 number of days does not affect the code at all.  Is there some aspect
 of the problem you haven't mentioned?

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com




-- 
*Victoria Concordia Crescit*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about creating time sequences

2012-05-25 Thread Shivam
Hi All,

I have a query about time based sequences. I know such questions have been
asked a lot on forums, but I couldnt find the exact thing that I was
looking for.

I want to create a time-based sequence which will mimic the trading window
AND would span multiple days. Something like below:

2011-01-03 09:15:00 IST
2011-01-03 09:15:01 IST



2011-01-03 15:29:59 IST
2011-01-03 15:30:00 IST
2011-01-04 09:15:00 IST
2011-01-04 09:15:01 IST



2011-01-04 15:29:59 IST
2011-01-04 15:30:00 IST

Kindly notice the change of date in the sequence.

The Indian Equity markets open at 09:15:00 and close at 15:30:00. I have
equity data that spans 124 days, and I need to create a corresponding
sequence which I will later use to regularize the irregular dataset to make
a regular time-series.

I was able to accomplish this task for a single day (i.e. creating a
sequence then merging my dataset with it and use na.locf to make my dataset
regular) but am unable to create a sequence for 'n' number of days. Can
anyone help me with this?

If it is of any help, I have a file which contains all the dates for which
I need the sequence. The dput of the file is placed at the end of the
email.

One option is to create sequences for the entire days and then later remove
all these records after merging. Although I havent checked the feasibility
of this method, it would be complex and more so it will increase the data
four folds (I already have 2 million records in the dataframe which I have
to make regular).

Another approach that I could think of was to make a timebased sequence
based on the date from the file and then use a loop to append one sequence
after another. But am not having much success there either.

Any kind of help would be greatly appreciated.

Thanks and regards,
Shivam

structure(list(20110103, 20110104, 20110105, 20110106,
20110107, 20110110, 20110111, 20110112, 20110113,
20110114, 20110117, 20110118, 20110119, 20110120,
20110121, 20110124, 20110125, 20110127, 20110128,
20110131, 20110201, 20110202, 20110203, 20110204,
20110207, 20110208, 20110209, 20110210, 20110211,
20110214, 20110215, 20110216, 20110217, 20110218,
20110221, 20110222, 20110223, 20110224, 20110225,
20110228, 20110301, 20110303, 20110304, 20110307,
20110308, 20110309, 20110310, 20110311, 20110314,
20110315, 20110316, 20110317, 20110318, 20110321,
20110322, 20110323, 20110324, 20110325, 20110328,
20110329, 20110330, 20110331, 20110401, 20110404,
20110405, 20110406, 20110407, 20110408, 20110411,
20110413, 20110415, 20110418, 20110419, 20110420,
20110421, 20110425, 20110426, 20110427, 20110428,
20110429, 20110502, 20110503, 20110504, 20110505,
20110506, 20110509, 20110510, 20110511, 20110512,
20110513, 20110516, 20110517, 20110518, 20110519,
20110520, 20110523, 20110524, 20110525, 20110526,
20110527, 20110530, 20110531, 20110601, 20110602,
20110603, 20110606, 20110607, 20110608, 20110609,
20110610, 20110613, 20110614, 20110615, 20110616,
20110617, 20110620, 20110621, 20110622, 20110623,
20110624, 20110627, 20110628, 20110629, 20110630), .Dim =
c(124L,
1L), .Dimnames = list(c(X1, X2, X3, X4, X5, X6, X7,
X8, X9, X10, X11, X12, X13, X14, X15, X16,
X17, X18, X19, X20, X21, X22, X23, X24, X25,
X26, X27, X28, X29, X30, X31, X32, X33, X34,
X35, X36, X37, X38, X39, X40, X41, X42, X43,
X44, X45, X46, X47, X48, X49, X50, X51, X52,
X53, X54, X55, X56, X57, X58, X59, X60, X61,
X62, X63, X64, X65, X66, X67, X68, X69, X70,
X71, X72, X73, X74, X75, X76, X77, X78, X79,
X80, X81, X82, X83, X84, X85, X86, X87, X88,
X89, X90, X91, X92, X93, X94, X95, X96, X97,
X98, X99, X100, X101, X102, X103, X104, X105,
X106, X107, X108, X109, X110, X111, X112, X113,
X114, X115, X116, X117, X118, X119, X120, X121,
X122, X123, X124), NULL))

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query regarding date as argument in functions - and about sqldf

2012-05-12 Thread Shivam
Hi,

I have a query about sqldf, and dates in general. I couldnt find much on
the net or on the forums, hence I am here. Here is the issue:

I want to write a function that accepts 3 arguments: date1, date2 and a
dataframe, say 'df'. Within the function, I want to populate a temp
dataframe which essentially contains the output of the query select * from
df where DATE between date1 and date2. DATE is a column (of class Date)
which will be present in the input dataframe. This is how the function will
look like:

testfun = function(date1, date2, df){
temp = sqldf(select * from df where DATE between 'date1' and 'date2')
return(temp) #not sure if I'm doing this correctly
}

There are multiple issues here:

One is, how do I accept date as a parameter? Do I need to change the class
of date1 and date2 at input level or within the sql query? In general how
to format the input and how to return the temp dataframe once the function
ends, kindly help. It may be basic but I've spent hours on it with no
success.

The above was the bigger picture. Another issue that I faced is when
running a simple select (using sqldf) on a sample dataframe, I could not
get a result. But when I use 'subset', I get an output.

Below is the sample dataset named 'dlfhead':

structure(list(NAME = c(DLF, DLF, DLF, DLF, DLF, DLF
), EXPDATE = structure(c(15001, 15001, 15001, 15001, 15001, 15001
), class = Date), STRIKE = c(280, 280, 280, 280, 280, 280),
TMSTMP = c(14:54:17, 14:42:38, 14:09:04, 14:20:25,
12:27:26, 11:59:36), PRICE = c(22, 23, 23, 23, 23, 23.2
), DATE = structure(c(14977, 14977, 14977, 14977, 14977,
14977), class = Date), TTE = c(24, 24, 24, 24, 24, 24),
DTTM = structure(c(1294046657, 1294045958, 1294043944, 1294044625,
1294037846, 1294036176), class = c(POSIXct, POSIXt)),
SPOT = c(295, 294.5, 294.8, 293.85, 294.7, 294.55), MONEYNESS =
c(1.05357142857143,
1.05178571428571, 1.05285714285714, 1.04946428571429, 1.0525,
1.05196428571429), TTEBY7 = c(4, 4, 4, 4, 4, 4), COMB =
c(2011-01-034,
2011-01-034, 2011-01-034, 2011-01-034, 2011-01-034,
2011-01-034), RATE = c(7.55219988366433, 7.55260732560891,
7.55302228192484, 7.55344470721609, 7.55387455629645, 7.82914555610361
)), .Names = c(NAME, EXPDATE, STRIKE, TMSTMP, PRICE,
DATE, TTE, DTTM, SPOT, MONEYNESS, TTEBY7, COMB,
RATE), row.names = c(NA, 6L), class = data.frame)

Below are the commands and their outputs:

 dlftest = sqldf(select * from dlfhead where DATE = '2011-01-03')
 nrow(dlftest)
[1] 0

Kindly note that in the above, the query does complete successfully with
'dlftest' created with 0 rows. When I run the select on a column with class
other than date I get the output.

 dlftest = sqldf(select * from dlfhead where TTE=24)
 nrow(dlftest)
[1] 6

But when I use the 'subset' command, I do get an answer:

 dlftest = subset(dlfhead, DATE == '2011-01-03')
 nrow(dlftest)
[1] 6

So the issue has to be the date-class columns in the dataframe, but what is
the issue I have no idea. Can anyone help please?

Thanks a lot for reading though this long email. Kindly reply if you have
an idea about either issue.

Regards,
Shivam

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with SQLDF - Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table:

2012-05-10 Thread Shivam
Thanks Gabor, Jim, POSIXct is working fine :)

Regards,
Shivam

On Thu, May 10, 2012 at 5:22 AM, Gabor Grothendieck ggrothendi...@gmail.com
 wrote:

 On Wed, May 9, 2012 at 7:24 PM, Shivam shivamsi...@gmail.com wrote:
  Ohh ... Thanks Gabor. I have a few related queries then, kindly have a
 look:
 
  1. Does it only hinder the sqldf package or are there other issues with
  using POSIXlt in a dataframe? Am asking because I have a few dataframes
 with
  columns of class(POSIXlt ).
 
  2. I have columns containing 'date+timestamp', something like '2011-01-03
  09:07:07' which are of class POSIXlt. I need to perform some arithmetic
  operations on these columns. Which class would be most appropriate for
 such
  kind of data?
 

 Its not just sqldf.  You will have other problems too if you put
 POSIXlt objects in data frames too.

 See R News 4/1.

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com




-- 
*Victoria Concordia Crescit*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Problem with SQLDF - Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table:

2012-05-09 Thread Shivam
Hi All,

I am having trouble executing SQL statements on a few dataframes, but the
funny thing is that I am able to execute the statement on some other
dataframes.

To test, I have 2 very small dataframes (6 rows and some columns). One is
'lessliq', the dput is given below.

 dput(head(lessliq))
structure(list(V1 = c(50464677L, 50464846L, 50432581L, 50426614L,
50504329L, 50504735L), V2 = c(TATASTEEL, TATASTEEL, TATASTEEL,
TATASTEEL, TATASTEEL, TATASTEEL), V3 = c(OPTSTK, OPTSTK,
OPTSTK, OPTSTK, OPTSTK, OPTSTK), V4 = structure(c(15029,
15029, 15029, 15029, 15029, 15029), class = Date), V5 = c(CE,
CE, CE, CE, CE, CE), V6 = c(0L, 0L, 0L, 0L, 0L, 0L),
V7 = c(700, 700, 700, 700, 700, 700), V8 = c(14:15:45,
14:15:51, 13:51:12, 13:45:13, 14:39:53, 14:40:08
), V9 = c(37, 37, 37.75, 37, 37.5, 37.5), V10 = c(500L, 500L,
500L, 500L, 2000L, 500L), V11 = structure(c(14977, 14977,
14977, 14977, 14977, 14977), class = Date), V12 = c(52,
52, 52, 52, 52, 52)), .Names = c(V1, V2, V3, V4,
V5, V6, V7, V8, V9, V10, V11, V12), row.names = c(NA,
6L), class = data.frame)

I run the below command:

 new2 = sqldf(select * from lessliq)
This works fine.

But on many other dataframes it is not working. I have a dataframe
'testeq'. dput given below:

 dput(head(testeq))
structure(list(NAME = c(DLF, DLF, DLF, DLF, DLF, DLF
), TMSTMP = c(09:07:07, 09:15:03, 09:15:03, 09:15:03,
09:15:03, 09:15:04), PRICE = c(295, 294.5, 293.9, 294.9,
295, 294.5), DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c(2011-01-03,
2011-01-04, 2011-01-05, 2011-01-06, 2011-01-07, 2011-01-10,
2011-01-11, 2011-01-12, 2011-01-13, 2011-01-14, 2011-01-17,
2011-01-18, 2011-01-19, 2011-01-20, 2011-01-21, 2011-01-24,
2011-01-25, 2011-01-27, 2011-01-28, 2011-01-31, 2011-02-01,
2011-02-02, 2011-02-03, 2011-02-04, 2011-02-07, 2011-02-08,
2011-02-09, 2011-02-10, 2011-02-11, 2011-02-14, 2011-02-15,
2011-02-16, 2011-02-17, 2011-02-18, 2011-02-21, 2011-02-22,
2011-02-23, 2011-02-24, 2011-02-25, 2011-02-28, 2011-03-01,
2011-03-03, 2011-03-04, 2011-03-07, 2011-03-08, 2011-03-09,
2011-03-10, 2011-03-11, 2011-03-14, 2011-03-15, 2011-03-16,
2011-03-17, 2011-03-18, 2011-03-21, 2011-03-22, 2011-03-23,
2011-03-24, 2011-03-25, 2011-03-28, 2011-03-29, 2011-03-30,
2011-03-31, 2011-04-01, 2011-04-04, 2011-04-05, 2011-04-06,
2011-04-07, 2011-04-08, 2011-04-11, 2011-04-13, 2011-04-15,
2011-04-18, 2011-04-19, 2011-04-20, 2011-04-21, 2011-04-25,
2011-04-26, 2011-04-27, 2011-04-28, 2011-04-29, 2011-05-02,
2011-05-03, 2011-05-04, 2011-05-05, 2011-05-06, 2011-05-09,
2011-05-10, 2011-05-11, 2011-05-12, 2011-05-13, 2011-05-16,
2011-05-17, 2011-05-18, 2011-05-19, 2011-05-20, 2011-05-23,
2011-05-24, 2011-05-25, 2011-05-26, 2011-05-27, 2011-05-30,
2011-05-31, 2011-06-01, 2011-06-02, 2011-06-03, 2011-06-06,
2011-06-07, 2011-06-08, 2011-06-09, 2011-06-10, 2011-06-13,
2011-06-14, 2011-06-15, 2011-06-16, 2011-06-17, 2011-06-20,
2011-06-21, 2011-06-22, 2011-06-23, 2011-06-24, 2011-06-27,
2011-06-28, 2011-06-29, 2011-06-30), class = factor),
DTTM = structure(list(sec = c(7, 3, 3, 3, 3, 4), min = c(7L,
15L, 15L, 15L, 15L, 15L), hour = c(9L, 9L, 9L, 9L, 9L, 9L
), mday = c(3L, 3L, 3L, 3L, 3L, 3L), mon = c(0L, 0L, 0L,
0L, 0L, 0L), year = c(111L, 111L, 111L, 111L, 111L, 111L),
wday = c(1L, 1L, 1L, 1L, 1L, 1L), yday = c(2L, 2L, 2L,
2L, 2L, 2L), isdst = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = c(sec,
min, hour, mday, mon, year, wday, yday, isdst
), class = c(POSIXlt, POSIXt))), .Names = c(NAME, TMSTMP,
PRICE, DATE, DTTM), row.names = c(NA, 6L), class = data.frame)

I run the above command again but this time I get an error:

 new3 = sqldf(select * from testeq)
Error in sqliteExecStatement(con, statement, bind.data) :
  RS-DBI driver: (error in statement: no such table: testeq)
In addition: Warning message:
In value[[3L]](cond) : RAW() can only be applied to a 'raw', not a 'double'

Can anyone guide me if there is any difference in the structure of the two
dataframes or what else can be the issue?

Thanks in advance,

Regards,
Shivam Singh

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Problem with SQLDF - Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: no such table:

2012-05-09 Thread Shivam
Ohh ... Thanks Gabor. I have a few related queries then, kindly have a look:

1. Does it only hinder the sqldf package or are there other issues with
using POSIXlt in a dataframe? Am asking because I have a few dataframes
with columns of class(POSIXlt ).

2. I have columns containing 'date+timestamp', something like '2011-01-03
09:07:07' which are of class POSIXlt. I need to perform some arithmetic
operations on these columns. Which class would be most appropriate for such
kind of data?

Thanks and Regards,
Shivam

On Thu, May 10, 2012 at 4:45 AM, Gabor Grothendieck ggrothendi...@gmail.com
 wrote:

 On Wed, May 9, 2012 at 5:58 PM, Shivam shivamsi...@gmail.com wrote:
  Hi All,
 
  I am having trouble executing SQL statements on a few dataframes, but the
  funny thing is that I am able to execute the statement on some other
  dataframes.
 

 Never put POSIXlt objects into data frames.

 --
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com




-- 
*Victoria Concordia Crescit*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Query about memory used in list and dataframe

2012-05-05 Thread Shivam
Hi,

I had a query regarding which object, a list or a dataframe, consumes more
R memory. Let me clarify this:

For example, I have a df of 6 rows and 12 columns, say 'test'. I do
object.size() and find it uses 3.3 KB of memory.

I run a loop and make a list, say 'testlist', of 6 elements, each element
being the above mentioned df 'test'. The size of this list is 19.9 KB,
understandably.

Now I combine this list into a dataframe using rbind. The df formed has 12
cols and 36 rows. The size of this df is just 5.8 KB, almost a 75%
reduction in memory.

I had to work with a much larger list, and I thought of using the same
method to convert my bigger list (62 dataframes, each having 4 cols and
close to 200,000 rows) into a single dataframe. The big list, sat LIST A,
had a size of 571 MB. But when I convert it into a dataframe, say DF A,
using rbind, the object size increases to 1.35 GB. This was in
contradiction to the earlier result.

What am I missing? Why a 75% reduction in size in one case and double size
in other? Anyone with any explanation?

Sorry for the verbose email, just wanted to make my case clear.

Thanks in advance,
Regards
Shivam

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to insert filename as column in a file

2012-04-25 Thread Shivam
Thanks Jeff. I had tried the 'list' approach as well but got stuck with the
below error:

Error in `$-.data.frame`(`*tmp*`, date, value = 20100701) :
  replacement has 1 rows, data has 0

Couldnt find a work around to this, hence resorted to the multiple
dataframes approach. Any insights into this?


On Tue, Apr 24, 2012 at 9:51 PM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote:

 Programatically dealing with large numbers of separately-named objects
 leads to syntactically complicated code that is hard to read and maintain.

 Load the data frames into a list so you can access them by numeric or
 named index, and then getting at the loaded data will be much easier.

 fnames = list.files(path = getwd())
 # preallocating the list for efficiency (execution speed)
 dtalist - vector( list, length(fnames) )
 for (i in seq_len(length(fnames))){
  dtalist[[i]] - read.csv.sql(fnames[i], sql = select * from file where
 V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol =\n))
  dtalist[[i]]$date -  substr(fnames[i],1,8))
 }
 names(dtalist) - fnames
 # now you can optionally refer to dtalist$file20120424.csv or
 dtalist[[file20120424]] if you wish.
 ---
 Jeff NewmillerThe .   .  Go Live...
 DCN:jdnew...@dcn.davis.ca.usBasics: ##.#.   ##.#.  Live
 Go...
  Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/BatteriesO.O#.   #.O#.  with
 /Software/Embedded Controllers)   .OO#.   .OO#.  rocks...1k
 ---
 Sent from my phone. Please excuse my brevity.



 Shivam shivamsi...@gmail.com wrote:

 Reposting in hope of a reply.
 
 On Tue, Apr 24, 2012 at 1:12 AM, Shivam shivamsi...@gmail.com wrote:
 
  Thanks for the quick response. It works for an individual dataframe,
 but I
  have many dataframes. This is the code so far
 
  fnames = list.files(path = getwd())
  for (i in 1:length(fnames)){
  assign(paste(file,i,sep=),read.csv.sql(fnames[i], sql = select *
 from
  file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol =
 \n))
  }
 
  This generates dataframes named as as file1,file2,...,file250. Is
 there a
  way to do something like below within the same loop?
 
  file1$date = substr(fnames[1],1,8))
  file2$date = substr(fnames[2],1,8))
  .
  .
  file250$date = substr(fnames[250],1,8))
 
  assign(paste(file,i,sep=)$date doesnt work.
 
  Any help?
 
 
 
 
 
  On Tue, Apr 24, 2012 at 12:01 AM, MacQueen, Don
 macque...@llnl.govwrote:
 
  This little example might help.
 
   foo - data.frame(a=1:10, b=letters[1:0])
   foo
 a b
  1   1 a
  2   2 a
  3   3 a
  4   4 a
  5   5 a
  6   6 a
  7   7 a
  8   8 a
  9   9 a
  10 10 a
   foo$date - '20120423'
   foo
 a b date
  1   1 a 20120423
  2   2 a 20120423
  3   3 a 20120423
  4   4 a 20120423
  5   5 a 20120423
  6   6 a 20120423
  7   7 a 20120423
  8   8 a 20120423
  9   9 a 20120423
  10 10 a 20120423
 
 
  In other words, immediately after reading the data into a data
 frame, add
  a date column as in the example. You'll have to extract the date
 from the
  filename, of course.
 
  -Don
 
 
  --
  Don MacQueen
 
  Lawrence Livermore National Laboratory
  7000 East Ave., L-627
  Livermore, CA 94550
  925-423-1062
 
 
 
 
 
  On 4/23/12 9:29 AM, Shivam shivamsi...@gmail.com wrote:
 
  Hi,
  
  I am relatively new to R. Have scourged the help files and the www
 but
  havent been able to get a solution.
  
  I have around 250 csv files, one file for each date. They have
 columns of
  all types, numeric, string etc. The name of each file is the date
 in the
  form of 'mmdd'. There is no column within the file which helps
 me
  identify the date on which the file was generated, only the
 filename has
  that info.
  
  I am selecting some data (using read.csv.sql) from each file and
 creating
  a
  dataset for each day. Ultimately I will combine all the datasets. I
 can
  accomplish the select and combine part, but after combining I wont
 have a
  record as to the date corresponding to the data.
  
  Hence I want to insert the filename as a column in the respective
 file to
  help me in identifying to what date each data row belongs to.
  
  Sorry for the long mail, but wanted to make myself clear. Any help
 would
  be
  greatly appreciated.
  
  Thanks in advance,
  Shivam
  
 [[alternative HTML version deleted]]
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 
  --
  *Victoria Concordia Crescit*
 
 
 
 
 --
 *Victoria Concordia Crescit*
 
[[alternative HTML version deleted]]
 
 __
 R

Re: [R] How to insert filename as column in a file

2012-04-24 Thread Shivam
Reposting in hope of a reply.

On Tue, Apr 24, 2012 at 1:12 AM, Shivam shivamsi...@gmail.com wrote:

 Thanks for the quick response. It works for an individual dataframe, but I
 have many dataframes. This is the code so far

 fnames = list.files(path = getwd())
 for (i in 1:length(fnames)){
 assign(paste(file,i,sep=),read.csv.sql(fnames[i], sql = select * from
 file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol = \n))
 }

 This generates dataframes named as as file1,file2,...,file250. Is there a
 way to do something like below within the same loop?

 file1$date = substr(fnames[1],1,8))
 file2$date = substr(fnames[2],1,8))
 .
 .
 file250$date = substr(fnames[250],1,8))

 assign(paste(file,i,sep=)$date doesnt work.

 Any help?





 On Tue, Apr 24, 2012 at 12:01 AM, MacQueen, Don macque...@llnl.govwrote:

 This little example might help.

  foo - data.frame(a=1:10, b=letters[1:0])
  foo
a b
 1   1 a
 2   2 a
 3   3 a
 4   4 a
 5   5 a
 6   6 a
 7   7 a
 8   8 a
 9   9 a
 10 10 a
  foo$date - '20120423'
  foo
a b date
 1   1 a 20120423
 2   2 a 20120423
 3   3 a 20120423
 4   4 a 20120423
 5   5 a 20120423
 6   6 a 20120423
 7   7 a 20120423
 8   8 a 20120423
 9   9 a 20120423
 10 10 a 20120423


 In other words, immediately after reading the data into a data frame, add
 a date column as in the example. You'll have to extract the date from the
 filename, of course.

 -Don


 --
 Don MacQueen

 Lawrence Livermore National Laboratory
 7000 East Ave., L-627
 Livermore, CA 94550
 925-423-1062





 On 4/23/12 9:29 AM, Shivam shivamsi...@gmail.com wrote:

 Hi,
 
 I am relatively new to R. Have scourged the help files and the www but
 havent been able to get a solution.
 
 I have around 250 csv files, one file for each date. They have columns of
 all types, numeric, string etc. The name of each file is the date in the
 form of 'mmdd'. There is no column within the file which helps me
 identify the date on which the file was generated, only the filename has
 that info.
 
 I am selecting some data (using read.csv.sql) from each file and creating
 a
 dataset for each day. Ultimately I will combine all the datasets. I can
 accomplish the select and combine part, but after combining I wont have a
 record as to the date corresponding to the data.
 
 Hence I want to insert the filename as a column in the respective file to
 help me in identifying to what date each data row belongs to.
 
 Sorry for the long mail, but wanted to make myself clear. Any help would
 be
 greatly appreciated.
 
 Thanks in advance,
 Shivam
 
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




 --
 *Victoria Concordia Crescit*




-- 
*Victoria Concordia Crescit*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to insert filename as column in a file

2012-04-23 Thread Shivam
Hi,

I am relatively new to R. Have scourged the help files and the www but
havent been able to get a solution.

I have around 250 csv files, one file for each date. They have columns of
all types, numeric, string etc. The name of each file is the date in the
form of 'mmdd'. There is no column within the file which helps me
identify the date on which the file was generated, only the filename has
that info.

I am selecting some data (using read.csv.sql) from each file and creating a
dataset for each day. Ultimately I will combine all the datasets. I can
accomplish the select and combine part, but after combining I wont have a
record as to the date corresponding to the data.

Hence I want to insert the filename as a column in the respective file to
help me in identifying to what date each data row belongs to.

Sorry for the long mail, but wanted to make myself clear. Any help would be
greatly appreciated.

Thanks in advance,
Shivam

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to insert filename as column in a file

2012-04-23 Thread Shivam
Thanks for the quick response. It works for an individual dataframe, but I
have many dataframes. This is the code so far

fnames = list.files(path = getwd())
for (i in 1:length(fnames)){
assign(paste(file,i,sep=),read.csv.sql(fnames[i], sql = select * from
file where V3 == 'XXX' and V5=='YYY',header = FALSE, sep= '|', eol = \n))
}

This generates dataframes named as as file1,file2,...,file250. Is there a
way to do something like below within the same loop?

file1$date = substr(fnames[1],1,8))
file2$date = substr(fnames[2],1,8))
.
.
file250$date = substr(fnames[250],1,8))

assign(paste(file,i,sep=)$date doesnt work.

Any help?





On Tue, Apr 24, 2012 at 12:01 AM, MacQueen, Don macque...@llnl.gov wrote:

 This little example might help.

  foo - data.frame(a=1:10, b=letters[1:0])
  foo
a b
 1   1 a
 2   2 a
 3   3 a
 4   4 a
 5   5 a
 6   6 a
 7   7 a
 8   8 a
 9   9 a
 10 10 a
  foo$date - '20120423'
  foo
a b date
 1   1 a 20120423
 2   2 a 20120423
 3   3 a 20120423
 4   4 a 20120423
 5   5 a 20120423
 6   6 a 20120423
 7   7 a 20120423
 8   8 a 20120423
 9   9 a 20120423
 10 10 a 20120423


 In other words, immediately after reading the data into a data frame, add
 a date column as in the example. You'll have to extract the date from the
 filename, of course.

 -Don


 --
 Don MacQueen

 Lawrence Livermore National Laboratory
 7000 East Ave., L-627
 Livermore, CA 94550
 925-423-1062





 On 4/23/12 9:29 AM, Shivam shivamsi...@gmail.com wrote:

 Hi,
 
 I am relatively new to R. Have scourged the help files and the www but
 havent been able to get a solution.
 
 I have around 250 csv files, one file for each date. They have columns of
 all types, numeric, string etc. The name of each file is the date in the
 form of 'mmdd'. There is no column within the file which helps me
 identify the date on which the file was generated, only the filename has
 that info.
 
 I am selecting some data (using read.csv.sql) from each file and creating
 a
 dataset for each day. Ultimately I will combine all the datasets. I can
 accomplish the select and combine part, but after combining I wont have a
 record as to the date corresponding to the data.
 
 Hence I want to insert the filename as a column in the respective file to
 help me in identifying to what date each data row belongs to.
 
 Sorry for the long mail, but wanted to make myself clear. Any help would
 be
 greatly appreciated.
 
 Thanks in advance,
 Shivam
 
[[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
*Victoria Concordia Crescit*

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.