Re: [R] Controlling number of numbers before R rewrites to +e18 etc

2010-10-25 Thread ZeMajik
Thanks Jim, but I still got the problem that the pre-processing becomes way
too computationally expensive. R seems to handle characters and factors much
much worse than numeric IDs. I don't have enough RAM to even write the file
when they are viewed as chars instead of numeric values!

Anyone have any other ideas? Is it not possible to tell R not to rewrite
upon import? It wouldn't matter if it only would write the correct IDs to
the exported csv file, but it exports the abbreviated version which is of no
use.

Mike

On Sat, Oct 23, 2010 at 3:56 AM, jim holtman jholt...@gmail.com wrote:

 Your best bet is to make sure that you read the IDs in as characters.
 If they are being read in as floating point numbers, then there is
 only 15 digits of accuracy, so if you have IDs 18-22 digits, you will
 be missing data.  So if you are using read.table, then look at
 colClasses to see how to do this.

 Provide a subset of your data and the statements that you are using to
 read in the data.

 On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik zema...@gmail.com wrote:
  Hey,
 
  I'm using R as a pre-processor for a large dataset with IDs which are
  numeric (but has no numeric meaning so can be seen as factors).
  I do some data formating and then write it out to a csv file.
 
  However the problem is that the IDs are very long, 18-22 chars long more
  precisely. R is constantly rewriting these IDs to the abbreviated +eX
 which
  hinders me from exporting the data to the csv since the IDs are no longer
  intact.
  I've tried telling R that ID column is a factor, but this results in two
  problems: 1) Since I have millions of rows and R is slower handling
 factors
  than numbers my comp can't run the process in any kind of reasonable
 time.
  and 2) Some IDs STILL seem to be rewritten somehow. The second point made
 me
  believe that perhaps R is rewriting upon import?
 
  Does anyone have any tips on how to solve this problem?
 
  Thanks,
  Mike
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Controlling number of numbers before R rewrites to +e18 etc

2010-10-25 Thread jim holtman
You can always read a portion of the file and then write it out.  For
large files, I will read in 10,000 line, fix them up and then write
them out and go back and process the next batch of lines.  You haven't
shown us what a sample of your input/output is, or how you are
processing them.  Depending on what type of preprocessing needs to be
done to the data, PERL is also an option.  But most things I used to
use PERL for, I can do within R these days.

Here is an example of reading in your IDs:

 x - read.table(textConnection(1234567890123456789012 
 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543
+ 1234567890123456789012 987654321234567898765432 98765432123456789876543)
+ , colClasses = rep('character', 3))
 closeAllConnections()
 str(x)
'data.frame':   7 obs. of  3 variables:
 $ V1: chr  1234567890123456789012 1234567890123456789012
1234567890123456789012 1234567890123456789012 ...
 $ V2: chr  987654321234567898765432 987654321234567898765432
987654321234567898765432 987654321234567898765432 ...
 $ V3: chr  98765432123456789876543 98765432123456789876543
98765432123456789876543 98765432123456789876543 ...
 x
  V1   V2  V3
1 1234567890123456789012 987654321234567898765432 98765432123456789876543
2 1234567890123456789012 987654321234567898765432 98765432123456789876543
3 1234567890123456789012 987654321234567898765432 98765432123456789876543
4 1234567890123456789012 987654321234567898765432 98765432123456789876543
5 1234567890123456789012 987654321234567898765432 98765432123456789876543
6 1234567890123456789012 987654321234567898765432 98765432123456789876543
7 1234567890123456789012 987654321234567898765432 98765432123456789876543



On Mon, Oct 25, 2010 at 4:41 AM, ZeMajik zema...@gmail.com wrote:
 Thanks Jim, but I still got the problem that the pre-processing becomes way
 too computationally expensive. R seems to handle characters and factors much
 much worse than numeric IDs. I don't have enough RAM to even write the file
 when they are viewed as chars instead of numeric values!

 Anyone have any other ideas? Is it not possible to tell R not to rewrite
 upon import? It wouldn't matter if it only would write the correct IDs to
 the exported csv file, but it exports the abbreviated version which is of no
 use.

 Mike

 On Sat, Oct 23, 2010 at 3:56 AM, jim holtman jholt...@gmail.com wrote:

 Your best bet is to make sure that you read the IDs in as characters.
 If they are being read in as floating point numbers, then there is
 only 15 digits of accuracy, so if you have IDs 18-22 digits, you will
 be missing data.  So if you are using read.table, then look at
 colClasses to see how to do this.

 Provide a subset of your data and the statements that you are using to
 read in the data.

 On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik zema...@gmail.com wrote:
  Hey,
 
  I'm using R as a pre-processor for a large dataset with IDs which are
  numeric (but has no numeric meaning so can be seen as factors).
  I do some data formating and then write it out to a csv file.
 
  However the problem is that the IDs are very long, 18-22 chars long more
  precisely. R is constantly rewriting these IDs to the abbreviated +eX
  which
  hinders me from exporting the data to the csv since the IDs are no
  longer
  intact.
  I've tried telling R that ID column is a factor, but this results in two
  problems: 1) Since I have millions of rows and R is slower handling
  factors
  than numbers my comp can't run the process in any kind of reasonable
  time.
  and 2) Some IDs STILL seem to be rewritten somehow. The second point
  made me
  believe that perhaps R is rewriting upon import?
 
  Does anyone have any tips on how to solve this problem?
 
  Thanks,
  Mike
 
         [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 



 --
 Jim Holtman
 Cincinnati, OH
 +1 513 646 9390

 What is the problem that you are trying to solve?





-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Controlling number of numbers before R rewrites to +e18 etc

2010-10-22 Thread ZeMajik
Hey,

I'm using R as a pre-processor for a large dataset with IDs which are
numeric (but has no numeric meaning so can be seen as factors).
I do some data formating and then write it out to a csv file.

However the problem is that the IDs are very long, 18-22 chars long more
precisely. R is constantly rewriting these IDs to the abbreviated +eX which
hinders me from exporting the data to the csv since the IDs are no longer
intact.
I've tried telling R that ID column is a factor, but this results in two
problems: 1) Since I have millions of rows and R is slower handling factors
than numbers my comp can't run the process in any kind of reasonable time.
and 2) Some IDs STILL seem to be rewritten somehow. The second point made me
believe that perhaps R is rewriting upon import?

Does anyone have any tips on how to solve this problem?

Thanks,
Mike

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Controlling number of numbers before R rewrites to +e18 etc

2010-10-22 Thread jim holtman
Your best bet is to make sure that you read the IDs in as characters.
If they are being read in as floating point numbers, then there is
only 15 digits of accuracy, so if you have IDs 18-22 digits, you will
be missing data.  So if you are using read.table, then look at
colClasses to see how to do this.

Provide a subset of your data and the statements that you are using to
read in the data.

On Fri, Oct 22, 2010 at 1:15 PM, ZeMajik zema...@gmail.com wrote:
 Hey,

 I'm using R as a pre-processor for a large dataset with IDs which are
 numeric (but has no numeric meaning so can be seen as factors).
 I do some data formating and then write it out to a csv file.

 However the problem is that the IDs are very long, 18-22 chars long more
 precisely. R is constantly rewriting these IDs to the abbreviated +eX which
 hinders me from exporting the data to the csv since the IDs are no longer
 intact.
 I've tried telling R that ID column is a factor, but this results in two
 problems: 1) Since I have millions of rows and R is slower handling factors
 than numbers my comp can't run the process in any kind of reasonable time.
 and 2) Some IDs STILL seem to be rewritten somehow. The second point made me
 believe that perhaps R is rewriting upon import?

 Does anyone have any tips on how to solve this problem?

 Thanks,
 Mike

        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.