Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread jim holtman
Try reading the lines in (readLines), count the number of both types of
quotes in each line. Find out which are not even and investigate.

On Mon, Apr 8, 2024, 15:24 Dave Dixon  wrote:

> I solved the mystery, but not the problem. The problem is that there's
> an unclosed quote somewhere in those 5 additional records I'm trying to
> access. So read.csv is reading million-character fields. It's slow at
> that. That mystery solved.
>
> However, the the problem persists: how to fix what is obvious to the
> naked eye - a quote not adjacent to a comma - but that read.csv can't
> handle. readLines followed by read.csv(text= ) works great because, in
> that case, read.csv knows where the record terminates. Meaning, read.csv
> throws an exception that I can catch and handle with a quick and clean
> regex expression.
>
> Thanks, I'll take a look at vroom.
>
> -dave
>
> On 4/8/24 09:18, Stevie Pederson wrote:
> > Hi Dave,
> >
> > That's rather frustrating. I've found vroom (from the package vroom)
> > to be helpful with large files like this.
> >
> > Does the following give you any better luck?
> >
> > vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
> >
> > Of course, when you know you've got errors & the files are big like
> > that it can take a bit of work resolving things. The command line
> > tools awk & sed might even be a good plan for finding lines that have
> > errors & figuring out a fix, but I certainly don't envy you.
> >
> > All the best
> >
> > Stevie
> >
> > On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
> >
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I know
> > that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with
> > read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are
> > good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of magnitude
> > longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > 
> > and provide commented, minimal, self-contained, reproducible code.
> >
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Eberhard W Lisse
I find QSV very helpful.

el

On 08/04/2024 22:21, Dave Dixon wrote:
> I solved the mystery, but not the problem. The problem is that
> there's an unclosed quote somewhere in those 5 additional records I'm
> trying to access. So read.csv is reading million-character fields.
> It's slow at that. That mystery solved.
> 
> However, the the problem persists: how to fix what is obvious to the
> naked eye - a quote not adjacent to a comma - but that read.csv
> can't handle. readLines followed by read.csv(text= ) works great
> because, in that case, read.csv knows where the record terminates.
> Meaning, read.csv throws an exception that I can catch and handle
> with a quick and clean regex expression.
> 
> Thanks, I'll take a look at vroom.
[...]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
Right, I meant to add header=FALSE. And, it looks now like the next line 
is the one with the unclosed quote, so read.csv is trying to read 
million-character headers!


On 4/8/24 12:42, Ivan Krylov wrote:

В Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon  пишет:


  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)

It may or may not be important that read.csv defaults to header =
TRUE. Having skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.

Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
Good suggestion - I'll look into data.table.

On 4/8/24 12:14, CALUM POLWART wrote:
> data.table's fread is also fast. Not sure about error handling. But I 
> can merge 300 csvs with a total of 0.5m lines and 50 columns in a 
> couple of minutes versus a lifetime with read.csv or readr::read_csv
>
>
>
> On Mon, 8 Apr 2024, 16:19 Stevie Pederson, 
>  wrote:
>
> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package
> vroom) to be
> helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big
> like that it
> can take a bit of work resolving things. The command line tools
> awk & sed
> might even be a good plan for finding lines that have errors &
> figuring out
> a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
>
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I
> know that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with
> read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are
> good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of
> magnitude longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> 
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> 
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
Thanks, yeah, I think scan is more promising. I'll check it out.

On 4/8/24 11:49, Bert Gunter wrote:
> No idea, but have you tried using ?scan to read those next 5 rows? It 
> might give you a better idea of the pathologies that are causing 
> problems. For example, an unmatched quote might result in some huge 
> number of characters trying to be read into a single element of a 
> character variable. As your previous respondent said, resolving such 
> problems can be a challenge.
>
> Cheers,
> Bert
>
>
>
> On Mon, Apr 8, 2024 at 8:06 AM Dave Dixon  wrote:
>
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know
> that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with
> read.csv(text
> = ...) is really slow. I know that the first 2459465 records are
> good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude
> longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> 
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon
I solved the mystery, but not the problem. The problem is that there's 
an unclosed quote somewhere in those 5 additional records I'm trying to 
access. So read.csv is reading million-character fields. It's slow at 
that. That mystery solved.

However, the the problem persists: how to fix what is obvious to the 
naked eye - a quote not adjacent to a comma - but that read.csv can't 
handle. readLines followed by read.csv(text= ) works great because, in 
that case, read.csv knows where the record terminates. Meaning, read.csv 
throws an exception that I can catch and handle with a quick and clean 
regex expression.

Thanks, I'll take a look at vroom.

-dave

On 4/8/24 09:18, Stevie Pederson wrote:
> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package vroom) 
> to be helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big like 
> that it can take a bit of work resolving things. The command line 
> tools awk & sed might even be a good plan for finding lines that have 
> errors & figuring out a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
>
> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know
> that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with
> read.csv(text
> = ...) is really slow. I know that the first 2459465 records are
> good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude
> longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> 
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Building R-4,3,3 fails

2024-04-08 Thread Rich Shepard

On Mon, 8 Apr 2024, Ivan Krylov wrote:


A Web search suggests that texi2dvi may output this message by mistake
when the TeX installation is subject to a different problem:
https://web.archive.org/web/20191006123002/https://lists.gnu.org/r/bug-texinfo/2016-10/msg00036.html


Ivan,

That thread is 8 years old and may no longer apply to TeXLive2024.


Find the reshape.Rnw file and try running bin/R CMD Sweave
path/to/reshape.Rnw. It should produce reshape.tex. When you run pdflatex
reshape.tex, do you get a more useful error message?


The error occurs when building R. Since R's not installed I cannot run it to
build reshape.tex.

Thanks,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Rui Barradas

Às 19:42 de 08/04/2024, Ivan Krylov via R-help escreveu:

В Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon  пишет:


  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)


It may or may not be important that read.csv defaults to header =
TRUE. Having skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.



This will throw an error, call read.table with sep="," instead.




Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.



Hoep this helps,

Rui Barradas


--
Este e-mail foi analisado pelo software antivírus AVG para verificar a presença 
de vírus.
www.avg.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Building R-4,3,3 fails

2024-04-08 Thread Rich Shepard

On Mon, 8 Apr 2024, Ivan Krylov wrote:


Questions about building R do get asked here and R-devel. Since you're
compiling a released version of R and we don't have an R-SIG-Slackware
mailing list, R-help sounds like the right place.


Ivan,

Okay:


What are the last lines of the build log, containing the error message? If
it's a LaTeX error, it may be that you need some extra TeX Live packages,
there is a list in R-admin:
https://cran.r-project.org/doc/manuals/R-admin.html#Making-the-manuals


* DONE (mgcv)
make[2]: Leaving directory '/tmp/SBo/R-4.3.3/src/library/Recommended'
make[1]: Leaving directory '/tmp/SBo/R-4.3.3/src/library/Recommended'
make[1]: Entering directory '/tmp/SBo/R-4.3.3/src/library'
building/updating vignettes for package 'grid' ...
building/updating vignettes for package 'parallel' ...
building/updating vignettes for package 'utils' ...
building/updating vignettes for package 'stats' ...
processing 'reshape.Rnw'
Error: compiling TeX file 'reshape.tex' failed with message:
Running 'texi2dvi' on 'reshape.tex' failed.
Messages:
/usr/bin/texi2dvi: TeX neither supports -recorder nor outputs \openout lines in 
its log file
Execution halted
make[1]: *** [Makefile:103: vignettes] Error 1
make[1]: Leaving directory '/tmp/SBo/R-4.3.3/src/library'
make: *** [Makefile:81: vignettes] Error 2
# (running time: 12m3.057s)

I don't know why /usr/bin/texi2dvi doesn't support -recorder nor the other
error. TeXLive2023 had no issues building R-4.1.1.

Thanks,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Ivan Krylov via R-help
В Sun, 7 Apr 2024 23:47:52 -0600
Dave Dixon  пишет:

>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)

It may or may not be important that read.csv defaults to header =
TRUE. Having skipped 2459465 lines, it may attempt to parse the next
one as a header, so the second call read.csv() should probably include
header = FALSE.

Bert's advice to try scan() is on point, though. It's likely that the
default-enabled header is not the most serious problem here.

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Building R-4,3,3 fails

2024-04-08 Thread Rich Shepard

I've been building R versions for years with no issues. Now I'm trying to
build R-4.3.3 on Slackware64-15.0 (fully patched) with TeXLive2024 (fully
patched) installed. The error occurs building a vignette.

Is this mail list the appropriate place to ask for help or should I post the
request on stackoverflow.com?

TIA,

Rich

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread CALUM POLWART
data.table's fread is also fast. Not sure about error handling. But I can
merge 300 csvs with a total of 0.5m lines and 50 columns in a couple of
minutes versus a lifetime with read.csv or readr::read_csv



On Mon, 8 Apr 2024, 16:19 Stevie Pederson, 
wrote:

> Hi Dave,
>
> That's rather frustrating. I've found vroom (from the package vroom) to be
> helpful with large files like this.
>
> Does the following give you any better luck?
>
> vroom(file_name, delim = ",", skip = 2459465, n_max = 5)
>
> Of course, when you know you've got errors & the files are big like that it
> can take a bit of work resolving things. The command line tools awk & sed
> might even be a good plan for finding lines that have errors & figuring out
> a fix, but I certainly don't envy you.
>
> All the best
>
> Stevie
>
> On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:
>
> > Greetings,
> >
> > I have a csv file of 76 fields and about 4 million records. I know that
> > some of the records have errors - unmatched quotes, specifically.
> > Reading the file with readLines and parsing the lines with read.csv(text
> > = ...) is really slow. I know that the first 2459465 records are good.
> > So I try this:
> >
> >  > startTime <- Sys.time()
> >  > first_records <- read.csv(file_name, nrows = 2459465)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > elapsed time =   24.12598
> >
> >  > startTime <- Sys.time()
> >  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> >  > endTime <- Sys.time()
> >  > cat("elapsed time = ", endTime - startTime, "\n")
> >
> > This appears to never finish. I have been waiting over 20 minutes.
> >
> > So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> > than (nrows = 2459465) ?
> >
> > Thanks!
> >
> > -dave
> >
> > PS: readLines(n=2459470) takes 10.42731 seconds.
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Bert Gunter
No idea, but have you tried using ?scan to read those next 5 rows? It might
give you a better idea of the pathologies that are causing problems. For
example, an unmatched quote might result in some huge number of characters
trying to be read into a single element of a character variable. As your
previous respondent said, resolving such problems can be a challenge.

Cheers,
Bert



On Mon, Apr 8, 2024 at 8:06 AM Dave Dixon  wrote:

> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] duplicated() on zero-column data frames returns empty

2024-04-08 Thread Jorgen Harmse via R-help
I appreciate the compliment from Ivan and still share the puzzlement at the 
empty return.

What is the policy for changing something that is wrong? There is a trade-off 
between breaking old code that worked around a problem and breaking new code 
written by people who make reasonable assumptions. Mathematically, it seems 
obvious to me that duplicated.matrix(A) should do something like this:

v <- matrix(FALSE, nrow = nrow(A) -> nr, ncol=1L) # or an ordinary vector?
if (nr > 1L) # Check because 2:0 & 2:1 do not do what we want.
{ for (i in 2:nr)
  { for (j in 1:(i-1))
if (identical(A[i,],A[j,])) # or something more complicated to handle 
incomparables
{ v[i] <- TRUE; break}
  }
}
v

Of course my code is horribly inefficient, but the difference should be just in 
computing the same result faster. An empty vector of some type is identical to 
an empty vector of the same type, so this computes

  [,1]

[1,] FALSE

[2,]  TRUE

[3,]  TRUE

[4,]  TRUE

[5,]  TRUE
, and I argue that that is correct.

A gap in documentation makes a change to the correct behaviour easier. (If the 
current behaviour were documented then the first step in changing the behaviour 
would be to issue a warning that the change is coming in a future version.) The 
protection for old code could be just a warning that can be turned off with a 
call to options. The new documentation should be more explicit.

Regards,
Jorgen.

From: Mark Webster 
To: Jorgen Harmse , Ivan Krylov

Cc: "r-help@r-project.org" 
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: <603481690.9150754.1712522666...@mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 duplicated.matrix is an interesting one. I think a similar change would make 
sense, because it would have the dimensions that people would expect when using 
the default MARGIN = 1. However, it could be argued that it's not a needed 
change, because the Value section of its documentation only guarantees the 
dimensions of the output when using MARGIN = 0. In that case, duplicated.matrix 
does indeed return the expected 5x0 matrix for your example:
str(duplicated(matrix(0, 5, 0), MARGIN = 0))# logi[1:5, 0 ]
Best Regards,
Mark Webster
[[alternative HTML version deleted]]

From: Mark Webster markwebster...@yahoo.co.uk
To: Ivan Krylov ikry...@disroot.org,  
r-help@r-project.org
r-help@r-project.org
Subject: Re: [R]  duplicated() on zero-column data frames returns
empty vector
Message-ID: 
1379736116.7985600.1712306452...@mail.yahoo.com
Content-Type: text/plain; charset="utf-8"

 Do you mean the row names should mean all the rows should be counted as 
non-duplicates?Yes, I can see the argument for that, thanks.I must say I'm 
still puzzled at what interpretation would motivate the current behaviour of 
returning a logical(0), however.

Date: Sun, 7 Apr 2024 11:00:51 +0300
From: Ivan Krylov mailto:ikry...@disroot.org>>
To: Jorgen Harmse mailto:jhar...@roku.com>>
Cc: "r-help@r-project.org" 
mailto:r-help@r-project.org>>,
"markwebster...@yahoo.co.uk" 
mailto:markwebster...@yahoo.co.uk>>
Subject: Re: [R] duplicated() on zero-column data frames returns empty
Message-ID: 
20240407110051.7924c03c@Tarkus
Content-Type: text/plain; charset="utf-8"

� Fri, 5 Apr 2024 16:08:13 +
Jorgen Harmse mailto:jhar...@roku.com>> �:

> if duplicated really treated a row name as part of the row then
> any(duplicated(data.frame(�))) would always be FALSE. My expectation
> is that if key1 is a subset of key2 then all(duplicated(df[key1]) >=
> duplicated(df[key2])) should always be TRUE.

That's a good argument, thank you!

Would you suggest similar changes to duplicated.matrix too? Currently
it too returns 0-length output for 0-column inputs:

# 0-column matrix for 0-column input
str(duplicated(matrix(0, 5, 0)))
# logi[1:5, 0 ]

# 1-column matrix for 1-column input
str(duplicated(matrix(0, 5, 1)))
# logi [1:5, 1] FALSE TRUE TRUE TRUE TRUE

# a dim-1 array for >1-column input
str(duplicated(matrix(0, 5, 10)))
# logi [1:5(1d)] FALSE TRUE TRUE TRUE TRUE

--
Best regards,
Ivan




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Questions about ks.test function {stats}

2024-04-08 Thread Jin, Ziyan
Dear R-help,

Hope this email finds you well. My name is Ziyan. I am a graduate student in 
Zhejiang University. My subject research involves ks.test in stats-package 
{stats}. Based on the code,  I have two main questions. Could you provide me 
some more information?

I download different versions of the r language source code through r language 
website (https://www.r-project.org/). By reading 
R-4.3.3/src/library/stats/R/ks.test.R, I encounter the following problem: 
before call the default psmirnov function (in two-sample case), z <- NULL. 
According to the T and F the TIES, determines the value of z assigned to w or 
not. However, when psmirnov is called, z=w is always used. I am curious whether 
the TIES parameter can be omitted.

Compared with the previous ks.test, such as version 4.1.3, the method of 
calculating p value in version 4.3.3 has changed a lot. ks.test() now provides 
exact p-values also with ties. For the psmirnov_exact_uniq_upper function in 
the ks.c file (R-4.3.3/src/stats/src/ks.c), could you please provide some 
details for the mathematical basis used to calculate the p-value? If you could 
provide me with some references, I would be grateful .

Thank you for your patience. I am eagerly awaiting your response.

Best,
Ziyan

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Exceptional slowness with read.csv

2024-04-08 Thread Stevie Pederson
Hi Dave,

That's rather frustrating. I've found vroom (from the package vroom) to be
helpful with large files like this.

Does the following give you any better luck?

vroom(file_name, delim = ",", skip = 2459465, n_max = 5)

Of course, when you know you've got errors & the files are big like that it
can take a bit of work resolving things. The command line tools awk & sed
might even be a good plan for finding lines that have errors & figuring out
a fix, but I certainly don't envy you.

All the best

Stevie

On Tue, 9 Apr 2024 at 00:36, Dave Dixon  wrote:

> Greetings,
>
> I have a csv file of 76 fields and about 4 million records. I know that
> some of the records have errors - unmatched quotes, specifically.
> Reading the file with readLines and parsing the lines with read.csv(text
> = ...) is really slow. I know that the first 2459465 records are good.
> So I try this:
>
>  > startTime <- Sys.time()
>  > first_records <- read.csv(file_name, nrows = 2459465)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> elapsed time =   24.12598
>
>  > startTime <- Sys.time()
>  > second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
>  > endTime <- Sys.time()
>  > cat("elapsed time = ", endTime - startTime, "\n")
>
> This appears to never finish. I have been waiting over 20 minutes.
>
> So why would (skip = 2459465, nrows = 5) take orders of magnitude longer
> than (nrows = 2459465) ?
>
> Thanks!
>
> -dave
>
> PS: readLines(n=2459470) takes 10.42731 seconds.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Exceptional slowness with read.csv

2024-04-08 Thread Dave Dixon

Greetings,

I have a csv file of 76 fields and about 4 million records. I know that 
some of the records have errors - unmatched quotes, specifically.  
Reading the file with readLines and parsing the lines with read.csv(text 
= ...) is really slow. I know that the first 2459465 records are good. 
So I try this:


> startTime <- Sys.time()
> first_records <- read.csv(file_name, nrows = 2459465)
> endTime <- Sys.time()
> cat("elapsed time = ", endTime - startTime, "\n")

elapsed time =   24.12598

> startTime <- Sys.time()
> second_records <- read.csv(file_name, skip = 2459465, nrows = 5)
> endTime <- Sys.time()
> cat("elapsed time = ", endTime - startTime, "\n")

This appears to never finish. I have been waiting over 20 minutes.

So why would (skip = 2459465, nrows = 5) take orders of magnitude longer 
than (nrows = 2459465) ?


Thanks!

-dave

PS: readLines(n=2459470) takes 10.42731 seconds.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to set the correct libomp for R

2024-04-08 Thread Ivan Krylov via R-help
В Mon, 8 Apr 2024 10:29:53 +0200
gernophil--- via R-help  пишет:

> I have some weird issue with using multithreaded data.table in macOS
> and I am trying to figure out, if it’s connected to my libomp.dylib.
> I started using libomp as stated here:
> https://mac.r-project.org/openmp/

Does the behaviour change if you temporarily move away
/usr/local/lib/libomp.dylib? 

> P.S.: If you need some more details about the actual issue with
> data.table you can also check here
> (https://github.com/rstudio/rstudio/issues/14517) and here
> (https://github.com/Rdatatable/data.table/issues/5957)

The debugger may be able to shed more light on the problem than just
"yes, this is due to OpenMP":
https://github.com/rstudio/rstudio/issues/14517#issuecomment-2040231196

When you reproduce the crash, what does the backtrace say?

-- 
Best regards,
Ivan

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to set the correct libomp for R

2024-04-08 Thread gernophil--- via R-help
Hey everyone,

I have some weird issue with using multithreaded data.table in macOS and I am 
trying to figure out, if it’s connected to my libomp.dylib. I started using 
libomp as stated here: https://mac.r-project.org/openmp/
 
Everything worked fine till beginning of this year, but all of a sudden, I get 
random fatal errors, when using data.table with multithreading. I figured that 
R (used in RStudio) is not loading this libomp.dylib (located at 
/usr/local/lib/libomp.dylib), but the one bundled with R 
(/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libomp.dylib).
 I sued this command to check for this:

system(paste("lsof -p", Sys.getpid(), "| grep dylib"))
 
I never checked, which one was used before, but this raised some questions for 
me:
 
1. Was libomp.dylib always bundled with R and if yes, what’s the point of this 
separate libomp.dylib from this page (https://mac.r-project.org/openmp/)?
2. Is there a way to set the libomp.dylib to another path and does this even 
make sense or should I always use the one bundled with R?
3. Could it be that one of the libraries is used for installing packages by 
Xcode’s clang and the other is used during usage of the package?
 
Maybe someone could shed some light onto this topic :).
 
P.S.: If you need some more details about the actual issue with data.table you 
can also check here (https://github.com/rstudio/rstudio/issues/14517) and here 
(https://github.com/Rdatatable/data.table/issues/5957). Or you can of course 
ask me, but it would be a little overkill to share everything that has been 
tried yet :).

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.