Thanks for checking this out. I am leaning towards readr::read_tsv which
is very explicit about
untoward content
Browse[2]>
debug: tab = readr::read_tsv(tf)
Browse[2]>
*Parsed with column specification:*
*cols(*
* .default = col_character(),*
* `DATE ADDED TO CATALOG` = **col_date(format
Everything works fine for me with quote="":
> system.time(gwas
<-read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
quote=""))
user system elapsed
4.435 0.052 4.487
> dim(gwas)
[1] 179364 38
> sessionInfo()
R version 4.0.0 Patched (2020-04-27 r78316)
Platform:
This file trips up fread around record 170349, inconsistently ... I haven't
figured that out yet.
readLines, strsplit may be the ultimate solution.
On Thu, Apr 30, 2020 at 7:15 AM Vincent Carey
wrote:
> right, line 35265 of
> http://www.ebi.ac.uk/gwas/api/search/downloads/alternative has an
>
right, line 35265 of
http://www.ebi.ac.uk/gwas/api/search/downloads/alternative has an unclosed
quote in a field.
35265 2019-04-10 30804558Grove J 2019-02-25 Nat Genet
www.ncbi.nlm.nih.gov/pubmed/30804558I dentification of common
genetic risk variants for autism
I'd look instead at or around line 35264 for use of quotes, e.g., "3' DNA", and
change the argument read.delim(quote = "") (though I never get that right so
probably wrong again...). A comment character might also be a problem.
If you point to the location of the file I could investigate
The EBI GWAS catalog is large -- now the download is over 100MB for 179K
associations. A "bug" in the
package was reported, so I acquired the file by hand.
> nn = read.delim("gwas_catalog_v1.0.2-associations_e98_r2020-03-08.tsv",
sep="\t")
*Warning message:*
*In scan(file = file, what = what,