Re: [R] EOF within quoted string
I tried reading it with read.table, as below, and didn't see any obvious problems. > txt <- readLines("http://ssc.wisc.edu/~ahanna/20_newsgroups.csv;) > str(txt) chr [1:11315] ",target,text" ... > writeLines(substring(txt[1:5],1,40)) ,target,text 0,9,"From: cub...@garnet.berkeley.edu ( 1,4,"From: gnel...@pion.rutgers.edu (Gre 2,11,"From: crypt-comme...@math.ncsu.edu 3,4,"From: () Subject: Re: Quadra SCSI > z <- read.table(text=txt, sep=",", quote="\"", header=TRUE, stringsAsFactors=FALSE) > str(z) 'data.frame': 11314 obs. of 3 variables: $ X : int 0 1 2 3 4 5 6 7 8 9 ... $ target: int 9 4 11 4 0 4 5 5 13 12 ... $ text : chr "From: cub...@garnet.berkeley.edu ( ) Subject: Re: Cubs behind Marlins? How? Artic"| __truncated__ "From: gnel...@pion.rutgers.edu (Gregory Nelson) Subject: Thanks Apple: Free Ethernet on my C610! Article-I.D.: "| __truncated__ "From: crypt-comme...@math.ncsu.edu Subject: Cryptography FAQ 10/10 - References Organization: The Crypt Cabal L"| __truncated__ "From: () Subject: Re: Quadra SCSI Problems??? Organization: Apple Computer Inc. Lines: 28 > ATTENTION: Mac Qu"| __truncated__ ... > summary(z) X target text Min. :0 Min. : 0.000 Length:11314 1st Qu.: 2828 1st Qu.: 5.000 Class :character Median : 5656 Median : 9.000 Mode :character Mean : 5656 Mean : 9.293 3rd Qu.: 8485 3rd Qu.:14.000 Max. :11313 Max. :19.000 > sum(is.na(z$text)) [1] 0 > table(substring(z$text,1,5)) cs.u egsn howl sgib uune wisc wupo zaph Distr From: Nntp- Organ Orgin 4 1 2 1 1 1 2 414 1079513 125 1 Reply Subje To: g X-Mai 3 343 1 3 Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Aug 11, 2017 at 1:58 AM, <mohan.radhakrish...@cognizant.com> wrote: > Yes. I tried that already. Not straightforward. > > data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, > quote="", sep=",", encoding="UTF-8") > > This line does read it haphazardly. The emails in the column are split > into multiple columns and there are several columns with just ‘NA’. Totally > 202 columns. > > And then I removed columns with NA’s and concatenated all the text and > finally got it. > > munged <- data[, unlist(lapply(data, function(x) !all(is.na(x] > munged <- munged[-1,] > munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , > collapse = " ") > > munged <- munged[,c("V1","V2","text")] > > print(head(munged$text)) > > Mohan > > From: Adams, Jean [mailto:jvad...@usgs.gov] > Sent: Thursday, August 10, 2017 8:03 PM > To: Radhakrishnan, Mohan (Cognizant) <mohan.radhakrish...@cognizant.com> > Cc: R help <r-help@r-project.org> > Subject: Re: [R] EOF within quoted string > > You might want to try some of the suggestions mentioned in this post: > https://stackoverflow.com/q/17414776/2140956 > > Jean > > On Thu, Aug 10, 2017 at 7:59 AM, <mohan.radhakrish...@cognizant.com > <mailto:mohan.radhakrish...@cognizant.com>> wrote: > Hi, > > Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading > it using > > data <- read.csv("20_newsgroups.csv",header=TRUE) > > throws this. > > Warning message: > In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > EOF within quoted string > > So, for example, the first line in the file is this. This column contains > only such text. Is there a way read it ? > > From: cub...@garnet.berkeley.edu<mailto:cub...@garnet.berkeley.edu> () > Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a > Organization: University of California, Berkeley Lines: 12 > NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu> > gajar...@pilot.njin.net<mailto:gajar...@pilot.njin.net> writes: morgan > and guzman will have era's 1 run higher than last year, and the cubs will > be idiots and not pitch harkey as much as hibbard. castillo won't be good > (i think he's a stud pitcher) This season so far, Morgan and Guzman > helped to lead the Cubsat top in ERA, even better than THE rotation > at Atlanta.Cubs ERA at 0.056 while Braves at 0.059. We know it is > earlyin the season, we Cubs fans have learned how to enjoy the > short triumph while it is still there. > > Thanks, > Mohan > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended reci
Re: [R] EOF within quoted string
Yes. I tried that already. Not straightforward. data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, quote="", sep=",", encoding="UTF-8") This line does read it haphazardly. The emails in the column are split into multiple columns and there are several columns with just ‘NA’. Totally 202 columns. And then I removed columns with NA’s and concatenated all the text and finally got it. munged <- data[, unlist(lapply(data, function(x) !all(is.na(x] munged <- munged[-1,] munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , collapse = " ") munged <- munged[,c("V1","V2","text")] print(head(munged$text)) Mohan From: Adams, Jean [mailto:jvad...@usgs.gov] Sent: Thursday, August 10, 2017 8:03 PM To: Radhakrishnan, Mohan (Cognizant) <mohan.radhakrish...@cognizant.com> Cc: R help <r-help@r-project.org> Subject: Re: [R] EOF within quoted string You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956 Jean On Thu, Aug 10, 2017 at 7:59 AM, <mohan.radhakrish...@cognizant.com<mailto:mohan.radhakrish...@cognizant.com>> wrote: Hi, Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using data <- read.csv("20_newsgroups.csv",header=TRUE) throws this. Warning message: In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : EOF within quoted string So, for example, the first line in the file is this. This column contains only such text. Is there a way read it ? From: cub...@garnet.berkeley.edu<mailto:cub...@garnet.berkeley.edu> () Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: University of California, Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu> gajar...@pilot.njin.net<mailto:gajar...@pilot.njin.net> writes: morgan and guzman will have era's 1 run higher than last year, and the cubs will be idiots and not pitch harkey as much as hibbard. castillo won't be good (i think he's a stud pitcher) This season so far, Morgan and Guzman helped to lead the Cubsat top in ERA, even better than THE rotation at Atlanta.Cubs ERA at 0.056 while Braves at 0.059. We know it is early in the season, we Cubs fans have learned how to enjoy theshort triumph while it is still there. Thanks, Mohan This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. [[alternative HTML version deleted]] __ R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient(s), please reply to the sender and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email, and/or any action taken in reliance on the contents of this e-mail is strictly prohibited and may be unlawful. Where permitted by applicable law, this e-mail and other e-mail communications sent to and from Cognizant e-mail addresses may be monitored. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] EOF within quoted string
You might want to try some of the suggestions mentioned in this post: https://stackoverflow.com/q/17414776/2140956 Jean On Thu, Aug 10, 2017 at 7:59 AM,wrote: > Hi, > > Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading > it using > > data <- read.csv("20_newsgroups.csv",header=TRUE) > > throws this. > > Warning message: > In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > EOF within quoted string > > So, for example, the first line in the file is this. This column contains > only such text. Is there a way read it ? > > From: cub...@garnet.berkeley.edu () Subject: Re: Cubs behind Marlins? > How? Article-I.D.: agate.1pt592$f9a Organization: University of California, > Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu > gajar...@pilot.njin.net writes: morgan and guzman will have era's 1 run > higher than last year, and the cubs will be idiots and not pitch harkey as > much as hibbard. castillo won't be good (i think he's a stud pitcher) >This season so far, Morgan and Guzman helped to lead the Cubsat > top in ERA, even better than THE rotation at Atlanta.Cubs ERA at > 0.056 while Braves at 0.059. We know it is earlyin the season, we > Cubs fans have learned how to enjoy theshort triumph while it is > still there. > > Thanks, > Mohan > This e-mail and any files transmitted with it are for the sole use of the > intended recipient(s) and may contain confidential and privileged > information. If you are not the intended recipient(s), please reply to the > sender and destroy all copies of the original message. Any unauthorized > review, use, disclosure, dissemination, forwarding, printing or copying of > this email, and/or any action taken in reliance on the contents of this > e-mail is strictly prohibited and may be unlawful. Where permitted by > applicable law, this e-mail and other e-mail communications sent to and > from Cognizant e-mail addresses may be monitored. > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.