Re: [R] EOF within quoted string

2017-08-11 Thread William Dunlap via R-help
I tried reading it with read.table, as below, and didn't see any obvious
problems.

> txt <- readLines("http://ssc.wisc.edu/~ahanna/20_newsgroups.csv;)
> str(txt)
 chr [1:11315] ",target,text" ...
> writeLines(substring(txt[1:5],1,40))
,target,text
0,9,"From: cub...@garnet.berkeley.edu (
1,4,"From: gnel...@pion.rutgers.edu (Gre
2,11,"From: crypt-comme...@math.ncsu.edu
3,4,"From:  () Subject: Re: Quadra SCSI
> z <- read.table(text=txt, sep=",", quote="\"", header=TRUE,
stringsAsFactors=FALSE)
> str(z)
'data.frame':   11314 obs. of  3 variables:
 $ X : int  0 1 2 3 4 5 6 7 8 9 ...
 $ target: int  9 4 11 4 0 4 5 5 13 12 ...
 $ text  : chr  "From: cub...@garnet.berkeley.edu (
  ) Subject: Re: Cubs behind Marlins? How? Artic"| __truncated__ "From:
gnel...@pion.rutgers.edu (Gregory Nelson) Subject: Thanks Apple: Free
Ethernet on my C610! Article-I.D.: "| __truncated__ "From:
crypt-comme...@math.ncsu.edu Subject: Cryptography FAQ 10/10 - References
Organization: The Crypt Cabal L"| __truncated__ "From:  () Subject: Re:
Quadra SCSI Problems??? Organization: Apple Computer Inc. Lines: 28  >
ATTENTION: Mac Qu"| __truncated__ ...
> summary(z)
   X target   text
 Min.   :0   Min.   : 0.000   Length:11314
 1st Qu.: 2828   1st Qu.: 5.000   Class :character
 Median : 5656   Median : 9.000   Mode  :character
 Mean   : 5656   Mean   : 9.293
 3rd Qu.: 8485   3rd Qu.:14.000
 Max.   :11313   Max.   :19.000
> sum(is.na(z$text))
[1] 0
> table(substring(z$text,1,5))

 cs.u  egsn  howl  sgib  uune  wisc  wupo  zaph Distr From: Nntp- Organ
Orgin
4 1 2 1 1 1 2 414 1079513   125
1
Reply Subje To: g X-Mai
3   343 1 3




Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Fri, Aug 11, 2017 at 1:58 AM, <mohan.radhakrish...@cognizant.com> wrote:

> Yes. I tried that already. Not straightforward.
>
> data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F,
> quote="", sep=",", encoding="UTF-8")
>
> This line does read it haphazardly. The emails in the column are split
> into multiple columns and there are several columns with just ‘NA’. Totally
> 202 columns.
>
> And then I removed columns with NA’s and concatenated all the text and
> finally got it.
>
> munged <- data[, unlist(lapply(data, function(x) !all(is.na(x]
> munged <- munged[-1,]
> munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 ,
> collapse = " ")
>
> munged <- munged[,c("V1","V2","text")]
>
> print(head(munged$text))
>
> Mohan
>
> From: Adams, Jean [mailto:jvad...@usgs.gov]
> Sent: Thursday, August 10, 2017 8:03 PM
> To: Radhakrishnan, Mohan (Cognizant) <mohan.radhakrish...@cognizant.com>
> Cc: R help <r-help@r-project.org>
> Subject: Re: [R] EOF within quoted string
>
> You might want to try some of the suggestions mentioned in this post:
> https://stackoverflow.com/q/17414776/2140956
>
> Jean
>
> On Thu, Aug 10, 2017 at 7:59 AM, <mohan.radhakrish...@cognizant.com
> <mailto:mohan.radhakrish...@cognizant.com>> wrote:
> Hi,
>
> Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading
> it using
>
> data <- read.csv("20_newsgroups.csv",header=TRUE)
>
> throws this.
>
> Warning message:
> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   EOF within quoted string
>
> So, for example, the first line in the file is this. This column contains
> only such text. Is there a way read it ?
>
> From: cub...@garnet.berkeley.edu<mailto:cub...@garnet.berkeley.edu> ()
> Subject: Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a
> Organization: University of California, Berkeley Lines: 12
> NNTP-Posting-Host: garnet.berkeley.edu<http://garnet.berkeley.edu>
> gajar...@pilot.njin.net<mailto:gajar...@pilot.njin.net> writes:  morgan
> and guzman will have era's 1 run higher than last year, and  the cubs will
> be idiots and not pitch harkey as much as hibbard.  castillo won't be good
> (i think he's a stud pitcher) This season so far, Morgan and Guzman
> helped to lead the Cubsat top in ERA, even better than THE rotation
> at Atlanta.Cubs ERA at 0.056 while Braves at 0.059. We know it is
> earlyin the season, we Cubs fans have learned how to enjoy the
>   short triumph while it is still there.
>
> Thanks,
> Mohan
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended reci

Re: [R] EOF within quoted string

2017-08-11 Thread Mohan.Radhakrishnan
Yes. I tried that already. Not straightforward.

data <- read.csv("20_newsgroups.csv",fill=TRUE,as.is=T,header=F, quote="", 
sep=",", encoding="UTF-8")

This line does read it haphazardly. The emails in the column are split into 
multiple columns and there are several columns with just ‘NA’. Totally 202 
columns.

And then I removed columns with NA’s and concatenated all the text and finally 
got it.

munged <- data[, unlist(lapply(data, function(x) !all(is.na(x]
munged <- munged[-1,]
munged$text <- apply( munged[ , c(3:ncol(munged)) ] , 1 , paste0 , collapse = " 
")

munged <- munged[,c("V1","V2","text")]

print(head(munged$text))

Mohan

From: Adams, Jean [mailto:jvad...@usgs.gov]
Sent: Thursday, August 10, 2017 8:03 PM
To: Radhakrishnan, Mohan (Cognizant) <mohan.radhakrish...@cognizant.com>
Cc: R help <r-help@r-project.org>
Subject: Re: [R] EOF within quoted string

You might want to try some of the suggestions mentioned in this post: 
https://stackoverflow.com/q/17414776/2140956

Jean

On Thu, Aug 10, 2017 at 7:59 AM, 
<mohan.radhakrish...@cognizant.com<mailto:mohan.radhakrish...@cognizant.com>> 
wrote:
Hi,

Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading it using

data <- read.csv("20_newsgroups.csv",header=TRUE)

throws this.

Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  EOF within quoted string

So, for example, the first line in the file is this. This column contains only 
such text. Is there a way read it ?

From: cub...@garnet.berkeley.edu<mailto:cub...@garnet.berkeley.edu> () Subject: 
Re: Cubs behind Marlins? How? Article-I.D.: agate.1pt592$f9a Organization: 
University of California, Berkeley Lines: 12 NNTP-Posting-Host: 
garnet.berkeley.edu<http://garnet.berkeley.edu>   
gajar...@pilot.njin.net<mailto:gajar...@pilot.njin.net> writes:  morgan and 
guzman will have era's 1 run higher than last year, and  the cubs will be 
idiots and not pitch harkey as much as hibbard.  castillo won't be good (i 
think he's a stud pitcher) This season so far, Morgan and Guzman helped 
to lead the Cubsat top in ERA, even better than THE rotation at 
Atlanta.Cubs ERA at 0.056 while Braves at 0.059. We know it is early
in the season, we Cubs fans have learned how to enjoy theshort 
triumph while it is still there.

Thanks,
Mohan
This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.

[[alternative HTML version deleted]]

__
R-help@r-project.org<mailto:R-help@r-project.org> mailing list -- To 
UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information. 
If you are not the intended recipient(s), please reply to the sender and 
destroy all copies of the original message. Any unauthorized review, use, 
disclosure, dissemination, forwarding, printing or copying of this email, 
and/or any action taken in reliance on the contents of this e-mail is strictly 
prohibited and may be unlawful. Where permitted by applicable law, this e-mail 
and other e-mail communications sent to and from Cognizant e-mail addresses may 
be monitored.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] EOF within quoted string

2017-08-10 Thread Adams, Jean
You might want to try some of the suggestions mentioned in this post:
https://stackoverflow.com/q/17414776/2140956

Jean

On Thu, Aug 10, 2017 at 7:59 AM,  wrote:

> Hi,
>
> Reading http://ssc.wisc.edu/~ahanna/20_newsgroups.csv after downloading
> it using
>
> data <- read.csv("20_newsgroups.csv",header=TRUE)
>
> throws this.
>
> Warning message:
> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   EOF within quoted string
>
> So, for example, the first line in the file is this. This column contains
> only such text. Is there a way read it ?
>
> From: cub...@garnet.berkeley.edu () Subject: Re: Cubs behind Marlins?
> How? Article-I.D.: agate.1pt592$f9a Organization: University of California,
> Berkeley Lines: 12 NNTP-Posting-Host: garnet.berkeley.edu
> gajar...@pilot.njin.net writes:  morgan and guzman will have era's 1 run
> higher than last year, and  the cubs will be idiots and not pitch harkey as
> much as hibbard.  castillo won't be good (i think he's a stud pitcher)
>This season so far, Morgan and Guzman helped to lead the Cubsat
> top in ERA, even better than THE rotation at Atlanta.Cubs ERA at
> 0.056 while Braves at 0.059. We know it is earlyin the season, we
> Cubs fans have learned how to enjoy theshort triumph while it is
> still there.
>
> Thanks,
> Mohan
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.