[R] Text Mining in R

2016-05-17 Thread Burhan ul haq
Hi,

Wishing you all well.

I am exploring text mining with R. Here is where I need help:

1. The starting point is a data frame

worder1<- c("I am, taking 2","are these the three samples?",
"He speaks differently to you, aint it !","This is distilled -
my dear, now give me $3","I saved 2500 this month.")
df1 <- data.frame(id=1:5, words=worder1)

here in dput format:

dput(df1)
structure(list(id = 1:5, words = structure(c(3L, 1L, 2L, 5L,
4L), .Label = c("are these the three samples?", "He speaks differently to
you, aint it !",
"I am, taking 2", "I saved 2500 this month.", "This is distilled - my dear,
now give me $3"
), class = "factor")), .Names = c("id", "words"), row.names = c(NA,
-5L), class = "data.frame")


2. The corpus rituals ...

corp1 <- Corpus(VectorSource(df1$words))
inspect(corp1)
class(corp1)

corp1 <- tm_map(corp1, removeNumbers)
corp1 <- tm_map(corp1, removePunctuation)
corp1 <- tm_map(corp1, removeWords, stopwords("english"))
corp1 <- tm_map(corp1, stripWhitespace)
class(corp1)


3. Getting to the analysis

tdm1 <- TermDocumentMatrix(corp1)
inspect(tdm1[1:5,])
dtm1 <- DocumentTermMatrix(corp1)
inspect(dtm1[1:5,])

4. Now here is the problem

If I do a translation, not in getTransformations(), I am unable to convert
to tdm or dtm

corp1 <- tm_map(corp1, tolower)
class(corp1)
tdm1.2 <- TermDocumentMatrix(corp1)
dtm1.2 <- DocumentTermMatrix(corp1)

The error returned is:

Error: inherits(doc, "TextDocument") is not TRUE

5. The explaination on internet suggests either

a) corp1 <- tm_map(corp1, content_transformer(tolower))
which in my case returns error:
Error in UseMethod("content", x) :
  no applicable method for 'content' applied to an object of class
"character"

b) corpus_clean <- tm_map(corp1, PlainTextDocument)
which results in loss of all the meta data

I will appreciate any help. Lastly to keep the doc ids with R corpus,
should the step 2 be changed as:
corp1 <- Corpus(DataframeSource(df1))

from:
corp1 <- Corpus(VectorSource(df1$words))

Thanks /


-

Some of the references I explored:
http://stackoverflow.com/questions/25638503/tm-loses-the-metadata-when-applying-tm-map
http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument
http://stackoverflow.com/questions/24771165/r-project-no-applicable-method-for-meta-applied-to-an-object-of-class-charact
http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r
http://stackoverflow.com/questions/20699111/tm-map-error-message-in-r
http://stackoverflow.com/questions/31996891/error-in-usemethodmeta-x-no-applicable-method-for-meta-applied-to-an-ob
http://stackoverflow.com/questions/11876740/r-stemming-a-string-document-corpus

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text Mining - Remove punctuation not removing quotes and dashes

2015-06-08 Thread Anindya Sankar Dey
Hi,

I have been doing some text mining. I created the DTM matrix using the
following steps.

corpus1-VCorpus(VectorSource(resume1$Dat1))

corpus1-tm_map(corpus1,content_transformer(tolower))

dtm-DocumentTermMatrix(corpus1,
   control = list(removePunctuation = TRUE,
  removeNumbers = TRUE,
  removeSparseTerms=TRUE,
stopwords = TRUE))


​After all the run I am still getting words like -quotation, fun, model​
, etc.

What can I do about it. I do not need this dahses and extra quotations.

-- 
Anindya Sankar Dey

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Text Mining in Non English Speaking Countries

2014-09-24 Thread ziad.elmously
Hello All,

I am interested in conducting text mining in languages other English.  My 
understanding is the following R packages can analyze alternative (to English) 
languages:


1.   topicmodels

2.   snowball

3.   tm

Can anyone confirm?  Specifically, I am interested in Hindi and Chinese (2 or 
so most popular dialects).  If so, can you recommend relevant documentation and 
share your experiences with these packages.

Thank you in advance.

Ziad Elmously





http://www.kantar.com/disclaimer.html



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Mining in Non English Speaking Countries

2014-09-24 Thread Flavio Barros
I used already with portuguese. No problems.


Flavio Barros

www.flaviobarros.net
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.flaviobarros.netsn=
[image: Facebook]
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.facebook.com%2Fflavio.barros.1650%3Fref%3Dtn_tnmnsn=
[image:
LinkedIn]
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fprofile%2Fview%3Fid%3D61839390%26trk%3Dtab_prosn=
[image:
about.me]
http://s.wisestamp.com/links?url=http%3A%2F%2Fabout.me%2Fflavio_barrossn=
Contact me: [image: Google Talk] flaviomargar...@gmail.com
“We are not victims by nature...we are programmed to be victims...for good
reason...if we truly embraced our power, we would never be controlled. Live
WISE~ - Gail Blackman
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.quotesdaddy.com%2Fquote%2F1403644%2Fgail-blackman%2Fwe-are-not-victims-by-naturewe-are-programmed-to-besn=
”  Get this email app!
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.wisestamp.com%2Fapps%2Fquotes%3Futm_source%3Dextension%26utm_medium%3Demail%26utm_term%3Dquotes%26utm_campaign%3Dappssn=

[image: WordPress Blog Posts]
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.flaviobarros.netsn=My
latest post:Data Preparation – Part II
http://s.wisestamp.com/links?url=http%3A%2F%2Ffeedproxy.google.com%2F~r%2FFlavioBarros%2F~3%2F9MTu1M40mhE%2Fsn=
Read more
http://s.wisestamp.com/links?url=http%3A%2F%2Ffeedproxy.google.com%2F~r%2FFlavioBarros%2F~3%2F9MTu1M40mhE%2Fsn=
| My blog
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.flaviobarros.netsn=
[image: Share on Facebook]
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.facebook.com%2Fsharer.php%3Fu%3Dhttp%253A%252F%252Ffeedproxy.google.com%252F~r%252FFlavioBarros%252F~3%252F9MTu1M40mhE%252Fsn=
[image:
Share on Twitter]
http://s.wisestamp.com/links?url=https%3A%2F%2Ftwitter.com%2Fintent%2Ftweet%3Ftext%3DData%2520Preparation%2520%25E2%2580%2593%2520Part%2520II%2520%2520(via%2520%2540wisestamp)sn=
  Get this email app!
http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.wisestamp.com%2Fapps%2Fwordpress%3Futm_source%3Dextension%26utm_medium%3Demail%26utm_term%3Dwordpress%26utm_campaign%3Dappssn=


http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2Fsn=
 Create your free signature:
http://s.wisestamp.com/links?url=http%3A%2F%2Fr1.wisestamp.com%2Fr%2Flanding%3Fpromo%3D33%26dest%3Dhttp%253A%252F%252Fwww.wisestamp.com%252Femail-install%253Futm_source%253Dextension%2526utm_medium%253Demail%2526utm_campaign%253Dpromo_33sn=
CLICK
HERE!
http://s.wisestamp.com/links?url=http%3A%2F%2Fr1.wisestamp.com%2Fr%2Flanding%3Fpromo%3D33%26amp%3Bdest%3Dhttp%253A%252F%252Fwww.wisestamp.com%252Femail-install%253Futm_source%253Dextension%2526utm_medium%253Demail%2526utm_campaign%253Dpromo_33sn=
​

On Wed, Sep 24, 2014 at 8:30 AM, ziad.elmou...@tnsglobal.com wrote:

 Hello All,

 I am interested in conducting text mining in languages other English.  My
 understanding is the following R packages can analyze alternative (to
 English) languages:


 1.   topicmodels

 2.   snowball

 3.   tm

 Can anyone confirm?  Specifically, I am interested in Hindi and Chinese (2
 or so most popular dialects).  If so, can you recommend relevant
 documentation and share your experiences with these packages.

 Thank you in advance.

 Ziad Elmously





 http://www.kantar.com/disclaimer.html



 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text mining

2013-01-26 Thread Steve Stephenson
Hallo to everybody,
I would like to perform an analysis but I don't know how to proceed and
whether R packages are available for my purpose or not. Therefore I'm here
to request your support.
*The idea is the following:* I noticed that the names of the towns and
villages in northern Italy most of the time sound differently from names of
cities based on southern Italy. Just to give you an idea Caronno
Pertusella is a northern Italy village while Frascati is a center Italy
town. Most of the time I am able to recognize where the town is located just
hearing the name but I cannot say why, that is to say that I didn't find a
rule.
What I would like to do is to find a classification rule/engine that is able
to locate the city starting from its name. *I think the classification
method should be based on the sequence of letters belonging to the town's
name*. But this is just an intuition not yet formalized!
I know that mine is a strange request and idea, anyway advices are very
appreciated and welcome!
Many thanks in advance to all.

Steve



--
View this message in context: 
http://r.789695.n4.nabble.com/Text-mining-tp4656732.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text mining

2013-01-26 Thread Giovanni Azua
Hi Steve,

IMO this problem does not need a classifier but rather a database and a
simple query. I would just build a database with all city names including
the geo information, and then say whether it is north or south exactly. 

If there was such a rule (which I doubt) I would expect it to have many
exceptions and therefore a bunch of false-positives on both sides. Why
overcomplicate a simple problem? 

HTH,
Ciao,
Giovanni

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
Behalf Of Steve Stephenson
Sent: Saturday, January 26, 2013 10:08 PM
To: r-help@r-project.org
Subject: [R] Text mining

Hallo to everybody,
I would like to perform an analysis but I don't know how to proceed and
whether R packages are available for my purpose or not. Therefore I'm here
to request your support.
*The idea is the following:* I noticed that the names of the towns and
villages in northern Italy most of the time sound differently from names of
cities based on southern Italy. Just to give you an idea Caronno
Pertusella is a northern Italy village while Frascati is a center Italy
town. Most of the time I am able to recognize where the town is located just
hearing the name but I cannot say why, that is to say that I didn't find a
rule.
What I would like to do is to find a classification rule/engine that is able
to locate the city starting from its name. *I think the classification
method should be based on the sequence of letters belonging to the town's
name*. But this is just an intuition not yet formalized!
I know that mine is a strange request and idea, anyway advices are very
appreciated and welcome!
Many thanks in advance to all.

Steve



--
View this message in context:
http://r.789695.n4.nabble.com/Text-mining-tp4656732.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text mining

2013-01-26 Thread Steve Stephenson
Hi Giovanni,
thanks a lot for your quick reply!!!
I try to answer you in a few points:
1 - A Data Base containing all the towns and the Region they belong to
(North, Sud...) is already available on the ISTAT site (www.ISTAT.it);
2- My goal was just to find a method supporting my idea, that is to say
that northern towns names sound different from southern names;
3- To build this method I should use the ISTAT DB, partially as training set
and partially as validation set;
4- The idea was born just for fun since I find very interesting and also
challenging the data mining;
5- I absolutely agree with you: I will find a lot of exception and therefore
; if the exceptions are greater than the rule (this could happen) this would
imply that my initial idea is wrong. In any case I would be satisfied
because this would mean that I have been able to prove if an in intuition is
right or wrong. 

I hope this can clarify my previous post.
Many thanks and *sorry for the lack of clarity*.

Steve




--
View this message in context: 
http://r.789695.n4.nabble.com/Text-mining-tp4656732p4656738.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text mining? Text manipulation? Both? Predicting KRAS test results in cancer patients

2012-09-28 Thread Paul Miller
Happy Friday Everyone,
 
Hope Friday afternoon doesn't turn out to be a terrible time to post a 
question. I've been doing a little data mining of patient text medical records 
as of late. I started out trying to predict whether or not cancer patients had 
received KRAS mutation testing and did quite well with that. Now I'm trying to 
predict the results of KRAS testing (mutated vs. wild type). This is proving to 
be a little more difficult.
 
With the first classification task, I created counts of terms (e.g., kras, 
mutated) in the text medical records using the tm package and then used those 
counts to predict whether or not patients had had KRAS mutation testing. I 
tried a few different analyses here, but found that random forests worked the 
best.
 
Predicting the results of testing is harder though because of the way 
physicians and other healthcare professionals write about testing. For example, 
I'm finding phrases like KRAS mutation returned wild-type. In this example, 
if we're counting, we get 1 instance of kras, 1 instance of mutated, and 
one instance of wild. So you can see how it might be difficult to accurately 
predict the results of testing based on counts alone.
 
My question is how best to deal with this. Are there any R text mining packages 
or related software that would be particularly suited to my problem? I took a 
look at the CRAN Task View: Natural Language Processing and there were so many 
options I didn't really know where to start (and it's not even clear that an 
R-based solution will work best for my problem). Alternatively, is there any 
real chance one could simply write code that would be able to identify true 
references to the results of KRAS testing and then create counts only of what 
are likely to be true references?
 
I'd greatly appreciate it if someone could point me in the right direction.
 
Thanks,
 
Paul 
 
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text mining: Narrowing a field of 27, 855 predictors using semi-partial correlations or some other means

2012-04-18 Thread Paul Miller
Hello Everyone,

Trying to learn a little bit about data mining. I'm working on a text mining 
project that will attempt to predict whether cancer patients got a particular 
type of genetic testing. A subsequent stage then will be aimed at predicting 
what the results of that testing were. 
 
I've used the tm package to prepare my data and am planning to use rattle to do 
the actual data mining. The tm package has proved to be a great help so far. 
I've managed to perform a variety of transformations of my data. I've also 
managed to create a document-term matrix that has a row for each of my patients 
and columns for each of the terms in my patient medical records. 
 
Because I'm not yet a particularly good R programmer, I've converted my 
document-term matrix to a data frame and then added information about the 
genetic testing. 
 
So here's the thing. The tm package has a feature that would allow me to drop 
words that occur infrequently in patient medical records. However, I've been 
asked not to use it because it's believed that even infrequently occurring 
terms may be highly diagnostic. The consequence is that my data frame has a 
large number of columns for the various words. In fact, over 27,000 of them.
 
So my question is how to reduce this to some more manageable number. One 
thought has been to look at semi-partial correlations. Here these would be 
between tested(y/n) and each predictor, controlling for length of medical 
record. The idea would be to use only those predictors that were significant in 
the actual data mining.
 
Is this likely to be a good approach? Or is there likely to be a better way of 
doing it?
 
If it is a good approach, I’m wondering how to go about obtaining the necessary 
results. I’ve managed to figure out how to compute semi-partial correlations 
using the spcor.test() function in the ppcor package, as in:
 
 spcor.test(as.numeric(Tested$TestStatus==Yes), Tested$predictor, Tested 
 $nchar_record)
 
   estimate  p.value statistic   n gp  Method
1 0.3853547 2.307562e-08  5.587203 182  1 pearson
 
This is fine for a single pair of variables. What I’d need though is to combine 
a whole series of such outputs, one for each of my predictors. After that, I’d 
need to be able to determine which semi-partial correlations were significant 
(or perhaps substantial) and to create a list that I could use to eliminate a 
lot of the predictors from my data frame. I’m just beginning to use R in my 
day-to-day work. So it’s not clear to me how to do this. 
 
Thanks,
 
Paul

 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Mining with Facebook Reviews (XML and FQL)

2011-10-11 Thread Duncan Temple Lang

Hi Kenneth

  First off, you probably don't need to use xmlParseDoc(), but rather
  xmlParse().  (Both are fine, but xmlParseDoc() allows you to control many of
  the options in the libxml2 parser, which you don't need here.)

  xmlParse() has some capabilities to fetch the content of URLs. However,
 it cannot deal with HTTPS requests which this call to facebook is.
 The approach to this is to
i) make the request
   ii) parse the resulting string via xmlParse(txt, asText = TRUE)

 As for i), there are several ways to do this, but the RCurl
 package allows you to do it entirely within R and gives you
 more control over the request than you would ever want.

   library(RCurl)
   txt = getForm('https://api.facebook.com/method/fql.query', query = QUERY)

   mydata.xml = xmlParse(txt, asText = TRUE)

However, you are most likely going to have to login / get a token
before you make this request. And then, if you are using RCurl,
you will want to use the same curl object with the token or cookies, etc.

D.

On 10/10/11 3:52 PM, Kenneth Zhang wrote:
 Hello,
 
 I am trying to use XML package to download Facebook reviews in the following
 way:
 
 require(XML)
 mydata.vectors - character(0)
 Qword - URLencode('#IBM')
 QUERY - paste('SELECT review_id, message, rating from review where message
 LIKE %',Qword,'%',sep='')
 Facebook_url =  paste('https://api.facebook.com/method/fql.query?query=
 ',QUERY,sep='')
 mydata.xml - xmlParseDoc(Facebook_url, asText=F)
 mydata.vector - xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue,
 namespaces =c('s'='http://www.w3.org/2005/Atom'))
 
 The mydata.xml is NULL therefore no further step can be execute. I am not so
 familiar with XML or FQL. Any suggestion will be appreciated. Thank you!
 
 Best regards,
 Kenneth
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text Mining with Facebook Reviews (XML and FQL)

2011-10-10 Thread Kenneth Zhang
Hello,

I am trying to use XML package to download Facebook reviews in the following
way:

require(XML)
mydata.vectors - character(0)
Qword - URLencode('#IBM')
QUERY - paste('SELECT review_id, message, rating from review where message
LIKE %',Qword,'%',sep='')
Facebook_url =  paste('https://api.facebook.com/method/fql.query?query=
',QUERY,sep='')
mydata.xml - xmlParseDoc(Facebook_url, asText=F)
mydata.vector - xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue,
namespaces =c('s'='http://www.w3.org/2005/Atom'))

The mydata.xml is NULL therefore no further step can be execute. I am not so
familiar with XML or FQL. Any suggestion will be appreciated. Thank you!

Best regards,
Kenneth

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text mining analysis methods

2011-08-28 Thread mpavlic
Hi all, 

i am trying to do some text mining in R. So far I managed to do some text
mining like TermDocumentMatrices and word count and similiar things. 

What I would like to do is this :
I have a soil descriptions from borehole logs that corresponds to soil
classes. The problem is that some of this classes are wrongly classified.
What i did is i made DocumenstTermMatrices for each of the class. So now i
would like to use some king of statistical method to determine to which of
the classes they actually belong.

Hope that explains it. Any help of info would be grately appreciated, 

matevz

--
View this message in context: 
http://r.789695.n4.nabble.com/Text-mining-analysis-methods-tp3774283p3774283.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] text mining

2011-05-30 Thread rgui
Hi,

I have a problem when indexing the corpus. I used the following syntax:

 Setwd (c :/)
 Library (tm)
 Txt = Corpus (DirSource (.); readerControl = list (language = frensh))

an error message comes:

 Messages d'avis :
1: In readLines(y, encoding = x$Encoding) :
  ligne finale incomplète trouvée dans './n3.txt'
2: In readLines(y, encoding = x$Encoding) :
  ligne finale incomplète trouvée dans './n32.

another question:
 how can I read different document types (. pdf,. ...) html using the
package tm?

Thanks very well for help



--
View this message in context: 
http://r.789695.n4.nabble.com/text-mining-tp3560367p3560367.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining

2011-05-30 Thread Duncan Murdoch

On 30/05/2011 6:17 AM, rgui wrote:

Hi,

I have a problem when indexing the corpus. I used the following syntax:

  Setwd (c :/)
  Library (tm)
  Txt = Corpus (DirSource (.); readerControl = list (language = frensh))

Capitalization is important in R, so when asking a question, please cut 
and paste what you actually did.  In this case, it doesn't matter.



an error message comes:

  Messages d'avis :
1: In readLines(y, encoding = x$Encoding) :
   ligne finale incomplète trouvée dans './n3.txt'
2: In readLines(y, encoding = x$Encoding) :
   ligne finale incomplète trouvée dans './n32.


Those are warnings, not errors.   readLines gives those warnings when 
the last line of the file stops abruptly, rather than having an end of 
line marker.  On Unix systems this usually signals a problem with the 
file.  Windows is more tolerant, so many editors don't bother to add the 
final marker.

another question:
  how can I read different document types (. pdf,. ...) html using the
package tm?


I think you need to convert them to text first (by some tool outside of 
R), but I might be wrong.


Duncan Murdoch


Thanks very well for help



--
View this message in context: 
http://r.789695.n4.nabble.com/text-mining-tp3560367p3560367.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining

2011-05-27 Thread rgui
Thanks very well

--
View this message in context: 
http://r.789695.n4.nabble.com/text-mining-tp3552221p3554849.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] text mining

2011-05-26 Thread rgui
Hi,

how can I import a document whose type is. txt using the package tm?
it is the command to know that my document is not placed in the library
package tm.

thanks.

--
View this message in context: 
http://r.789695.n4.nabble.com/text-mining-tp3552221p3552221.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining

2011-05-26 Thread Matevž Pavlič
HI, 

I do it like this :

setwd(C:/Users/mpavlic/Desktop/Temp)
library(tm)

tekst - Corpus(DirSource(.),readerControl = list(language =ansi))  

where *.txt files are stored in a folder Temp in my desktop, 

HTH, m

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of rgui
Sent: Thursday, May 26, 2011 1:02 PM
To: r-help@r-project.org
Subject: [R] text mining

Hi,

how can I import a document whose type is. txt using the package tm?
it is the command to know that my document is not placed in the library package 
tm.

thanks.

--
View this message in context: 
http://r.789695.n4.nabble.com/text-mining-tp3552221p3552221.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining analysis and word visualization of pdfs

2011-05-19 Thread Mike Marchywka













Date: Wed, 18 May 2011 15:24:49 +0530
From: ashimkap...@gmail.com
To: k...@huftis.org
CC: r-h...@stat.math.ethz.ch
Subject: Re: [R] text mining analysis and word visualization of pdfs


On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer wrote:

 Ajay Ohri wrote:

  What is the appropriate software package for dumping say 20 PDFS in a
  folder, then creating data visualization with frequency counts of
  certain words as well as measure correlation within each file for
  certain key relationships or key words.

 pdftotext + Unix™ for Poets + R (ggplot2)

 What about the tm package ? I am a beginner and I don't know much about
this but I recall that it does have the ability to handle PDF's. A few words
from the experts would be nice.

I don;t know if I'm an expert, I can't even get a browser that echo's
keystrokes in a reasonable time with 4 core CPU on 'dohs, but PDF
could mean just about anything in terms of how text is respresented. Whatever
R packages do, they will not be able to read the mind of the author.
Even with pdftotext, there are many options and even simple things like
US IRS instruction forms can be almost impossible to extract in a coherent
manner. Many authors could care less about the information as long as the
thing looks like paper copy. If you are stuck with PDF, I'd be looking
for more tools first as you will probably want to know how they are 
constrcuted. 

I would just reiterate that the best approach for many data analysts would
be to contact data source explaining problems with improperly authored PDF or
other specialized file format that are only supported by limited proprietary 
tools
or that obfuscate information of interest. 


  









  
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] text mining analysis and word visualization of pdfs

2011-05-18 Thread Ajay Ohri
Dear Lists,

What is the appropriate software package for dumping say 20 PDFS in a
folder, then creating data visualization with frequency counts of
certain words as well as measure correlation within each file for
certain key relationships or key words.

I am doing text analysis of biases in enterprise software sponsored
publications- and need to come up with a statistical threshold.

Regards,

Ajay Ohri

Websites-
http://decisionstats.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining analysis and word visualization of pdfs

2011-05-18 Thread Karl Ove Hufthammer
Ajay Ohri wrote:

 What is the appropriate software package for dumping say 20 PDFS in a
 folder, then creating data visualization with frequency counts of
 certain words as well as measure correlation within each file for
 certain key relationships or key words.

pdftotext + Unix™ for Poets + R (ggplot2)

HTH.

-- 
Karl Ove Hufthammer

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining analysis and word visualization of pdfs

2011-05-18 Thread Ashim Kapoor
On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer k...@huftis.orgwrote:

 Ajay Ohri wrote:

  What is the appropriate software package for dumping say 20 PDFS in a
  folder, then creating data visualization with frequency counts of
  certain words as well as measure correlation within each file for
  certain key relationships or key words.

 pdftotext + Unix™ for Poets + R (ggplot2)

 What about the tm package ? I am a beginner and I don't know much about
this but I recall that it does have the ability to handle PDF's. A few words
from the experts would be nice.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] text mining problem using TM package

2011-05-18 Thread Andy Adamiec
Hi, I’m using R (TM package) for text mining and I’m having problems
filtering articles out of my data set by local meta data.



Here is the code:



*data - (C:/… /19970331)*

* *

* *

*rs - ReutersSource(data , encoding = UTF-8)*

*RC - VCorpus(DirSource(data), readerControl = list(reader =
readRCV1asPlain,*

*
language = en_US,*

*
load = TRUE),*

*
 dbControl = list(useDb = TRUE,*

*
  dbName = texts.db,*

*
  dbType = DB1))*

* *

* *

* *

*tm_index(RC, FUN = sFilter, doclevel = F, useMeta = T,  Topics == 'MCAT')
*

* *

* *



When I use  sFilter, I can only filter fields in yellow, I want to filter
fields in red, what am I doing wrong?



Thanks, Andy



This is meta data that is attached to each article



Available meta data pairs are:

  Author   :

  DateTimeStamp: 1997-03-31

  Description  :

  Heading  : USA: WHX begins tender offer for Dynamics Corp.

  ID   : 476871

  Language : en_US

  Origin   : Reuters Corpus Volume 1

User-defined local meta data pairs are:

$Publisher

[1] Reuters Holdings Plc



$Topics

[1] C18  C181 CCAT



$Industries

[1] I22100 I34000



$Countries

[1] USA

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text Mining in R

2009-10-10 Thread Axel Urbiz
Dear R users,

I'm new in Text Mining applications and just started to look into the tm
package. If anyone of you has experience with this package, I'll appreciate
if you could share your thoughts around it. Also what's the best way to
store large amounts of text data on limited RAM when using this package.

Thanks in advance for your help

Axel.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] text mining

2009-10-02 Thread PDXRugger

The following code is derived from a paper titled Text Mining Infrastructure
in R (http://www.jstatsoft.org/v25/i05/paper).  The example below seems to
load some default documents for analysis, some sort of latin document.  I
cannot for the life of me figure out to load my own document let alone an
entire corpus.  I have searched the above documenet as well as related
documentation.  Any leads or help would be appreciated.  Thanks everyone

from document

txt - system.file(texts, txt, package = tm)
 (ovid - Corpus(DirSource(txt),
 readerControl = list(reader = readPlain,
language = la,
 load = TRUE)))

my attempt
txt - system.file(Speeches/speech, txt, package = tm)
 (ovid - Corpus(DirSource(txt),
 readerControl = list(reader = readPlain,
language = la,
 load = TRUE)))


-- 
View this message in context: 
http://www.nabble.com/text-mining-tp25717142p25717142.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] text mining

2009-10-02 Thread Corey Dow-Hygelund
Your problem lies in the use of system.file.  This command looks in the
folder location of tm for specific folders.  See ?system.files.

Basically, for the document example, it assigning txt to the directory
string like C:/Program Files (x86)/R/R-2.9.0/library/tm/texts/txt
Then the DirSource(txt) constructs a directory source from directory string
txt.
Finally Corpus constructs a tm corpus from the DirSource object (with some
extra arguments to boot).

So, to solve your problem, replace txt with the directory containing your
files:
txt-C:/location to folder/docs

and then run the subsequent command
 ovid - Corpus(DirSource(txt),
 readerControl = list(reader = readPlain,
language = la,
 load = TRUE))

(though you may want to change the object name ovid to something more
descriptive)

C






On Fri, Oct 2, 2009 at 10:15 AM, PDXRugger j_r...@hotmail.com wrote:


 The following code is derived from a paper titled Text Mining
 Infrastructure
 in R (http://www.jstatsoft.org/v25/i05/paper).  The example below seems
 to
 load some default documents for analysis, some sort of latin document.  I
 cannot for the life of me figure out to load my own document let alone an
 entire corpus.  I have searched the above documenet as well as related
 documentation.  Any leads or help would be appreciated.  Thanks everyone

 from document

 txt - system.file(texts, txt, package = tm)
  (ovid - Corpus(DirSource(txt),
  readerControl = list(reader = readPlain,
 language = la,
  load = TRUE)))

 my attempt
 txt - system.file(Speeches/speech, txt, package = tm)
  (ovid - Corpus(DirSource(txt),
  readerControl = list(reader = readPlain,
 language = la,
  load = TRUE)))


 --
 View this message in context:
 http://www.nabble.com/text-mining-tp25717142p25717142.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] text mining in italian

2009-05-05 Thread Laura Arsanto



Hello everybody,

I'm trying to do text mining on a serie of texts in italian.

I would like to know if it is possible to find the italian synonyms and/or if 
something like WordNet database for English exists also for italian.

Thank you very much in advance.

Regards,

Laura

_
[[elided Hotmail spam]]

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Mining

2009-02-05 Thread spiketide

hi everyone...

i am a newbie to text mining. and i gotta do my project in it i've
looked up various infos online but still haven't got an idea on where to
start so please, if anyone gave suggestions on this, it will be really
helpful...

thanks a lot in advance

-- 
View this message in context: 
http://www.nabble.com/Text-Mining-tp11467848p21864801.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Mining

2009-02-05 Thread cruz
There is an interesting article An Introduction to Text Mining in R
by Ingo Feinerer on R News Volume 8/2, October 2008
(http://www.r-project.org/doc/Rnews/Rnews_2008-2.pdf)

Check it out


On Fri, Feb 6, 2009 at 9:16 AM, spiketide spiket...@gmail.com wrote:

 hi everyone...

 i am a newbie to text mining. and i gotta do my project in it i've
 looked up various infos online but still haven't got an idea on where to
 start so please, if anyone gave suggestions on this, it will be really
 helpful...

 thanks a lot in advance

 --
 View this message in context: 
 http://www.nabble.com/Text-Mining-tp11467848p21864801.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.