[R] Text Mining in R
Hi, Wishing you all well. I am exploring text mining with R. Here is where I need help: 1. The starting point is a data frame worder1<- c("I am, taking 2","are these the three samples?", "He speaks differently to you, aint it !","This is distilled - my dear, now give me $3","I saved 2500 this month.") df1 <- data.frame(id=1:5, words=worder1) here in dput format: dput(df1) structure(list(id = 1:5, words = structure(c(3L, 1L, 2L, 5L, 4L), .Label = c("are these the three samples?", "He speaks differently to you, aint it !", "I am, taking 2", "I saved 2500 this month.", "This is distilled - my dear, now give me $3" ), class = "factor")), .Names = c("id", "words"), row.names = c(NA, -5L), class = "data.frame") 2. The corpus rituals ... corp1 <- Corpus(VectorSource(df1$words)) inspect(corp1) class(corp1) corp1 <- tm_map(corp1, removeNumbers) corp1 <- tm_map(corp1, removePunctuation) corp1 <- tm_map(corp1, removeWords, stopwords("english")) corp1 <- tm_map(corp1, stripWhitespace) class(corp1) 3. Getting to the analysis tdm1 <- TermDocumentMatrix(corp1) inspect(tdm1[1:5,]) dtm1 <- DocumentTermMatrix(corp1) inspect(dtm1[1:5,]) 4. Now here is the problem If I do a translation, not in getTransformations(), I am unable to convert to tdm or dtm corp1 <- tm_map(corp1, tolower) class(corp1) tdm1.2 <- TermDocumentMatrix(corp1) dtm1.2 <- DocumentTermMatrix(corp1) The error returned is: Error: inherits(doc, "TextDocument") is not TRUE 5. The explaination on internet suggests either a) corp1 <- tm_map(corp1, content_transformer(tolower)) which in my case returns error: Error in UseMethod("content", x) : no applicable method for 'content' applied to an object of class "character" b) corpus_clean <- tm_map(corp1, PlainTextDocument) which results in loss of all the meta data I will appreciate any help. Lastly to keep the doc ids with R corpus, should the step 2 be changed as: corp1 <- Corpus(DataframeSource(df1)) from: corp1 <- Corpus(VectorSource(df1$words)) Thanks / - Some of the references I explored: http://stackoverflow.com/questions/25638503/tm-loses-the-metadata-when-applying-tm-map http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument http://stackoverflow.com/questions/24771165/r-project-no-applicable-method-for-meta-applied-to-an-object-of-class-charact http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r http://stackoverflow.com/questions/20699111/tm-map-error-message-in-r http://stackoverflow.com/questions/31996891/error-in-usemethodmeta-x-no-applicable-method-for-meta-applied-to-an-ob http://stackoverflow.com/questions/11876740/r-stemming-a-string-document-corpus [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text Mining - Remove punctuation not removing quotes and dashes
Hi, I have been doing some text mining. I created the DTM matrix using the following steps. corpus1-VCorpus(VectorSource(resume1$Dat1)) corpus1-tm_map(corpus1,content_transformer(tolower)) dtm-DocumentTermMatrix(corpus1, control = list(removePunctuation = TRUE, removeNumbers = TRUE, removeSparseTerms=TRUE, stopwords = TRUE)) After all the run I am still getting words like -quotation, fun, model , etc. What can I do about it. I do not need this dahses and extra quotations. -- Anindya Sankar Dey [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text Mining in Non English Speaking Countries
Hello All, I am interested in conducting text mining in languages other English. My understanding is the following R packages can analyze alternative (to English) languages: 1. topicmodels 2. snowball 3. tm Can anyone confirm? Specifically, I am interested in Hindi and Chinese (2 or so most popular dialects). If so, can you recommend relevant documentation and share your experiences with these packages. Thank you in advance. Ziad Elmously http://www.kantar.com/disclaimer.html [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Mining in Non English Speaking Countries
I used already with portuguese. No problems. Flavio Barros www.flaviobarros.net http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.flaviobarros.netsn= [image: Facebook] http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.facebook.com%2Fflavio.barros.1650%3Fref%3Dtn_tnmnsn= [image: LinkedIn] http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fprofile%2Fview%3Fid%3D61839390%26trk%3Dtab_prosn= [image: about.me] http://s.wisestamp.com/links?url=http%3A%2F%2Fabout.me%2Fflavio_barrossn= Contact me: [image: Google Talk] flaviomargar...@gmail.com “We are not victims by nature...we are programmed to be victims...for good reason...if we truly embraced our power, we would never be controlled. Live WISE~ - Gail Blackman http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.quotesdaddy.com%2Fquote%2F1403644%2Fgail-blackman%2Fwe-are-not-victims-by-naturewe-are-programmed-to-besn= ” Get this email app! http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.wisestamp.com%2Fapps%2Fquotes%3Futm_source%3Dextension%26utm_medium%3Demail%26utm_term%3Dquotes%26utm_campaign%3Dappssn= [image: WordPress Blog Posts] http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.flaviobarros.netsn=My latest post:Data Preparation – Part II http://s.wisestamp.com/links?url=http%3A%2F%2Ffeedproxy.google.com%2F~r%2FFlavioBarros%2F~3%2F9MTu1M40mhE%2Fsn= Read more http://s.wisestamp.com/links?url=http%3A%2F%2Ffeedproxy.google.com%2F~r%2FFlavioBarros%2F~3%2F9MTu1M40mhE%2Fsn= | My blog http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.flaviobarros.netsn= [image: Share on Facebook] http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.facebook.com%2Fsharer.php%3Fu%3Dhttp%253A%252F%252Ffeedproxy.google.com%252F~r%252FFlavioBarros%252F~3%252F9MTu1M40mhE%252Fsn= [image: Share on Twitter] http://s.wisestamp.com/links?url=https%3A%2F%2Ftwitter.com%2Fintent%2Ftweet%3Ftext%3DData%2520Preparation%2520%25E2%2580%2593%2520Part%2520II%2520%2520(via%2520%2540wisestamp)sn= Get this email app! http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.wisestamp.com%2Fapps%2Fwordpress%3Futm_source%3Dextension%26utm_medium%3Demail%26utm_term%3Dwordpress%26utm_campaign%3Dappssn= http://s.wisestamp.com/links?url=http%3A%2F%2Fwww.linkedin.com%2Fin%2Fsn= Create your free signature: http://s.wisestamp.com/links?url=http%3A%2F%2Fr1.wisestamp.com%2Fr%2Flanding%3Fpromo%3D33%26dest%3Dhttp%253A%252F%252Fwww.wisestamp.com%252Femail-install%253Futm_source%253Dextension%2526utm_medium%253Demail%2526utm_campaign%253Dpromo_33sn= CLICK HERE! http://s.wisestamp.com/links?url=http%3A%2F%2Fr1.wisestamp.com%2Fr%2Flanding%3Fpromo%3D33%26amp%3Bdest%3Dhttp%253A%252F%252Fwww.wisestamp.com%252Femail-install%253Futm_source%253Dextension%2526utm_medium%253Demail%2526utm_campaign%253Dpromo_33sn= On Wed, Sep 24, 2014 at 8:30 AM, ziad.elmou...@tnsglobal.com wrote: Hello All, I am interested in conducting text mining in languages other English. My understanding is the following R packages can analyze alternative (to English) languages: 1. topicmodels 2. snowball 3. tm Can anyone confirm? Specifically, I am interested in Hindi and Chinese (2 or so most popular dialects). If so, can you recommend relevant documentation and share your experiences with these packages. Thank you in advance. Ziad Elmously http://www.kantar.com/disclaimer.html [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text mining
Hallo to everybody, I would like to perform an analysis but I don't know how to proceed and whether R packages are available for my purpose or not. Therefore I'm here to request your support. *The idea is the following:* I noticed that the names of the towns and villages in northern Italy most of the time sound differently from names of cities based on southern Italy. Just to give you an idea Caronno Pertusella is a northern Italy village while Frascati is a center Italy town. Most of the time I am able to recognize where the town is located just hearing the name but I cannot say why, that is to say that I didn't find a rule. What I would like to do is to find a classification rule/engine that is able to locate the city starting from its name. *I think the classification method should be based on the sequence of letters belonging to the town's name*. But this is just an intuition not yet formalized! I know that mine is a strange request and idea, anyway advices are very appreciated and welcome! Many thanks in advance to all. Steve -- View this message in context: http://r.789695.n4.nabble.com/Text-mining-tp4656732.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text mining
Hi Steve, IMO this problem does not need a classifier but rather a database and a simple query. I would just build a database with all city names including the geo information, and then say whether it is north or south exactly. If there was such a rule (which I doubt) I would expect it to have many exceptions and therefore a bunch of false-positives on both sides. Why overcomplicate a simple problem? HTH, Ciao, Giovanni -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Steve Stephenson Sent: Saturday, January 26, 2013 10:08 PM To: r-help@r-project.org Subject: [R] Text mining Hallo to everybody, I would like to perform an analysis but I don't know how to proceed and whether R packages are available for my purpose or not. Therefore I'm here to request your support. *The idea is the following:* I noticed that the names of the towns and villages in northern Italy most of the time sound differently from names of cities based on southern Italy. Just to give you an idea Caronno Pertusella is a northern Italy village while Frascati is a center Italy town. Most of the time I am able to recognize where the town is located just hearing the name but I cannot say why, that is to say that I didn't find a rule. What I would like to do is to find a classification rule/engine that is able to locate the city starting from its name. *I think the classification method should be based on the sequence of letters belonging to the town's name*. But this is just an intuition not yet formalized! I know that mine is a strange request and idea, anyway advices are very appreciated and welcome! Many thanks in advance to all. Steve -- View this message in context: http://r.789695.n4.nabble.com/Text-mining-tp4656732.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text mining
Hi Giovanni, thanks a lot for your quick reply!!! I try to answer you in a few points: 1 - A Data Base containing all the towns and the Region they belong to (North, Sud...) is already available on the ISTAT site (www.ISTAT.it); 2- My goal was just to find a method supporting my idea, that is to say that northern towns names sound different from southern names; 3- To build this method I should use the ISTAT DB, partially as training set and partially as validation set; 4- The idea was born just for fun since I find very interesting and also challenging the data mining; 5- I absolutely agree with you: I will find a lot of exception and therefore ; if the exceptions are greater than the rule (this could happen) this would imply that my initial idea is wrong. In any case I would be satisfied because this would mean that I have been able to prove if an in intuition is right or wrong. I hope this can clarify my previous post. Many thanks and *sorry for the lack of clarity*. Steve -- View this message in context: http://r.789695.n4.nabble.com/Text-mining-tp4656732p4656738.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text mining? Text manipulation? Both? Predicting KRAS test results in cancer patients
Happy Friday Everyone, Hope Friday afternoon doesn't turn out to be a terrible time to post a question. I've been doing a little data mining of patient text medical records as of late. I started out trying to predict whether or not cancer patients had received KRAS mutation testing and did quite well with that. Now I'm trying to predict the results of KRAS testing (mutated vs. wild type). This is proving to be a little more difficult. With the first classification task, I created counts of terms (e.g., kras, mutated) in the text medical records using the tm package and then used those counts to predict whether or not patients had had KRAS mutation testing. I tried a few different analyses here, but found that random forests worked the best. Predicting the results of testing is harder though because of the way physicians and other healthcare professionals write about testing. For example, I'm finding phrases like KRAS mutation returned wild-type. In this example, if we're counting, we get 1 instance of kras, 1 instance of mutated, and one instance of wild. So you can see how it might be difficult to accurately predict the results of testing based on counts alone. My question is how best to deal with this. Are there any R text mining packages or related software that would be particularly suited to my problem? I took a look at the CRAN Task View: Natural Language Processing and there were so many options I didn't really know where to start (and it's not even clear that an R-based solution will work best for my problem). Alternatively, is there any real chance one could simply write code that would be able to identify true references to the results of KRAS testing and then create counts only of what are likely to be true references? I'd greatly appreciate it if someone could point me in the right direction. Thanks, Paul __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text mining: Narrowing a field of 27, 855 predictors using semi-partial correlations or some other means
Hello Everyone, Trying to learn a little bit about data mining. I'm working on a text mining project that will attempt to predict whether cancer patients got a particular type of genetic testing. A subsequent stage then will be aimed at predicting what the results of that testing were. I've used the tm package to prepare my data and am planning to use rattle to do the actual data mining. The tm package has proved to be a great help so far. I've managed to perform a variety of transformations of my data. I've also managed to create a document-term matrix that has a row for each of my patients and columns for each of the terms in my patient medical records. Because I'm not yet a particularly good R programmer, I've converted my document-term matrix to a data frame and then added information about the genetic testing. So here's the thing. The tm package has a feature that would allow me to drop words that occur infrequently in patient medical records. However, I've been asked not to use it because it's believed that even infrequently occurring terms may be highly diagnostic. The consequence is that my data frame has a large number of columns for the various words. In fact, over 27,000 of them. So my question is how to reduce this to some more manageable number. One thought has been to look at semi-partial correlations. Here these would be between tested(y/n) and each predictor, controlling for length of medical record. The idea would be to use only those predictors that were significant in the actual data mining. Is this likely to be a good approach? Or is there likely to be a better way of doing it? If it is a good approach, I’m wondering how to go about obtaining the necessary results. I’ve managed to figure out how to compute semi-partial correlations using the spcor.test() function in the ppcor package, as in: spcor.test(as.numeric(Tested$TestStatus==Yes), Tested$predictor, Tested $nchar_record) estimate p.value statistic n gp Method 1 0.3853547 2.307562e-08 5.587203 182 1 pearson This is fine for a single pair of variables. What I’d need though is to combine a whole series of such outputs, one for each of my predictors. After that, I’d need to be able to determine which semi-partial correlations were significant (or perhaps substantial) and to create a list that I could use to eliminate a lot of the predictors from my data frame. I’m just beginning to use R in my day-to-day work. So it’s not clear to me how to do this. Thanks, Paul __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Mining with Facebook Reviews (XML and FQL)
Hi Kenneth First off, you probably don't need to use xmlParseDoc(), but rather xmlParse(). (Both are fine, but xmlParseDoc() allows you to control many of the options in the libxml2 parser, which you don't need here.) xmlParse() has some capabilities to fetch the content of URLs. However, it cannot deal with HTTPS requests which this call to facebook is. The approach to this is to i) make the request ii) parse the resulting string via xmlParse(txt, asText = TRUE) As for i), there are several ways to do this, but the RCurl package allows you to do it entirely within R and gives you more control over the request than you would ever want. library(RCurl) txt = getForm('https://api.facebook.com/method/fql.query', query = QUERY) mydata.xml = xmlParse(txt, asText = TRUE) However, you are most likely going to have to login / get a token before you make this request. And then, if you are using RCurl, you will want to use the same curl object with the token or cookies, etc. D. On 10/10/11 3:52 PM, Kenneth Zhang wrote: Hello, I am trying to use XML package to download Facebook reviews in the following way: require(XML) mydata.vectors - character(0) Qword - URLencode('#IBM') QUERY - paste('SELECT review_id, message, rating from review where message LIKE %',Qword,'%',sep='') Facebook_url = paste('https://api.facebook.com/method/fql.query?query= ',QUERY,sep='') mydata.xml - xmlParseDoc(Facebook_url, asText=F) mydata.vector - xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom')) The mydata.xml is NULL therefore no further step can be execute. I am not so familiar with XML or FQL. Any suggestion will be appreciated. Thank you! Best regards, Kenneth [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text Mining with Facebook Reviews (XML and FQL)
Hello, I am trying to use XML package to download Facebook reviews in the following way: require(XML) mydata.vectors - character(0) Qword - URLencode('#IBM') QUERY - paste('SELECT review_id, message, rating from review where message LIKE %',Qword,'%',sep='') Facebook_url = paste('https://api.facebook.com/method/fql.query?query= ',QUERY,sep='') mydata.xml - xmlParseDoc(Facebook_url, asText=F) mydata.vector - xpathSApply(mydata.xml, '//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom')) The mydata.xml is NULL therefore no further step can be execute. I am not so familiar with XML or FQL. Any suggestion will be appreciated. Thank you! Best regards, Kenneth [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text mining analysis methods
Hi all, i am trying to do some text mining in R. So far I managed to do some text mining like TermDocumentMatrices and word count and similiar things. What I would like to do is this : I have a soil descriptions from borehole logs that corresponds to soil classes. The problem is that some of this classes are wrongly classified. What i did is i made DocumenstTermMatrices for each of the class. So now i would like to use some king of statistical method to determine to which of the classes they actually belong. Hope that explains it. Any help of info would be grately appreciated, matevz -- View this message in context: http://r.789695.n4.nabble.com/Text-mining-analysis-methods-tp3774283p3774283.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] text mining
Hi, I have a problem when indexing the corpus. I used the following syntax: Setwd (c :/) Library (tm) Txt = Corpus (DirSource (.); readerControl = list (language = frensh)) an error message comes: Messages d'avis : 1: In readLines(y, encoding = x$Encoding) : ligne finale incomplète trouvée dans './n3.txt' 2: In readLines(y, encoding = x$Encoding) : ligne finale incomplète trouvée dans './n32. another question: how can I read different document types (. pdf,. ...) html using the package tm? Thanks very well for help -- View this message in context: http://r.789695.n4.nabble.com/text-mining-tp3560367p3560367.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining
On 30/05/2011 6:17 AM, rgui wrote: Hi, I have a problem when indexing the corpus. I used the following syntax: Setwd (c :/) Library (tm) Txt = Corpus (DirSource (.); readerControl = list (language = frensh)) Capitalization is important in R, so when asking a question, please cut and paste what you actually did. In this case, it doesn't matter. an error message comes: Messages d'avis : 1: In readLines(y, encoding = x$Encoding) : ligne finale incomplète trouvée dans './n3.txt' 2: In readLines(y, encoding = x$Encoding) : ligne finale incomplète trouvée dans './n32. Those are warnings, not errors. readLines gives those warnings when the last line of the file stops abruptly, rather than having an end of line marker. On Unix systems this usually signals a problem with the file. Windows is more tolerant, so many editors don't bother to add the final marker. another question: how can I read different document types (. pdf,. ...) html using the package tm? I think you need to convert them to text first (by some tool outside of R), but I might be wrong. Duncan Murdoch Thanks very well for help -- View this message in context: http://r.789695.n4.nabble.com/text-mining-tp3560367p3560367.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining
Thanks very well -- View this message in context: http://r.789695.n4.nabble.com/text-mining-tp3552221p3554849.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] text mining
Hi, how can I import a document whose type is. txt using the package tm? it is the command to know that my document is not placed in the library package tm. thanks. -- View this message in context: http://r.789695.n4.nabble.com/text-mining-tp3552221p3552221.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining
HI, I do it like this : setwd(C:/Users/mpavlic/Desktop/Temp) library(tm) tekst - Corpus(DirSource(.),readerControl = list(language =ansi)) where *.txt files are stored in a folder Temp in my desktop, HTH, m -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of rgui Sent: Thursday, May 26, 2011 1:02 PM To: r-help@r-project.org Subject: [R] text mining Hi, how can I import a document whose type is. txt using the package tm? it is the command to know that my document is not placed in the library package tm. thanks. -- View this message in context: http://r.789695.n4.nabble.com/text-mining-tp3552221p3552221.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining analysis and word visualization of pdfs
Date: Wed, 18 May 2011 15:24:49 +0530 From: ashimkap...@gmail.com To: k...@huftis.org CC: r-h...@stat.math.ethz.ch Subject: Re: [R] text mining analysis and word visualization of pdfs On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer wrote: Ajay Ohri wrote: What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words. pdftotext + Unix™ for Poets + R (ggplot2) What about the tm package ? I am a beginner and I don't know much about this but I recall that it does have the ability to handle PDF's. A few words from the experts would be nice. I don;t know if I'm an expert, I can't even get a browser that echo's keystrokes in a reasonable time with 4 core CPU on 'dohs, but PDF could mean just about anything in terms of how text is respresented. Whatever R packages do, they will not be able to read the mind of the author. Even with pdftotext, there are many options and even simple things like US IRS instruction forms can be almost impossible to extract in a coherent manner. Many authors could care less about the information as long as the thing looks like paper copy. If you are stuck with PDF, I'd be looking for more tools first as you will probably want to know how they are constrcuted. I would just reiterate that the best approach for many data analysts would be to contact data source explaining problems with improperly authored PDF or other specialized file format that are only supported by limited proprietary tools or that obfuscate information of interest. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] text mining analysis and word visualization of pdfs
Dear Lists, What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words. I am doing text analysis of biases in enterprise software sponsored publications- and need to come up with a statistical threshold. Regards, Ajay Ohri Websites- http://decisionstats.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining analysis and word visualization of pdfs
Ajay Ohri wrote: What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words. pdftotext + Unix™ for Poets + R (ggplot2) HTH. -- Karl Ove Hufthammer __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining analysis and word visualization of pdfs
On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer k...@huftis.orgwrote: Ajay Ohri wrote: What is the appropriate software package for dumping say 20 PDFS in a folder, then creating data visualization with frequency counts of certain words as well as measure correlation within each file for certain key relationships or key words. pdftotext + Unix for Poets + R (ggplot2) What about the tm package ? I am a beginner and I don't know much about this but I recall that it does have the ability to handle PDF's. A few words from the experts would be nice. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] text mining problem using TM package
Hi, Im using R (TM package) for text mining and Im having problems filtering articles out of my data set by local meta data. Here is the code: *data - (C:/ /19970331)* * * * * *rs - ReutersSource(data , encoding = UTF-8)* *RC - VCorpus(DirSource(data), readerControl = list(reader = readRCV1asPlain,* * language = en_US,* * load = TRUE),* * dbControl = list(useDb = TRUE,* * dbName = texts.db,* * dbType = DB1))* * * * * * * *tm_index(RC, FUN = sFilter, doclevel = F, useMeta = T, Topics == 'MCAT') * * * * * When I use sFilter, I can only filter fields in yellow, I want to filter fields in red, what am I doing wrong? Thanks, Andy This is meta data that is attached to each article Available meta data pairs are: Author : DateTimeStamp: 1997-03-31 Description : Heading : USA: WHX begins tender offer for Dynamics Corp. ID : 476871 Language : en_US Origin : Reuters Corpus Volume 1 User-defined local meta data pairs are: $Publisher [1] Reuters Holdings Plc $Topics [1] C18 C181 CCAT $Industries [1] I22100 I34000 $Countries [1] USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Text Mining in R
Dear R users, I'm new in Text Mining applications and just started to look into the tm package. If anyone of you has experience with this package, I'll appreciate if you could share your thoughts around it. Also what's the best way to store large amounts of text data on limited RAM when using this package. Thanks in advance for your help Axel. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] text mining
The following code is derived from a paper titled Text Mining Infrastructure in R (http://www.jstatsoft.org/v25/i05/paper). The example below seems to load some default documents for analysis, some sort of latin document. I cannot for the life of me figure out to load my own document let alone an entire corpus. I have searched the above documenet as well as related documentation. Any leads or help would be appreciated. Thanks everyone from document txt - system.file(texts, txt, package = tm) (ovid - Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = la, load = TRUE))) my attempt txt - system.file(Speeches/speech, txt, package = tm) (ovid - Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = la, load = TRUE))) -- View this message in context: http://www.nabble.com/text-mining-tp25717142p25717142.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] text mining
Your problem lies in the use of system.file. This command looks in the folder location of tm for specific folders. See ?system.files. Basically, for the document example, it assigning txt to the directory string like C:/Program Files (x86)/R/R-2.9.0/library/tm/texts/txt Then the DirSource(txt) constructs a directory source from directory string txt. Finally Corpus constructs a tm corpus from the DirSource object (with some extra arguments to boot). So, to solve your problem, replace txt with the directory containing your files: txt-C:/location to folder/docs and then run the subsequent command ovid - Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = la, load = TRUE)) (though you may want to change the object name ovid to something more descriptive) C On Fri, Oct 2, 2009 at 10:15 AM, PDXRugger j_r...@hotmail.com wrote: The following code is derived from a paper titled Text Mining Infrastructure in R (http://www.jstatsoft.org/v25/i05/paper). The example below seems to load some default documents for analysis, some sort of latin document. I cannot for the life of me figure out to load my own document let alone an entire corpus. I have searched the above documenet as well as related documentation. Any leads or help would be appreciated. Thanks everyone from document txt - system.file(texts, txt, package = tm) (ovid - Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = la, load = TRUE))) my attempt txt - system.file(Speeches/speech, txt, package = tm) (ovid - Corpus(DirSource(txt), readerControl = list(reader = readPlain, language = la, load = TRUE))) -- View this message in context: http://www.nabble.com/text-mining-tp25717142p25717142.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] text mining in italian
Hello everybody, I'm trying to do text mining on a serie of texts in italian. I would like to know if it is possible to find the italian synonyms and/or if something like WordNet database for English exists also for italian. Thank you very much in advance. Regards, Laura _ [[elided Hotmail spam]] [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Mining
hi everyone... i am a newbie to text mining. and i gotta do my project in it i've looked up various infos online but still haven't got an idea on where to start so please, if anyone gave suggestions on this, it will be really helpful... thanks a lot in advance -- View this message in context: http://www.nabble.com/Text-Mining-tp11467848p21864801.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Text Mining
There is an interesting article An Introduction to Text Mining in R by Ingo Feinerer on R News Volume 8/2, October 2008 (http://www.r-project.org/doc/Rnews/Rnews_2008-2.pdf) Check it out On Fri, Feb 6, 2009 at 9:16 AM, spiketide spiket...@gmail.com wrote: hi everyone... i am a newbie to text mining. and i gotta do my project in it i've looked up various infos online but still haven't got an idea on where to start so please, if anyone gave suggestions on this, it will be really helpful... thanks a lot in advance -- View this message in context: http://www.nabble.com/Text-Mining-tp11467848p21864801.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.