[R] [R-pkgs] Natural Language Processing for non-English languages with udpipe
Dear R users, I'm happy to announce the release of version 0.3 of the udpipe R package on CRAN (https://CRAN.R-project.org/package=udpipe). The udpipe R package is a Natural Language Processing toolkit that provides language-agnostic 'tokenization', 'parts of speech tagging', 'lemmatization', 'morphological feature tagging' and 'dependency parsing' of raw text. Next to text parsing, the R package also allows you to train annotation models based on data of 'treebanks' in 'CoNLL-U' format as provided at http://universaldependencies.org/format.html. The R package provides direct access to language models trained on more than 50 languages. The following languages are directly available: afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt, czech, danish, dutch-lassysmall, dutch, english-lines, english-partut, english, estonian, finnish-ftb, finnish, french-partut, french-sequoia, french, galician-treegal, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal, norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br, portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian, slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese We hope that the package will allow other R users to build natural language applications on top of the resulting parts of speech tags, tokens, morphological features and dependency parsing output. And we hope in particular that applications will arise which are not limited to English only (like the textrank R package or the cleanNLP package to name a few) Note that the package has no external software dependencies (no java nor python) and depends only on 2 R packages (Rcpp and data.table), which makes the package easy to install on any platform. The package is available on CRAN at https://CRAN.R-project.org/package=udpipe and is developed at https://github.com/bnosac/udpipe A small docusaurus website is made available at https://bnosac.github.io/udpipe/en We hope you enjoy using it and we would like to thank Milan Straka for all the efforts done on UDPipe as well as all persons involved in http://universaldependencies.org all the best, Jan Jan Wijffels Statistician www.bnosac.be | +32 486 611708 [[alternative HTML version deleted]] ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] release of version 0.2 of the textrank package
Hello R users, I'm pleased to announce the release of version 0.2 of the textrank package on CRAN: https://CRAN.R-project.org/package=textrank *The package is a natural language processing package which allows one to summarize text by finding* *- relevant sentences* *- relevant keywords* This is done by constructing a sentence network which finds how sentences are related to one another (word overlap). On that network Google Pagerank is used in order to find relevant sentences. In a similar way 'textrank' can also be used to extract keywords. How? A word network is constructed by looking if words are following one another. On top of that network the 'Pagerank' algorithm is applied to extract relevant words. Relevant words which are following one another are next pasted together to get keywords. The package has a vignette at https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html and it also plays nicely with the udpipe package https://CRAN.R-project.org/package=udpipe which is good for parts-of-speech tagging, lemmatisation, dependency parsing and general NLP processing. all the best, Jan Jan Wijffels Statistician www.bnosac.be [[alternative HTML version deleted]] ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R-pkgs] RMOA data stream modelling using MOA (Massive Online Analysis)
Dear R community, For users interested in streaming classification or building classification models with limited amounts of RAM on your whole data set, I would like to announce the release of a new package called RMOA on CRAN ( http://cran.r-project.org/web/packages/RMOA). MOA is the most popular open source framework for data stream mining and is being developed at the University of Waikato: http://moa.cms.waikato.ac.nz. RMOA interfaces with MOA version 2014.04 and focusses on building streaming classification regression models on data streams (the stream package in R already allows clustering). Classification models which are possible through RMOA are: - Classification trees: * AdaHoeffdingOptionTree * ASHoeffdingTree * DecisionStump * HoeffdingAdaptiveTree * HoeffdingOptionTree * HoeffdingTree * LimAttHoeffdingTree * RandomHoeffdingTree - Bayesian classification: * NaiveBayes * NaiveBayesMultinomial - Active learning classification: * ActiveClassifier - Ensemble (meta) classifiers: * Bagging + LeveragingBag + OzaBag + OzaBagAdwin + OzaBagASHT * Boosting + OCBoost + OzaBoost + OzaBoostAdwin * Stacking + LimAttClassifier * Other + AccuracyUpdatedEnsemble + AccuracyWeightedEnsemble + ADACC + DACC + OnlineAccuracyUpdatedEnsemble + TemporallyAugmentedClassifier + WeightedMajorityAlgorithm Interfaces are implemented to model data in standard files (csv, txt, delimited), ffdf data (from the ff package), data.frames and matrices. Documentation of MOA directed towards RMOA users can be found at http://jwijffels.github.io/RMOA/ Examples on the use of RMOA can be found in the documentation, on github at https://github.com/jwijffels/RMOA or e.g. by viewing the showcase at http://bnosac.be/index.php/blog/32-rmoa-massive-online-data-stream-classifications-with-r-a-moa I you have any remarks or requests, don't hesitate to get into contact. stream on, Jan Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708 [[alternative HTML version deleted]] ___ R-packages mailing list r-packa...@r-project.org https://stat.ethz.ch/mailman/listinfo/r-packages __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] * operator overloading setOldClass
Dear R gurus, I am trying to overload some operators in order to let these work with the ff package by registering the S3 objects from the ff package and overloading the operators as shown below in a reproducible example where the * operator is overloaded. require(ff) setOldClass(Classes=c(ff_vector)) setMethod( f=*, signature = signature(e1 = c(ff_vector), e2 = c(ff_vector)), definition = function (e1, e2){ e1[] * e2[] } ) ff(1:10) * ff(1:10) It looks like the ff(1:10) * ff(1:10) is not recognising the fact that both objects are of class ff_vector. Can someone tell me why this is, point me to some documentation how to solve this and possibly indicate what needs to be done so that the above code gives similar behaviour as 1:10 * 1:10 [1] 1 4 9 16 25 36 49 64 81 100 thanks in advance, Jan -- groeten/kind regards, Jan Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] last observation carried forward +1
Hi R-helpers I'm looking for a vectorised function which does missing value replacement as in last observation carried forward in the zoo package but instead of a locf, I would like the locf function to add +1 to each time a missing value occurred. See below for an example. require(zoo) x - 5:15 x[4:7] - NA coredata(na.locf(zoo(x))) [1] 5 6 7 7 7 7 7 12 13 14 15 But what I need is 5 6 7 7+1 7+1+1 7+1+1+1 7+1+1+1+1 12 13 14 15 to obtain [1] 5 6 7 8 9 10 11 12 13 14 15 I could program this in C but if anyone has already done this I would be interested in seeing their vectorized solution. thanks, Jan -- groeten/kind regards, Jan Jan Wijffels Statistical Data Miner www.bnosac.be | +32 486 611708 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] issue with ... in write.fwf in gdata
Hi Greg, That's indeed the solution. Thanks for updating the package. I'm looking forward to see it on CRAN. kind regards, Jan From: g...@warnes.net Date: Fri, 12 Nov 2010 14:39:46 -0500 Subject: Re: [R] issue with ... in write.fwf in gdata To: janwijff...@hotmail.com CC: r-help@r-project.org Hi Jan, The issue isn't that the ... arguments aren't passed on. Rather, the problem is that in the current implementation the ... arguments are passed to format(), which doesn't understand the eol argument. The solution is to modify write.fwf() to explicitly accept all of the appropriate the arguments for write.table() and to only pass the ... arguments to format() and format.info(). I've just modified gdata to make this change, and have submitted the new version to CRAN as gdata version 2.8.1. -Greg On Fri, Nov 12, 2010 at 7:08 AM, Jan Wijffels janwijff...@hotmail.com wrote: Dear R-list This is just message to inform that the there is an issue with write.fwf in the gdata library (from version 2.5.0 on). It does not seem to accept further arguments to write.table like eol as the help file indicates as it stops when executing tmp - lapply(x, format.info, ...). Great package though - I use it a lot except for this function :) See example below. require(gdata) saveto - tempfile(pattern = test.txt, tmpdir = tempdir()) write.fwf(x = data.frame(a=1:length(LETTERS), b=LETTERS), file=saveto, eol=\r\n) Error in FUN(X[[1L]], ...) : unused argument(s) (eol = \r\n) sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] gdata_2.8.0 loaded via a namespace (and not attached): [1] gtools_2.6.2 kind regards, Jan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] issue with ... in write.fwf in gdata
Dear R-list This is just message to inform that the there is an issue with write.fwf in the gdata library (from version 2.5.0 on). It does not seem to accept further arguments to write.table like eol as the help file indicates as it stops when executing tmp - lapply(x, format.info, ...). Great package though - I use it a lot except for this function :) See example below. require(gdata) saveto - tempfile(pattern = test.txt, tmpdir = tempdir()) write.fwf(x = data.frame(a=1:length(LETTERS), b=LETTERS), file=saveto, eol=\r\n) Error in FUN(X[[1L]], ...) : unused argument(s) (eol = \r\n) sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] gdata_2.8.0 loaded via a namespace (and not attached): [1] gtools_2.6.2 kind regards, Jan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] glibc detected *** /usr/lib64/R/bin/exec/R: double free or corruption ???? tm package
Hi, I have a collection of .txt documents in my working folder for which I want to do some text mining. If I run TextDocCol from the tm package, R crashes with some memory issues. Does anyone has any idea if this is related to R itself or to the tm package? Below you can find what is happening here. setwd(/home/jan/Work/2008/Profacts/textmining/tryouts/workfolder) require(tm) Loading required package: tm Loading required package: filehash Simple key-value database (1.0-1 2007-08-13) Loading required package: Matrix Loading required package: lattice Loading required package: Snowball Loading required package: RWeka Loading required package: rJava Loading required package: grid Loading required package: XML sessionInfo() R version 2.6.1 (2007-11-26) x86_64-redhat-linux-gnu locale: LC_CTYPE=nl_BE.UTF-8;LC_NUMERIC=C;LC_TIME=nl_BE.UTF-8;LC_COLLATE=nl_BE.UTF-8;LC_MONETARY=nl_BE.UTF-8;LC_MESSAGES=nl_BE.UTF-8;LC_PAPER=nl_BE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=nl_BE.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] grid stats graphics grDevices utils datasets methods [8] base other attached packages: [1] tm_0.2-3.7XML_1.93-2Snowball_0.0-3RWeka_0.3-9 [5] rJava_0.5-1 Matrix_0.999375-3 lattice_0.17-2filehash_1.0-1 loaded via a namespace (and not attached): [1] proxy_0.3 rcompgen_0.1-17 R.Version() $platform [1] x86_64-redhat-linux-gnu $arch [1] x86_64 $os [1] linux-gnu $system [1] x86_64, linux-gnu $status [1] $major [1] 2 $minor [1] 6.1 $year [1] 2007 $month [1] 11 $day [1] 26 $`svn rev` [1] 43537 $language [1] R $version.string [1] R version 2.6.1 (2007-11-26) test - TextDocCol(DirSource(getwd()), readerControl = list(reader = readPlain, load = TRUE, language = nl_BE)) *** glibc detected *** /usr/lib64/R/bin/exec/R: double free or corruption (!prev): 0x22e20680 *** === Backtrace: = /lib64/libc.so.6[0x359946f4f4] /lib64/libc.so.6(cfree+0x8c)[0x3599472b1c] /usr/lib64/R/lib/libR.so[0x305b670a3d] /usr/lib64/R/lib/libR.so[0x305b6f875e] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so[0x305b6c4ce2] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so(Rf_applyClosure+0x291)[0x305b6c6c01] /usr/lib64/R/lib/libR.so(Rf_eval+0x303)[0x305b6c3873] /usr/lib64/R/lib/libR.so[0x305b6c4648] /usr/lib64/R/lib/libR.so(Rf_eval+0x502)[0x305b6c3a72] /usr/lib64/R/lib/libR.so[0x305b6c4ce2] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so[0x305b6c6504] /usr/lib64/R/lib/libR.so(R_execMethod+0x239)[0x305b6c6889] /usr/lib64/R/library/methods/libs/methods.so[0x2e4367e9] /usr/lib64/R/lib/libR.so[0x305b6f9cf7] /usr/lib64/R/lib/libR.so(Rf_eval+0x55d)[0x305b6c3acd] /usr/lib64/R/lib/libR.so(Rf_applyClosure+0x291)[0x305b6c6c01] /usr/lib64/R/lib/libR.so(Rf_eval+0x303)[0x305b6c3873] /usr/lib64/R/lib/libR.so[0x305b6c638e] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so[0x305b6c4ce2] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so[0x305b6c5009] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so[0x305b6c4ce2] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so[0x305b6c6504] /usr/lib64/R/lib/libR.so(R_execMethod+0x239)[0x305b6c6889] /usr/lib64/R/library/methods/libs/methods.so[0x2e4367e9] /usr/lib64/R/lib/libR.so[0x305b6f9cf7] /usr/lib64/R/lib/libR.so(Rf_eval+0x55d)[0x305b6c3acd] /usr/lib64/R/lib/libR.so(Rf_applyClosure+0x291)[0x305b6c6c01] /usr/lib64/R/lib/libR.so(Rf_eval+0x303)[0x305b6c3873] /usr/lib64/R/lib/libR.so[0x305b6c638e] /usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6] /usr/lib64/R/lib/libR.so(Rf_ReplIteration+0x183)[0x305b6e7893] /usr/lib64/R/lib/libR.so(run_Rmainloop+0xc2)[0x305b6e7bc2] /usr/lib64/R/bin/exec/R(main+0x1b)[0x40080b] /lib64/libc.so.6(__libc_start_main+0xf4)[0x359941d8a4] /usr/lib64/R/bin/exec/R[0x400709] === Memory map: 0040-00401000 r-xp fd:00 12868864 /usr/lib64/R/bin/exec/R 0060-00602000 rw-p fd:00 12868864 /usr/lib64/R/bin/exec/R 1de8b000-2305b000 rw-p 1de8b000 00:00 0 4000-40001000 ---p 4000 00:00 0 40001000-40101000 rwxp 40001000 00:00 0 40101000-40102000 ---p 40101000 00:00 0 40102000-40202000 rwxp 40102000 00:00 0 40202000-40203000 ---p 40202000 00:00 0 40203000-40303000 rwxp 40203000 00:00 0 40303000-40306000 ---p 40303000 00:00 0 40306000-40404000 rwxp 40306000 00:00 0 40404000-40407000 ---p 40404000 00:00 0 40407000-40505000 rwxp 40407000 00:00 0 40505000-40508000 ---p 40505000 00:00 0 40508000-40606000 rwxp 40508000 00:00 0 40606000-40609000 ---p 40606000 00:00 0 40609000-40707000 rwxp 40609000 00:00 0 40707000-4070a000 ---p 40707000 00:00 0 4070a000-40808000 rwxp 4070a000 00:00 0 40808000-4080b000 ---p 40808000 00:00 0 4080b000-40909000 rwxp 4080b000 00:00 0 40909000-4090a000
[R] guidelines for the use of the R logo
Hi, I was wondering if there are any guidelines for the use of the R logo on websites which are for commercial use? Similar to http://www.python.org/community/logos/ for Python and to http://www.postgresql.org/community/propaganda for PostgreSQL perhaps? thanks, Jan __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.