[R] [R-pkgs] Natural Language Processing for non-English languages with udpipe

2018-01-16 Thread Jan Wijffels
Dear R users,

I'm happy to announce the release of version 0.3 of the udpipe R package on
CRAN (https://CRAN.R-project.org/package=udpipe). The udpipe R package is a
Natural Language Processing toolkit that provides language-agnostic
'tokenization', 'parts of speech tagging', 'lemmatization', 'morphological
feature tagging' and 'dependency parsing' of raw text. Next to text
parsing, the R package also allows you to train annotation models based on
data of 'treebanks' in 'CoNLL-U' format as provided at
http://universaldependencies.org/format.html.

The R package provides direct access to language models trained on more
than 50 languages. The following languages are directly available:

afrikaans, ancient_greek-proiel, ancient_greek, arabic, basque, belarusian,
bulgarian, catalan, chinese, coptic, croatian, czech-cac, czech-cltt,
czech, danish, dutch-lassysmall, dutch, english-lines, english-partut,
english, estonian, finnish-ftb, finnish, french-partut, french-sequoia,
french, galician-treegal, galician, german, gothic, greek, hebrew, hindi,
hungarian, indonesian, irish, italian, japanese, kazakh, korean,
latin-ittb, latin-proiel, latin, latvian, lithuanian, norwegian-bokmaal,
norwegian-nynorsk, old_church_slavonic, persian, polish, portuguese-br,
portuguese, romanian, russian-syntagrus, russian, sanskrit, serbian,
slovak, slovenian-sst, slovenian, spanish-ancora, spanish, swedish-lines,
swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese

We hope that the package will allow other R users to build natural language
applications on top of the resulting parts of speech tags, tokens,
morphological features and dependency parsing output. And we hope in
particular that applications will arise which are not limited to English
only (like the textrank R package or the cleanNLP package to name a few)

Note that the package has no external software dependencies (no java nor
python) and depends only on 2 R packages (Rcpp and data.table), which makes
the package easy to install on any platform.

The package is available on CRAN at
https://CRAN.R-project.org/package=udpipe and is developed at
https://github.com/bnosac/udpipe
A small docusaurus website is made available at
https://bnosac.github.io/udpipe/en

We hope you enjoy using it and we would like to thank Milan Straka for all
the efforts done on UDPipe as well as all persons involved in
http://universaldependencies.org

all the best,
Jan

Jan Wijffels
Statistician
www.bnosac.be  | +32 486 611708

[[alternative HTML version deleted]]

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] release of version 0.2 of the textrank package

2017-12-21 Thread Jan Wijffels
Hello R users,

I'm pleased to announce the release of version 0.2 of the textrank package
on CRAN: https://CRAN.R-project.org/package=textrank

*The package is a natural language processing package which allows one to
summarize text by finding*
*- relevant sentences*
*- relevant keywords*

This is done by constructing a sentence network which finds how sentences
are related to one another (word overlap). On that network Google Pagerank
is used in order to find relevant sentences.

In a similar way 'textrank' can also be used to extract keywords. How? A
word network is constructed by looking if words are following one another.
On top of that network the 'Pagerank' algorithm is applied to extract
relevant words. Relevant words which are following one another are next
pasted together to get keywords.

The package has a vignette at
https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html
and it also plays nicely with the udpipe package ​
https://CRAN.R-project.org/package=udpipe which is good for parts-of-speech
tagging, lemmatisation, dependency parsing and general NLP processing.

​all the best,
Jan


Jan Wijffels
Statistician
www.bnosac.be

[[alternative HTML version deleted]]

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R-pkgs] RMOA data stream modelling using MOA (Massive Online Analysis)

2014-09-19 Thread Jan Wijffels
Dear R community,

For users interested in streaming classification or building classification
models with limited amounts of RAM on your whole data set, I would like to
announce the release of a new package called RMOA on CRAN (
http://cran.r-project.org/web/packages/RMOA).

MOA is the most popular open source framework for data stream mining and is
being developed at the University of Waikato: http://moa.cms.waikato.ac.nz.
RMOA interfaces with MOA version 2014.04 and focusses on building streaming
classification  regression models on data streams (the stream package in R
already allows clustering).

Classification models which are possible through RMOA are:

- Classification trees:
  * AdaHoeffdingOptionTree
  * ASHoeffdingTree
  * DecisionStump
  * HoeffdingAdaptiveTree
  * HoeffdingOptionTree
  * HoeffdingTree
  * LimAttHoeffdingTree
  * RandomHoeffdingTree
- Bayesian classification:
  * NaiveBayes
  * NaiveBayesMultinomial
- Active learning classification:
  * ActiveClassifier
- Ensemble (meta) classifiers:
  * Bagging
  + LeveragingBag
  + OzaBag
  + OzaBagAdwin
  + OzaBagASHT
  * Boosting
  + OCBoost
  + OzaBoost
  + OzaBoostAdwin
  * Stacking
  + LimAttClassifier
  * Other
  + AccuracyUpdatedEnsemble
  + AccuracyWeightedEnsemble
  + ADACC
  + DACC
  + OnlineAccuracyUpdatedEnsemble
  + TemporallyAugmentedClassifier
  + WeightedMajorityAlgorithm

Interfaces are implemented to model data in standard files (csv, txt,
delimited), ffdf data (from the ff package), data.frames and matrices.

Documentation of MOA directed towards RMOA users can be found at
http://jwijffels.github.io/RMOA/
Examples on the use of RMOA can be found in the documentation, on github at
https://github.com/jwijffels/RMOA or e.g. by viewing the showcase at
http://bnosac.be/index.php/blog/32-rmoa-massive-online-data-stream-classifications-with-r-a-moa

I you have any remarks or requests, don't hesitate to get into contact.

stream on,
Jan


Jan Wijffels
Statistical Data Miner
www.bnosac.be  | +32 486 611708

[[alternative HTML version deleted]]

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] * operator overloading setOldClass

2012-05-04 Thread Jan Wijffels
Dear R gurus,

I am trying to overload some operators in order to let these work with the
ff package by registering the S3 objects from the ff package and
overloading the operators as shown below in a reproducible example where
the * operator is overloaded.

require(ff)
setOldClass(Classes=c(ff_vector))
setMethod(
f=*,
  signature = signature(e1 = c(ff_vector), e2 = c(ff_vector)),
  definition = function (e1, e2){
e1[] * e2[]
}
)
ff(1:10) * ff(1:10)

It looks like the ff(1:10) * ff(1:10) is not recognising the fact that both
objects are of class ff_vector.
Can someone tell me why this is, point me to some documentation how to
solve this and possibly indicate what needs to be done so that the above
code gives similar behaviour as
 1:10 * 1:10
 [1]   1   4   9  16  25  36  49  64  81 100


thanks in advance,
Jan

-- 
groeten/kind regards,
Jan

Jan Wijffels
Statistical Data Miner
www.bnosac.be  | +32 486 611708

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] last observation carried forward +1

2011-09-30 Thread Jan Wijffels
Hi R-helpers

I'm looking for a vectorised function which does missing value replacement
as in last observation carried forward in the zoo package but instead of a
locf, I would like the locf function to add +1 to each time a missing value
occurred. See below for an example.

 require(zoo)
 x - 5:15
 x[4:7] - NA
 coredata(na.locf(zoo(x)))
 [1]  5  6  7  7  7  7  7 12 13 14 15
But what I need is
5  6  7  7+1  7+1+1  7+1+1+1  7+1+1+1+1 12 13 14 15
to obtain
[1]  5  6  7  8  9 10 11 12 13 14 15
I could program this in C but if anyone has already done this I would be
interested in seeing their vectorized solution.

thanks,
Jan

-- 
groeten/kind regards,
Jan

Jan Wijffels
Statistical Data Miner
www.bnosac.be  | +32 486 611708

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] issue with ... in write.fwf in gdata

2010-11-13 Thread Jan Wijffels

Hi Greg,

That's indeed the solution. Thanks for updating the package. I'm looking 
forward to see it on CRAN.

kind regards,
Jan

From: g...@warnes.net
Date: Fri, 12 Nov 2010 14:39:46 -0500
Subject: Re: [R] issue with ... in write.fwf in gdata
To: janwijff...@hotmail.com
CC: r-help@r-project.org

Hi Jan,

The issue isn't that the ... arguments aren't passed on.  Rather, the problem 
is that in the current implementation the ... arguments are passed to format(), 
which doesn't understand the eol argument.



The solution is to modify write.fwf() to explicitly accept all of the 
appropriate the arguments for write.table() and to only pass the ... arguments 
to format() and format.info().



I've just modified gdata to make this change, and have submitted the new 
version to CRAN as gdata version 2.8.1.

-Greg

On Fri, Nov 12, 2010 at 7:08 AM, Jan Wijffels janwijff...@hotmail.com wrote:




Dear R-list



This is just message to inform that the there is an issue with write.fwf in the 
gdata library (from version 2.5.0 on). It does not seem to accept further 
arguments to write.table like eol as the help file indicates as it stops when 
executing tmp - lapply(x, format.info, ...).



Great package though - I use it a lot except for this function :)

See example below.



 require(gdata)

 saveto - tempfile(pattern = test.txt, tmpdir = tempdir())

 write.fwf(x = data.frame(a=1:length(LETTERS), b=LETTERS), file=saveto, 
 eol=\r\n)

Error in FUN(X[[1L]], ...) : unused argument(s) (eol = \r\n)

 sessionInfo()

R version 2.12.0 (2010-10-15)

Platform: x86_64-pc-linux-gnu (64-bit)



locale:

 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C

 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8

 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8

 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C

 [9] LC_ADDRESS=C   LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C



attached base packages:

[1] stats graphics  grDevices utils datasets  methods   base



other attached packages:

[1] gdata_2.8.0



loaded via a namespace (and not attached):

[1] gtools_2.6.2





kind regards,

Jan





[[alternative HTML version deleted]]



__

R-help@r-project.org mailing list

https://stat.ethz.ch/mailman/listinfo/r-help

PLEASE do read the posting guide http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] issue with ... in write.fwf in gdata

2010-11-12 Thread Jan Wijffels

Dear R-list

This is just message to inform that the there is an issue with write.fwf in the 
gdata library (from version 2.5.0 on). It does not seem to accept further 
arguments to write.table like eol as the help file indicates as it stops when 
executing tmp - lapply(x, format.info, ...). 
Great package though - I use it a lot except for this function :)
See example below.

 require(gdata)
 saveto - tempfile(pattern = test.txt, tmpdir = tempdir()) 
 write.fwf(x = data.frame(a=1:length(LETTERS), b=LETTERS), file=saveto, 
 eol=\r\n)
Error in FUN(X[[1L]], ...) : unused argument(s) (eol = \r\n)
 sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base 

other attached packages:
[1] gdata_2.8.0

loaded via a namespace (and not attached):
[1] gtools_2.6.2


kind regards,
Jan

  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] glibc detected *** /usr/lib64/R/bin/exec/R: double free or corruption ???? tm package

2008-01-07 Thread Jan Wijffels

Hi,

I have a collection of .txt documents in my working folder for which I want to 
do some text mining. If I run TextDocCol from the tm package, R crashes with 
some memory issues. Does anyone has any idea if this is related to R itself or 
to the tm package?
Below you can find what is happening here.

 setwd(/home/jan/Work/2008/Profacts/textmining/tryouts/workfolder)
 require(tm)
Loading required package: tm
Loading required package: filehash
Simple key-value database (1.0-1 2007-08-13)
Loading required package: Matrix
Loading required package: lattice
Loading required package: Snowball
Loading required package: RWeka
Loading required package: rJava
Loading required package: grid
Loading required package: XML
 sessionInfo()
R version 2.6.1 (2007-11-26)
x86_64-redhat-linux-gnu

locale:
LC_CTYPE=nl_BE.UTF-8;LC_NUMERIC=C;LC_TIME=nl_BE.UTF-8;LC_COLLATE=nl_BE.UTF-8;LC_MONETARY=nl_BE.UTF-8;LC_MESSAGES=nl_BE.UTF-8;LC_PAPER=nl_BE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=nl_BE.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] grid  stats graphics  grDevices utils datasets  methods
[8] base

other attached packages:
[1] tm_0.2-3.7XML_1.93-2Snowball_0.0-3RWeka_0.3-9
[5] rJava_0.5-1   Matrix_0.999375-3 lattice_0.17-2filehash_1.0-1

loaded via a namespace (and not attached):
[1] proxy_0.3   rcompgen_0.1-17
 R.Version()
$platform
[1] x86_64-redhat-linux-gnu

$arch
[1] x86_64

$os
[1] linux-gnu

$system
[1] x86_64, linux-gnu

$status
[1] 

$major
[1] 2

$minor
[1] 6.1

$year
[1] 2007

$month
[1] 11

$day
[1] 26

$`svn rev`
[1] 43537

$language
[1] R

$version.string
[1] R version 2.6.1 (2007-11-26)

 test  - TextDocCol(DirSource(getwd()), readerControl = list(reader = 
 readPlain, load = TRUE, language = nl_BE))
*** glibc detected *** /usr/lib64/R/bin/exec/R: double free or corruption 
(!prev): 0x22e20680 ***
=== Backtrace: =
/lib64/libc.so.6[0x359946f4f4]
/lib64/libc.so.6(cfree+0x8c)[0x3599472b1c]
/usr/lib64/R/lib/libR.so[0x305b670a3d]
/usr/lib64/R/lib/libR.so[0x305b6f875e]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so[0x305b6c4ce2]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x291)[0x305b6c6c01]
/usr/lib64/R/lib/libR.so(Rf_eval+0x303)[0x305b6c3873]
/usr/lib64/R/lib/libR.so[0x305b6c4648]
/usr/lib64/R/lib/libR.so(Rf_eval+0x502)[0x305b6c3a72]
/usr/lib64/R/lib/libR.so[0x305b6c4ce2]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so[0x305b6c6504]
/usr/lib64/R/lib/libR.so(R_execMethod+0x239)[0x305b6c6889]
/usr/lib64/R/library/methods/libs/methods.so[0x2e4367e9]
/usr/lib64/R/lib/libR.so[0x305b6f9cf7]
/usr/lib64/R/lib/libR.so(Rf_eval+0x55d)[0x305b6c3acd]
/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x291)[0x305b6c6c01]
/usr/lib64/R/lib/libR.so(Rf_eval+0x303)[0x305b6c3873]
/usr/lib64/R/lib/libR.so[0x305b6c638e]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so[0x305b6c4ce2]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so[0x305b6c5009]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so[0x305b6c4ce2]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so[0x305b6c6504]
/usr/lib64/R/lib/libR.so(R_execMethod+0x239)[0x305b6c6889]
/usr/lib64/R/library/methods/libs/methods.so[0x2e4367e9]
/usr/lib64/R/lib/libR.so[0x305b6f9cf7]
/usr/lib64/R/lib/libR.so(Rf_eval+0x55d)[0x305b6c3acd]
/usr/lib64/R/lib/libR.so(Rf_applyClosure+0x291)[0x305b6c6c01]
/usr/lib64/R/lib/libR.so(Rf_eval+0x303)[0x305b6c3873]
/usr/lib64/R/lib/libR.so[0x305b6c638e]
/usr/lib64/R/lib/libR.so(Rf_eval+0x436)[0x305b6c39a6]
/usr/lib64/R/lib/libR.so(Rf_ReplIteration+0x183)[0x305b6e7893]
/usr/lib64/R/lib/libR.so(run_Rmainloop+0xc2)[0x305b6e7bc2]
/usr/lib64/R/bin/exec/R(main+0x1b)[0x40080b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x359941d8a4]
/usr/lib64/R/bin/exec/R[0x400709]
=== Memory map: 
0040-00401000 r-xp  fd:00 12868864   
/usr/lib64/R/bin/exec/R
0060-00602000 rw-p  fd:00 12868864   
/usr/lib64/R/bin/exec/R
1de8b000-2305b000 rw-p 1de8b000 00:00 0
4000-40001000 ---p 4000 00:00 0
40001000-40101000 rwxp 40001000 00:00 0
40101000-40102000 ---p 40101000 00:00 0
40102000-40202000 rwxp 40102000 00:00 0
40202000-40203000 ---p 40202000 00:00 0
40203000-40303000 rwxp 40203000 00:00 0
40303000-40306000 ---p 40303000 00:00 0
40306000-40404000 rwxp 40306000 00:00 0
40404000-40407000 ---p 40404000 00:00 0
40407000-40505000 rwxp 40407000 00:00 0
40505000-40508000 ---p 40505000 00:00 0
40508000-40606000 rwxp 40508000 00:00 0
40606000-40609000 ---p 40606000 00:00 0
40609000-40707000 rwxp 40609000 00:00 0
40707000-4070a000 ---p 40707000 00:00 0
4070a000-40808000 rwxp 4070a000 00:00 0
40808000-4080b000 ---p 40808000 00:00 0
4080b000-40909000 rwxp 4080b000 00:00 0
40909000-4090a000 

[R] guidelines for the use of the R logo

2008-01-02 Thread Jan Wijffels

Hi,

I was wondering if there are any guidelines for the use of the R logo on 
websites which are for commercial use? Similar to 
http://www.python.org/community/logos/ for Python and to 
http://www.postgresql.org/community/propaganda for PostgreSQL perhaps?


thanks,
Jan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.