Hello Jerad, > It was suggested I contact you for possible help with this issue. Well, > as you can see for the emails below, that is what I was told at R-help. > Any insight to my lsa problems (also listed below) would be of great > help.
from what I see, the problem probably indeed lies within the textfiles: for performance reasons, it was not possible to include any "check" routines that exclude a file if it contains no words (or words below a docFrequency) and thus produces an empty column-vector. I am pretty sure that you do not want to use docFrequency with a value like 50 (it would mean that a term in a document is only included if it appears more than 50 times in *that* document). I will send you the alpha-release of the updated lsa package in a separate message which also includes a parameter called minGlobFreq which is filtering out terms that appear less than x times in the whole document collection. I guess that is what you were looking for. Considering the sanitizing: if you set minDocFreq to 1 and set minWordLength to 1, you should not get an error with your document collection as you then are basically taking everything (even a single character appearing only once). It probably is not so problematic as the LSA step will anyway group this low-frequency terms in a lower order factor. Of course you will still get an error if you use documents that are completely empty, so delete all 0 bytes documents beforehands. I am thinking about what to do with this sanitizing part. It is not a good idea to integrate that into the textmatrix method -- it would slow things down tremendously. So what about this idea: does it make sense to provide a sanitizing collection of methods that help to select the files you want to work with (copy them to a different directory or just return a list with the filenames of the ones that are "good")? What should we do with other sanitizing options (deleting urls from texts, deleting short words, etc.)? Hope, I could be of help, Best, Fridolin -- Fridolin Wild, Institute for Information Systems and New Media, Vienna University of Economics and Business Administration (WUW), Augasse 2-6, A-1090 Wien, Austria fon +43-1-31336-4488, fax +43-1-31336-746 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.