I suspect that this was not a bug, but rather that the words were singletons. I'm a little surprised, given that these were some financial documents (check a singleton?) but it looks like that's what happened.
When tomorrow's demo for the client is over I will run seq2sparse with the floor set to 1 instead of 2 and see if that changes things. On 05/29/2015 03:13 PM, Suneel Marthi wrote: > Allen, could u please file a JIRA for this? > > On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh <[email protected]> > wrote: > >> This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop >> 2.2.0 >> >> When I run seq2sparse on a document containing the following tokens: >> >> cash cash equival cash cash equival consist highli liquid instrument >> commerci paper time deposit other monei market instrument which origin >> matur three month less aggreg cash balanc bank reclassifi neg balanc >> consist mainli unclear check account payabl neg balanc reclassifi >> account payabl decemb >> >> the tokens mainli, check and unclear are dropped on the floor (they do >> not appear in the dictionary file). The issue persists if I change the >> analyzer to SimpleAnalyzer (-a >> org.apache.lucene.analysis.core.SimpleAnalyzer). I can understand an >> English analyzer doing something like this, but it seems a little >> strange that it would happen with SimpleAnalyzer. (I wonder if it is >> coincidence that these tokens appear consecutively in the input.) >> >> What I am trying to do: The standard analyzers don't do enough, and I >> have no access to the client's cluster to preload a custom analyzer. >> Processing the text before stuffing it into the initial sequence file >> seemed to be the cleanest alternative, since there doesn't seem to be >> any way to add a custom jar when using a stock Mahout app. >> >> Why dropped or mangled tokens matter, other than as missing information: >> Ultimately what I need to do is calculate topic weights for an >> arbitrary chunk of text. (See next post.) If I can't get the tokens >> right, I don't think I can do this. >> >> >> >> >
