I suspect that this was not a bug, but rather that the words were
singletons.  I'm a little surprised, given that these were some
financial documents (check a singleton?) but it looks like that's what
happened.

When tomorrow's demo for the client is over I will run seq2sparse with
the floor set to 1 instead of 2 and see if that changes things.

On 05/29/2015 03:13 PM, Suneel Marthi wrote:
> Allen, could u please file a JIRA for this?
> 
> On Fri, May 29, 2015 at 8:58 AM, Allen McIntosh <[email protected]>
> wrote:
> 
>> This shows up with Mahout 0.10.0 (the distribution archive) and Hadoop
>> 2.2.0
>>
>> When I run seq2sparse on a document containing the following tokens:
>>
>> cash cash equival cash cash equival consist highli liquid instrument
>> commerci paper time deposit other monei market instrument which origin
>> matur three month less aggreg cash balanc bank reclassifi neg balanc
>> consist mainli unclear check account payabl neg balanc reclassifi
>> account payabl decemb
>>
>> the tokens mainli, check and unclear are dropped on the floor (they do
>> not appear in the dictionary file).  The issue persists if I change the
>> analyzer to SimpleAnalyzer (-a
>> org.apache.lucene.analysis.core.SimpleAnalyzer).  I can understand an
>> English analyzer doing something like this, but it seems a little
>> strange that it would happen with SimpleAnalyzer.  (I wonder if it is
>> coincidence that these tokens appear consecutively in the input.)
>>
>> What I am trying to do:  The standard analyzers don't do enough, and I
>> have no access to the client's cluster to preload a custom analyzer.
>> Processing the text before stuffing it into the initial sequence file
>> seemed to be the cleanest alternative, since there doesn't seem to be
>> any way to add a custom jar when using a stock Mahout app.
>>
>> Why dropped or mangled tokens matter, other than as missing information:
>>  Ultimately what I need to do is calculate topic weights for an
>> arbitrary chunk of text.  (See next post.)  If I can't get the tokens
>> right, I don't think I can do this.
>>
>>
>>
>>
> 

Reply via email to