In thinking more about this, it seems that it would be even better to just 
incorporate some of the ideas of this into the DocumentProcessor, except I 
think it is useful to not have to go to SeqFile first.  Also, it might be worth 
grabbing Solr's FilterFactory stuff for configuring the Lucene analyzers.  Not 
sure how easy that would be to do.

-Grant
On May 28, 2010, at 4:53 PM, Grant Ingersoll wrote:

> OK, I posted a draft patch of this.  Would appreciate a review.  I think it's 
> even the case that one could slip Groovy into it (or whatever) through the 
> proper implementation of one interface.  Feedback welcome on M-403.
> 
> 
> On May 28, 2010, at 10:05 AM, Grant Ingersoll wrote:
> 
>> https://issues.apache.org/jira/browse/MAHOUT-403
>> 
>> On May 28, 2010, at 8:58 AM, Grant Ingersoll wrote:
>> 
>>> 
>>> On May 27, 2010, at 7:06 PM, Ted Dunning wrote:
>>> 
>>>> That should be a small change (and helpful for a lot of mining tasks).
>>>> 
>>>> But once you jump on that slippery slope, why not allow a tiny Groovy
>>>> closure to be injected?  Or to pass in an object that will extract a map of
>>>> values from each line?
>>> 
>>> Expanding on this, I think we could do the following:
>>> 
>>> Map capturing groups to labels, then have pluggable output so that one 
>>> could easily output to FPG, Classifiers, etc.
>>> 
>>> I'm not all that familiar w/ Groovy, so I'll put up my variation and then 
>>> people can expand on it.
>>> 
>>> -Grant
>> 
>> 
> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Reply via email to