Re: [Free Text] Field Tokenizing

Erick Erickson Thu, 09 Jun 2011 09:27:56 -0700

The problem here is that none of the built-in filters or tokenizers
have a prayer
of recognizing what #you# think are phrases, since it'll be unique to
your situation.


If you have a list of phrases you care about, you could substitute a
single token
for the phrases you care about...

But the overriding question is what determines a phrase you're
interested in? Is it
a list or is there some heuristic you want to apply?

Or could you just recognize them at query time and make them into a
literal phrase
(i.e. with quotationmarks)?

Best
Erick

On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
<estrada.adam.gro...@gmail.com> wrote:
> All,
>
> I am at a bit of a loss here so any help would be greatly appreciated. I am
> using the DIH to grab data from a DB. The field that I am most interested in
> has anywhere from 1 word to several paragraphs worth of free text. What I
> would really like to do is pull out phrases like "Joe's coffee shop" rather
> than the 3 individual words. I have tried the KeywordTokenizerFactory and
> that does seem to do what I want it to do but it is not actually tokenizing
> anything so it does what I want it to for the most part but it's not
> creating the tokens that I need for further analysis in apps like Mahout.
>
> We can play with the combination of tokenizers and filters all day long and
> see what the results are after a quick reindex. I typlically just view them
> in Solitas as facets which may be the problem for me too. Does anyone have
> an example fieldType they can share with me that shows how to
> extract phrases if they are there from the data I described earlier. Am I
> even going about this the right way? I am using today's trunk build of Solr
> and here is what I have munged together this morning.
>
> <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"
> autoGeneratePhraseQueries="true">
>  <analyzer >
>  <charFilter class="solr.HTMLStripCharFilterFactory"/>
>  <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>  <tokenizer class="solr.KeywordTokenizerFactory"/>
>  <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>  <filter class="solr.ShingleFilterFactory" maxShingleSize="4"
> outputUnigrams="true" outputUnigramIfNoNgram="false"/>
>  <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
>  <filter class="solr.EnglishPossessiveFilterFactory"/>
>  <filter class="solr.EnglishMinimalStemFilterFactory"/>
>  <filter class="solr.ASCIIFoldingFilterFactory"/>
>  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>  <filter class="solr.TrimFilterFactory"/>
>  </analyzer>
> </fieldType>
>
> Thanks,
> Adam
>

Re: [Free Text] Field Tokenizing

Reply via email to