The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation.
If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada <estrada.adam.gro...@gmail.com> wrote: > All, > > I am at a bit of a loss here so any help would be greatly appreciated. I am > using the DIH to grab data from a DB. The field that I am most interested in > has anywhere from 1 word to several paragraphs worth of free text. What I > would really like to do is pull out phrases like "Joe's coffee shop" rather > than the 3 individual words. I have tried the KeywordTokenizerFactory and > that does seem to do what I want it to do but it is not actually tokenizing > anything so it does what I want it to for the most part but it's not > creating the tokens that I need for further analysis in apps like Mahout. > > We can play with the combination of tokenizers and filters all day long and > see what the results are after a quick reindex. I typlically just view them > in Solitas as facets which may be the problem for me too. Does anyone have > an example fieldType they can share with me that shows how to > extract phrases if they are there from the data I described earlier. Am I > even going about this the right way? I am using today's trunk build of Solr > and here is what I have munged together this morning. > > <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100" > autoGeneratePhraseQueries="true"> > <analyzer > > <charFilter class="solr.HTMLStripCharFilterFactory"/> > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping-ISOLatin1Accent.txt"/> > <tokenizer class="solr.KeywordTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true"/> > <filter class="solr.ShingleFilterFactory" maxShingleSize="4" > outputUnigrams="true" outputUnigramIfNoNgram="false"/> > <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/> > <filter class="solr.EnglishPossessiveFilterFactory"/> > <filter class="solr.EnglishMinimalStemFilterFactory"/> > <filter class="solr.ASCIIFoldingFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > <filter class="solr.TrimFilterFactory"/> > </analyzer> > </fieldType> > > Thanks, > Adam >