[Free Text] Field Tokenizing
All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Free Text] Field Tokenizing
The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Free Text] Field Tokenizing
Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-) Adam On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote: The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam
Re: [Free Text] Field Tokenizing
The KeywordTokenizer doesn't do anything to break up the input stream, it just treats the whole input to the field as a single token. So I don't think you'll be able to extract anything starting with that tokenizer. Look at the admin/analysis page to see a step-by-step breakdown of what your analyzer chain does. Be sure to check the verbose checkbox Best Erick On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada estrada.adam.gro...@gmail.com wrote: Erick, I totally understand that BUT the keyword tokenizer factory does a really good job extracting phrases (or what look like phrases from) from my data. I don't know why exactly but it does do it. I am going to continue working through it to see if I can't figure it out ;-) Adam On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote: The problem here is that none of the built-in filters or tokenizers have a prayer of recognizing what #you# think are phrases, since it'll be unique to your situation. If you have a list of phrases you care about, you could substitute a single token for the phrases you care about... But the overriding question is what determines a phrase you're interested in? Is it a list or is there some heuristic you want to apply? Or could you just recognize them at query time and make them into a literal phrase (i.e. with quotationmarks)? Best Erick On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada estrada.adam.gro...@gmail.com wrote: All, I am at a bit of a loss here so any help would be greatly appreciated. I am using the DIH to grab data from a DB. The field that I am most interested in has anywhere from 1 word to several paragraphs worth of free text. What I would really like to do is pull out phrases like Joe's coffee shop rather than the 3 individual words. I have tried the KeywordTokenizerFactory and that does seem to do what I want it to do but it is not actually tokenizing anything so it does what I want it to for the most part but it's not creating the tokens that I need for further analysis in apps like Mahout. We can play with the combination of tokenizers and filters all day long and see what the results are after a quick reindex. I typlically just view them in Solitas as facets which may be the problem for me too. Does anyone have an example fieldType they can share with me that shows how to extract phrases if they are there from the data I described earlier. Am I even going about this the right way? I am using today's trunk build of Solr and here is what I have munged together this morning. fieldType name=text_ws class=solr.TextField positionIncrementGap=100 autoGeneratePhraseQueries=true analyzer charFilter class=solr.HTMLStripCharFilterFactory/ charFilter class=solr.MappingCharFilterFactory mapping=mapping-ISOLatin1Accent.txt/ tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true/ filter class=solr.ShingleFilterFactory maxShingleSize=4 outputUnigrams=true outputUnigramIfNoNgram=false/ filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/ filter class=solr.EnglishPossessiveFilterFactory/ filter class=solr.EnglishMinimalStemFilterFactory/ filter class=solr.ASCIIFoldingFilterFactory/ filter class=solr.RemoveDuplicatesTokenFilterFactory/ filter class=solr.TrimFilterFactory/ /analyzer /fieldType Thanks, Adam