[Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada
All,

I am at a bit of a loss here so any help would be greatly appreciated. I am
using the DIH to grab data from a DB. The field that I am most interested in
has anywhere from 1 word to several paragraphs worth of free text. What I
would really like to do is pull out phrases like Joe's coffee shop rather
than the 3 individual words. I have tried the KeywordTokenizerFactory and
that does seem to do what I want it to do but it is not actually tokenizing
anything so it does what I want it to for the most part but it's not
creating the tokens that I need for further analysis in apps like Mahout.

We can play with the combination of tokenizers and filters all day long and
see what the results are after a quick reindex. I typlically just view them
in Solitas as facets which may be the problem for me too. Does anyone have
an example fieldType they can share with me that shows how to
extract phrases if they are there from the data I described earlier. Am I
even going about this the right way? I am using today's trunk build of Solr
and here is what I have munged together this morning.

fieldType name=text_ws class=solr.TextField positionIncrementGap=100
autoGeneratePhraseQueries=true
 analyzer 
 charFilter class=solr.HTMLStripCharFilterFactory/
 charFilter class=solr.MappingCharFilterFactory
mapping=mapping-ISOLatin1Accent.txt/
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true/
 filter class=solr.ShingleFilterFactory maxShingleSize=4
outputUnigrams=true outputUnigramIfNoNgram=false/
 filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
 filter class=solr.EnglishPossessiveFilterFactory/
 filter class=solr.EnglishMinimalStemFilterFactory/
 filter class=solr.ASCIIFoldingFilterFactory/
 filter class=solr.RemoveDuplicatesTokenFilterFactory/
 filter class=solr.TrimFilterFactory/
 /analyzer
/fieldType

Thanks,
Adam


Re: [Free Text] Field Tokenizing

2011-06-09 Thread Erick Erickson
The problem here is that none of the built-in filters or tokenizers
have a prayer
of recognizing what #you# think are phrases, since it'll be unique to
your situation.

If you have a list of phrases you care about, you could substitute a
single token
for the phrases you care about...

But the overriding question is what determines a phrase you're
interested in? Is it
a list or is there some heuristic you want to apply?

Or could you just recognize them at query time and make them into a
literal phrase
(i.e. with quotationmarks)?

Best
Erick

On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 All,

 I am at a bit of a loss here so any help would be greatly appreciated. I am
 using the DIH to grab data from a DB. The field that I am most interested in
 has anywhere from 1 word to several paragraphs worth of free text. What I
 would really like to do is pull out phrases like Joe's coffee shop rather
 than the 3 individual words. I have tried the KeywordTokenizerFactory and
 that does seem to do what I want it to do but it is not actually tokenizing
 anything so it does what I want it to for the most part but it's not
 creating the tokens that I need for further analysis in apps like Mahout.

 We can play with the combination of tokenizers and filters all day long and
 see what the results are after a quick reindex. I typlically just view them
 in Solitas as facets which may be the problem for me too. Does anyone have
 an example fieldType they can share with me that shows how to
 extract phrases if they are there from the data I described earlier. Am I
 even going about this the right way? I am using today's trunk build of Solr
 and here is what I have munged together this morning.

 fieldType name=text_ws class=solr.TextField positionIncrementGap=100
 autoGeneratePhraseQueries=true
  analyzer 
  charFilter class=solr.HTMLStripCharFilterFactory/
  charFilter class=solr.MappingCharFilterFactory
 mapping=mapping-ISOLatin1Accent.txt/
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true/
  filter class=solr.ShingleFilterFactory maxShingleSize=4
 outputUnigrams=true outputUnigramIfNoNgram=false/
  filter class=solr.KeywordMarkerFilterFactory protected=protwords.txt/
  filter class=solr.EnglishPossessiveFilterFactory/
  filter class=solr.EnglishMinimalStemFilterFactory/
  filter class=solr.ASCIIFoldingFilterFactory/
  filter class=solr.RemoveDuplicatesTokenFilterFactory/
  filter class=solr.TrimFilterFactory/
  /analyzer
 /fieldType

 Thanks,
 Adam



Re: [Free Text] Field Tokenizing

2011-06-09 Thread Adam Estrada
Erick,

I totally understand that BUT the keyword tokenizer factory does a really
good job extracting phrases (or what look like phrases from) from my data. I
don't know why exactly but it does do it. I am going to continue working
through it to see if I can't figure it out ;-)

Adam

On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson erickerick...@gmail.comwrote:

 The problem here is that none of the built-in filters or tokenizers
 have a prayer
 of recognizing what #you# think are phrases, since it'll be unique to
 your situation.

 If you have a list of phrases you care about, you could substitute a
 single token
 for the phrases you care about...

 But the overriding question is what determines a phrase you're
 interested in? Is it
 a list or is there some heuristic you want to apply?

 Or could you just recognize them at query time and make them into a
 literal phrase
 (i.e. with quotationmarks)?

 Best
 Erick

 On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  All,
 
  I am at a bit of a loss here so any help would be greatly appreciated. I
 am
  using the DIH to grab data from a DB. The field that I am most interested
 in
  has anywhere from 1 word to several paragraphs worth of free text. What I
  would really like to do is pull out phrases like Joe's coffee shop
 rather
  than the 3 individual words. I have tried the KeywordTokenizerFactory and
  that does seem to do what I want it to do but it is not actually
 tokenizing
  anything so it does what I want it to for the most part but it's not
  creating the tokens that I need for further analysis in apps like Mahout.
 
  We can play with the combination of tokenizers and filters all day long
 and
  see what the results are after a quick reindex. I typlically just view
 them
  in Solitas as facets which may be the problem for me too. Does anyone
 have
  an example fieldType they can share with me that shows how to
  extract phrases if they are there from the data I described earlier. Am I
  even going about this the right way? I am using today's trunk build of
 Solr
  and here is what I have munged together this morning.
 
  fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=true
   analyzer 
   charFilter class=solr.HTMLStripCharFilterFactory/
   charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.ShingleFilterFactory maxShingleSize=4
  outputUnigrams=true outputUnigramIfNoNgram=false/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.EnglishPossessiveFilterFactory/
   filter class=solr.EnglishMinimalStemFilterFactory/
   filter class=solr.ASCIIFoldingFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.TrimFilterFactory/
   /analyzer
  /fieldType
 
  Thanks,
  Adam
 



Re: [Free Text] Field Tokenizing

2011-06-09 Thread Erick Erickson
The KeywordTokenizer doesn't do anything to break up the input stream,
it just treats the whole input to the field as a single token. So I don't think
you'll be able to extract anything starting with that tokenizer.

Look at the admin/analysis page to see a step-by-step breakdown of what
your analyzer chain does. Be sure to check the verbose checkbox

Best
Erick

On Thu, Jun 9, 2011 at 12:35 PM, Adam Estrada
estrada.adam.gro...@gmail.com wrote:
 Erick,

 I totally understand that BUT the keyword tokenizer factory does a really
 good job extracting phrases (or what look like phrases from) from my data. I
 don't know why exactly but it does do it. I am going to continue working
 through it to see if I can't figure it out ;-)

 Adam

 On Thu, Jun 9, 2011 at 12:26 PM, Erick Erickson 
 erickerick...@gmail.comwrote:

 The problem here is that none of the built-in filters or tokenizers
 have a prayer
 of recognizing what #you# think are phrases, since it'll be unique to
 your situation.

 If you have a list of phrases you care about, you could substitute a
 single token
 for the phrases you care about...

 But the overriding question is what determines a phrase you're
 interested in? Is it
 a list or is there some heuristic you want to apply?

 Or could you just recognize them at query time and make them into a
 literal phrase
 (i.e. with quotationmarks)?

 Best
 Erick

 On Thu, Jun 9, 2011 at 10:56 AM, Adam Estrada
 estrada.adam.gro...@gmail.com wrote:
  All,
 
  I am at a bit of a loss here so any help would be greatly appreciated. I
 am
  using the DIH to grab data from a DB. The field that I am most interested
 in
  has anywhere from 1 word to several paragraphs worth of free text. What I
  would really like to do is pull out phrases like Joe's coffee shop
 rather
  than the 3 individual words. I have tried the KeywordTokenizerFactory and
  that does seem to do what I want it to do but it is not actually
 tokenizing
  anything so it does what I want it to for the most part but it's not
  creating the tokens that I need for further analysis in apps like Mahout.
 
  We can play with the combination of tokenizers and filters all day long
 and
  see what the results are after a quick reindex. I typlically just view
 them
  in Solitas as facets which may be the problem for me too. Does anyone
 have
  an example fieldType they can share with me that shows how to
  extract phrases if they are there from the data I described earlier. Am I
  even going about this the right way? I am using today's trunk build of
 Solr
  and here is what I have munged together this morning.
 
  fieldType name=text_ws class=solr.TextField
 positionIncrementGap=100
  autoGeneratePhraseQueries=true
   analyzer 
   charFilter class=solr.HTMLStripCharFilterFactory/
   charFilter class=solr.MappingCharFilterFactory
  mapping=mapping-ISOLatin1Accent.txt/
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true/
   filter class=solr.ShingleFilterFactory maxShingleSize=4
  outputUnigrams=true outputUnigramIfNoNgram=false/
   filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt/
   filter class=solr.EnglishPossessiveFilterFactory/
   filter class=solr.EnglishMinimalStemFilterFactory/
   filter class=solr.ASCIIFoldingFilterFactory/
   filter class=solr.RemoveDuplicatesTokenFilterFactory/
   filter class=solr.TrimFilterFactory/
   /analyzer
  /fieldType
 
  Thanks,
  Adam