Give it a try. I was surprised the first time I tried ngramming, the actual increase in my index size was much less than I feared.
Best Erick On Wed, Feb 8, 2012 at 11:41 AM, Robert Brown <r...@intelcompute.com> wrote: > Attempting to re-produce legacy behaviour (i know!) of simple SQL > substring searching, with and without phrases. > > I feel simply NGram'ing 4m CV's may be pushing it? > > > --- > > IntelCompute > Web Design & Local Online Marketing > > http://www.intelcompute.com > > > On Wed, 8 Feb 2012 11:27:24 -0500, Erick Erickson > <erickerick...@gmail.com> wrote: >> You'll probably have to index them in separate fields to >> get what you want. The question is always whether it's >> worth it, is the use-case really well served by having a >> variant that keeps dots and things? But that's always more >> a question for your product manager.... >> >> Best >> Erick >> >> On Wed, Feb 8, 2012 at 9:23 AM, Robert Brown <r...@intelcompute.com> wrote: >>> Thanks Erick, >>> >>> I didn't get confused with multiple tokens vs multiValued :) >>> >>> Before I go ahead and re-index 4m docs, and believe me I'm using the >>> analysis page like a mad-man! >>> >>> What do I need to configure to have the following both indexed with and >>> without the dots... >>> >>> .net >>> sales manager. >>> £12.50 >>> >>> Currently... >>> >>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> >>> <filter class="solr.WordDelimiterFilterFactory" >>> generateWordParts="1" >>> generateNumberParts="1" >>> catenateWords="1" >>> catenateNumbers="1" >>> catenateAll="1" >>> splitOnCaseChange="1" >>> splitOnNumerics="1" >>> types="wdftypes.txt" >>> /> >>> >>> with nothing specific in wdftypes.txt for full-stops. >>> >>> Should there also be any difference when quoting my searches? >>> >>> The analysis page seems to just drop the quotes, but surely actual >>> calls don't do this? >>> >>> >>> >>> --- >>> >>> IntelCompute >>> Web Design & Local Online Marketing >>> >>> http://www.intelcompute.com >>> >>> >>> On Wed, 8 Feb 2012 07:38:42 -0500, Erick Erickson >>> <erickerick...@gmail.com> wrote: >>>> Yes, WDDF creates multiple tokens. But that has >>>> nothing to do with the multiValued suggestion. >>>> >>>> You can get exactly what you want by >>>> 1> setting multiValued="true" in your schema file and re-indexing. Say >>>> positionIncrementGap is set to 100 >>>> 2> When you index, add the field for each sentence, so your doc >>>> looks something like: >>>> <doc> >>>> <field name="sentences">i am a sales-manager in here</field> >>>> <field name="sentences">using asp.net and .net daily</field> >>>> ..... >>>> </doc> >>>> 3> search like "sales manager"~100 >>>> >>>> Best >>>> Erick >>>> >>>> On Wed, Feb 8, 2012 at 3:05 AM, Rob Brown <r...@intelcompute.com> wrote: >>>>> Apologies if things were a little vague. >>>>> >>>>> Given the example snippet to index (numbered to show searches needed to >>>>> match)... >>>>> >>>>> 1: i am a sales-manager in here >>>>> 2: using asp.net and .net daily >>>>> 3: working in design. >>>>> 4: using something called sage 200. and i'm fluent >>>>> 5: german sausages. >>>>> 6: busy A&E dept earning £10,000 annually >>>>> >>>>> >>>>> ... all with newlines in place. >>>>> >>>>> able to match... >>>>> >>>>> 1. sales >>>>> 1. "sales manager" >>>>> 1. sales-manager >>>>> 1. "sales-manager" >>>>> 2. .net >>>>> 2. asp.net >>>>> 3. design >>>>> 4. sage 200 >>>>> 6. A&E >>>>> 6. £10,000 >>>>> >>>>> But do NOT match "fluent german" from 4 + 5 since there's a newline >>>>> between them when indexed, but not when searched. >>>>> >>>>> >>>>> Do the filters (wdf in this case) not create multiple tokens, so if >>>>> splitting on period in "asp.net" would create tokens for all of "asp", >>>>> "asp.", "asp.net", ".net", "net". >>>>> >>>>> >>>>> Cheers, >>>>> Rob >>>>> >>>>> -- >>>>> >>>>> IntelCompute >>>>> Web Design and Online Marketing >>>>> >>>>> http://www.intelcompute.com >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Chris Hostetter <hossman_luc...@fucit.org> >>>>> Reply-to: solr-user@lucene.apache.org >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: Which Tokeniser (and/or filter) >>>>> Date: Tue, 7 Feb 2012 15:02:36 -0800 (PST) >>>>> >>>>> : This all seems a bit too much work for such a real-world scenario? >>>>> >>>>> You haven't really told us what your scenerio is. >>>>> >>>>> You said you want to split tokens on whitespace, full-stop (aka: >>>>> period) and comma only, but then in response to some suggestions you added >>>>> comments other things that you never mentioned previously... >>>>> >>>>> 1) evidently you don't want the "." in foo.net to cause a split in tokens? >>>>> 2) evidently you not only want token splits on newlines, but also >>>>> positition gaps to prevent phrases matching across newlines. >>>>> >>>>> ...these are kind of important details that affect suggestions people >>>>> might give you. >>>>> >>>>> can you please provide some concrete examples of hte types of data you >>>>> have, the types of queries you want them to match, and the types of >>>>> queries you *don't* want to match? >>>>> >>>>> >>>>> -Hoss >>>>> >>> >