Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
To avoid wildcard queries, you can write a TokenFilter that will create both tokens "ADJ" and "ADJ:brown" in same position. so you can use you index for both lookups without doing wildcard. On Tue, Aug 7, 2012 at 12:31 PM, Carsten Schnober wrote: > Hi Danil, > >>> Just transform your input like

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
I mean "ADJ:brown" as a token and only the as payload, since you probably only use it for some scoring/postprocessing not the actual matching. You can even write a filter that will emit both tokens "ADJ" and "AJD:brown" on same position (so you'll be able to do phrase queries), and still maintain

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Hi Danil, >> Just transform your input like "brown fox" into "ADJ:brown|> payload> NOUN:fox|" > > I understand that this denotes "ADJ" and "NOUN" to be interpreted as the > actual token and "brown" and "fox" as payloads (followed by payload>), right? Sorry for replying to myself, but I've reali

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 07.08.2012 10:20, schrieb Danil ŢORIN: Hi Danil, > If you do intersection (not join), maybe it make sense to put every > thing into 1 index? Just a note on that: my application performs intersections and joins (unions) on the results, depending on the query. So the index structure has to be r

Re: Small Vocabulary

2012-08-07 Thread Danil ŢORIN
If you do intersection (not join), maybe it make sense to put every thing into 1 index? Just transform your input like "brown fox" into "ADJ:brown| NOUN:fox|" Write a custom tokenizer, some filters and that's it. Of course I'm not aware of all the details, so my solution might not be applicable

Re: Small Vocabulary

2012-08-07 Thread Carsten Schnober
Am 06.08.2012 20:29, schrieb Mike Sokolov: Hi Mike, > There was some interesting work done on optimizing queries including > very common words (stop words) that I think overlaps with your problem. > See this blog post > http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-wo

Re: Small Vocabulary

2012-08-06 Thread Mike Sokolov
There was some interesting work done on optimizing queries including very common words (stop words) that I think overlaps with your problem. See this blog post http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 from the Hathi Trust. The upshot in a nutshel

Re: Small Vocabulary

2012-08-02 Thread Carsten Schnober
Am 31.07.2012 12:10, schrieb Ian Lea: Hi Ian, > Lucene 4.0 allows you to use custom codecs and there may be one that > would be better for this sort of data, or you could write one. > > In your tests is it the searching that is slow or are you reading lots > of data for lots of docs? The latter

Re: Small Vocabulary

2012-07-31 Thread Ian Lea
Lucene 4.0 allows you to use custom codecs and there may be one that would be better for this sort of data, or you could write one. In your tests is it the searching that is slow or are you reading lots of data for lots of docs? The latter is always likely to be slow. General performance advice a