Re: Sharding Techniques
Hi Tom, the more i am getting responses in this thread the more i feel that our application needs optimization. 350 GB and less than 2 seconds!!! That's much more than my expectation :-) (in current scenario). *characteristics of slow queries:* there are a few reasons for greater search time 1.Two of our fields contain decimal values but are not NumericField :( . These fields are searched as a range. Whenever the ranges are larger and/or both the fields are used in search the search time and server load goes high. I have already started work to convert it to NumericField - but suggestions and experiences are most welcome. 2. When queries (without two fields mentioned above) have a lot of words/phrases search time is high. E.g I took a query with around 80 unique terms (not words) in 5 fields. These terms occur repeatedly and become total 225 terms (non-unique). This particular query took 4.2 seconds. the 15 indexes used for this query were of total size 5 G. Are 225 terms (80 unique) is a very big number? and yes, slow queries are always slow. yes but obviously high load will add up to their slowness. Here I have another curiosity about something I noticed. If I have a query like following: title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz *Will lucene search for the term 11 times or it will reuse the results of first term?* If later is true (which I think is), is there any particular reason or it may be optimized inside lucene? On Tue, May 10, 2011 at 9:46 PM, Burton-West, Tom wrote: > Hi Samar, > > >>Normal queries go fine under 500 ms but when people start searching > >>"anything" some queries take up to > 100 seconds. Don't you think > >>distributing smaller indexes on different machines would reduce the > average > >>.search time. (Although I have a feeling that search time for smaller > queries > >>may be slightly increased) > > What are the characteristics of your slow queries? Can you give examples? > Are the slow queries always slow or only under heavy load? What the > bottleneck is and whether splitting into smaller indexes would help depends > on just what your bottleneck is. It's not clear that your index is large > enough that the size of the index is causing your bottleneck. > > We run indexes of about 350GB with average response times under 200ms and > 99th percentile reponse times of under 2 seconds. (We have a very low qps > rate however). > > > Tom > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Regards, Samar
Re: Sharding Techniques
Ganesh Nobody is saying that sharding is never a good idea - it just doesn't seem to be applicable in the case being discussed. On my indexes I care much more about speed of searching rather than speed of indexing. The latter typically happens in the background in the dead of night and within reason I don't really care how long it takes. Your application and requirements will be different and you may come to different conclusions. -- Ian. On Wed, May 11, 2011 at 6:09 AM, Ganesh wrote: > > We also use similar kind of technique, breaking indexes in to smaller and > search using ParallelMultiSearcher. We have to do incremental indexing and > the records older than 6 months or 1 year (based on ageout setting) should be > deleted. Having multiple small indexes is really fast in terms of indexing. > > Since you guys mentioned about keeping single large index. Search time woule > be faster but the indexing and index optimization will take more time. How > you are handling it in case of incremental indexing. If we keep the indexes > size to 100+ GB then each file size (fdt, fdx etc) would in GB's. Small > addition or deletion to the file will not cause more IO as it has to skip > those bytes and write it at the end of file. > > Regards > Ganesh > > - Original Message - > From: "Burton-West, Tom" > To: > Sent: Tuesday, May 10, 2011 9:46 PM > Subject: RE: Sharding Techniques > > > Hi Samar, > >>>Normal queries go fine under 500 ms but when people start searching >>>"anything" some queries take up to > 100 seconds. Don't you think >>>distributing smaller indexes on different machines would reduce the average >>>.search time. (Although I have a feeling that search time for smaller queries >>>may be slightly increased) > > What are the characteristics of your slow queries? Can you give examples? > Are the slow queries always slow or only under heavy load? What the > bottleneck is and whether splitting into smaller indexes would help depends > on just what your bottleneck is. It's not clear that your index is large > enough that the size of the index is causing your bottleneck. > > We run indexes of about 350GB with average response times under 200ms and > 99th percentile reponse times of under 2 seconds. (We have a very low qps > rate however). > > > Tom > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Sharding Techniques
I'm sure that you should try building one large index and convert to NumericField wherever you can. I'm convinced that will be faster - but as ever, the proof will be in the numbers. On repeated terms, I believe that lucene will search multiple times. If so, I'd guess it is just something that has never been optimized. 225 terms is not a very big number, but not very small either. Complex queries with lots of terms can be expected to be slower than simple queries with few terms. If you have a particular problem with repeated terms perhaps you could dedup them yourself. -- Ian. On Wed, May 11, 2011 at 9:10 AM, Samarendra Pratap wrote: > Hi Tom, > the more i am getting responses in this thread the more i feel that our > application needs optimization. > > 350 GB and less than 2 seconds!!! That's much more than my expectation :-) > (in current scenario). > > *characteristics of slow queries:* > there are a few reasons for greater search time > > 1.Two of our fields contain decimal values but are not NumericField :( . > These fields are searched as a range. Whenever the ranges are larger and/or > both the fields are used in search the search time and server load goes > high. I have already started work to convert it to NumericField - but > suggestions and experiences are most welcome. > > 2. When queries (without two fields mentioned above) have a lot of > words/phrases search time is high. E.g I took a query with around 80 unique > terms (not words) in 5 fields. These terms occur repeatedly and become total > 225 terms (non-unique). This particular query took 4.2 seconds. the 15 > indexes used for this query were of total size 5 G. > Are 225 terms (80 unique) is a very big number? > > and yes, slow queries are always slow. yes but obviously high load will add > up to their slowness. > > > > > Here I have another curiosity about something I noticed. > If I have a query like following: > > > title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz title:xyz > title:xyz title:xyz title:xyz title:xyz > > *Will lucene search for the term 11 times or it will reuse the results of > first term?* > > If later is true (which I think is), is there any particular reason or it > may be optimized inside lucene? > > > On Tue, May 10, 2011 at 9:46 PM, Burton-West, Tom wrote: > >> Hi Samar, >> >> >>Normal queries go fine under 500 ms but when people start searching >> >>"anything" some queries take up to > 100 seconds. Don't you think >> >>distributing smaller indexes on different machines would reduce the >> average >> >>.search time. (Although I have a feeling that search time for smaller >> queries >> >>may be slightly increased) >> >> What are the characteristics of your slow queries? Can you give examples? >> Are the slow queries always slow or only under heavy load? What the >> bottleneck is and whether splitting into smaller indexes would help depends >> on just what your bottleneck is. It's not clear that your index is large >> enough that the size of the index is causing your bottleneck. >> >> We run indexes of about 350GB with average response times under 200ms and >> 99th percentile reponse times of under 2 seconds. (We have a very low qps >> rate however). >> >> >> Tom >> >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Regards, > Samar > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Can I omit ShingleFilter's filler tokens
Hi Bill, I can think of two possible interpretations of "removing filler tokens": 1. Don't create shingles across stopwords, e.g. for text "one two three four five" and stopword "three", bigrams only, you'd get ("one two", "four five"), instead of the current ("one two", "two _", "_ four", "four five"). 2. Create shingles as if the stopwords were never there, e.g. for the same text and stopword, bigrams only, you'd get ("one two", "two four", "four five"). Which one did you have in mind? #2 can be achieved by adding PositionFilter after StopFilter and before ShingleFilter. I think #1 requires ShingleFilter modifications. Steve > -Original Message- > From: William Koscho [mailto:wkos...@gmail.com] > Sent: Wednesday, May 11, 2011 12:05 AM > To: java-user@lucene.apache.org > Subject: Can I omit ShingleFilter's filler tokens > > Hi, > > Can I remove the filler token _ from the n-gram-tokens that are generated > by > a ShingleFilter? > > I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter, > and ShingleFilter to create phrase n-grams. The ShingleFilter inserts > FILLER_TOKENs in place of the stopwords, but I don't want them. > > How can I omit the filler tokens? > > thanks > Bill
Re: Can I omit ShingleFilter's filler tokens
another idea is to .setEnablePositionIncrements(false) on your stopfilter. On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: > Hi Bill, > > I can think of two possible interpretations of "removing filler tokens": > > 1. Don't create shingles across stopwords, e.g. for text "one two three four > five" and stopword "three", bigrams only, you'd get ("one two", "four five"), > instead of the current ("one two", "two _", "_ four", "four five"). > > 2. Create shingles as if the stopwords were never there, e.g. for the same > text and stopword, bigrams only, you'd get ("one two", "two four", "four > five"). > > Which one did you have in mind? #2 can be achieved by adding PositionFilter > after StopFilter and before ShingleFilter. I think #1 requires ShingleFilter > modifications. > > Steve > >> -Original Message- >> From: William Koscho [mailto:wkos...@gmail.com] >> Sent: Wednesday, May 11, 2011 12:05 AM >> To: java-user@lucene.apache.org >> Subject: Can I omit ShingleFilter's filler tokens >> >> Hi, >> >> Can I remove the filler token _ from the n-gram-tokens that are generated >> by >> a ShingleFilter? >> >> I'm using a chain of filters: ClassicFilter, StopFilter, LowerCaseFilter, >> and ShingleFilter to create phrase n-grams. The ShingleFilter inserts >> FILLER_TOKENs in place of the stopwords, but I don't want them. >> >> How can I omit the filler tokens? >> >> thanks >> Bill > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Can I omit ShingleFilter's filler tokens
Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly get higher throughput than inserting PositionFilter. Like PositionFilter, this will buy you #2 (create shingles as if stopwords were never there), but not #1 (don't create shingles across stopwords). > -Original Message- > From: Robert Muir [mailto:rcm...@gmail.com] > Sent: Wednesday, May 11, 2011 9:02 AM > To: java-user@lucene.apache.org > Subject: Re: Can I omit ShingleFilter's filler tokens > > another idea is to .setEnablePositionIncrements(false) on your > stopfilter. > > On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: > > Hi Bill, > > > > I can think of two possible interpretations of "removing filler > tokens": > > > > 1. Don't create shingles across stopwords, e.g. for text "one two three > four five" and stopword "three", bigrams only, you'd get ("one two", > "four five"), instead of the current ("one two", "two _", "_ four", "four > five"). > > > > 2. Create shingles as if the stopwords were never there, e.g. for the > same text and stopword, bigrams only, you'd get ("one two", "two four", > "four five"). > > > > Which one did you have in mind? #2 can be achieved by adding > PositionFilter after StopFilter and before ShingleFilter. I think #1 > requires ShingleFilter modifications. > > > > Steve > > > >> -Original Message- > >> From: William Koscho [mailto:wkos...@gmail.com] > >> Sent: Wednesday, May 11, 2011 12:05 AM > >> To: java-user@lucene.apache.org > >> Subject: Can I omit ShingleFilter's filler tokens > >> > >> Hi, > >> > >> Can I remove the filler token _ from the n-gram-tokens that are > generated > >> by > >> a ShingleFilter? > >> > >> I'm using a chain of filters: ClassicFilter, StopFilter, > LowerCaseFilter, > >> and ShingleFilter to create phrase n-grams. The ShingleFilter inserts > >> FILLER_TOKENs in place of the stopwords, but I don't want them. > >> > >> How can I omit the filler tokens? > >> > >> thanks > >> Bill > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Non-English Languages Search
On Mon, May 9, 2011 at 5:32 PM, Provalov, Ivan wrote: > We are planning to ingest some non-English content into our application. All > content is OCR'ed and there are a lot of misspellings and garbage terms > because of this. Each document has one primary language with a some > exceptions (e.g. a few English terms mixed in with primarily non-English > document terms). > sounds like you should talk to Tom Burton-West! > 1. Does it make sense to mix two or more different Latin-based languages in > the same index directory in Lucene (e.g. Spanish/French/English)? I think it depends upon the application. If the user is specifying the language via the UI somehow then its probably simplest to just use different indexes for each collection. > 2. What about mixing Latin and non-Latin languages? We ran tests on English > and Chinese collections mixed together and didn't see any negative impact > (precision/recall). Any other potential issues? Right, none of the terms would overlap here... the only "issue" would be a skewed maxDoc but this is probably not a big deal at all. But whats the benefit to mixing them? > 3. Any recommendations for an Urdu analyzer? > you can always start with standardanalyzer as it will tokenize it... you might be able to make use of resources such as http://www.crulp.org/software/ling_resources/UrduClosedClassWordsList.htm and http://www.crulp.org/software/ling_resources/UrduHighFreqWords.htm as a stoplist. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Bug in BrazilianAnalyzer?
Hi, I did a test to understand the use of '*'and '?'. If I use StandardAnalyzer I have expected results by if a use BrazilianAnalyzer I have a mistake result. Please, where is my mistake? Junit is at the end. Paulo Cesar cities = {"Brasília","Brasilândia","Braslândia", "São Paulo", "São Roque", "Salvador"}; >>> Using StandardAnalyzer > Using BrazilianAnalyzer > JUnit
Re: Bug in BrazilianAnalyzer?
Hi, I think you forgot to attach the JUnit. On Wed, May 11, 2011 at 10:04 AM, wrote: > Hi, > I did a test to understand the use of '*'and '?'. > If I use StandardAnalyzer I have expected results by if a use > BrazilianAnalyzer I have a mistake result. > Please, where is my mistake? Junit is at the end. > Paulo Cesar > cities = {"Brasília","Brasilândia","Braslândia", "São Paulo", > "São Roque", "Salvador"}; > >>> Using StandardAnalyzer > Using BrazilianAnalyzer > JUnit
Re: Can I omit ShingleFilter's filler tokens
#1 is what I'm trying for, so Ill give setPositionIncrements(false) a try. Thanks for everyone's help. Bill On 5/11/11, Steven A Rowe wrote: > Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly get > higher throughput than inserting PositionFilter. Like PositionFilter, this > will buy you #2 (create shingles as if stopwords were never there), but not > #1 (don't create shingles across stopwords). > >> -Original Message- >> From: Robert Muir [mailto:rcm...@gmail.com] >> Sent: Wednesday, May 11, 2011 9:02 AM >> To: java-user@lucene.apache.org >> Subject: Re: Can I omit ShingleFilter's filler tokens >> >> another idea is to .setEnablePositionIncrements(false) on your >> stopfilter. >> >> On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: >> > Hi Bill, >> > >> > I can think of two possible interpretations of "removing filler >> tokens": >> > >> > 1. Don't create shingles across stopwords, e.g. for text "one two three >> four five" and stopword "three", bigrams only, you'd get ("one two", >> "four five"), instead of the current ("one two", "two _", "_ four", "four >> five"). >> > >> > 2. Create shingles as if the stopwords were never there, e.g. for the >> same text and stopword, bigrams only, you'd get ("one two", "two four", >> "four five"). >> > >> > Which one did you have in mind? #2 can be achieved by adding >> PositionFilter after StopFilter and before ShingleFilter. I think #1 >> requires ShingleFilter modifications. >> > >> > Steve >> > >> >> -Original Message- >> >> From: William Koscho [mailto:wkos...@gmail.com] >> >> Sent: Wednesday, May 11, 2011 12:05 AM >> >> To: java-user@lucene.apache.org >> >> Subject: Can I omit ShingleFilter's filler tokens >> >> >> >> Hi, >> >> >> >> Can I remove the filler token _ from the n-gram-tokens that are >> generated >> >> by >> >> a ShingleFilter? >> >> >> >> I'm using a chain of filters: ClassicFilter, StopFilter, >> LowerCaseFilter, >> >> and ShingleFilter to create phrase n-grams. The ShingleFilter inserts >> >> FILLER_TOKENs in place of the stopwords, but I don't want them. >> >> >> >> How can I omit the filler tokens? >> >> >> >> thanks >> >> Bill >> > >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Sent from my mobile device - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Can I omit ShingleFilter's filler tokens
I meant I'm trying for #2 so this should work (got my numbers mixed up). Thanks again Bill On 5/11/11, William Koscho wrote: > #1 is what I'm trying for, so Ill give setPositionIncrements(false) a > try. Thanks for everyone's help. > > Bill > > On 5/11/11, Steven A Rowe wrote: >> Yes, StopFilter.setEnablePositionIncrements(false) will almost certainly >> get >> higher throughput than inserting PositionFilter. Like PositionFilter, >> this >> will buy you #2 (create shingles as if stopwords were never there), but >> not >> #1 (don't create shingles across stopwords). >> >>> -Original Message- >>> From: Robert Muir [mailto:rcm...@gmail.com] >>> Sent: Wednesday, May 11, 2011 9:02 AM >>> To: java-user@lucene.apache.org >>> Subject: Re: Can I omit ShingleFilter's filler tokens >>> >>> another idea is to .setEnablePositionIncrements(false) on your >>> stopfilter. >>> >>> On Wed, May 11, 2011 at 8:27 AM, Steven A Rowe wrote: >>> > Hi Bill, >>> > >>> > I can think of two possible interpretations of "removing filler >>> tokens": >>> > >>> > 1. Don't create shingles across stopwords, e.g. for text "one two >>> > three >>> four five" and stopword "three", bigrams only, you'd get ("one two", >>> "four five"), instead of the current ("one two", "two _", "_ four", >>> "four >>> five"). >>> > >>> > 2. Create shingles as if the stopwords were never there, e.g. for the >>> same text and stopword, bigrams only, you'd get ("one two", "two four", >>> "four five"). >>> > >>> > Which one did you have in mind? #2 can be achieved by adding >>> PositionFilter after StopFilter and before ShingleFilter. I think #1 >>> requires ShingleFilter modifications. >>> > >>> > Steve >>> > >>> >> -Original Message- >>> >> From: William Koscho [mailto:wkos...@gmail.com] >>> >> Sent: Wednesday, May 11, 2011 12:05 AM >>> >> To: java-user@lucene.apache.org >>> >> Subject: Can I omit ShingleFilter's filler tokens >>> >> >>> >> Hi, >>> >> >>> >> Can I remove the filler token _ from the n-gram-tokens that are >>> generated >>> >> by >>> >> a ShingleFilter? >>> >> >>> >> I'm using a chain of filters: ClassicFilter, StopFilter, >>> LowerCaseFilter, >>> >> and ShingleFilter to create phrase n-grams. The ShingleFilter >>> >> inserts >>> >> FILLER_TOKENs in place of the stopwords, but I don't want them. >>> >> >>> >> How can I omit the filler tokens? >>> >> >>> >> thanks >>> >> Bill >>> > >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > -- > Sent from my mobile device > -- Sent from my mobile device - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
found workaround: Query on using Payload with MoreLikeThis class
Hi All, I am not sure if any one got chance to go over my question (below). The question was to check if I can modify MoreLikeThis.like() result using index time boosting. I have found a work around as there is no easy way to influence MoreLikeThis result using index time payload value. The work around is to write class similar to MoreLikeThis (can not extend this call as it is final) and in the createQuery method of MoreLikeThis class change the Query class from TermQuery to PayloadTermQuery. Change: TermQuery tq = new TermQuery(new Term((String) ar[1], (String) ar[0])); To: Term payloadTerm = new Term((String) ar[1], (String) ar[0]); Query tq = new PayloadTermQuery(payloadTerm, new AveragePayloadFunction()); Thats it, rest of the MoreLikeThis code stays the same :) With this change, I could boost my MoreLikeThis result with the payload value setup at the index time If any one has any better thoughts, I would be glad to hear about them Thanks Saurabh On Tue, May 10, 2011 at 1:36 PM, Saurabh Gokhale wrote: > Hi, > > In the Lucene 2.9.4 project, there is a requirement to boost some of the > keywords in the document using payload. > > Now while searching, is there a way I can boost the MoreLikeThis result > using the index time payload values? > > Or can I merge MoreLikeThis output and PayloadTermQuery output somehow to > get the final percentage output? >
Help needed on Ant build script for creating Lucene index
Hi, Can someone pls direct me to an example where I can get help on creating ant build script for creating lucene index?. It is part of Lucene contrib but I did not get much idea from the documentation on Lucene site. Thanks Saurabh
Re: How do I sort lucene search results by relevance and time?
If only you were using Solr http://wiki.apache.org/solr/DisMaxQParserPlugin#bf_.28Boost_Functions.29 Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Johnbin Wang > To: java-user@lucene.apache.org > Sent: Sun, May 8, 2011 11:59:11 PM > Subject: How do I sort lucene search results by relevance and time? > > What do I want to do is just like Google search results. The results in the > first page is the most relevant and also recent documents, but not > absolutely sorted by time desc. > > -- > cheers, > Johnbin Wang > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org