Re: Weighted Query Sequence
Sounds custom made for boosting. Depending on how you are structuring your fields and queries you could use either index or query time boosts, or even both. http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_the_difference_between_field_.28or_document.29_boosting_and_query_boosting.3F -- Ian. 2011/10/31 Shengtao Lei : > Hello Every One! > > > I'm struggling with my degree paper. My research project is build a search > engine for a language which has many affixes and prefixes. > Many papers have been read, the common way is stemming, > My segmentation processor can cut of the affix and prefix 。But for this > language, i can't just remove them simply(My supervisor said so). > > what i should do is: > If User input a query like : " root + affix1+ affix2", It means "root" is > the most important , "affix1" and "affix2" are following "root". > If "root + affix1 + affix2" is founded in the doc, it is best result. If > not "root + affix1"matched is Better , If not "A"matched is also OK. > > How can I construct my query and search by using exist API? > Evey advice is appreciate! Thank you very much! > > Sincerely > Scott Lei > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: index bigger than it should be?
Do the individual docs get bigger after 28 million? Can you try loading the last few million docs, from when the size jumps, and see what happens? Or load them in reverse order or something, again to see what happens? I don't have indexes with that many docs, but I believe that plenty of people do. -- Ian. On Sun, Oct 30, 2011 at 9:01 AM, wrote: > Hi, > > I did the following on the existing index: > - expunge deletes > - optimize(5) > - check index > > then from the existing index I exported all docs into a new one, then on > the new one I did: > - optimize(5) > - check index > > the entire log is in http://dl.dropbox.com/u/47469698/lucene/index.txt > > during the export, I also monitored the size on disk at each chunk of > 10 docs added to the new index: > http://dl.dropbox.com/u/47469698/lucene/index.xls > > what I found was that the index was taking around 2400 Mb/million docs > almost all the time, and from time to time it would take a little bit more > (<3500) during a short period of time. this stays true until around 28 > millions docs where the size on disk increases a lot (4500 Mb/million docs > = 135 Gb on disk) until the end of the export (my index contains 32 > millions docs). at the end the space on disk went from 134 Gb to 91 Gb > thanks to the optimize. but even at 91 Gb pour 32 millions docs, it is > still 3000 Mb/million docs, far more than the 2400 I was seeing most of > the time. > > I understand that merges happen, what I was surprised about was that the > behavior between 28 and 32 millions was a lot bigger in scale than the > other merges before, and even an optimize would not solve this entirely. > did I reach a limit? should I maintain the index at 25 millions to avoid > this behavior? > > I am using lucene 3.4 with the tiered merge policy and all the fields are > stored. > > thanks, > > > Vincent Sevel > > > > > > > > > Ian Lea > Sent by: java-user-return-51136-v.sevel=lombardodier@lucene.apache.org > > > 27.10.2011 15:28 > Please respond to > java-user@lucene.apache.org > > > > To > java-user@lucene.apache.org > cc > > Subject > Re: index bigger than it should be? > > > > > > > There's org.apache.lucene.index.CheckIndex which will report assorted > stats about the index, as well as checking it for correctness. It can > fix it too but you don't need that. I hope. Will take quite a while > to run on a large index. > > What version of lucene? Does a before/after (or large/small) > directory listing give any clues? > > > -- > Ian. > > > On Thu, Oct 27, 2011 at 12:44 PM, wrote: >> Hi, >> >> I have an application that has an index with 30 millions docs in it. > every >> day, I add around 1 million docs, and I remove the oldest 1 million, to >> keepit stable at 30 million. >> for the most part doc fields are indexed and stored. each doc weighs >> around from a few Kb to a 1 Mb (a few Mb in some cases). >> I used to be able to maintain the index at around 60 Gb on disk. but >> recently the index has had a tendency to keep growing (90 Gb). I can see >> that the expunge is doing what it should do, because after it executes, >> the size on disk does go down, but never as low as the previous day. > from >> the outside, it looks like a leak, but since I do not remove the docs I >> added during the day, it might be that the new docs are just bigger than >> the old ones. still I am surprised with the increase. >> >> are there any tools to dig into the index structure and help justify the >> space taken on disk? >> I was thinking about something that would help identify terms that take > up >> the most space, or some sort of dump that I could compare from one day > to >> the other. >> >> any help appreciated, >> >> thanks, >> >> vince > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > DISCLAIMER > This message is intended only for use by the person to > whom it is addressed. It may contain information that is > privileged and confidential. Its content does not > constitute a formal commitment by Lombard Odier > Darier Hentsch & Cie or any of its branches or affiliates. > If you are not the intended recipient of this message, > kindly notify the sender immediately and destroy this > message. Thank You. > * > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: IndexReader#reopen() on externally changed index
That's a good idea, if your index is "large enough", and/or you make heavy use of FieldCache (eg, sorting by field), regardless of whether you use NRT or "normal" commit + reopen to reopen your reader. Mike McCandless http://blog.mikemccandless.com On Sun, Oct 30, 2011 at 7:36 PM, Denis Bazhenov wrote: > Well, if so I guess I should use IndexWarmer to warm up IndexReader before > publishing reference to search clients. At least it will pre read all the > segments in RAM before issuing search. > > On Oct 17, 2011, at 9:47 PM, Michael McCandless wrote: > >> You'll have to call .commit() from the IndexWriter to make the changes >> externally visible. >> >> The call IndexReader.reopen to get a reader seeing the committed >> changes; the reopen will be efficient (only open "new" segments vs the >> old reader). >> >> It's still best to use near-real-time reader when possible (ie, open >> the IndexReader from the IndexWriter), but it sounds like in your case >> this is not possible since writer and reader on different >> JVMs/machines across a network. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sun, Oct 16, 2011 at 10:32 PM, Denis Bazhenov wrote: >>> We have situation when lucene index is replicated over network. And on that >>> machine reader reopen doesn't make new documents visible to a search. >>> >>> As far as I know IndexReader.reopen() call does work only if changes are >>> applied using the linked IndexWriter. My question is: how can I implement >>> efficient index reopen (only new segments should be read) when index is >>> changed externally? >>> --- >>> Denis Bazhenov >>> >>> >>> >>> >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > --- > Denis Bazhenov > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: multiple phrase search for topic
thanks Ian for your response. This is a one-time offline program so am not bothered about the performance (i.e. speed etc.). one more question, there are some situations where I need to run a AND clause (i.e. more than one phrase, such as "Apple" AND "Steve Jobs"). My approach was something like :- ** String searchString = "(" + phrase1 + ")" + " AND " + "(" + phrase2 + ")" ; QueryParser queryParser = new QueryParser(Version.LUCENE_33,"content", new StandardAnalyzer(Version.LUCENE_33)); Query query = queryParser.parse(searchString); bQuery.add(query,BooleanClause.Occur.SHOULD); ** thanks for the carrot2 pointer. -d -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3468005.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: multiple phrase search for topic
Nice not to have to worry about performance. You say there is another question, but not what it is. The code you show looks like it should do what you want. For anything non-trivial I prefer to build the queries directly in code rather than concatenating strings to be parsed, because I find it hard to work out the quotes and brackets and what the result will be. But your way is fine. -- Ian. On Mon, Oct 31, 2011 at 2:51 PM, deb.lucene wrote: > thanks Ian for your response. This is a one-time offline program so am not > bothered about the performance (i.e. speed etc.). > > one more question, there are some situations where I need to run a AND > clause (i.e. more than one phrase, such as "Apple" AND "Steve Jobs"). My > approach was something like :- > > ** > String searchString = "(" + phrase1 + ")" + " AND " + "(" + phrase2 + ")" ; > QueryParser queryParser = new QueryParser(Version.LUCENE_33,"content", new > StandardAnalyzer(Version.LUCENE_33)); > > Query query = queryParser.parse(searchString); > bQuery.add(query,BooleanClause.Occur.SHOULD); > > ** > thanks for the carrot2 pointer. > > -d > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/multiple-phrase-search-for-topic-tp3461423p3468005.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Bet you didn't know Lucene can...
On 22/10/2011 11:11, Grant Ingersoll wrote: Hi All, I'm giving a talk at ApacheCon titled "Bet you didn't know Lucene can..." (http://na11.apachecon.com/talks/18396). It's based on my observation, that over the years, a number of us in the community have done some pretty cool things using Lucene that don't fit under the core premise of full text search. I've got a fair number of ideas for the talk (easily enough for 1 hour), but I wanted to reach out to hear your stories of ways you've (ab)used Lucene and Solr to see if we couldn't extend the conversation to a bit more than the conference and also see if I can't inject more ideas beyond the ones I have. I don't need deep technical details, but just high level use case and the basic insight that led you to believe Lucene could solve the problem. Better late than never ... :) I briefly mentioned this use case to you at Eurocon, but here it is for the record. I used Lucene in a duplicate-detection scenario where instead of documents individual sentences would be indexed (with a fuzz). A similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash. The solution is described in SOLR-1918 - Bit-wise scoring field type. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Bet you didn't know Lucene can...
On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: > similarity-preserving hash function was calculated on each sentence, and the > hash was added as a field. The property of the hash was that similar > documents (sentences) would produce a similar hash, with only some bit-level > perturbation. The challenge was to find a ranked list of possible duplicates > with similar (not exact same) hashes, which in this case meant to find a > ranked list of documents that have the smallest bit-level distance in their > hashes from the query hash. > > The solution is described in SOLR-1918 - Bit-wise scoring field type. In other words, a simhash, no? Similarity Estimation Techniques from Rounding Algorithms http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf http://www.matpalm.com/resemblance/simhash/ - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: idf calculation in Lucene ?
Thanks! Is there any way to extend the Similarity class to overwrite the behavior (e.g., using the max idf instead of the sum of each term idfs)? On Thu, Oct 27, 2011 at 5:41 AM, Robert Muir wrote: > On Thu, Oct 20, 2011 at 3:11 PM, David Ryan wrote: > > > > > However, in some case, when I search o'reilly , I see > > > > * 44.0865 = idf(title: o''reilli=4 o=1488 reilli=14 oreilli=4)* > > > > In this cae, How is IDF calculated? > > > > thats a phrase or multiphrase query. > > in this case it sums up the idf of each term: > > http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/search/Similarity.html#idfExplain(java.util.Collection > , > org.apache.lucene.search.Searcher) > > -- > lucidimagination.com > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: idf calculation in Lucene ?
yes: override that method idfExplain(java.util.Collection, org.apache.lucene.search.Searcher) On Mon, Oct 31, 2011 at 5:24 PM, David Ryan wrote: > Thanks! Is there any way to extend the Similarity class to overwrite the > behavior (e.g., using the max idf instead of the sum of each term idfs)? > > > On Thu, Oct 27, 2011 at 5:41 AM, Robert Muir wrote: > >> On Thu, Oct 20, 2011 at 3:11 PM, David Ryan wrote: >> >> > >> > However, in some case, when I search o'reilly , I see >> > >> > * 44.0865 = idf(title: o''reilli=4 o=1488 reilli=14 oreilli=4)* >> > >> > In this cae, How is IDF calculated? >> > >> >> thats a phrase or multiphrase query. >> >> in this case it sums up the idf of each term: >> >> http://lucene.apache.org/java/3_4_0/api/all/org/apache/lucene/search/Similarity.html#idfExplain(java.util.Collection >> , >> org.apache.lucene.search.Searcher) >> >> -- >> lucidimagination.com >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > -- lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Bet you didn't know Lucene can...
On 31/10/2011 21:42, Petite Abeille wrote: On Oct 31, 2011, at 9:32 PM, Andrzej Bialecki wrote: similarity-preserving hash function was calculated on each sentence, and the hash was added as a field. The property of the hash was that similar documents (sentences) would produce a similar hash, with only some bit-level perturbation. The challenge was to find a ranked list of possible duplicates with similar (not exact same) hashes, which in this case meant to find a ranked list of documents that have the smallest bit-level distance in their hashes from the query hash. The solution is described in SOLR-1918 - Bit-wise scoring field type. In other words, a simhash, no? Similarity Estimation Techniques from Rounding Algorithms http://www.cs.princeton.edu/courses/archive/spr04/cos598B/bib/CharikarEstim.pdf http://www.matpalm.com/resemblance/simhash/ Yes, you could use this. In that project we used a different application-specific hash. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org