Grouping field using pruned terms?
TermFirstPassGroupingCollector loads all terms for a given group-by field, through FieldCache. Is it possible to instruct the class to group only pruned terms of a field, based on a user-supplied query [RangeQuery, TermQuery etc...] This way, only pruned terms are grouped and all others are ignored Is such a pre-processing step possible in group-by query? -- Ravi
Matched words from document - Stemmed and Synonyms
I am looking for a feature in solr that will give me all matched words in the document when I search with a word. My field uses Stemming and as well as Synonym filters. For example I have documents and part of the text goes like below 1.We were very careful about my surgery 2.are still needing shunts, or Decompression Surgeries, Taps work for som 3.prior to surgeries and Berinert as a rescue drug. 4.identified the disease as we were in the same area of operations as outcome from dioxins When I am search with surgery its giving all 4 documents that is fine, but I need to know what are the words in each doc that matched with my search query "surgery", here - 1. surgery 2. Surgeries and 3. surgeries 4.operations - from each doc. I have tried using Highlighter but its not only giving matched words also gives unwanted words and also it it not highlighting the synonym matched words - in this example operation- Thanks Venkatesham Gundu -- View this message in context: http://lucene.472066.n3.nabble.com/Matched-words-from-document-Stemmed-and-Synonyms-tp4079968.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Search a Part of the Sentence/Complete sentence in lucene 4.3
Dear All, Say suppose I have 3 documents. The sample text is /*File 1 : */ Mr X David is a manager of the company. He is the senior most manager. I also want to become manager of the company. /*File 2 :*/ Mr X David manager of the company is also very senior. He happens to be the senior most manager. I wish even I could reach that place. /*File 3:*/ Mr X David is working for a company. He happens to be the manager of the company.Infact he is the senior most manager. I dont want to become like him. /*String I wish to search :* X David is a manager of the company./ Ideally I should get only file1 in the hit result. I have no clue how to achieve this. Basically I am trying to match the part of the sentence or a complete sentence. What can be the best methodology. I presume is a are the stop words and will be skipped during indexing by the StandardAnalyzer. What wonders me how do I then search for a part of the sentence or complete sentence if sentence contains some/many stopwords. I am using StandardAnalyzer. Please guide. -- Regards Ankit
Re: Search a Part of the Sentence/Complete sentence in lucene 4.3
PhraseQuery? You can skip the holes created by stopwords ... e.g. QueryParser does this. Ie, the PhraseQuery becomes "X David _ _ manager _ _ company" if is/a/of/the are stop words, which isn't perfect (could return false matches) but should work well in practice ... Mike McCandless http://blog.mikemccandless.com On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka wrote: > Dear All, > > Say suppose I have 3 documents. The sample text is > > /*File 1 : */ > > Mr X David is a manager of the company. He is the senior most manager. I > also want to become manager of the company. > > /*File 2 :*/ > > Mr X David manager of the company is also very senior. He happens to be the > senior most manager. I wish even I could reach that place. > > /*File 3:*/ > > Mr X David is working for a company. He happens to be the manager of the > company.Infact he is the senior most manager. I dont want to become like > him. > > /*String I wish to search :* X David is a manager of the company./ > > Ideally I should get only file1 in the hit result. > > I have no clue how to achieve this. Basically I am trying to match the part > of the sentence or a complete sentence. What can be the best methodology. > I presume is a are the stop words and will be skipped during indexing by the > StandardAnalyzer. > > What wonders me how do I then search for a part of the sentence or complete > sentence if sentence contains some/many stopwords. > > I am using StandardAnalyzer. Please guide. > > -- > Regards > > Ankit > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search a Part of the Sentence/Complete sentence in lucene 4.3
I tried using Phrase Query with slops. Now since I am specifying the slop I also need to specify the 2nd term. In my case the 2nd term is not present. The whole string to be searched is still 1 single term. How do I skip the holes created by stopwords. I do not know before hand how many stop words are skipped and what string user is going to enter. Is there a definite way to skip the holes created by stopwords. I was now looking for MultiphraseQuery splitting the user provided string on space and providing each word as a term to multiphrasequery. Will it help..?? Is there any alternative. ?? On 7/24/2013 4:48 PM, Michael McCandless wrote: PhraseQuery? You can skip the holes created by stopwords ... e.g. QueryParser does this. Ie, the PhraseQuery becomes "X David _ _ manager _ _ company" if is/a/of/the are stop words, which isn't perfect (could return false matches) but should work well in practice ... Mike McCandless http://blog.mikemccandless.com On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka wrote: Dear All, Say suppose I have 3 documents. The sample text is /*File 1 : */ Mr X David is a manager of the company. He is the senior most manager. I also want to become manager of the company. /*File 2 :*/ Mr X David manager of the company is also very senior. He happens to be the senior most manager. I wish even I could reach that place. /*File 3:*/ Mr X David is working for a company. He happens to be the manager of the company.Infact he is the senior most manager. I dont want to become like him. /*String I wish to search :* X David is a manager of the company./ Ideally I should get only file1 in the hit result. I have no clue how to achieve this. Basically I am trying to match the part of the sentence or a complete sentence. What can be the best methodology. I presume is a are the stop words and will be skipped during indexing by the StandardAnalyzer. What wonders me how do I then search for a part of the sentence or complete sentence if sentence contains some/many stopwords. I am using StandardAnalyzer. Please guide. -- Regards Ankit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Regards Ankit Murarka "Peace is found not in what surrounds us, but in what we hold within." - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Question on Lucene hot-backup functionality.
This is unfortunately very trappy ... this happened with LUCENE-4876, where we added cloning of IndexDeletionPolicy on IW construction. It's very confusing that the IDP you set on your IWC is not in fact the one that IW uses... Mike McCandless http://blog.mikemccandless.com On Wed, Jul 24, 2013 at 2:35 AM, Shai Erera wrote: > Hi > > In Lucene 4.4 we've improved the snapshotting process so that you don't > need to specify an ID. > Also, there's a new Replicator module which can be used for just that > purpose - take hot index backups of the index. > It pretty much hides most of the snapshotting from you. You can read about > it here: http://shaierera.blogspot.com/2013/05/the-replicator.html > > As for your problem, I think it's related to the fact IndexWriter clones > the given IndexWriterConfig, including the SnapshotDeletionPolicy. > So you should obtain it from IW.getLiveConfig().getIndexDeletionPolicy(), > rather than IndexWriterConfig.getIndexDeletionPolicy(). I'm not sure what > Indexer.getSnapshotter() does, but I'd make sure that it uses IW. > > Shai > > > On Wed, Jul 24, 2013 at 7:34 AM, Marcos Juarez Lopez wrote: > >> I'm trying to get Lucene's hot backup functionality to work. I posted the >> question in detail over at StackOverflow, but it seems there's very little >> Lucene knowledge over there. >> >> Basically, I think I have setup everything correctly, but I can't get a >> valid snapshot when trying to do a backup. I'm following both the Lucene >> book's instructions, as well as the latest Lucene Javadocs, to no avail. >> Original question at the link, but I'll copy the relevant bits below: >> >> http://stackoverflow.com/questions/17753226/lucene-4-3-1-backup-process >> >> This is the code I have up to now: >> >> public Indexer(Directory indexDir, PrintStream printStream) throws >> IOException { >> IndexWriterConfig config = new >> IndexWriterConfig(Version.LUCENE_43, new Analyzer()); >> snapshotter = new SnapshotDeletionPolicy(new >> KeepOnlyLastCommitDeletionPolicy()); >> writerConfig.setIndexDeletionPolicy(snapshotter); >> indexWriter = new IndexWriter(indexDir, writerConfig); >> } >> >> And when starting the backup, you can't just do snapshotter.snapshot(). You >> now have to specify an arbitrary commitIdentifier id, and use that after >> you're done to release the snapshot. >> >> SnapshotDeletionPolicy snapshotter = indexer.getSnapshotter(); >> String commitIdentifier = generateCommitIdentifier(); >> try { >> IndexCommit commit = snapshotter.snapshot(commitIdentifier); >> for (String fileName : commit.getFileNames()) { >> backupFile(fileName); >> } >> } catch (Exception e) { >> logger.error("Exception", e); >> } finally { >> snapshotter.release(commitIdentifier); >> indexer.deleteUnusedFiles(); >> } >> >> However, this doesn't seem to be working. Regardless of whether there have >> been docs indexed or not, and regardless of whether I have committed or >> not, my call tosnapshotter.snapshot(commitIdentifier) always throws an >> IllegalStateException sayingNo index commit to snapshot. Looking at the >> code, the SnapshotDeletionPolicy seems to think there have been no commits, >> even though I'm committing to disk every 5 seconds or so. I've verified, >> and there are docs being written and committed to indexes all the time, but >> snapshotter always thinks there have been zero commits. >> >> Any idea of what I'm doing wrong? >> >> Thanks! >> >> Marcos Juarez >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search a Part of the Sentence/Complete sentence in lucene 4.3
Did you consider using shingles? It solves the "to be or not to be" problem quite nicely. Dawn On 24/07/2013 12:34, Ankit Murarka wrote: I tried using Phrase Query with slops. Now since I am specifying the slop I also need to specify the 2nd term. In my case the 2nd term is not present. The whole string to be searched is still 1 single term. How do I skip the holes created by stopwords. I do not know before hand how many stop words are skipped and what string user is going to enter. Is there a definite way to skip the holes created by stopwords. I was now looking for MultiphraseQuery splitting the user provided string on space and providing each word as a term to multiphrasequery. Will it help..?? Is there any alternative. ?? On 7/24/2013 4:48 PM, Michael McCandless wrote: PhraseQuery? You can skip the holes created by stopwords ... e.g. QueryParser does this. Ie, the PhraseQuery becomes "X David _ _ manager _ _ company" if is/a/of/the are stop words, which isn't perfect (could return false matches) but should work well in practice ... Mike McCandless http://blog.mikemccandless.com On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka wrote: Dear All, Say suppose I have 3 documents. The sample text is /*File 1 : */ Mr X David is a manager of the company. He is the senior most manager. I also want to become manager of the company. /*File 2 :*/ Mr X David manager of the company is also very senior. He happens to be the senior most manager. I wish even I could reach that place. /*File 3:*/ Mr X David is working for a company. He happens to be the manager of the company.Infact he is the senior most manager. I dont want to become like him. /*String I wish to search :* X David is a manager of the company./ Ideally I should get only file1 in the hit result. I have no clue how to achieve this. Basically I am trying to match the part of the sentence or a complete sentence. What can be the best methodology. I presume is a are the stop words and will be skipped during indexing by the StandardAnalyzer. What wonders me how do I then search for a part of the sentence or complete sentence if sentence contains some/many stopwords. I am using StandardAnalyzer. Please guide. -- Regards Ankit - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Rgds. *Dawn Raison* Technical Director, Digitorial Ltd. E:d...@digitorial.co.uk W:http://www.digitorial.co.uk M: 07956 609 618T: 01428 729 431 Reg: 04644583, England & Wales Church Villas Ecchinswell, Newbury, RG20 4TT This email and any attached files are for the exclusive use of the addressee and may contain privileged and/or confidential information. If you receive this email in error you should not disclose the contents to any other person nor take copies but should delete it immediately. Digitorial Ltd makes no warranty as to the accuracy or completeness of this email and accepts no liability for its contents or use. Any opinions expressed in this email are those of the author and do not necessarily reflect the opinions of Digitorial Ltd.
Re: Search a Part of the Sentence/Complete sentence in lucene 4.3
With PhraseQuery you can specify where each term must occur in the phrase. So X must occur in position 0, David in position 1, and then manager in position 4 (skipping 2 holes). QueryParser does this for you: when it analyzes the users phrase, if the resulting tokens have holes, then it sets the positions accordingly. And I agree: shingles are a good solution here too, but they make your index larger. CommonGramsFilter lets you shingle only specific words, e.g. you could pass your stop words to it. Mike McCandless http://blog.mikemccandless.com On Wed, Jul 24, 2013 at 7:34 AM, Ankit Murarka wrote: > I tried using Phrase Query with slops. Now since I am specifying the slop I > also need to specify the 2nd term. > > In my case the 2nd term is not present. The whole string to be searched is > still 1 single term. > > How do I skip the holes created by stopwords. I do not know before hand how > many stop words are skipped and what string user is going to enter. > > Is there a definite way to skip the holes created by stopwords. > > I was now looking for MultiphraseQuery splitting the user provided string on > space and providing each word as a term to multiphrasequery. > > Will it help..?? Is there any alternative. ?? > > > On 7/24/2013 4:48 PM, Michael McCandless wrote: >> >> PhraseQuery? >> >> You can skip the holes created by stopwords ... e.g. QueryParser does >> this. Ie, the PhraseQuery becomes "X David _ _ manager _ _ company" >> if is/a/of/the are stop words, which isn't perfect (could return false >> matches) but should work well in practice ... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> >> On Wed, Jul 24, 2013 at 4:31 AM, Ankit Murarka >> wrote: >> >>> >>> Dear All, >>> >>> Say suppose I have 3 documents. The sample text is >>> >>> /*File 1 : */ >>> >>> Mr X David is a manager of the company. He is the senior most manager. I >>> also want to become manager of the company. >>> >>> /*File 2 :*/ >>> >>> Mr X David manager of the company is also very senior. He happens to be >>> the >>> senior most manager. I wish even I could reach that place. >>> >>> /*File 3:*/ >>> >>> Mr X David is working for a company. He happens to be the manager of the >>> company.Infact he is the senior most manager. I dont want to become like >>> him. >>> >>> /*String I wish to search :* X David is a manager of the company./ >>> >>> Ideally I should get only file1 in the hit result. >>> >>> I have no clue how to achieve this. Basically I am trying to match the >>> part >>> of the sentence or a complete sentence. What can be the best methodology. >>> I presume is a are the stop words and will be skipped during indexing by >>> the >>> StandardAnalyzer. >>> >>> What wonders me how do I then search for a part of the sentence or >>> complete >>> sentence if sentence contains some/many stopwords. >>> >>> I am using StandardAnalyzer. Please guide. >>> >>> -- >>> Regards >>> >>> Ankit >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> >> > > > > -- > Regards > > Ankit Murarka > > "Peace is found not in what surrounds us, but in what we hold within." > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Performance measurements
I did some performance tests on a real index using a query having the following pattern: termA AND (termB1 OR termB2 OR ... OR termBn) The results were not good and I was wondering if I may be doing something wrong (and what I would need to do to improve performance), or is it just that the OR is very inefficient. The format for the data below is illustrated below by example: 5|10 time: 0.092728962; scored: 18 Here, n=5, and we measure performance for retrieval of 10 results which is 0.0927ms. Had we not early terminated, we would have obtained 18 results. As you will see in the data below, the performance for n=0 is very good, but goes down drastically as n is increased. Sriram. 0|10 time: 0.007941587; scored: 10887 0|1000 time: 0.018967384; scored: 10887 0|5000 time: 0.061943552; scored: 10887 0|1 time: 0.115327001; scored: 10887 1|10 time: 0.053950965; scored: 0 5|20 time: 0.274681853; scored: 18 10|10 time: 0.14251254; scored: 22 10|20 time: 0.282503313; scored: 22 20|10 time: 0.251964067; scored: 32 20|30 time: 0.52860957; scored: 32 50|10 time: 0.888969702; scored: 57 50|30 time: 1.078579956; scored: 57 50|50 time: 1.601169195; scored: 57 100|10 time: 1.396391061; scored: 79 100|40 time: 1.8083494; scored: 79 100|80 time: 2.921094513; scored: 79 200|10 time: 2.848105701; scored: 119 200|50 time: 3.472198462; scored: 119 200|100 time: 4.722673648; scored: 119 400|10 time: 4.463727049; scored: 235 400|100 time: 6.554119665; scored: 235 400|200 time: 9.591892527; scored: 235
Re: Performance measurements
Clarification - I used an MMap'd index and warmed it up with similar queries, as well as running the identical query many times before starting measurements. I had ample heap space. Sriram. On Wed, Jul 24, 2013 at 9:11 AM, Sriram Sankar wrote: > I did some performance tests on a real index using a query having the > following pattern: > > termA AND (termB1 OR termB2 OR ... OR termBn) > > The results were not good and I was wondering if I may be doing something > wrong (and what I would need to do to improve performance), or is it just > that the OR is very inefficient. > > The format for the data below is illustrated below by example: > > 5|10 > time: 0.092728962; scored: 18 > > Here, n=5, and we measure performance for retrieval of 10 results which is > 0.0927ms. Had we not early terminated, we would have obtained 18 results. > > As you will see in the data below, the performance for n=0 is very good, > but goes down drastically as n is increased. > > Sriram. > > > 0|10 > time: 0.007941587; scored: 10887 > > 0|1000 > time: 0.018967384; scored: 10887 > > 0|5000 > time: 0.061943552; scored: 10887 > > 0|1 > time: 0.115327001; scored: 10887 > > 1|10 > time: 0.053950965; scored: 0 > > 5|20 > time: 0.274681853; scored: 18 > > 10|10 > time: 0.14251254; scored: 22 > > 10|20 > time: 0.282503313; scored: 22 > > 20|10 > time: 0.251964067; scored: 32 > > 20|30 > time: 0.52860957; scored: 32 > > 50|10 > time: 0.888969702; scored: 57 > > 50|30 > time: 1.078579956; scored: 57 > > 50|50 > time: 1.601169195; scored: 57 > > 100|10 > time: 1.396391061; scored: 79 > > 100|40 > time: 1.8083494; scored: 79 > > 100|80 > time: 2.921094513; scored: 79 > > 200|10 > time: 2.848105701; scored: 119 > > 200|50 > time: 3.472198462; scored: 119 > > 200|100 > time: 4.722673648; scored: 119 > > 400|10 > time: 4.463727049; scored: 235 > > 400|100 > time: 6.554119665; scored: 235 > > 400|200 > time: 9.591892527; scored: 235 > >
Re: Performance measurements
Thanks for the detailed numbers. Nothing seems unexpected to me. Increasing query complexity or term count is simply going to increase query execution time. I think I'll add a new rule to my informal performance guidance - Query complexity of no more than ten to twenty terms is a "slam dunk", but more than that is "uncharted territory" that risks queries taking more than half a second or even multiple seconds and requires a proof of concept implementation to validate reasonable query times. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 12:11 PM To: java-user@lucene.apache.org Subject: Performance measurements I did some performance tests on a real index using a query having the following pattern: termA AND (termB1 OR termB2 OR ... OR termBn) The results were not good and I was wondering if I may be doing something wrong (and what I would need to do to improve performance), or is it just that the OR is very inefficient. The format for the data below is illustrated below by example: 5|10 time: 0.092728962; scored: 18 Here, n=5, and we measure performance for retrieval of 10 results which is 0.0927ms. Had we not early terminated, we would have obtained 18 results. As you will see in the data below, the performance for n=0 is very good, but goes down drastically as n is increased. Sriram. 0|10 time: 0.007941587; scored: 10887 0|1000 time: 0.018967384; scored: 10887 0|5000 time: 0.061943552; scored: 10887 0|1 time: 0.115327001; scored: 10887 1|10 time: 0.053950965; scored: 0 5|20 time: 0.274681853; scored: 18 10|10 time: 0.14251254; scored: 22 10|20 time: 0.282503313; scored: 22 20|10 time: 0.251964067; scored: 32 20|30 time: 0.52860957; scored: 32 50|10 time: 0.888969702; scored: 57 50|30 time: 1.078579956; scored: 57 50|50 time: 1.601169195; scored: 57 100|10 time: 1.396391061; scored: 79 100|40 time: 1.8083494; scored: 79 100|80 time: 2.921094513; scored: 79 200|10 time: 2.848105701; scored: 119 200|50 time: 3.472198462; scored: 119 200|100 time: 4.722673648; scored: 119 400|10 time: 4.463727049; scored: 235 400|100 time: 6.554119665; scored: 235 400|200 time: 9.591892527; scored: 235 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance measurements
Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster when n gets large. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance measurements
No I do not need scoring. This is a pure retrieval query - which matches what we used to do with Unicorn in Facebook - something like: (name:sriram AND (friend:1 OR friend:2 ...)) This automatically gives us second degree. With Unicorn, we would always get sub-millisecond performance even for n>500. Should I assume that Lucene is that much worse - or is it that this use case has not been optimized? Sriram. On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand wrote: > Hi, > > On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > > termA AND (termB1 OR termB2 OR ... OR termBn) > > Maybe this comment is not appropriate for your use-case, but if you > don't actually need scoring from the disjunction on the right of the > query, a TermsFilter will be faster when n gets large. > > -- > Adrien > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Performance measurements
Unicorn sounds like it was optimized for graph search. Specialized search engines can in fact beat out generalized search engines for specific use cases. Scoring has been a major focus of Lucene. Non-scored filters are also available, but the query parsers are focused (exclusively) on scored-search. As Adrien indicates, try using raw Lucene filters and you should get much better results. Whether even that will compete with a use-case-specific (graph) search engine remains to be seen. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 1:03 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements No I do not need scoring. This is a pure retrieval query - which matches what we used to do with Unicorn in Facebook - something like: (name:sriram AND (friend:1 OR friend:2 ...)) This automatically gives us second degree. With Unicorn, we would always get sub-millisecond performance even for n>500. Should I assume that Lucene is that much worse - or is it that this use case has not been optimized? Sriram. On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand wrote: Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster when n gets large. -- Adrien - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Performance measurements
On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky wrote: > Unicorn sounds like it was optimized for graph search. Specialized search > engines can in fact beat out generalized search engines for specific use > cases. > Yes and no (I worked on it). Yes, there are many aspect of Unicorn that have been optimized for graph search. But the tests I am running have very little to do with those optimizations. I am still learning about Lucene and have suspected that the scoring framework (that has to be very general) may be contributing to the performance issues. With Unicorn, we made a decision to do all scoring after retrieval and not during retrieval. > > Scoring has been a major focus of Lucene. Non-scored filters are also > available, but the query parsers are focused (exclusively) on scored-search. > When you say "filter" do you mean a step performed after retrieval? Or is it yet another retrieval operation? > > As Adrien indicates, try using raw Lucene filters and you should get much > better results. Whether even that will compete with a use-case-specific > (graph) search engine remains to be seen. Thanks (I will study this more). Sriram. > > > -- Jack Krupansky > > -Original Message- From: Sriram Sankar > Sent: Wednesday, July 24, 2013 1:03 PM > To: java-user@lucene.apache.org > Subject: Re: Performance measurements > > > No I do not need scoring. This is a pure retrieval query - which matches > what we used to do with Unicorn in Facebook - something like: > > (name:sriram AND (friend:1 OR friend:2 ...)) > > This automatically gives us second degree. > > With Unicorn, we would always get sub-millisecond performance even for > n>500. > > Should I assume that Lucene is that much worse - or is it that this use > case has not been optimized? > > Sriram. > > > > On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand wrote: > > Hi, >> >> On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: >> > termA AND (termB1 OR termB2 OR ... OR termBn) >> >> Maybe this comment is not appropriate for your use-case, but if you >> don't actually need scoring from the disjunction on the right of the >> query, a TermsFilter will be faster when n gets large. >> >> -- >> Adrien >> >> --**--**- >> To unsubscribe, e-mail: >> java-user-unsubscribe@lucene.**apache.org >> For additional commands, e-mail: >> java-user-help@lucene.apache.**org >> >> >> > > --**--**- > To unsubscribe, e-mail: > java-user-unsubscribe@lucene.**apache.org > For additional commands, e-mail: > java-user-help@lucene.apache.**org > >
Re: Performance measurements
I think I've exhausted my expertise in Lucene filters, but I think you can wrap a query with a filter and also wrap a filter with a query. So, for IndexSearcher.search, you could take a filter and wrap it with ConstantScoreQuery. So, if a BooleanQuery got wrapped as a filter, it could be wrapped as a CSQ for search so that no scoring would be done. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 3:58 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements On Wed, Jul 24, 2013 at 10:24 AM, Jack Krupansky wrote: Unicorn sounds like it was optimized for graph search. Specialized search engines can in fact beat out generalized search engines for specific use cases. Yes and no (I worked on it). Yes, there are many aspect of Unicorn that have been optimized for graph search. But the tests I am running have very little to do with those optimizations. I am still learning about Lucene and have suspected that the scoring framework (that has to be very general) may be contributing to the performance issues. With Unicorn, we made a decision to do all scoring after retrieval and not during retrieval. Scoring has been a major focus of Lucene. Non-scored filters are also available, but the query parsers are focused (exclusively) on scored-search. When you say "filter" do you mean a step performed after retrieval? Or is it yet another retrieval operation? As Adrien indicates, try using raw Lucene filters and you should get much better results. Whether even that will compete with a use-case-specific (graph) search engine remains to be seen. Thanks (I will study this more). Sriram. -- Jack Krupansky -Original Message- From: Sriram Sankar Sent: Wednesday, July 24, 2013 1:03 PM To: java-user@lucene.apache.org Subject: Re: Performance measurements No I do not need scoring. This is a pure retrieval query - which matches what we used to do with Unicorn in Facebook - something like: (name:sriram AND (friend:1 OR friend:2 ...)) This automatically gives us second degree. With Unicorn, we would always get sub-millisecond performance even for n>500. Should I assume that Lucene is that much worse - or is it that this use case has not been optimized? Sriram. On Wed, Jul 24, 2013 at 9:59 AM, Adrien Grand wrote: Hi, On Wed, Jul 24, 2013 at 6:11 PM, Sriram Sankar wrote: > termA AND (termB1 OR termB2 OR ... OR termBn) Maybe this comment is not appropriate for your use-case, but if you don't actually need scoring from the disjunction on the right of the query, a TermsFilter will be faster when n gets large. -- Adrien --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org --**--**- To unsubscribe, e-mail: java-user-unsubscribe@lucene.**apache.org For additional commands, e-mail: java-user-help@lucene.apache.**org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
lucene indexwriter crash
Hi, I am using lucene 4 to index very big data. The indexer crashed after three days (147Gig of current index size). I find the stack trash weird. Any ideas on this will be helpful. Exception in thread "main" java.io.FileNotFoundException: /ir/data/data/collections/KBA/2013/index/1322092800-1326499200/_9cx_Lucene40_0.tim (Input/output error) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:233) at org.apache.lucene.store.FSDirectory$FSIndexOutput.(FSDirectory.java:512) at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.java:289) at org.apache.lucene.store.TrackingDirectoryWrapper.createOutput(TrackingDirectoryWrapper.java:62) at org.apache.lucene.codecs.BlockTreeTermsWriter.(BlockTreeTermsWriter.java:160) at org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat.fieldsConsumer(Lucene40PostingsFormat.java:304) at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.addField(PerFieldPostingsFormat.java:130) at org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:335) at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85) at org.apache.lucene.index.TermsHash.flush(TermsHash.java:117) at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53) at org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:82) at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:482) at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:419) at org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:313) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:386) at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1445) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1124) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1105) at kba.NewCorpusReader.createLuceneIndex(NewCorpusReader.java:747) at kba.NewCorpusReader.main(NewCorpusReader.java:825) -- Thanks, A
Re: Tokenize String using Operators(Logical Operator, : operator etc)
Greetings, I have wrote a custom tokenizer class which extends lucene tokenizer class. Thanks for all replies Regards DJ -- View this message in context: http://lucene.472066.n3.nabble.com/Tokenize-String-using-Operators-Logical-Operator-operator-etc-tp4079673p4080225.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
2 exceptions in IndexWriter
Recently I find my unit test will failed sometimes but no always. I use Lucene 4.3.0 After inverstigation, I found when I try to open a IndexWriter for a disk directory. Some time it will throw this exception: org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/tmp/test-idx/write.lock The default time out is 1000ms, when I set to 3000ms, seems this exception disappeared. But I think 1000ms should be enough, how can it happen. What's the recommended number? Some time it will throw another exception which I think is more serious. read past EOF: SimpleFSIndexInput Each test will purge the index folder, however this exception will happen sometimes. I don't know the reason, how can I fix this exception?