Re: Lucene 4.0 Index Format Finalization Timetable
While we are in constant sync due to the merge, lucene would still be updated multiple times before a solr 4 release, and it would be subject to happen at any time - so its really not any different. On Wednesday, December 7, 2011, Jamie Johnson wrote: > Yeah, biggest issue for us is we're using the SolrCloud features. > While I see some good things related to the Lucene and Solr code bases > being merged, this is certainly a frustrating aspect of it as I don't > require some of the changes that are in Lucene 4.0 (withstanding > anything that SolrCloud requires that is). > > I think the best solution (assuming it works) is to try to lock a > version of Lucene 4.0 while upgrading Solr. I'll have to test to see > if this works or not, but at least it's something. > > On Wed, Dec 7, 2011 at 9:02 AM, Mike Sokolov wrote: >> My personal view, as a bystander with no more information than you, is that >> one has to assume there will be further index format changes before a 4.0 >> release. This is based on the number of changes in the last 9 months, and >> the amount of activity on the dev list. >> >> For us the implication is we need to stick w/3.x for now. You might be in a >> different situation if you really need the 4.0 changes. Maybe you can just >> stick w/the current trunk and take responsibility for patching critical >> bugfixes, hoping you won't have to recreate your index too many times... >> >> -Mike >> >> >> On 12/06/2011 09:48 PM, Jamie Johnson wrote: >>> >>> I suppose that's fair enough. Some quick googling seems that this has >>> been asked many times with pretty much the same response. Sorry to >>> add to the noise. >>> >>> On Tue, Dec 6, 2011 at 9:34 PM, Darren Govoni wrote: >>> I asked here[1] and it said "Ask again later." [1] http://8ball.tridelphia.net/ On 12/06/2011 08:46 PM, Jamie Johnson wrote: > > Thanks Robert. Is there a timetable for that? I'm trying to gauge > whether it is appropriate to push for my organization to move to the > current lucene 4.0 implementation (we're using solr cloud which is > built against trunk) or if it's expected there will be changes to what > is currently on trunk. I'm not looking for anything hard, just trying > to plan as much as possible understanding that this is one of the > implications of using trunk. > > On Tue, Dec 6, 2011 at 6:48 PM, Robert Muir wrote: > >> >> On Tue, Dec 6, 2011 at 6:41 PM, Jamie Johnson >> wrote: >> >>> >>> Is there a timetable for when it is expected to be finalized? >>> >> >> it will be finalized when Lucene 4.0 is released. >> >> -- >> lucidimagination.com >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> - >>> To unsubscribe, e-mail: -- - Mark http://www.lucidimagination.com
Re: "read past EOF" when merge
Can you file a JIRA Markus? This is probably related to the new code that uses Directory for replication. - Mark On Nov 2, 2012, at 6:53 AM, Markus Jelsma wrote: > Hi, > > For what it's worth, we have seen similar issues with Lucene/Solr from this > week's trunk. The issue manifests itself when it want to replicate. The > servers have not been taken offline and did not crash when this happenend. > > 2012-10-30 16:12:51,061 WARN [solr.handler.ReplicationHandler] - > [http-8080-exec > -3] - : Exception while writing response for params: > file=_p_Lucene41_0.doc&comm > and=filecontent&checksum=true&generation=6&qt=/replication&wt=filestream > java.io.EOFException: read past EOF: > MMapIndexInput(path="/opt/solr/cores/openindex_h/data/index.20121030152234973/_p_Lucene41_0.doc") >at > org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:100) >at > org.apache.solr.handler.ReplicationHandler$DirectoryFileStream.write(ReplicationHandler.java:1065) >at > org.apache.solr.handler.ReplicationHandler$3.write(ReplicationHandler.java:932) > > > Markus > > -Original message- >> From:Michael McCandless >> Sent: Fri 02-Nov-2012 11:46 >> To: java-user@lucene.apache.org >> Subject: Re: "read past EOF" when merge >> >> Are you able to reproduce the corruption? >> >> If at any time you accidentally had two writers open on the same >> index, it could have created this corruption. >> >> Writing to an index over NFS ought to be OK, however, it's not well >> tested. You should use SimpleFSLockFactory (not the default >> NativeFSLockFactory). >> >> The more "typical" way people use NFS is to write to an index on a >> local disk, and then other machines read from that index using NFS. >> >> In any event performance is usually much worse than using local disks ... >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Nov 1, 2012 at 10:32 PM, superruiye wrote: >>> oh ,thx,I don't know CheckIndex before...and I use to fix my error index,it >>> is OK... >>> I use NFS to share my index,and no change to the LogFactory. >>> How could I avoid this problem,and not only fix after it was broken >>> suddenly? >>> >>> >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/read-past-EOF-when-merge-tp4017179p4017734.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene 4.1 tentative release
We are hoping for 4.1 very soon! With the holidays it will be difficult to say - but 4.1 talk has been going on for some time now. Its really a matter of wrapping up some short term work and getting some guys to do the release work. I dont think anyone can give you a date, but it's certainly in the works! - Mark On Dec 12, 2012, at 6:50 AM, Ramprakash Ramamoorthy wrote: > Hello, > > Any 'tentative' release date for 4.1 would help. I know it is > difficult pointing a date, but still couldn't resist asking, for we could > plan accordingly. Thanks in advance. > > -- > With Thanks and Regards, > Ramprakash Ramamoorthy, > India. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Luke?
If anyone is able to donate some effort, a nice future scenario could be that Luke comes fully up to date with every Lucene release: https://issues.apache.org/jira/browse/LUCENE-2562 - Mark On Mar 15, 2013, at 5:58 AM, Eric Charles wrote: > For the record, I happily use Luke (with Lucene 4.1) compiled from > https://github.com/sonarme/luke. It is also mavenized (shipped with a > pom.xml). > > Thx, Eric > > > On 14/03/2013 09:10, dizh wrote: >> OK , tomorrow I will put it on spmewhere such as GitHub or googlecode. >> >> But, I really don't look into details, when I compile Luke src , I found >> about ten's errors. >> >> Most are TermEnums API , so I fixed them. >> --- >> Confidentiality Notice: The information contained in this e-mail and any >> accompanying attachment(s) >> is intended only for the use of the intended recipient and may be >> confidential and/or privileged of >> Neusoft Corporation, its subsidiaries and/or its affiliates. If any reader >> of this communication is >> not the intended recipient, unauthorized use, forwarding, printing, >> storing, disclosure or copying >> is strictly prohibited, and may be unlawful.If you have received this >> communication in error,please >> immediately notify the sender by return e-mail, and delete the original >> message and all copies from >> your system. Thank you. >> --- >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
[ANNOUNCE] Apache Lucene 4.2.1 released
April 2013, Apache Lucene™ 4.2.1 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.2.1. Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. This release contains numerous bug fixes, optimizations, and improvements, some of which are highlighted below. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html See the CHANGES.txt file included with the release for a full list of details. Lucene 4.2.1 Release Highlights: * Lucene 4.2.1 includes 9 bug fixes and 3 optimizations, including a fix for a serious bug that could result in the loss of an index. Please read CHANGES.txt for a full list of changes. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers
[ANN] Lucene/Solr Meetup in NYC on May 11th
If you haven't heard, there is a Lucene/Solr meetup in New York next week: http://www.meetup.com/NYC-Apache-Lucene-Solr-Meetup/calendar/13325754/ The scheduled talks are (in addition to lightening talks): Solr 1.5 and Beyond: Yonik Seeley, author of Solr, co-founder, Lucid Imagination Topics will include new faceting functionality, new function queries, increased scalability, field collapsing, and spatial search. There will also be a discussion about the recently announced Lucene/Solr merge, the rationale, its implications, and plans for its completion. The talk will span features already included in trunk, features slated for the next release, as well as incomplete features under consideration for future releases. Cool Linguistic Tricks to Apply to Search Results: A source code level demonstration using LingPipe Breck Baldwin, Founder, LingPipe: The talk will cover post-processing options for Twitter searches. I will cover clustering and classification with some potential for conceptual indexing at the level of persons/orgs/locations. There will be a demo available for download that runs the examples and contains the source. -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NumericField API
On 6/1/10 9:34 AM, Mindaugas Žakšauskas wrote: It's just an early observation as historically Lucene has been doing an amazing job in terms of API stability. Yes it has :) Get ready for even more change in that area though :) -- - Mark http://www.lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
[ANN] Free technical webinar: Mastering the Lucene Index: Wednesday, August 11, 2010 11:00 AM PST / 2:00 PM EST / 20:00 CET
Hey all - apologize for the quick cross post - just to let you know, Andrzej is giving a free webinar this wed. His presentations are always fantastic, so check it out: Lucid Imagination Presents a free technical webinar: Mastering the Lucene Index Wednesday, August 11, 2010 11:00 AM PST / 2:00 PM EST / 20:00 CET Sign up here: http://www.eventsvc.com/lucidimagination/081110?trk-AP Lucene/Solr index implementation is critical to the performance of your search application and the quality of your results -- and not just at indexing time. If you're developing applications in Lucene/Solr, your index will reward care and attention -- adding power to your running search application -- all the more so as you inevitably increase the scope of your query traffic and the dimensions of your data. Join Andrzej Bialecki, Lucene Committer and inventor of the Luke index utility, for an advanced workshop on cutting edge techniques for keeping your Lucene/Solr index at its peak potential. Andrzej will discuss and present essential strategies for index post-processing, including: * Single-pass index splitting -- reshaping indexes for flexible deployment * Index pruning, filtering and multi-tiered search, or how to serve indexes (mostly) from RAM * Bit-wise search -- or how to find the best bit-wise matches - and applications in text fingerprinting About the presenter: Andrzej Bialecki is a committer of the Apache Lucene project, a Lucene PMC member, and chairman of the Apache Nutch project. He is also the author of Luke, the Lucene Index Toolbox. Andrzej participates in many commercial projects that use Lucene, Solr, Nutch and Hadoop to implement enterprise and vertical search. Sign up here: http://www.eventsvc.com/lucidimagination/081110?trk-AP - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Difference between regular Highlighter and Fast Vector Highlighter ?
The general and short answer is: Highlighter: highlights more query types, has a fairly rich API, doesn't scale well to very large documents (though https://issues.apache.org/jira/browse/LUCENE-2939 is going to help a lot here) - does not require that you store term vectors, but is faster if you do. FVH: works with fewer query types and requires that you store term vectors - but scales better than the std Highlighter to very large documents - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org On Apr 1, 2011, at 8:32 AM, shrinath.m wrote: > I was wondering whats the difference between the Lucene's 2 implementation of > highlighters... > I saw the javadoc of FVH, but it only says "another implementation of Lucene > Highlighter" ... > > Can someone throw some more light on this ? > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Difference-between-regular-Highlighter-and-Fast-Vector-Highlighter-tp2763162p2763162.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT consistency
On Apr 10, 2011, at 4:34 AM, Em wrote: > Hello list, > > I am currently trying to understand Lucene's Near-Real-Time-Feature which > was covered in "Lucene in Action, Second Edition". > > Let's say I got a distributed system with a master and a slave. > > In Solr replication is solved by checking for any differences in the > index-directory and to consume those differences to keep indices consistent. > > How is this possible within a NRT-System? Is there any possibility to > consume snapshots of the internal buffer of the index writer to send them to > the slave? I think for near real time, Solr index replication may not be appropriate. Though I think it would be cool to use Andrzej's mythical single pass index splitter to create a single+ doc segment that could be shipped around. Most likely, a system that just sends each doc to each replica is probably going to work a lot better. Introduces other issues of course - some of which we hope to alleviate with further SolrCloud work. > > Regards, > Em > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2801878.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT consistency
On Apr 11, 2011, at 1:05 PM, Em wrote: > Thank you both! > > Mark, could you explain what you mean? I never heard from such an > index-splitter. BTW: The idea of having a segment per document sounds a lot > like an exception for too many FileDescriptors :) This is just an idea for rebalancing I suppose - an index splitter lets you split up an index - there is a multi pass splitter in contrib. So if you wanted to move a few documents around (to rebalance after a couple servers go down perhaps), you might split out another index (just the docs you want to move), and then ship off that already analyzed and indexed bunch of documents to other servers. > > Mike, as you said, the segments are flushed like normal. > Let's say my server dies for whatever reason, when restarting it and > reopening the index-writer: Does the IW deletes the flushed file, because it > is not mentioned in the segmentInfo - file or how does Lucene handle this > internally? > > Regards, > Em > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2807475.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: NRT consistency
On Apr 11, 2011, at 2:41 PM, Otis Gospodnetic wrote: > I think what's being described here is a lot like what I *think* > ElasticSearch > does, where there is no single master and index changed made to any node get > propagated to N-1 other nodes (N=number of index replicas). I'm not sure how > it > deals with situations where "incompatible" index changes are made to the same > index via 2 different nodes at the same time. Is that what vector clocks are > about? Right - you have to have some sort of conflict detection/resolution - Amazon Dynamo uses vector clocks for this. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message >> From: Mark Miller >> To: java-user@lucene.apache.org >> Sent: Mon, April 11, 2011 11:52:05 AM >> Subject: Re: NRT consistency >> >> >> On Apr 10, 2011, at 4:34 AM, Em wrote: >> >>> Hello list, >>> >>> I am currently trying to understand Lucene's Near-Real-Time-Feature which >>> was covered in "Lucene in Action, Second Edition". >>> >>> Let's say I got a distributed system with a master and a slave. >>> >>> In Solr replication is solved by checking for any differences in the >>> index-directory and to consume those differences to keep indices > consistent. >>> >>> How is this possible within a NRT-System? Is there any possibility to >>> consume snapshots of the internal buffer of the index writer to send them > to >>> the slave? >> >> I think for near real time, Solr index replication may not be appropriate. >> Though I think it would be cool to use Andrzej's mythical single pass index >> splitter to create a single+ doc segment that could be shipped around. >> >> Most likely, a system that just sends each doc to each replica is probably >> going to work a lot better. Introduces other issues of course - some of >> which >> we hope to alleviate with further SolrCloud work. >> >>> >>> Regards, >>> Em >>> >>> -- >>> View this message in context: >> http://lucene.472066.n3.nabble.com/NRT-consistency-tp2801878p2801878.html >>> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> - Mark Miller >> lucidimagination.com >> >> Lucene/Solr User Conference >> May 25-26, San Francisco >> www.lucenerevolution.org >> >> >> >> >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - Mark Miller lucidimagination.com Lucene/Solr User Conference May 25-26, San Francisco www.lucenerevolution.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extracting span terms using WeightedSpanTermExtractor
Sorry - kind of my fault. When I fixed this to use maxDocCharsToAnalyze, I didn't set a default other than 0 because I didn't really count on this being used beyond how it is in the Highlighter - which always sets maxDocCharsToAnalyze with it's default. You've got to explicitly set it higher than 0 for now. Feel free to create a JIRA issue and we can give it's own default greater than 0. - Mark Miller lucidimagination.com On Jul 6, 2011, at 5:34 PM, Jahangir Anwari wrote: > I have a CustomHighlighter that extends the SolrHighlighter and overrides > the doHighlighting() method. Then for each document I am trying to extract > the span terms so that later I can use it to get the span Positions. I tried > to get the weightedSpanTerms using WeightedSpanTermExtractor but was > unsuccessful. Below is the code that I am have. Is there something missing > that needs to be added to get the span terms? > > // in CustomHighlighter.java > @Override > public NamedList doHighlighting(DocList docs, Query query, SolrQueryRequest > req, String[] defaultFields) throws IOException { > > NamedList highlightedSnippets = super.doHighlighting(docs, query, req, > defaultFields); > > IndexReader reader = req.getSearcher().getIndexReader(); > > String[] fieldNames = getHighlightFields(query, req, defaultFields); > for (String fieldName : fieldNames) > { > QueryScorer scorer = new QueryScorer(query, null); > scorer.setExpandMultiTermQuery(true); > scorer.setMaxDocCharsToAnalyze(51200); > > DocIterator iterator = docs.iterator(); > for (int i = 0; i < docs.size(); i++) > { > int docId = iterator.nextDoc(); > System.out.println("DocId: " + docId); > TokenStream tokenStream = TokenSources.getTokenStream(reader, docId, > fieldName); > WeightedSpanTermExtractor wste = new WeightedSpanTermExtractor(fieldName); > wste.setExpandMultiTermQuery(true); > wste.setWrapIfNotCachingTokenFilter(true); > > Map weightedSpanTerms = > wste.getWeightedSpanTerms(query, tokenStream, fieldName); // this is always > empty > System.out.println("weightedSpanTerms: " + weightedSpanTerms.values()); > > } > } > return highlightedSnippets; > > } > > Thanks, > Jahangir - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extracting span terms using WeightedSpanTermExtractor
On Jul 7, 2011, at 5:14 PM, Jahangir Anwari wrote: > I did noticed a strange issue though. When the query is just a > PhraseQuery(e.g. "everlasting glory"), getWeightedSpanTerms() returns all > the span terms along with their span positions. But when the query is a > BooleanQuery containing phrase and non-phrase terms(e.g. "everlasting > glory"+unity), getWeightedSpanTerms() returns all the span terms but the > span positions are returned only for the phrase terms(i.e. "everlasting" and > "glory"). Span positions for the non-phrase term(i.e. "unity") is empty. Any > ideas why this could be happening? Positions are only collected for "position sensitive" queries. The Highlighter framework that I plugged this into already runs through the TokenStream one token at a time - to highlight a TermQuery, there is no need to consult positions - just highlight every occurrence seen while marching through the TokenStream. Which means there is no need to find those positions either. If you are looking for those positions, here is a patch to calculate them for TermQuerys as well. If you open a JIRA issue, seems like a reasonable option to add to the class. Index: lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java === --- lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java (revision 1143407) +++ lucene/contrib/highlighter/src/java/org/apache/lucene/search/highlight/WeightedSpanTermExtractor.java (working copy) @@ -133,7 +133,7 @@ sp.setBoost(query.getBoost()); extractWeightedSpanTerms(terms, sp); } else if (query instanceof TermQuery) { - extractWeightedTerms(terms, query); + extractWeightedSpanTerms(terms, new SpanTermQuery(((TermQuery)query).getTerm())); } else if (query instanceof SpanQuery) { extractWeightedSpanTerms(terms, (SpanQuery) query); } else if (query instanceof FilteredQuery) { - Mark Miller lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Extracting span terms using WeightedSpanTermExtractor
On Jul 8, 2011, at 5:43 AM, Jahangir Anwari wrote: > I don't think this is the best > solution, am open to other alternatives. Could also make it static public where it is? Either way. - Mark Miller lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
[Announce] Lucene-Eurocon Call for Participation Closes Friday, JULY 15
Hey all - just a friendly FYI reminder: CALL FOR PARTICIPATION CLOSES FRIDAY, JULY 15! TO SUBMIT A TOPIC, GO TO: http://2011.lucene-eurocon.org/pages/cfp Now in its second year, Apache Lucene Eurocon 2011 comes to Barcelona, Spain, providing an unparalleled opportunity for European search application developers and technologists to connect and network. The conference takes place October 19 - 20, preceded by two days of optional training workshops October 17 - 18. Get Involved Today! The Call for Participation Closes This Week! Consider presenting at Apache Lucene EuroCon 2011. Submit your ideas by July 15. If you have a great Solr or Lucene story to tell, the community wants to hear about it. Share your expertise and innovations! To submit a topic, go to: http://2011.lucene-eurocon.org/pages/cfp Sample topics of interest include: * Lucene and Solr in the Enterprise (case studies, implementation, return on investment, etc.) * “How We Did It” Development Case Studies * Relevance in Practice * Spatial/Geo search * Lucene and Solr in the Cloud * Scalability and Performance Tuning * Large Scale Search * Real Time Search * Data Integration/Data Management * Tika, Nutch and Mahout * Faceting and Categorization * Lucene & Solr for Mobile Applications * Multi-language Support * Indexing and Analysis Techniques * Advanced Topics in Lucene & Solr Development Want to be added to the conference mailing list? Is your organization interested in sponsorship opportunities? Please send an email to i...@lucene-eurocon.org Best Regards, Suzanne Kushner Lucid Imagination Corporate Marketing www.lucidimagination.com DATE: OCTOBER 17 - 20 2011 LOCATION: Hotel Meliá Barcelona C/ Avenida Sarriá, 50 Barcelona - SPAIN 08029 Tel: (0034) 93 4106060 Apache Lucene EuroCon 2011 is presented by Lucid Imagination, the commercial entity for Apache Solr/Lucene Open Source Search; proceeds of the conference benefit The Apache Software Foundation. "Lucene" and "Apache Solr" are trademarks of the Apache Software Foundation. - Mark Miller lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Questions on index Writer
My advice: Don't close the IndexWriter - just call commit. Don't worry about forcing merges - let them happen as they do when you call commit. If you are going to use the IndexWriter again, you generally do not want to close it. Calling commit is the preferred option. - Mark Miller lucidimagination.com On Jul 15, 2011, at 3:03 PM, Saurabh Gokhale wrote: > Hi All, > > I have following questions about lucene indexWriter. I am using version > 3.1.0. > > While indexing documents, > 1. When is the good time to commit changes? (indexWriter.commit) or just > close the writer after the indexing is done so that commit automatically > happens. > 2. When is the good time to merge indexes (indexWriter.maybeMerge()). Is it > just before committing the changes or after indexing say X number of > documents. (I recently upgraded from 2.9.4 to 3.1 and I see 3.1 lucene > generates lot of small index files while indexing document) > > Also I have a problem where my lucene index files sometimes gets deleted > from the index folder. I am not sure what code snippet is causing the > existing index files to accidently get removed. > > My indexer runs in a thread loop where it indexes file whenever they are > available. When no more files are available, indexer thread closes the > writer and goes to sleep, after specific time, it again creates a new index > on the same folder and starts indexing new files if any available. > > A. Is it a wrong way to index files? > B. Because I close the index and open it again later, am I seeing my lucene > index files getting deleted? > > Thanks > > Saurabh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search within a sentence (revisited)
On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote: > Mark Miller's 'SpanWithinQuery' patch > seems to have the same issue. If I remember right (It's been more the a couple years), I did index the sentence markers at the same position as the last word in the sentence. And I think the limitation that I ate was that the word could belong to both it's true sentence, and the one after it. - Mark Miller lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search within a sentence (revisited)
On Jul 20, 2011, at 7:44 PM, Mark Miller wrote: > > On Jul 20, 2011, at 11:27 AM, Peter Keegan wrote: > >> Mark Miller's 'SpanWithinQuery' patch >> seems to have the same issue. > > If I remember right (It's been more the a couple years), I did index the > sentence markers at the same position as the last word in the sentence. And I > think the limitation that I ate was that the word could belong to both it's > true sentence, and the one after it. > > - Mark Miller > lucidimagination.com Perhaps you could index the sentence marker at both the last word of the sentence as well as the first word of the next sentence if there is one. This would seem to solve the above limitation as well? - Mark Miller lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Search within a sentence (revisited)
Hey Peter, Getting sucked back into Spans... That test should pass now - I uploaded a new patch to https://issues.apache.org/jira/browse/LUCENE-777 Further tests may be needed though. - Mark On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote: > Hi Mark, > > Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2 > ('getTerms' removed) . The last test fails (search for "1" and "3"). > > package org.apache.lucene.search.spans; > > import java.io.Reader; > > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; > import > org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.RandomIndexWriter; > import org.apache.lucene.index.Term; > import org.apache.lucene.store.Directory; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.PhraseQuery; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.TermQuery; > import org.apache.lucene.search.spans.SpanNearQuery; > import org.apache.lucene.search.spans.SpanQuery; > import org.apache.lucene.search.spans.SpanTermQuery; > import org.apache.lucene.util.LuceneTestCase; > > public class TestSentence extends LuceneTestCase { > public static final String field = "field"; > public static final String START = "^"; > public static final String END = "$"; > public void testSetPosition() throws Exception { > Analyzer analyzer = new Analyzer() { > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new TokenStream() { > private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, > "9"}; > private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; > private int i = 0; > > PositionIncrementAttribute posIncrAtt = > addAttribute(PositionIncrementAttribute.class); > CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); > OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); > > @Override > public boolean incrementToken() { > assertEquals(TOKENS.length, INCREMENTS.length); > if (i == TOKENS.length) > return false; > clearAttributes(); > termAtt.append(TOKENS[i]); > offsetAtt.setOffset(i,i); > posIncrAtt.setPositionIncrement(INCREMENTS[i]); > i++; > return true; > } > }; > } > }; > Directory store = newDirectory(); > RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer); > Document d = new Document(); > d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED)); > writer.addDocument(d); > IndexReader reader = writer.getReader(); > writer.close(); > IndexSearcher searcher = newSearcher(reader); > > SpanTermQuery startSentence = makeSpanTermQuery(START); > SpanTermQuery endSentence = makeSpanTermQuery(END); > SpanQuery[] clauses = new SpanQuery[2]; > clauses[0] = makeSpanTermQuery("1"); > clauses[1] = makeSpanTermQuery("2"); > SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, > false); // SpanAndQuery equivalent > SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(hits.length, 1); > > clauses[1] = makeSpanTermQuery("4"); > allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // > SpanAndQuery equivalent > query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(hits.length, 0); > > PhraseQuery pq = new PhraseQuery(); > pq.add(new Term(field, "3")); > pq.add(new Term(field, "4")); > hits = searcher.search(pq, null, 1000).scoreDocs; > assertEquals(hits.length, 1); > > clauses[1] = makeSpanTermQuery("3"); > allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // > SpanAndQuery equivalent > query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(hits.length, 1); > > > } > > public SpanTermQuery makeSpanTermQuery(String text) { > return new SpanTermQuery(new
Re: Search within a sentence (revisited)
Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change that to an IndexReader I believe. - Mark On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote: > Does this patch require the trunk version? I'm using 3.2 and > 'AtomicReaderContext' isn't there. > > Peter > > On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote: > >> Hey Peter, >> >> Getting sucked back into Spans... >> >> That test should pass now - I uploaded a new patch to >> https://issues.apache.org/jira/browse/LUCENE-777 >> >> Further tests may be needed though. >> >> - Mark >> >> >> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote: >> >>> Hi Mark, >>> >>> Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2 >>> ('getTerms' removed) . The last test fails (search for "1" and "3"). >>> >>> package org.apache.lucene.search.spans; >>> >>> import java.io.Reader; >>> >>> import org.apache.lucene.analysis.Analyzer; >>> import org.apache.lucene.analysis.TokenStream; >>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >>> import >>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>> import org.apache.lucene.document.Document; >>> import org.apache.lucene.document.Field; >>> import org.apache.lucene.index.IndexReader; >>> import org.apache.lucene.index.RandomIndexWriter; >>> import org.apache.lucene.index.Term; >>> import org.apache.lucene.store.Directory; >>> import org.apache.lucene.search.IndexSearcher; >>> import org.apache.lucene.search.PhraseQuery; >>> import org.apache.lucene.search.ScoreDoc; >>> import org.apache.lucene.search.TermQuery; >>> import org.apache.lucene.search.spans.SpanNearQuery; >>> import org.apache.lucene.search.spans.SpanQuery; >>> import org.apache.lucene.search.spans.SpanTermQuery; >>> import org.apache.lucene.util.LuceneTestCase; >>> >>> public class TestSentence extends LuceneTestCase { >>> public static final String field = "field"; >>> public static final String START = "^"; >>> public static final String END = "$"; >>> public void testSetPosition() throws Exception { >>> Analyzer analyzer = new Analyzer() { >>> @Override >>> public TokenStream tokenStream(String fieldName, Reader reader) { >>> return new TokenStream() { >>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, >>> "9"}; >>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; >>> private int i = 0; >>> >>> PositionIncrementAttribute posIncrAtt = >>> addAttribute(PositionIncrementAttribute.class); >>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); >>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); >>> >>> @Override >>> public boolean incrementToken() { >>> assertEquals(TOKENS.length, INCREMENTS.length); >>> if (i == TOKENS.length) >>> return false; >>> clearAttributes(); >>> termAtt.append(TOKENS[i]); >>> offsetAtt.setOffset(i,i); >>> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >>> i++; >>> return true; >>> } >>> }; >>> } >>> }; >>> Directory store = newDirectory(); >>> RandomIndexWriter writer = new RandomIndexWriter(random, store, >> analyzer); >>> Document d = new Document(); >>> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED)); >>> writer.addDocument(d); >>> IndexReader reader = writer.getReader(); >>> writer.close(); >>> IndexSearcher searcher = newSearcher(reader); >>> >>> SpanTermQuery startSentence = makeSpanTermQuery(START); >>> SpanTermQuery endSentence = makeSpanTermQuery(END); >>> SpanQuery[] clauses = new SpanQuery[2]; >>> clauses[0] = makeSpanTermQuery("1"); >>> clauses[1] = makeSpanTermQuery("2"); >>> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, >>> false); // SpanAndQuery equivalent >>> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0); >>> System.out.println("query: "+query); >>&
Re: Search within a sentence (revisited)
I just uploaded a patch for 3X that will work for 3.2. On Jul 21, 2011, at 4:25 PM, Mark Miller wrote: > Yeah, it's off trunk - I'll submit a 3X patch in a bit - just have to change > that to an IndexReader I believe. > > - Mark > > On Jul 21, 2011, at 4:01 PM, Peter Keegan wrote: > >> Does this patch require the trunk version? I'm using 3.2 and >> 'AtomicReaderContext' isn't there. >> >> Peter >> >> On Thu, Jul 21, 2011 at 3:07 PM, Mark Miller wrote: >> >>> Hey Peter, >>> >>> Getting sucked back into Spans... >>> >>> That test should pass now - I uploaded a new patch to >>> https://issues.apache.org/jira/browse/LUCENE-777 >>> >>> Further tests may be needed though. >>> >>> - Mark >>> >>> >>> On Jul 21, 2011, at 9:28 AM, Peter Keegan wrote: >>> >>>> Hi Mark, >>>> >>>> Here is a unit test using a version of 'SpanWithinQuery' modified for 3.2 >>>> ('getTerms' removed) . The last test fails (search for "1" and "3"). >>>> >>>> package org.apache.lucene.search.spans; >>>> >>>> import java.io.Reader; >>>> >>>> import org.apache.lucene.analysis.Analyzer; >>>> import org.apache.lucene.analysis.TokenStream; >>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >>>> import >>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>>> import org.apache.lucene.document.Document; >>>> import org.apache.lucene.document.Field; >>>> import org.apache.lucene.index.IndexReader; >>>> import org.apache.lucene.index.RandomIndexWriter; >>>> import org.apache.lucene.index.Term; >>>> import org.apache.lucene.store.Directory; >>>> import org.apache.lucene.search.IndexSearcher; >>>> import org.apache.lucene.search.PhraseQuery; >>>> import org.apache.lucene.search.ScoreDoc; >>>> import org.apache.lucene.search.TermQuery; >>>> import org.apache.lucene.search.spans.SpanNearQuery; >>>> import org.apache.lucene.search.spans.SpanQuery; >>>> import org.apache.lucene.search.spans.SpanTermQuery; >>>> import org.apache.lucene.util.LuceneTestCase; >>>> >>>> public class TestSentence extends LuceneTestCase { >>>> public static final String field = "field"; >>>> public static final String START = "^"; >>>> public static final String END = "$"; >>>> public void testSetPosition() throws Exception { >>>> Analyzer analyzer = new Analyzer() { >>>> @Override >>>> public TokenStream tokenStream(String fieldName, Reader reader) { >>>> return new TokenStream() { >>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, >>>> "9"}; >>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; >>>> private int i = 0; >>>> >>>> PositionIncrementAttribute posIncrAtt = >>>> addAttribute(PositionIncrementAttribute.class); >>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); >>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); >>>> >>>> @Override >>>> public boolean incrementToken() { >>>> assertEquals(TOKENS.length, INCREMENTS.length); >>>> if (i == TOKENS.length) >>>> return false; >>>> clearAttributes(); >>>> termAtt.append(TOKENS[i]); >>>> offsetAtt.setOffset(i,i); >>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >>>> i++; >>>> return true; >>>> } >>>> }; >>>> } >>>> }; >>>> Directory store = newDirectory(); >>>> RandomIndexWriter writer = new RandomIndexWriter(random, store, >>> analyzer); >>>> Document d = new Document(); >>>> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED)); >>>> writer.addDocument(d); >>>> IndexReader reader = writer.getReader(); >>>> writer.close(); >>>> IndexSearcher searcher = newSearcher(reader); >>>> >>>> SpanTermQuery startSentence = makeSpanTermQuery(START
Re: Search within a sentence (revisited)
Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes. I can likely look at this later today. - Mark Miller lucidimagination.com On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote: > Hi Mark, > > Sorry to bug you again, but there's another case that fails the unit test > (search within the second sentence), as shown here in the last test: > > package org.apache.lucene.search.spans; > > import java.io.Reader; > > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; > import > org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; > import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; > import org.apache.lucene.document.Document; > import org.apache.lucene.document.Field; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.RandomIndexWriter; > import org.apache.lucene.index.Term; > import org.apache.lucene.store.Directory; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.PhraseQuery; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.TermQuery; > import org.apache.lucene.search.spans.SpanNearQuery; > import org.apache.lucene.search.spans.SpanQuery; > import org.apache.lucene.search.spans.SpanTermQuery; > import org.apache.lucene.util.LuceneTestCase; > > public class TestSentence extends LuceneTestCase { > public static final String field = "field"; > public static final String START = "^"; > public static final String END = "$"; > public void testSetPosition() throws Exception { > Analyzer analyzer = new Analyzer() { > @Override > public TokenStream tokenStream(String fieldName, Reader reader) { > return new TokenStream() { > private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, > "9"}; > private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; > private int i = 0; > PositionIncrementAttribute posIncrAtt = > addAttribute(PositionIncrementAttribute.class); > CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); > OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); > @Override > public boolean incrementToken() { > assertEquals(TOKENS.length, INCREMENTS.length); > if (i == TOKENS.length) > return false; > clearAttributes(); > termAtt.append(TOKENS[i]); > offsetAtt.setOffset(i,i); > posIncrAtt.setPositionIncrement(INCREMENTS[i]); > i++; > return true; > } > }; > } > }; > Directory store = newDirectory(); > RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer); > Document d = new Document(); > d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED)); > writer.addDocument(d); > IndexReader reader = writer.getReader(); > writer.close(); > IndexSearcher searcher = newSearcher(reader); > SpanTermQuery startSentence = makeSpanTermQuery(START); > SpanTermQuery endSentence = makeSpanTermQuery(END); > SpanQuery[] clauses = new SpanQuery[2]; > clauses[0] = makeSpanTermQuery("1"); > clauses[1] = makeSpanTermQuery("2"); > SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, > false); // SpanAndQuery equivalent > SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(1, hits.length); > clauses[1] = makeSpanTermQuery("4"); > allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // > SpanAndQuery equivalent > query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(0, hits.length); > PhraseQuery pq = new PhraseQuery(); > pq.add(new Term(field, "3")); > pq.add(new Term(field, "4")); > System.out.println("query: "+pq); > hits = searcher.search(pq, null, 1000).scoreDocs; > assertEquals(1, hits.length); > clauses[0] = makeSpanTermQuery("4"); > clauses[1] = makeSpanTermQuery("6"); > allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // > SpanAndQuery equivalent > query = new SpanWithinQuery(allKeywords, endSentence, 0); > System.out.println("query: "+query); > hits = searcher.search(query, null, 1000).scoreDocs; > assertEquals(1, hits.length); > } > > public SpanTermQuery makeSpanTermQuery(String text) { > return new SpanTermQuery(new Term(field, text));
Re: Search within a sentence (revisited)
Sorry Peter - I introduced this problem with some kind of typo type issue - I somehow changed an includeSpans variable to excludeSpans - but I certainly didn't mean too - it makes no sense. So not sure how it happened, and surprised the tests that passed still passed! We could probably use even more tests before feeling too confident here… I've attached a patch for 3X with the new test and fix (changed that include back to exclude). - Mark Miller lucidimagination.com On Jul 25, 2011, at 10:29 AM, Mark Miller wrote: > Thanks Peter - if you supply the unit tests, I'm happy to work on the fixes. > > I can likely look at this later today. > > - Mark Miller > lucidimagination.com > > On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote: > >> Hi Mark, >> >> Sorry to bug you again, but there's another case that fails the unit test >> (search within the second sentence), as shown here in the last test: >> >> package org.apache.lucene.search.spans; >> >> import java.io.Reader; >> >> import org.apache.lucene.analysis.Analyzer; >> import org.apache.lucene.analysis.TokenStream; >> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >> import >> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >> import org.apache.lucene.document.Document; >> import org.apache.lucene.document.Field; >> import org.apache.lucene.index.IndexReader; >> import org.apache.lucene.index.RandomIndexWriter; >> import org.apache.lucene.index.Term; >> import org.apache.lucene.store.Directory; >> import org.apache.lucene.search.IndexSearcher; >> import org.apache.lucene.search.PhraseQuery; >> import org.apache.lucene.search.ScoreDoc; >> import org.apache.lucene.search.TermQuery; >> import org.apache.lucene.search.spans.SpanNearQuery; >> import org.apache.lucene.search.spans.SpanQuery; >> import org.apache.lucene.search.spans.SpanTermQuery; >> import org.apache.lucene.util.LuceneTestCase; >> >> public class TestSentence extends LuceneTestCase { >> public static final String field = "field"; >> public static final String START = "^"; >> public static final String END = "$"; >> public void testSetPosition() throws Exception { >> Analyzer analyzer = new Analyzer() { >> @Override >> public TokenStream tokenStream(String fieldName, Reader reader) { >> return new TokenStream() { >> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, >> "9"}; >> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; >> private int i = 0; >> PositionIncrementAttribute posIncrAtt = >> addAttribute(PositionIncrementAttribute.class); >> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); >> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); >> @Override >> public boolean incrementToken() { >> assertEquals(TOKENS.length, INCREMENTS.length); >> if (i == TOKENS.length) >> return false; >> clearAttributes(); >> termAtt.append(TOKENS[i]); >> offsetAtt.setOffset(i,i); >> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >> i++; >> return true; >> } >> }; >> } >> }; >> Directory store = newDirectory(); >> RandomIndexWriter writer = new RandomIndexWriter(random, store, analyzer); >> Document d = new Document(); >> d.add(newField("field", "bogus", Field.Store.YES, Field.Index.ANALYZED)); >> writer.addDocument(d); >> IndexReader reader = writer.getReader(); >> writer.close(); >> IndexSearcher searcher = newSearcher(reader); >> SpanTermQuery startSentence = makeSpanTermQuery(START); >> SpanTermQuery endSentence = makeSpanTermQuery(END); >> SpanQuery[] clauses = new SpanQuery[2]; >> clauses[0] = makeSpanTermQuery("1"); >> clauses[1] = makeSpanTermQuery("2"); >> SpanNearQuery allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, >> false); // SpanAndQuery equivalent >> SpanWithinQuery query = new SpanWithinQuery(allKeywords, endSentence, 0); >> System.out.println("query: "+query); >> ScoreDoc[] hits = searcher.search(query, null, 1000).scoreDocs; >> assertEquals(1, hits.length); >> clauses[1] = makeSpanTermQuery("4"); >> allKeywords = new SpanNearQuery(clauses, Integer.MAX_VALUE, false); // >> SpanAndQuery equivalent >> query = new SpanWithinQuery(allKeywords, endSentence, 0); >> System.out.println(&
Re: Search within a sentence (revisited)
As long as you are happy with the results, I'm good. Always nice to have an excuse to dip back into Lucene. Just don't want you to feel over confident with the code without proper testing of it - I coded to fix the broken tests rather than taking the time to write a bunch more corner case tests like I likely should try if I was going to commit this thing. - Mark Miller lucidimagination.com On Jul 26, 2011, at 8:56 AM, Peter Keegan wrote: > Thanks Mark! The new patch is working fine with the tests and a few more. If > you have particular test cases in mind, I'd be happy to add them. > > Thanks, > Peter > > On Mon, Jul 25, 2011 at 5:56 PM, Mark Miller wrote: > >> Sorry Peter - I introduced this problem with some kind of typo type issue - >> I somehow changed an includeSpans variable to excludeSpans - but I certainly >> didn't mean too - it makes no sense. So not sure how it happened, and >> surprised the tests that passed still passed! >> >> We could probably use even more tests before feeling too confident here… >> >> I've attached a patch for 3X with the new test and fix (changed that >> include back to exclude). >> >> - Mark Miller >> lucidimagination.com >> >> On Jul 25, 2011, at 10:29 AM, Mark Miller wrote: >> >>> Thanks Peter - if you supply the unit tests, I'm happy to work on the >> fixes. >>> >>> I can likely look at this later today. >>> >>> - Mark Miller >>> lucidimagination.com >>> >>> On Jul 25, 2011, at 10:14 AM, Peter Keegan wrote: >>> >>>> Hi Mark, >>>> >>>> Sorry to bug you again, but there's another case that fails the unit >> test >>>> (search within the second sentence), as shown here in the last test: >>>> >>>> package org.apache.lucene.search.spans; >>>> >>>> import java.io.Reader; >>>> >>>> import org.apache.lucene.analysis.Analyzer; >>>> import org.apache.lucene.analysis.TokenStream; >>>> import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; >>>> import >>>> org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute; >>>> import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; >>>> import org.apache.lucene.document.Document; >>>> import org.apache.lucene.document.Field; >>>> import org.apache.lucene.index.IndexReader; >>>> import org.apache.lucene.index.RandomIndexWriter; >>>> import org.apache.lucene.index.Term; >>>> import org.apache.lucene.store.Directory; >>>> import org.apache.lucene.search.IndexSearcher; >>>> import org.apache.lucene.search.PhraseQuery; >>>> import org.apache.lucene.search.ScoreDoc; >>>> import org.apache.lucene.search.TermQuery; >>>> import org.apache.lucene.search.spans.SpanNearQuery; >>>> import org.apache.lucene.search.spans.SpanQuery; >>>> import org.apache.lucene.search.spans.SpanTermQuery; >>>> import org.apache.lucene.util.LuceneTestCase; >>>> >>>> public class TestSentence extends LuceneTestCase { >>>> public static final String field = "field"; >>>> public static final String START = "^"; >>>> public static final String END = "$"; >>>> public void testSetPosition() throws Exception { >>>> Analyzer analyzer = new Analyzer() { >>>> @Override >>>> public TokenStream tokenStream(String fieldName, Reader reader) { >>>> return new TokenStream() { >>>> private final String[] TOKENS = {"1", "2", "3", END, "4", "5", "6", END, >>>> "9"}; >>>> private final int[] INCREMENTS = {1,1,1,0,1,1,1,0,1}; >>>> private int i = 0; >>>> PositionIncrementAttribute posIncrAtt = >>>> addAttribute(PositionIncrementAttribute.class); >>>> CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); >>>> OffsetAttribute offsetAtt = addAttribute(OffsetAttribute.class); >>>> @Override >>>> public boolean incrementToken() { >>>> assertEquals(TOKENS.length, INCREMENTS.length); >>>> if (i == TOKENS.length) >>>> return false; >>>> clearAttributes(); >>>> termAtt.append(TOKENS[i]); >>>> offsetAtt.setOffset(i,i); >>>> posIncrAtt.setPositionIncrement(INCREMENTS[i]); >>>> i++; >>>> return true; >>>
Re: implicit closing of an IndexWriter
On Jul 26, 2011, at 9:52 AM, Clemens Wyss wrote: > Side note: I am using threads when writing and theses threads are (by design) > interrupted (from time to time) Perhaps you are seeing this: https://issues.apache.org/jira/browse/LUCENE-2239 - Mark Miller lucidimagination.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: optimize with num segments > 1 index keeps growing
On Sep 9, 2011, at 3:35 PM, Robert Muir wrote: > On Fri, Sep 9, 2011 at 3:07 PM, Uwe Schindler wrote: >> Hi, >> >> This is still some kind of bug, because expungeDeletes is documented to >> remove all deletes. Maybe we need to modify MergePolicy? >> > > we should correct the javadocs for expungeDeletes here I think: so > that its more consistent with the javadocs for optimize? > > "Requests an expunge operation..." ? > +1 - it's a documentation bug now. - Mark Miller lucidimagination.com 2011.lucene-eurocon.org | Oct 17-20 | Barcelona - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ElasticSearch
The XML query parser can map to Lucene one to one as well - hasn't seemed to pick up enough steam to be included with Solr yet, but there has been some commotion so it's likely to go in at some point. Not enough demand yet I guess. https://issues.apache.org/jira/browse/SOLR-839 XML Query Parser Support -- - Mark http://www.lucidimagination.com On Thu, Nov 17, 2011 at 6:11 PM, Peter Karich wrote: > > > > I don't think it's possible. > > Eh, of course its possible (if I would understand it I would do it. no, > no, just joking ;)) > > and yes, Solr its a shorter for some common use cases. I don't think > that there is a 'best', but JSON can map 1:1 to lucene. > > The biggest problem with ES's syntax is that you can have super big > queries where you miss the big picture or some closing bracket (probably > would be better ;)) > => so this makes it sometimes harder to 'parse' for humans (for bigger > queries) and more chatty > > The biggest problem with Solr's syntax is that you need to escape here > and there and you have all the different brackets and dots (e.g. for > ranges, local params, term filter, ...), > which makes it hard to parse for *non*-humans and sub-intelligent people > IMO. An advantage is that you can put the URL into the browser with > Solr, which is only possible via additional software for ES (called > Elasticsearch-head). although some parameters are available as URL > parameters as well in ES > > Regards, > Peter. > > > > On Thu, Nov 17, 2011 at 3:44 PM, Michael McCandless > > wrote: > >> Maybe someone can post the equivalent query in ElasticSearch? > > I don't think it's possible. Hoss threw in the kitchen sink into his > > "contrived' example. > > Here's a super simple example: > > > > JSON: > > > > { > > "sort" : [ > > { "age" : {"order" : "asc"} } > > ], > > "query" : { > > "term" : { "user" : "jack" } > > } > > } > > > > Solr's HTTP: > > > > q=user:jack&sort=age asc > > > > -Yonik > > http://www.lucidimagination.com > > > > > > -- > http://jetsli.de news reader for geeks > >
Re: Regarding Compression Tool
Have you considered storing your indexes server-side? I haven't used compression but usually the trade-off of compression is CPU usage which will also be a drain on battery life. Or maybe consider how important the highlighter is to your users - is it worth the trade-off of either disk space or battery life? If it's more of a nice-to-have then maybe hold off on the feature for a later release until you've had some feedback and some more time to figure out the best solution. Of course I don't know much about your application, so take my advice with a grain of salt. On Mon, Sep 16, 2013 at 2:22 AM, Jebarlin Robertson wrote: > I am using Apache Lucene in Android. I have around 1 GB of Text documents > (Logs). When I Index these text documents using this > *new Field(ContentIndex.KEY_TEXTCONTENT, contents, Field.Store.YES, > Field.Index.ANALYZED,TermVector.WITH_POSITIONS_OFFSETS)*, the index > directory is consuming 1.59GB memory size. > But without Field Store it will be adound 0.59 GB indexed size. If the > Lucene indexing is taking this much space to create index and to store the > original text just to use hightlight feature, it will be big problem for > mobile devices. So I just want some help that, is there any alternative > ways to do this without occupying more space to use highligh feature in > Android powered devices. > > > On Sun, Sep 15, 2013 at 3:26 AM, Erick Erickson >wrote: > > > bq: I thought that I can use the CompressionTool to minimize the memory > > size. > > > > This doesn't make a lot of sense. Highlighting needs the raw data to > > figure out what to highlight, so I don't see how the CompressionTool > > will help you there. > > > > And unless you have a huge document and only a very few of them, then > > the memory occupied by the uncompressed data should be trivial > > compared to the various low-level caches. This really is seeming like > > an XY problem. Perhaps if you backed up and explained _why_ this > > seems important to do people could be more helpful. > > > > > > Best, > > Erick > > > > > > On Sat, Sep 14, 2013 at 12:21 PM, Jebarlin Robertson > >wrote: > > > > > Thank you very much Erick. Actually I was using Highlighter tool, that > > > needs the entire data to be stored to get the relevant searched > sentence. > > > But when I use that, It was consuming more memory (Indexed data size + > > > Store.YES - the entire content) than the actual documents size. > > > I thought that I can use the CompressionTool to minimize the memory > size. > > > You can help, if there is any possiblities or way to store the entire > > > content and to use the highlighter feature. > > > > > > Thankyou > > > > > > > > > On Fri, Sep 13, 2013 at 6:54 PM, Erick Erickson < > erickerick...@gmail.com > > > >wrote: > > > > > > > Compression is for the _stored_ data, which is not searched. Ignore > > > > the compression and insure that you index the data. > > > > > > > > The compressing/decompressing for looking at stored > > > > values is, I believe, done at a very low level that you don't > > > > need to care about at all. > > > > > > > > If you index the data in the field, you shouldn't have to do > > > > anything special to search it. > > > > > > > > Best, > > > > Erick > > > > > > > > > > > > On Fri, Sep 13, 2013 at 1:19 AM, Jebarlin Robertson < > > jebar...@gmail.com > > > > >wrote: > > > > > > > > > Hi, > > > > > > > > > > I am trying to store all the Field values using CompressionTool, > But > > > > When I > > > > > search for any content, it is not finding any results. > > > > > > > > > > Can you help me, how to create the Field with CompressionTool to > add > > to > > > > the > > > > > Document and how to decompress it when searching for any content in > > it. > > > > > > > > > > -- > > > > > Thanks & Regards, > > > > > Jebarlin Robertson.R > > > > > > > > > > > > > > > > > > > > > -- > > > Thanks & Regards, > > > Jebarlin Robertson.R > > > GSM: 91-9538106181. > > > > > > > > > -- > Thanks & Regards, > Jebarlin Robertson.R > GSM: 91-9538106181. > -- Mark J. Miller Blog: http://www.developmentalmadness.com LinkedIn: http://www.linkedin.com/in/developmentalmadness
[ANNOUNCE] Apache Lucene 4.5.1 released.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 October 2013, Apache Lucene™ 4.5.1 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.5.1 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Lucene 4.5.1 includes 8 bug fixes. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy searching, Lucene/Solr developers -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJSaUdcAAoJED+/0YJ4eWrI+WMP/2SJySsdpGdO2QRT3cj+5y5f b62LlhTMpMG3vVkETphWyVaRFrDyDBmG7co1ZAQ86YEesJ5VeumJqIrLN6gheT30 DJM/j70BKvPDhESCSSJocJ59peHkfbz5DI4UOdDjHqyNgM6sPHBfMuxLTQkg4NkY CStKXo/X5GWu5sscwUSuUazI59Qm8gAMj1fLnqhRJPpjgNvYLs8+XG12jA0phL6y pDClThi0eYekf2x6t3Rlzm4GaF0wFBBLJhaJZr+YhbJRApNXsYwJNUbtYPyspHWE Xye8HRep0Q26FHmUPas3sLew92MhE/xqUPjeHooDbVlfGFeJIUkKcT482V2+MHXW ubOno1MA6LVGr1LGu56rx+VHUz7BiNFP9vi2tvfNoifTPWsQ0+38ptk5HrEchgB4 sayhEJepyrGRVKu7i8AvwGb/CLXAjE7SmHAbftOFTUFkNMchqN9Evb2mS+F4aZYA bbpoz5hX92C5UTmRnKk/Lm+I6p1Vu5OlErCpxVqFMAI+NAdMMyakLZ1OS6itAESa V9uIVOfN89jBO8h4xjxYROQpCeoBWx1B6kCnEozKpr07B7jV9VOrzzlhqb/VSGor 0tZ12Gq1BLUdVE8Hl5ZA86JDeV/l6Y+Aoo2ibxKDDCnvdQOugyVQXtOYCElJ1rCu 7TQSntmfiwN5BB37jJWQ =uO0Z -END PGP SIGNATURE- - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
[ANNOUNCE] Apache Lucene 4.10.3 released
December 2014, Apache Lucene™ 4.10.3 available The Lucene PMC is pleased to announce the release of Apache Lucene 4.10.3 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. The release is available for immediate download at: http://lucene.apache.org/core/mirrors-core-latest-redir.html Lucene 4.10.3 includes 12 bug fixes. See the CHANGES.txt file included with the release for a full list of changes and further details. Please report any feedback to the mailing lists (http://lucene.apache.org/core/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access. Happy Holidays, Mark Miller http://www.about.me/markrmiller - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Lucene in action
Nature abhors being anything but an author by name on a second tech book. The ruse is up after one when you have the inputs crystalized and the hourly wage in hand. Hard to find anything but executive producers after that. I’d shoot for a persuasive crowdfunding attempt.
Re: Analyzer at Query time
Dino Korah wrote: Hi All, If I am to completely avoid the query parser and use the BooleanQuery along with TermQuery, RangeQuery, PrefixQuery, PhraseQuery, etc, does the search words still get to the Analyzer, before actually doing the real search? Many thanks, Dino Answer: no The QueryParser applies the analyzer and builds a Query object tree based on the results. You will have to apply the analyzer yourself if your going to forgo QueryParser. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: phrases and slop
Andy Goodell wrote: I thought I understood phrases and slop until one of my coworkers brought by the following example For a document that contains "quick brown fox" "quick brown fox"~0 "quick fox brown"~2 "fox quick brown"~3 all match. I would have expected "fox quick brown" to require a 4 instead of a 3, two to transpose brown and fox, two to transpose quick and fox. Why is this only 3? - andy g - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] I think its this: push fox on quick for move 1, then fox on brown for move 2, then fox into last spot for move 3, quick brown fox. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance, yet again
Andre Rubin wrote: Hi all, Most of our queries are very simple, of the type: Query query = new PrefixQuery(new Term(LABEL_FIELD, prefix)); Hits hits = searcher.search(query, new Sort(new SortField(LABEL_FIELD))) You might want to check out solrs ConstantScorePrefixQuery and compare performance. Which sometimes result in 10, 20, sometimes 40 thousand hits. I get good performance if hits.length is 20.000 or less (less than 0.5 seconds). I However, if it is 40.000 or more, querying takes over a second, up to 2.5 seconds. Point in check here is that this solution is not scaling. Any ideas I can try? I already exhausted the ideas from http://wiki.apache.org/lucene -java/ImproveSearchingSpeed I was reading about TopDocs and TopFieldDocs. Is this search method (using TopDocs) preferred over Hits? Also, there's no constructor for them without a Filter, can I just pass null? It is preferred over Hits. Hits has been deprecated and you should really migrate away from it. Is it possible to pre-sort the index, so I don't have to every time I perform a query? Any other ideas? I think in general, sorting and prefix query can be slower operations in Lucene (though sorting is generally pretty fast after the field caches are loaded). You might try the first couple suggestions there though, and others may fill on other steps you can take as well. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance, yet again
Andre Rubin wrote: On Tue, Sep 2, 2008 at 10:16 AM, Mark Miller <[EMAIL PROTECTED]> wrote: Andre Rubin wrote: Hi all, Most of our queries are very simple, of the type: Query query = new PrefixQuery(new Term(LABEL_FIELD, prefix)); Hits hits = searcher.search(query, new Sort(new SortField(LABEL_FIELD))) You might want to check out solrs ConstantScorePrefixQuery and compare performance. I'm not familiar with Solrs. It is not standard Lucene, is it? Sorry about that. Solr is a search server that is a sub project of the Lucene Apache project. You can just copy the Query from solrs source code and use it with Lucene. ConstantScorePrefixQuery may be faster for you than PrefixQuery and it doesn't have MaxClause exceptions issues when your prefix matches too many terms in the index. Please report back the speed difference if you can. http://lucene.apache.org/solr/ Which sometimes result in 10, 20, sometimes 40 thousand hits. I get good performance if hits.length is 20.000 or less (less than 0.5 seconds). I However, if it is 40.000 or more, querying takes over a second, up to 2.5 seconds. Point in check here is that this solution is not scaling. Any ideas I can try? I already exhausted the ideas from http://wiki.apache.org/lucene -java/ImproveSearchingSpeed I was reading about TopDocs and TopFieldDocs. Is this search method (using TopDocs) preferred over Hits? Also, there's no constructor for them without a Filter, can I just pass null? It is preferred over Hits. Hits has been deprecated and you should really migrate away from it. I was trying, before, to use it, but it doesn't seem as straightfoward as Hits. Is there an example code, somewhere? I think work was done on this when Hits was deprecated. Anyone know? Is it possible to pre-sort the index, so I don't have to every time I perform a query? Any other ideas? I think in general, sorting and prefix query can be slower operations in Lucene (though sorting is generally pretty fast after the field caches are loaded). You might try the first couple suggestions there though, and others may fill on other steps you can take as well. - Mark Thanks, Mark. Andre - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene Memory Leak
You should really close the IndexSearcher rather than the directory. Andy33 wrote: I have a memory leak in my lucene search code. I am able to run a few queries fine, but I eventually run out of memory. Please note that I do close and set to null the ivIndexSearcher object elsewhere. Here is the code I am using... private synchronized Hits doQuery(String field, String queryStr, Sort sortOrder, String indexDirectory) throws Exception { Directory directory = null; try { Analyzer analyzer = new StandardAnalyzer(); directory = FSDirectory.getDirectory(indexDirectory); //search the index ivIndexSearcher = new IndexSearcher(directory); QueryParser parser = new QueryParser(field, analyzer); Query query = parser.parse(queryStr); Hits results = ivIndexSearcher.search(query, sortOrder); return results; } finally { if(null != directory) { directory.close(); } directory = null; } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhraseQuery issues - differences with SpanNearQuery
Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so distance shouldnt matter with spans (I'm sure Paul will correct me if I am wrong). - Mark Yannis Pavlidis wrote: Hi, I am having an issue when using the PhraseQuery which is best illustrated with this example: I have created 2 documents to emulate URLs. One with a URL of: "http://www.airballoon.com"; and title "air balloon" and the second one with URL "http://www.balloonair.com"; and title: "balloon air". Test1 (PhraseQuery) == Now when I use the phrase query with - title: "air balloon" ~2 I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 0.57 Test2 (PhraseQuery) == Now when I use the phrase query with - title: "balloon air" ~2 I get back: url: "http://www.balloonair.com"; - score: 1.0 url: "http://www.airballoon.com"; - score: 0.57 Test3 (PhraseQuery) == Now when I use the phrase query with - title: "air balloon" ~2 title: "balloon air" ~2 I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 1.0 Test4 (SpanNearQuery) === spanNear([title:air, title:balloon], 2, false) I get back: url: "http://www.airballoon.com"; - score: 1.0 url: "http://www.balloonair.com"; - score: 1.0 I would have expected that Test1, Test2 would actually return both URLs with score of 1.0 since I am setting the slop to 2. It seems though that lucene really favors and absolute exact match. Is it safe to assume that for what I am looking for (basically score the docs the same regardless on when someone is searching for "air balloon" or "balloon air") it would be better to use the SpanNearQuery rather than the PhraseQuery? Any input would be appreciated. Thanks in advance, Yannis. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhraseQuery issues - differences with SpanNearQuery
Paul Elschot wrote: Op Thursday 04 September 2008 20:39:13 schreef Mark Miller: Sounds like its more in line with what you are looking for. If I remember correctly, the phrase query factors in the edit distance in scoring, but the NearSpanQuery will just use the combined idf for each of the terms in it, so distance shouldnt matter with spans (I'm sure Paul will correct me if I am wrong). SpanScorer will use the similarity slop factor for each matching span size to adjust the effective frequency. The span size is the difference in position between the first and last matching term, and idf is not used for scoring Spans. The reason why idf is not used could be that there is no basic score value associated with inner spans; only top level spans are scored by SpanScorer. For more details, please consult the SpanScorer code. Regards, Paul Elschot Right, my fault, its the query normalization in the weight which uses idf (by pulling from each clause in the span). So its kind of factored into the score, but not in the way I implied. Sorry, my bad on the info. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: PhraseQuery issues - differences with SpanNearQuery
SpanScorer will use the similarity slop factor for each matching span size to adjust the effective frequency. Regards, Paul Elschot You have pointed this out to me before. One day I will remember Every time I look things over again I miss it, and I couldn't find that email in the archives. Its done here if original questioner is intersted: SpanScorer protected boolean setFreqCurrentDoc() throws IOException { if (! more) { return false; } doc = spans.doc(); freq = 0.0f; while (more && doc == spans.doc()) { int matchLength = spans.end() - spans.start(); freq += getSimilarity().sloppyFreq(matchLength); more = spans.next(); } return more || (freq != 0); } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Frequently updated fields
You might check out the tagindex issue in jira as well. Havn't looked at it myself, but I believe its supposed to be an option for this. Gerardo Segura wrote: I think the important question is: in general how to cope with frequently changing fields. Karl Wettin wrote: Hi Wojciech, can you please give us a bit more specific information about the meta data fields that will change? I would recommend you looking at creating filters from your primary persistency for query clauses such as unread/read, mailbox folders, et c. karl 12 sep 2008 kl. 13.57 skrev Wojciech Strza?ka: Hi. I'm new to Lucene and I would like to get a few answers (they can be lame) I want to index large amount of emails using Lucene (maybe SOLR), not only the contents but also some metadata like state or flags. The problem is that the metadata will change during mail lifecycle, although much smaller updating this information will require reindex the whole mail content which I see performance bottleneck. I have the data in DB also so my first question is: - are there any best practices to implement my needs (querying both lucene & DB and then merging in memory?, close one eye and re-index the whole content on every metadata change? others?) - is at all Lucene good solution for my problem? - are there any plans to implement field updates in more efficient way then delete/insert the whole document? if yes what's the time horizon? Best regards Wojtek - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardAnalyzer exclude numbers
[EMAIL PROTECTED] wrote: Hello Is it possible to exclude numbers using StandardAnalyzer just like SimpleAnalyzer? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Its possible but its tricky. You would want to copy the StandardAnalyzer into your own Analyzer and then modify the grammar. StandardTokenizerImpl.jflex is where to look, but you will have to learn how to use/compile jflex (look at the build file) to build the parser classes. What you would do though, is start by trying to remove the digit from the Alphanum regex in StandardTokenizerImpl.jflex. You might want to rename alphanum after such a move. That may be as far as you need to go. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardAnalyzer exclude numbers
Agreed. I am always diving into that analyzer too fast Possibly premature optimization thoughts as well. But scanning the token after in a filter and breaking/skipping if you find a number will be much easier and possibly not too much slower. Depends on how involved you are/want to get I suppose. Personally I would prefer to start a new analyzer for such a significant change, but for the average Lucene user, pre/post processing is always going to make more sense. Plus there is enough overlap in the code that I can see plenty of people preferring not to split off. 黄成 wrote: > why not use a token filter? > > On Mon, Sep 22, 2008 at 8:36 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > > >> [EMAIL PROTECTED] wrote: >> >> >>> Hello >>> >>> Is it possible to exclude numbers using StandardAnalyzer just like >>> SimpleAnalyzer? >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> Its possible but its tricky. You would want to copy the StandardAnalyzer >>> >> into your own Analyzer and then modify the grammar. >> StandardTokenizerImpl.jflex is where to look, but you will have to learn how >> to use/compile jflex (look at the build file) to build the parser classes. >> What you would do though, is start by trying to remove the digit from the >> Alphanum regex in StandardTokenizerImpl.jflex. You might want to rename >> alphanum after such a move. That may be as far as you need to go. >> >> >> - Mark >> >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> >> > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: sharing SearchIndexer
simon litwan wrote: hi all i tried to reuse the IndexSearcher among all of the threads that are doing searches as described in (http://wiki.apache.org/lucene-java/LuceneFAQ#head-48921635adf2c968f7936dc07d51dfb40d638b82) this works fine. but our application does continuous indexing. so the index is changing and the at startup initialized IndexSearcher seems not to be notified to reload the index. is there a way to force the IndexSearcher to reload the index if the index has changed? thanks in advance simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] You want to reopen the Reader under the IndexSearcher, or open a new IndexSearcher. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: QueryParser
Right, just don't share the same instance across threads. - Mark On Oct 18, 2008, at 3:11 PM, "Rafael Almeida" <[EMAIL PROTECTED]> wrote: On queryparser's documentation says: "Note that QueryParser is not thread-safe." it only means that the same instance of QueryParser can't be used by multiple threads, right? But if each thread has its own QueryParser instance, then it's OK, right? BTW, the link http://lucene.apache.org/java/docs/ queryparsersyntax.html on http://lucene.apache.org/java/2_4_0/api/core/org/apache/lucene/queryParser/QueryParser.html seems to be broken. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Hiring etiquette
Richard Marr wrote: Hi all, Is there a mailing-list-appropriate way to hire coders with Lucene experience? I don't want to just spam the list because I don't want to crap where I live. I'm a programmer not a recruiter if that makes any difference. Cheers, Rich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Generally, people just throw out the request to the list and no one really complains, but I do think its frowned upon. I'm sure someone else can give the 'official' stance (since we are not a job board/list I assume its against). You might instead limit your email to those that have agreed to be contacted at http://wiki.apache.org/lucene-java/Support - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi -threaded indexing of large number of PDF documents
It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't much need to index to smaller indexes and then merge. There is a lot of juggling to get wrong with that approach. - Mark Sudarsan, Sithu D. wrote: Hi, We are trying to index large collection of PDF documents, sizes varying from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for text extraction) and on Windows as well as CentOS Linux. Used java -Xms and -Xmx options, both at 1080m, even though we have 4GB on Windows and 32 GB on Linux with sufficient swap space. With just one thread, though it takes time, the indexing happens. To speed up, we tried multi-threaded approach with one Indexwriter for each thread. After all the threads finish their indexing, they are merged. With about 100 sample files and 10 threads, the program works pretty well and it does speed up. But, when we run on document collection of about 25GB, couple of threads just hang, while the rest have completed their indexing. The program never gracefully exits, and the threads that seem to have died ensure that the final index merging does not take place. The program needs to be manually terminated. Tried both with simple analyzer as well as standard analyzer, with similar results. Any useful tips / solutions welcome. Thanks in advance, Sithu Sudarsan Graduate Research Assistant, UALR & Visiting Researcher, CDRH/OSEL [EMAIL PROTECTED] [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi -threaded indexing of large number of PDF documents
Glen Newton wrote: 2008/10/23 Mark Miller <[EMAIL PROTECTED]>: It sounds like you might have some thread synchronization issues outside of Lucene. To simplify things a bit, you might try just using one IndexWriter. If I remember right, the IndexWriter is now pretty efficient, and there isn't much need to index to smaller indexes and then merge. There is a lot of juggling to get wrong with that approach. While I agree it is easier to have a single IndexWriter, if you have multiple cores you will get significant speed-ups with multiple IndexWriters, even with the impact of merging at the end. #IndexWriters = # physical cores is an reasonable rule of thumb. General speed-up estimate: # cores * 0.6 - 0.8 over single IndexWriter YMMV When I get around to it, I'll re-run my tests varying the # of IndexWriters & post. -Glen Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as efficient as using Multiple Writers? Where do you suppose the hold up is? Number of threads doing merges? Sync contention? I hate the idea of multiple IndexWriter/Readers being more efficient than a single instance. In an ideal Lucene world, a single instance would hide the complexity and use the number of threads needed to match multiple instance performance. - Mark Sudarsan, Sithu D. wrote: Hi, We are trying to index large collection of PDF documents, sizes varying from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for text extraction) and on Windows as well as CentOS Linux. Used java -Xms and -Xmx options, both at 1080m, even though we have 4GB on Windows and 32 GB on Linux with sufficient swap space. With just one thread, though it takes time, the indexing happens. To speed up, we tried multi-threaded approach with one Indexwriter for each thread. After all the threads finish their indexing, they are merged. With about 100 sample files and 10 threads, the program works pretty well and it does speed up. But, when we run on document collection of about 25GB, couple of threads just hang, while the rest have completed their indexing. The program never gracefully exits, and the threads that seem to have died ensure that the final index merging does not take place. The program needs to be manually terminated. Tried both with simple analyzer as well as standard analyzer, with similar results. Any useful tips / solutions welcome. Thanks in advance, Sithu Sudarsan Graduate Research Assistant, UALR & Visiting Researcher, CDRH/OSEL [EMAIL PROTECTED] [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Change the merge factor for an existing index?
Just change it. Merges will start obeying the new merge factor seamlessly. - Mark On Oct 27, 2008, at 1:07 PM, Tom Saulpaugh <[EMAIL PROTECTED]> wrote: Hello, We are currently using lucene v2.1 and we are planning to upgrade to lucene v2.4. Can we change the merge factor for an existing index and then add more documents to that index? Is there some kind of upgrade path like using optimize to move an existing index to a different merge factor? Thanks, Tom - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory Problems Lucene 2.4 / Tomcat
How many fields are you sorting on? Lots of unuiqe terms in those fields? - Mark On Oct 29, 2008, at 6:03 PM, "Todd Benge" <[EMAIL PROTECTED]> wrote: Hi, I'm the lead engineer for search on a large website using lucene for search. We're indexing about 300M documents in ~ 100 indices. The indices add up to ~ 60G. The indices are sorted into 4 different Multisearcher with the largest handling ~50G. The code is basically like the following: private static MultiSearcher searcher; public void init(File files) { IndexSearcer [] searchers = new IndexSearcher[files.length] (); int i = 0; for ( File file: files ) { searchers[i++] = new IndexSearcher(FSDirectory.getDirectory(file); } searcher = new MultiSearcher(searchers); } public Searcher getSearcher() { return searcher; } We're seeing a high cache rate with Term & TermInfo in Lucene 2.4. Performance is good but servers are consistently hanging with OutOfMemory errors. We're allocating 4G in the heap to each server. Is there any way to control the amount of memory Lucene consume for caching? Any other suggestions on fixing the memory errors? Thanks, Todd - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory Problems Lucene 2.4 / Tomcat
The term, terminfo, indexreader internals stuff is prob on the low end compared to the size of your field caches (needed for sorting). If you are sorting by String I think the space needed is 32 bits x number of docs + an array to hold all of the unique terms. So checking 300 million docs (I know you are actually breaking it up smaller than that, but for example) and ignoring things like String chars being variable byte lengths and storing the length, etc and randomly picking 5 unique terms at 6 chars per: 32 bits x 3 + 5 x 6 x 16 bits to MB = 1 144.98138 megabytes Thats per field your sorting on. If you are sorting on an int field it should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc. So you have those field caches, plus the IndexReader terminfo, term stuff, plus whatever RAM your app needs beyond Lucene. 4 gig might just not *quite* cut it is my guess. Todd Benge wrote: There's usually only a couple sort fields and a bunch of terms in the various indices. The terms are user entered on various media so the number of terms is very large. Thanks for the help. Todd On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote: Hi, I'm the lead engineer for search on a large website using lucene for search. We're indexing about 300M documents in ~ 100 indices. The indices add up to ~ 60G. The indices are sorted into 4 different Multisearcher with the largest handling ~50G. The code is basically like the following: private static MultiSearcher searcher; public void init(File files) { IndexSearcer [] searchers = new IndexSearcher[files.length] (); int i = 0; for ( File file: files ) { searchers[i++] = new IndexSearcher(FSDirectory.getDirectory(file); } searcher = new MultiSearcher(searchers); } public Searcher getSearcher() { return searcher; } We're seeing a high cache rate with Term & TermInfo in Lucene 2.4. Performance is good but servers are consistently hanging with OutOfMemory errors. We're allocating 4G in the heap to each server. Is there any way to control the amount of memory Lucene consume for caching? Any other suggestions on fixing the memory errors? Thanks, Todd - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory Problems Lucene 2.4 / Tomcat
Michaels got some great points (he the lucene master), especially possibly turning off norms if you can, but for an index like that i'd reccomwnd solr. Solr sharding can be scaled to billions (min a billion or two anyway) with few limitations (of course there are a few). Plus it has further caching options, indexreader refresh managment, etc etc etc - Mark On Oct 29, 2008, at 10:30 PM, "Todd Benge" <[EMAIL PROTECTED]> wrote: Thanks Mark. I appreciate the help. I thought our memory may be low but wanted to verify there if there is any way to control memory usage. I think we'll likely upgrade the memory on the machines but that may just delay the inevitable. Wondering if anyone else has encountered similar issues with indices if a similar size. I've been thinking we will need to move to a clustered solution and have been reading on hadoop, nutch, solr & terracotta for possibilities such as index sharding. Has anyone implemented a solution using hadoop or terracotta for a large scale system? Just wondering the pro's / con's of the various approaches. Thanks, Todd On Wed, Oct 29, 2008 at 6:07 PM, Mark Miller <[EMAIL PROTECTED]> wrote: The term, terminfo, indexreader internals stuff is prob on the low end compared to the size of your field caches (needed for sorting). If you are sorting by String I think the space needed is 32 bits x number of docs + an array to hold all of the unique terms. So checking 300 million docs (I know you are actually breaking it up smaller than that, but for example) and ignoring things like String chars being variable byte lengths and storing the length, etc and randomly picking 5 unique terms at 6 chars per: 32 bits x 3 + 5 x 6 x 16 bits to MB = 1 144.98138 megabytes Thats per field your sorting on. If you are sorting on an int field it should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc. So you have those field caches, plus the IndexReader terminfo, term stuff, plus whatever RAM your app needs beyond Lucene. 4 gig might just not *quite* cut it is my guess. Todd Benge wrote: There's usually only a couple sort fields and a bunch of terms in the various indices. The terms are user entered on various media so the number of terms is very large. Thanks for the help. Todd On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote: Hi, I'm the lead engineer for search on a large website using lucene for search. We're indexing about 300M documents in ~ 100 indices. The indices add up to ~ 60G. The indices are sorted into 4 different Multisearcher with the largest handling ~50G. The code is basically like the following: private static MultiSearcher searcher; public void init(File files) { IndexSearcer [] searchers = new IndexSearcher[files.length] (); int i = 0; for ( File file: files ) { searchers[i++] = new IndexSearcher(FSDirectory.getDirectory(file); } searcher = new MultiSearcher(searchers); } public Searcher getSearcher() { return searcher; } We're seeing a high cache rate with Term & TermInfo in Lucene 2.4. Performance is good but servers are consistently hanging with OutOfMemory errors. We're allocating 4G in the heap to each server. Is there any way to control the amount of memory Lucene consume for caching? Any other suggestions on fixing the memory errors? Thanks, Todd - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document marked as deleted
John G wrote: I have an index with a particular document marked as deleted. If I use the search method that returns TopDocs and that deleted document satisfies the search criteria, will it be included in the returned TopDocs object even though it has been marked as deleted? Thanks in advance. John G. Nope. It will still be loaded in the field cache and used for corpus statistics I believe, but it won't be returned in search results, no matter which search method on searcher you are using. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: OutOfMemory Problems Lucene 2.4 / Tomcat
20 fields on a huge index? Wow - not sure there is a ton you can do with that...anyone have any suggestions for that one? Distributed should help I suppose, but thats a lot of sort fields for a large index. If LUCENE-831 ever gets off the ground you will be able to change the cache used, and possibly use something that spills over to disk. PabloS wrote: Hi, I'm having a similar problem with my application, although we are using lucene 2.3.2. The problem we have is that we are required to sort on most of the fields (20 at least). Is there any way of changing the cache being used? I can't seem to find a way, since the cache is being accessed using the FieldCache.DEFAULT static field.. Any tip would be appreciated, otherwise I'll have to start looking for a clustered solution like Todd. Thanks in advance. Pablo markrmiller wrote: The term, terminfo, indexreader internals stuff is prob on the low end compared to the size of your field caches (needed for sorting). If you are sorting by String I think the space needed is 32 bits x number of docs + an array to hold all of the unique terms. So checking 300 million docs (I know you are actually breaking it up smaller than that, but for example) and ignoring things like String chars being variable byte lengths and storing the length, etc and randomly picking 5 unique terms at 6 chars per: 32 bits x 3 + 5 x 6 x 16 bits to MB = 1 144.98138 megabytes Thats per field your sorting on. If you are sorting on an int field it should be closer to 32 bits x num docs - shorts, 32 bits x num docs, etc. So you have those field caches, plus the IndexReader terminfo, term stuff, plus whatever RAM your app needs beyond Lucene. 4 gig might just not *quite* cut it is my guess. Todd Benge wrote: There's usually only a couple sort fields and a bunch of terms in the various indices. The terms are user entered on various media so the number of terms is very large. Thanks for the help. Todd On 10/29/08, Todd Benge <[EMAIL PROTECTED]> wrote: Hi, I'm the lead engineer for search on a large website using lucene for search. We're indexing about 300M documents in ~ 100 indices. The indices add up to ~ 60G. The indices are sorted into 4 different Multisearcher with the largest handling ~50G. The code is basically like the following: private static MultiSearcher searcher; public void init(File files) { IndexSearcer [] searchers = new IndexSearcher[files.length] (); int i = 0; for ( File file: files ) { searchers[i++] = new IndexSearcher(FSDirectory.getDirectory(file); } searcher = new MultiSearcher(searchers); } public Searcher getSearcher() { return searcher; } We're seeing a high cache rate with Term & TermInfo in Lucene 2.4. Performance is good but servers are consistently hanging with OutOfMemory errors. We're allocating 4G in the heap to each server. Is there any way to control the amount of memory Lucene consume for caching? Any other suggestions on fixing the memory errors? Thanks, Todd - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance of never optimizing
Am I missing your benchmark algorithm somewhere? We need it. Something doesn't make sense. - Mark Justus Pendleton wrote: Howdy, I have a couple of questions regarding some Lucene benchmarking and what the results mean[3]. (Skip to the numbered list at the end if you don't want to read the lengthy exegesis :) I'm a developer for JIRA[1]. We are currently trying to get a better understanding of Lucene, and our use of it, to cope with the needs of our larger customers. These "large" indexes are only a couple hundred thousand documents but our problem is compounded by the fact that they have a relatively high rate of modification (=delete+insert of new document) and our users expect these modification to show up in query results pretty much instantly. Our current default behaviour is a merge factor of 4. We perform an optimization on the index every 4000 additions. We also perform an optimize at midnight. Our fundamental problem is that these optimizations are locking the index for unacceptably long periods of time, something that we want to resolve for our next major release, hopefully without undermining search performance too badly. In the Lucene javadoc there is a comment, and a link to a mailing list discussion[2], that suggests applications such as JIRA should never perform optimize but should instead set their merge factor very low. In an attempt to understand the impact of a) lowering the merge factor from 4 to 2 and b) never, ever optimizing on an index (over the course of years and millions of additions/updates) I wanted to try to benchmark Lucene. I used the contrib/benchmark framework and wrote a small algorithm that adds documents to an index (using the Reuters doc generator), does a search, does an optimize, then does another search. All the pretty pictures can be seen at: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs I have several questions, hopefully they aren't overwhelming in their quantity :-/ 1. Why does the merge factor of 4 appear to be faster than the merge factor of 2? 2. Why does non-optimized searching appear to be faster than optimized searching once the index hits ~500,000 documents? 3. There appears to be a fairly sizable performance drop across the board around 450,000 documents. Why is that? 4. Searching performance appears to decrease towards a fairly pessimistic 20 searches per second (for a relatively simple search). Is this really what we should expect long-term from Lucene? 5. Does my benchmark even make sense? I am far from an expert on benchmarking so it is possible I'm not measuring what I think I am measuring. Thanks in advance for any insight you can provide. This is an area that we very much want to understand better as Lucene is a key part of JIRA's success, Cheers, Justus JIRA Developer [1]: http://www.atlassian.com [2]: http://www.gossamer-threads.com/lists/lucene/java-dev/47895 [3]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Performance of never optimizing
Been a while since I've been in the benchmark stuff, so I am going to take some time to look at this when I get a chance, but off the cuff I think you are open and closing the reader for each search. Try using the openreader task before the 100 searches and then the closereader task. That will ensure you are reusing the same reader for each search. Hope to analyze further soon. - Mark Justus Pendleton wrote: On 03/11/2008, at 11:07 PM, Mark Miller wrote: Am I missing your benchmark algorithm somewhere? We need it. Something doesn't make sense. I thought I had included in at[1] before but apparently not, my apologies for that. I have updated that wiki page. I'll also reproduce it here: { "Rounds" ResetSystemErase { CreateIndex > { AddDoc > : NUM_DOCS { CloseIndex > [ "UnoptSearch" Search > : 100 { "Optimize" OpenIndex Optimize CloseIndex } [ "OptSearch" Search > : 100 NewRound } : 6 NUM_DOCS increases by 5,000 for each iteration. What constitutes a "proper warm up before measuring"? [1]: http://confluence.atlassian.com/display/JIRACOM/Lucene+graphs Cheers, Justus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searchable archives
Or nabble or markmail - Mark On Nov 7, 2008, at 3:33 PM, Dragon Fly <[EMAIL PROTECTED]> wrote: http://www.gossamer-threads.com/lists/lucene/java-user/ Date: Fri, 7 Nov 2008 14:27:38 -0700 From: [EMAIL PROTECTED] To: java-user@lucene.apache.org Subject: searchable archives Hey, Is this list available somewhere that you can search the entire archives at one time? Thanks, Chad _ Stay up to date on your PC, the Web, and your mobile phone with Windows Live http://clk.atdmt.com/MRT/go/119462413/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multisearcher
Not out of the box, but it's fairly trivial to copy multisesscher and modify it so that a different query goes to each suvsearcher. - Mark On Nov 8, 2008, at 5:45 AM, "Shishir Jain" <[EMAIL PROTECTED]> wrote: Hi, Doc1: Field1, Field2 Doc2: Field1, Field2 If I create Index such that Field1 is stored in index1 and Field2 is stored in index2. Can I use Multisearcher to search for Field1 in index1 and Field2 index2 and get the merged results? Thanks & Regards, Shishir Jain - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ScoreDoc
Excuse me. Some unchecked logic there concerning HitCollector. A HitCollector hits all matching documents, not all documents. Sometimes that can be a lot. With TopDocs, you only ask for the Top scoring documents, which is usually a lesser number than all matching docs, and generally what people are interested in rather than all matching docs. Sorry for the confusion there - need to double check what I write... Mark Miller wrote: Their is definitely some stale javadoc in Lucene here and there. All of what your talking about has been shaken up recently with the deprecation of Hits. Hits used to pretty much be considered the non-expert API, but its been tossed in favor of the TopDoc API's. The HitCollector stuff has been marked expert because a lot of people get into trouble using something that hits every doc in the index on a search, not just the matching docs from the search. If you don't understand whats going on, you can, and many have, make some pretty slow code. The expert stuff just means, understand whats going on before you start to play here ;) I don't necessarily think it doesn't belong in a tutorial - assuming the guy who wrote the tutorial understood what he was doing. As for the stale java-doc though, I'm sure patches would be welcome ;) Its a group of volunteers all scratching their own itches here, so its likely you will find things like that. Best bet is to pitch in when you see it, and I'm sure one of the commiters will apply your patch if its appropriate. - Mark ChadDavis wrote: In fact, the search method used to populate the collector used in that sample code also claims to be low level. It suggests using the IndexSearcher.search( query ) method instead, but that method is deprecated. Lower-level search API. HitCollector.collect(int,float) is called for every matching document. Applications should only use this if they need *all* of the matching documents. The high-level search API (Searcher.search(Query)) is usually more efficient, as it skips non-high-scoring hits. Note: The score passed to this method is a raw score. In other words, the score will not necessarily be a float whose value is between 0 and 1. Is this just stale documentation ? On Sun, Nov 9, 2008 at 3:28 PM, ChadDavis <[EMAIL PROTECTED]>wrote: The sample code uses a ScoreDoc array to hold the hits. ScoreDoc[] hits = collector.topDocs().scoreDocs; But the JavaDoc says "Expert: Returned by low-level search implementations." Why would the tutorial sample code use an "expert" api? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Highlighter and Phrase Queries
Check out the SpanScorer. - Mark On Nov 10, 2008, at 8:25 AM, "Sertic Mirko, Bedag" <[EMAIL PROTECTED] > wrote: [EMAIL PROTECTED] I am searching for a solution to make the Highlighter run property in combination with phrase queries. I want to highlight text with a phrase query like "windows printserver", the following highlighted: "windows printservers" are good blah blah "windows" manages "printserver" blah blah, so the phrases and the single terms are highlighted, but I just want to highlight the phrases. How could this be done? Thanks in advance Mirko - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Boosting results
Michael McCandless wrote: But: it's slow to load a field for the first time. LUCENE-1231 (column-stride fields) aims to greatly speed up the load time. Test it out though. In some recent testing I was doing it was *way* faster than I thought it would be based on what I had been reading. Of course if every term is unique, its going to be worse, but even with like 10 mil docs and a few hundred thousand uniques, either I was doing something wrong, or even on my 4200rpm laptop hd, it loaded like nothing (of course even a second load and then a search is much slower than just a warmed search though). Was hoping to see some advantage with a payload implementation with LUCENE-831, but really didn't seem to... It's also memory-consuming. Finally, you might want to instead look at Solr, which provides facet counting out of the box, rather than roll your own... Mike Stefan Trcek wrote: On Friday 07 November 2008 18:46:17 Michael McCandless wrote: Sorting populates the field cache (internal to Lucene) for that field, meaning it loads all values for all docs and holds them in memory. This makes the first query slow, and, consumes RAM, in proportion to how large your index is. Can you direct me to the API how to access these cached values? I'd like to have a function like: "List all unique values of the categories (A, B, C...) for documents that match this query". i.e. for a query "text:john" show up categories=(A,B) Doc 1: category=A text=john Doc 2: category=B text=mary Doc 3: category=B text=john Doc 4: category=C text=mary This is intended for search refinement (I use about 200 categories). Sorry for hijacking this thread. Stefan - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: Highlighter and Phrase Queries
Check out the unit tests for the highlighter and there are a bunch of examples. Its pretty much the same as using the standard scorer, except that it requires a cached token filter so that the tokenstream can be read more than once. Once you pass in the SpanScorer to the Highlighter though, it works just like the non phrase/span aware Highlighter. - Mark Sertic Mirko, Bedag wrote: Hi Thank you for your response. Are there examples available? Regards Mirko -Ursprüngliche Nachricht- Von: Mark Miller [mailto:[EMAIL PROTECTED] Gesendet: Montag, 10. November 2008 14:45 An: java-user@lucene.apache.org Betreff: Re: Highlighter and Phrase Queries Check out the SpanScorer. - Mark On Nov 10, 2008, at 8:25 AM, "Sertic Mirko, Bedag" <[EMAIL PROTECTED] > wrote: [EMAIL PROTECTED] I am searching for a solution to make the Highlighter run property in combination with phrase queries. I want to highlight text with a phrase query like "windows printserver", the following highlighted: "windows printservers" are good blah blah "windows" manages "printserver" blah blah, so the phrases and the single terms are highlighted, but I just want to highlight the phrases. How could this be done? Thanks in advance Mirko - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: AW: AW: Highlighter and Phrase Queries
Right, it will work the same as the standard Highlighter except that it highlights spans and phrase queries based on position. Sertic Mirko, Bedag wrote: Ok, i will do. I guess it will also work with BooleanQueries and combined Term/Wildcard/Phrase Queries? -Ursprüngliche Nachricht- Von: Mark Miller [mailto:[EMAIL PROTECTED] Gesendet: Montag, 10. November 2008 15:38 An: java-user@lucene.apache.org Betreff: Re: AW: Highlighter and Phrase Queries Check out the unit tests for the highlighter and there are a bunch of examples. Its pretty much the same as using the standard scorer, except that it requires a cached token filter so that the tokenstream can be read more than once. Once you pass in the SpanScorer to the Highlighter though, it works just like the non phrase/span aware Highlighter. - Mark Sertic Mirko, Bedag wrote: Hi Thank you for your response. Are there examples available? Regards Mirko -Ursprüngliche Nachricht- Von: Mark Miller [mailto:[EMAIL PROTECTED] Gesendet: Montag, 10. November 2008 14:45 An: java-user@lucene.apache.org Betreff: Re: Highlighter and Phrase Queries Check out the SpanScorer. - Mark On Nov 10, 2008, at 8:25 AM, "Sertic Mirko, Bedag" <[EMAIL PROTECTED] > wrote: [EMAIL PROTECTED] I am searching for a solution to make the Highlighter run property in combination with phrase queries. I want to highlight text with a phrase query like "windows printserver", the following highlighted: "windows printservers" are good blah blah "windows" manages "printserver" blah blah, so the phrases and the single terms are highlighted, but I just want to highlight the phrases. How could this be done? Thanks in advance Mirko - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: ScoreDoc
Their is definitely some stale javadoc in Lucene here and there. All of what your talking about has been shaken up recently with the deprecation of Hits. Hits used to pretty much be considered the non-expert API, but its been tossed in favor of the TopDoc API's. The HitCollector stuff has been marked expert because a lot of people get into trouble using something that hits every doc in the index on a search, not just the matching docs from the search. If you don't understand whats going on, you can, and many have, make some pretty slow code. The expert stuff just means, understand whats going on before you start to play here ;) I don't necessarily think it doesn't belong in a tutorial - assuming the guy who wrote the tutorial understood what he was doing. As for the stale java-doc though, I'm sure patches would be welcome ;) Its a group of volunteers all scratching their own itches here, so its likely you will find things like that. Best bet is to pitch in when you see it, and I'm sure one of the commiters will apply your patch if its appropriate. - Mark ChadDavis wrote: In fact, the search method used to populate the collector used in that sample code also claims to be low level. It suggests using the IndexSearcher.search( query ) method instead, but that method is deprecated. Lower-level search API. HitCollector.collect(int,float) is called for every matching document. Applications should only use this if they need *all* of the matching documents. The high-level search API (Searcher.search(Query)) is usually more efficient, as it skips non-high-scoring hits. Note: The score passed to this method is a raw score. In other words, the score will not necessarily be a float whose value is between 0 and 1. Is this just stale documentation ? On Sun, Nov 9, 2008 at 3:28 PM, ChadDavis <[EMAIL PROTECTED]>wrote: The sample code uses a ScoreDoc array to hold the hits. ScoreDoc[] hits = collector.topDocs().scoreDocs; But the JavaDoc says "Expert: Returned by low-level search implementations." Why would the tutorial sample code use an "expert" api? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
Nice! An 8 core machine with a test ready to go! How about trying the read only mode that was added to 2.4 on your IndexReader? And if you you are on unix and could try trunk and use the new NIOFSDirectory implementation...that would be awesome. Those two additions are our current hope for what your seeing...would be nice to know if we need to try for more (or if we need to petition the smart people that work on that stuff to try for more ;) ). - Mark Dmitri Bichko wrote: Hi, I'm pretty new to Lucene, so please bear with me if this has been covered before. The wiki suggests sharing a single IndexSearcher between threads for best performance (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed). I've tested running the same set of queries with: multiple threads sharing the same searcher, with a separate searcher for each thread, both shared/private with a RAMDirectory in-memory index, and (just for fun) in multiple JVMs running concurrently (the results are in milliseconds to complete the whole job): threads multi-jvm shared per-thread ram-shared ram-thread 1 72997 70883 72573 60308 60012 2 33147 48762 35973 25498 25734 4 16229 46828 21267 13127 27164 6 13088 47240 140289858 29917 8 9775 47020 109838948 10440 10 8721 50132 113349587 11355 12 7290 49002 117989832 16 9365 47099 12338 11296 The shared searcher indeed behaves better with a ram-based index, but what's going on with the disk-based one? It's basically not scaling beyond two threads. Am I just doing something completely wrong here? The test consists of about 1,500 Boolean OR queries with 1-10 PhraseQueries each, with 1-20 Terms per PhraseQuery. I'm using a HitCollector to count the hits, so I'm not retrieving any results. The index is about 5GB and 20 million documents. This is running on a 8 x quad-core Opteron machine with plenty of RAM to spare. Any idea why I would see this behaviour? Thanks, Dmitri - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
And if you you are on unix and could try trunk and use the new NIOFSDirectory implementation...that would be awesome. Woah...that made 2.4 too. A 2.4 release will allow both optimizations. Many thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
Dmitri Bichko wrote: 32 cores, actually :) Glossed over that - even better! Killer machine to be able to test this on. I reran the test with readonly turned on (I changed how the time is measured a little, it should be more consistent): fs-thread ram-thread fs-shared ram-shared 1 71877 54739 73986 61595 2 34949 26735 43719 28935 3 25581 26885 38412 19624 4 20511 31742 38712 15059 5 19235 24345 39685 12509 6 16775 26896 39592 10841 7 17147 18296 46678 10183 8 18327 19043 39886 10048 9 16885 18721 40342 9483 10 17832 30757 44706 10975 11 17251 21199 39947 9704 12 17267 36284 40208 10996 I can't seem to get NIOFSDirectory working, though. Calling NIOFSDirectory.getDirectory("foo") just returns an FSDirectory. Thats a good point, and points out a bug in solr trunk for me. Frankly I don't see how its done. There is no code I can see/find to use it rather than FSDirectory. Still assuming there must be a way, but I don't see it... - Mark Any ideas? Cheers, Dmitri On Tue, Nov 11, 2008 at 5:09 PM, Mark Miller <[EMAIL PROTECTED]> wrote: Nice! An 8 core machine with a test ready to go! How about trying the read only mode that was added to 2.4 on your IndexReader? And if you you are on unix and could try trunk and use the new NIOFSDirectory implementation...that would be awesome. Those two additions are our current hope for what your seeing...would be nice to know if we need to try for more (or if we need to petition the smart people that work on that stuff to try for more ;) ). - Mark Dmitri Bichko wrote: Hi, I'm pretty new to Lucene, so please bear with me if this has been covered before. The wiki suggests sharing a single IndexSearcher between threads for best performance (http://wiki.apache.org/lucene-java/ImproveSearchingSpeed). I've tested running the same set of queries with: multiple threads sharing the same searcher, with a separate searcher for each thread, both shared/private with a RAMDirectory in-memory index, and (just for fun) in multiple JVMs running concurrently (the results are in milliseconds to complete the whole job): threads multi-jvm shared per-thread ram-shared ram-thread 1 72997 70883 72573 60308 60012 2 33147 48762 35973 25498 25734 4 16229 46828 21267 13127 27164 6 13088 47240 140289858 29917 8 9775 47020 109838948 10440 10 8721 50132 113349587 11355 12 7290 49002 117989832 16 9365 47099 12338 11296 The shared searcher indeed behaves better with a ram-based index, but what's going on with the disk-based one? It's basically not scaling beyond two threads. Am I just doing something completely wrong here? The test consists of about 1,500 Boolean OR queries with 1-10 PhraseQueries each, with 1-20 Terms per PhraseQuery. I'm using a HitCollector to count the hits, so I'm not retrieving any results. The index is about 5GB and 20 million documents. This is running on a 8 x quad-core Opteron machine with plenty of RAM to spare. Any idea why I would see this behaviour? Thanks, Dmitri - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
Mark Miller wrote: Thats a good point, and points out a bug in solr trunk for me. Frankly I don't see how its done. There is no code I can see/find to use it rather than FSDirectory. Still assuming there must be a way, but I don't see it... Ah - brain freeze. What else is new :) You have to set the system property to change implementations: org.apache.lucene.FSDirectory.class is the property, set it to the class. Been a long time... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
+1 - Mark On Nov 12, 2008, at 4:50 AM, Michael McCandless <[EMAIL PROTECTED] > wrote: I think we really should open up a non-static way to choose a different FSDirectory impl? EG maybe add optional Class to FSDirectory.getDirectory? Or maybe give NIOFSDirectory a public ctor? Or something? Mike Mark Miller wrote: Mark Miller wrote: Thats a good point, and points out a bug in solr trunk for me. Frankly I don't see how its done. There is no code I can see/find to use it rather than FSDirectory. Still assuming there must be a way, but I don't see it... Ah - brain freeze. What else is new :) You have to set the system property to change implementations: org.apache.lucene.FSDirectory.class is the property, set it to the class. Been a long time... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and multi-threaded performance
I'm thinking about it, so if someone else doesn't get something together before I have some free time... Its just not clear to me at the moment how best to do it. Michael McCandless wrote: Any takers for pulling a patch together...? Mike Mark Miller wrote: +1 - Mark On Nov 12, 2008, at 4:50 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: I think we really should open up a non-static way to choose a different FSDirectory impl? EG maybe add optional Class to FSDirectory.getDirectory? Or maybe give NIOFSDirectory a public ctor? Or something? Mike Mark Miller wrote: Mark Miller wrote: Thats a good point, and points out a bug in solr trunk for me. Frankly I don't see how its done. There is no code I can see/find to use it rather than FSDirectory. Still assuming there must be a way, but I don't see it... Ah - brain freeze. What else is new :) You have to set the system property to change implementations: org.apache.lucene.FSDirectory.class is the property, set it to the class. Been a long time... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene implementation/performance question
If your new to Lucene, this might be a little much (and maybe I am not fully understand the problem), but you might try: Add the attributes to the words in a payload with a PayloadAnalyzer. Do searching as normal. Use the new PayloadSpanUtil class to get the payloads for the matching words. (Think of the PayloadSpanUtil as a highlighter - you give it a query, it gives you the payloads to the terms that match). The PayloadSpanUtil class is a bit experimental, but I'll fix anything you run into with it. - Mark Greg Shackles wrote: Hi Erick, Thanks for the response, sorry that I was somewhat vague in the reasoning for my implementation in the first post. I should have mentioned that the word details are not details of the Lucene document, but are attributes about the word that I am storing. Some examples are position on the actual page, color, size, bold/italic/underlined, and most importantly, the text as it appeared on the page. The reason the last one matters is that things like punctuation, spacing and capitalization can vary between the result and the search term, and can affect how I need to process the results afterwords. I am certainly open to the idea of a new approach if it would improve on things, I admit I am new to Lucene so if there are options I'm unaware of I'd love to learn about them. Just to sum it up with an example, let's say we have a page of text that stores "This is a page of text." We want to search for the text "of text", which would span multiple words in the word index. The final result would need to contain "of" and "text", along with the details about each as described before. I hope this is more helpful! - Greg On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <[EMAIL PROTECTED]>wrote: If I may suggest, could you expand upon what you're trying to accomplish? Why do you care about the detailed information about each word? The reason I'm suggesting this is "the XY problem". That is, people often ask for details about a specific approach when what they really need is a different approach There are TermFrequencies, TermPositions, TermVectorOffsetInfo and a bunch of other stuff that I don't know the details of that may work for you if we had a better idea of what it is you're trying to accomplish... Best Erick On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <[EMAIL PROTECTED]> wrote: I hope this isn't a dumb question or anything, I'm fairly new to Lucene so I've been picking it up as I go pretty much. Without going into too much detail, I need to store pages of text, and for each word on each page, store detailed information about it. To do this, I have 2 indexes: 1) pages: this stores the full text of the page, and identifying information about it 2) words: this stores a single word, along with the page it was on and is stored in the order they appear on the page When doing a search, not only do I need to return the page it was found on, but also the details of the matching words. Since I couldn't think of a better way to do it, I first search the pages index and find any matching pages. Then I iterate the words on those pages to find where the match occurred. Obviously this is costly as far as execution time goes, but at least it only has to get done for matching pages rather than every page. Searches still take way longer than I'd like though, and the bottleneck is almost entirely in the code to find the matches on the page. One simple optimization I can think of is store the pages in smaller blocks so that the scope of the iteration is made smaller. This is not really ideal, since I also need the ability to narrow down results based on other words that can/can't appear on the same page which would mean storing 3 full copies of every word on every page (one in each of the 3 resulting indexes). I know this isn't a Java performance forum so I'll try to keep this Lucene related, but has anyone done anything similar to this, or have any comments/ideas on how to improve it? I'm in the process of trying to speed things up since I need to perform many searches often over very large sets of pages. Thanks! - Greg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene implementation/performance question
Here is a great power point on payloads from Michael Busch: www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt. Essentially, you can store metadata at each term position, so its an excellent place to store attributes of the term - they are very fast to load, efficient, etc. You can check out the spans test classes for a small example using the PayloadSpanUtil...its actually fairly simple and short, and the main reason I consider it experimental is that it hasn't really been used too much to my knowledge (who knows though). If you have a problem, you'll know quickly and I'll fix quickly. It should work fine though. Overall, the approach wouldn't take that much code, so I don't think youd be out a lot of time. The PayloadSpanUtil takes an IndexReader and a query and returns the payloads for the terms in the IndexReader that match the query. If you end up with multiple docs in the IndexReader, be sure to isolate the query down to the exact doc you want the payloads from (the Span scoring mode of the highlighter actually puts the doc in a fast MemoryIndex which only holds one doc, and uses an IndexReader from the MemoryIndex). Greg Shackles wrote: Hey Mark, This sounds very interesting. Is there any documentation or examples I could see? I did a quick search but didn't really find much. It might just be that I don't know how payloads work in Lucene, but I'm not sure how I would see this actually doing what I need. My reasoning is this...you'd have an index that stores all the text for a particular page. Would you be able to attach payload information to individual words on that page? In my head it seems like that would be the job of a second index, which is exactly why I added the word index. Any details you can give would be great as I need to keep moving on this project quickly. I will also say that I'm somewhat wary of using an experimental class since this is a really important project that really won't be able to wait on a lot of development cycles to get the class fully working. That said, if it can give me serious speed improvements it's definitely worth considering. - Greg On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote: If your new to Lucene, this might be a little much (and maybe I am not fully understand the problem), but you might try: Add the attributes to the words in a payload with a PayloadAnalyzer. Do searching as normal. Use the new PayloadSpanUtil class to get the payloads for the matching words. (Think of the PayloadSpanUtil as a highlighter - you give it a query, it gives you the payloads to the terms that match). The PayloadSpanUtil class is a bit experimental, but I'll fix anything you run into with it. - Mark Greg Shackles wrote: Hi Erick, Thanks for the response, sorry that I was somewhat vague in the reasoning for my implementation in the first post. I should have mentioned that the word details are not details of the Lucene document, but are attributes about the word that I am storing. Some examples are position on the actual page, color, size, bold/italic/underlined, and most importantly, the text as it appeared on the page. The reason the last one matters is that things like punctuation, spacing and capitalization can vary between the result and the search term, and can affect how I need to process the results afterwords. I am certainly open to the idea of a new approach if it would improve on things, I admit I am new to Lucene so if there are options I'm unaware of I'd love to learn about them. Just to sum it up with an example, let's say we have a page of text that stores "This is a page of text." We want to search for the text "of text", which would span multiple words in the word index. The final result would need to contain "of" and "text", along with the details about each as described before. I hope this is more helpful! - Greg On Wed, Nov 12, 2008 at 11:17 AM, Erick Erickson <[EMAIL PROTECTED] wrote: If I may suggest, could you expand upon what you're trying to accomplish? Why do you care about the detailed information about each word? The reason I'm suggesting this is "the XY problem". That is, people often ask for details about a specific approach when what they really need is a different approach There are TermFrequencies, TermPositions, TermVectorOffsetInfo and a bunch of other stuff that I don't know the details of that may work for you if we had a better idea of what it is you're trying to accomplish... Best Erick On Wed, Nov 12, 2008 at 10:47 AM, Greg Shackles <[EMAIL PROTECTED]> wrote: I hope this isn't a dumb question or anything, I'm fairly new to Lucene so I've been picking it up as I go pretty much. Without going
Re: Lucene implementation/performance question
Greg Shackles wrote: Thanks! This all actually sounds promising, I just want to make sure I'm thinking about this correctly. Does this make sense? Indexing process: 1) Get list of all words for a page and their attributes, stored in some sort of data structure 2) Concatenate the text from those words (space separated) into a string that represents the entire page 3) When adding the page document to the index, run it through a custom analyzer that attaches the payloads to the tokens * this would have to follow along in the word list from #1 to get the payload information for each token * would also have to tokenize the word we are storing to see how many Lucene tokens it would translate to (to make sure the right payloads go with the right tokens) Right, sounds like you have it spot on. That second * from 3 looks like a possible tricky part. I haven't totally analyzed the searching process yet since I want to get my head around the storage part first, but I imagine that would be the easier part anyway. Does this approach sound reasonable? Sounds good. My other concern is your comment about isolating results. If I'm reading it correctly, it means that I'd have to do the search in multiple passes, one to get the individual docs containing the matches, and then one query for each of those to get the payloads within them? Right...you'd do it essentially how Highlighting works...you do the search to get the docs of interest, and then redo the search somewhat to get the highlights/payloads for an individual doc at a time. You are redoing some work, but if you think about, getting that info for every match (there could be tons) doesn't make much since when someone might just look at the top couple results, or say 10 at a time. Depends on your usecase if its feasible or not though. Most find it efficient enough to do highlighting with, so I'm assuming it should be good enough here. Thanks again for your help on this one. - Greg On Wed, Nov 12, 2008 at 12:52 PM, Mark Miller <[EMAIL PROTECTED]> wrote: Here is a great power point on payloads from Michael Busch: www.us.apachecon.com/us2007/downloads/AdvancedIndexing*Lucene*.ppt. Essentially, you can store metadata at each term position, so its an excellent place to store attributes of the term - they are very fast to load, efficient, etc. You can check out the spans test classes for a small example using the PayloadSpanUtil...its actually fairly simple and short, and the main reason I consider it experimental is that it hasn't really been used too much to my knowledge (who knows though). If you have a problem, you'll know quickly and I'll fix quickly. It should work fine though. Overall, the approach wouldn't take that much code, so I don't think youd be out a lot of time. The PayloadSpanUtil takes an IndexReader and a query and returns the payloads for the terms in the IndexReader that match the query. If you end up with multiple docs in the IndexReader, be sure to isolate the query down to the exact doc you want the payloads from (the Span scoring mode of the highlighter actually puts the doc in a fast MemoryIndex which only holds one doc, and uses an IndexReader from the MemoryIndex). Greg Shackles wrote: Hey Mark, This sounds very interesting. Is there any documentation or examples I could see? I did a quick search but didn't really find much. It might just be that I don't know how payloads work in Lucene, but I'm not sure how I would see this actually doing what I need. My reasoning is this...you'd have an index that stores all the text for a particular page. Would you be able to attach payload information to individual words on that page? In my head it seems like that would be the job of a second index, which is exactly why I added the word index. Any details you can give would be great as I need to keep moving on this project quickly. I will also say that I'm somewhat wary of using an experimental class since this is a really important project that really won't be able to wait on a lot of development cycles to get the class fully working. That said, if it can give me serious speed improvements it's definitely worth considering. - Greg On Wed, Nov 12, 2008 at 12:01 PM, Mark Miller <[EMAIL PROTECTED]> wrote: If your new to Lucene, this might be a little much (and maybe I am not fully understand the problem), but you might try: Add the attributes to the words in a payload with a PayloadAnalyzer. Do searching as normal. Use the new PayloadSpanUtil class to get the payloads for the matching words. (Think of the PayloadSpanUtil as a highlighter - you give it a query, it gives you the payloads to the terms that match). The PayloadSpanUtil class is a bit experimental, but I'll fix anything you run into with it. - Mark Greg Shackles wrote: Hi Erick, Thanks for the re
Re: LUCENE-831 (complete cache overhaul) -> mem use
Its hard to predict the future of LUCENE-831. I would bet that it will end up in Lucene at some point in one form or another, but its hard to say if that form will be whats in the available patches (I'm a contrib committer so I won't have any real say in that, so take that prediction with a grain of salt). It has strong ties to other issues and a committer hasn't really had their whack at it yet. Having said that though, LUCENE-831 allows for two types for dealing with field values: either the old style int/string/long/etc arrays, or for a small speed hit and faster reopens, an ArrayObject type that is basically an Object that can provide access to one or two real or virtual arrays. So technically you could use an ArrayObject that had a sparse implementation behind it. Unfortunately, you would have to implement new CachKeys to do this. Trivial to do, but reveals our LUCENE-831 problem of exponential cachkey increases with every new little option/idea and the juggling of which to use. I havn't thought about it, but I'm hoping an API tweak can alleviate some of this. - Mark Britske wrote: Hi, I recently saw activity on LUCENE-831 (Complete overhaul of FieldCache API/Implementation) which I have interest in. I posted previously on this with my concern that given the current default cache I sometimes get OOM-errors because I have a lot of fields which are sorted on, which ultimately causes the fieldcache to grow greater then available RAM. ultimately I want to subclass the new pluggable Fieldcache of lucene-831 to offload to disk (using ehcache or memcachedB or something) but havn't found the time yet. What I would like to know for now is if perhaps the newly implemented standard cache in LUCENE-831 uses another strategy of caching than the standard Fieldcache in Lucene. i.e: The normal cache consumes memory while generating a fieldcache for every document in lucene even though the document hasn't got that field set. Since my documents are very sparse in these fields I want to sort on it would differ a_lot when documents that don't have the field in question set don't add up in the used memory. So am I lucky? Or would I indeed have to cook up something myself? Thanks and best regards, Geert-Jan I'm - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: LUCENE-831 (complete cache overhaul) -> mem use
Like I said, its pretty easy to add this, but its also going to suck. Kind of exposes the fact that its missing the right extensibility at the moment. Things are still a bit ugly overall. Your going to need new CacheKeys for the data types you want to support. A CacheKey builds and provides access to the field data and is simply: *public* *abstract* *class* CacheKey { *public* *abstract* CacheData buildData(IndexReader r); *public* *abstract* *boolean* equals(Object o); *public* *abstract* *int* hashCode(); *public* *boolean* isMergable(); *public* CacheData mergeData(*int*[] starts, CacheData[] data) ; *public* *boolean* usesObjectArray(); For a sparse storage implementation you would use an object array, so have usesObjectArray return true and isMergable can then be false and you dont have to support the mergeData method. In buildData you will load your object array and return it. Here is an array backed IntObjectArrayCacheKey build method: *public* CacheData buildData(IndexReader reader) *throws* IOException { *final* *int*[] retArray = getIntArray(reader); ObjectArray fieldValues = *new* ObjectArray() { *public* Object get(*int* index) { *return* *new* Integer(retArray[index]); } }; *return* *new* CacheData(fieldValues); } *protected* *int*[] getIntArray(IndexReader reader) *throws* IOException { *final* *int*[] retArray = *new* *int*[reader.maxDoc()]; TermDocs termDocs = reader.termDocs(); TermEnum termEnum = reader.terms(*new* Term(field, "")); *try* { *do* { Term term = termEnum.term(); *if* (term == *null* || term.field() != field) *break*; * int* termval = parser.parseInt(term.text()); termDocs.seek(termEnum); *while* (termDocs.next()) { retArray[termDocs.doc()] = termval; } } *while* (termEnum.next()); } *finally* { termDocs.close(); termEnum.close(); } *return* retArray; } So it should be fairly straightforward to return a sparse implementation backed object array from your new CacheKey (SparseIntObjectArrayCacheKey or something). Now some more ugliness: You can turn on the ObjectArray cachekeys by setting the system property 'use.object.array.sort' to true. This will cause FieldSortedHitQueue to return ScoreDocComparators that use the standard ObjectArray CacheKeys, IntObjectArrayCacheKey, FloatObjectArrayCacheKey, etc.The method that builds each comparator type knows what type to build for and whether to use primitive arrays or ObjectArrays ie (from FieldSortedHitQueue): *static* ScoreDocComparator comparatorDoubleOA(*final* IndexReader reader, *final* String fieldname) does this (it has to provide the CacheKey and know the return type): *final* ObjectArray fieldOrder = (ObjectArray) reader.getCachedData(*new* DoubleObjectArrayCacheKey(field)).getCachePayload(); So you have to either change all of the ObjectArray comparator builders to use your CacheKeys: *final* ObjectArray fieldOrder = (ObjectArray) reader.getCachedData(*new* SparseIntObjectArrayCacheKey(field)).getCachePayload(); Or you have to add more options in FieldSortedHitQueue.CacheEntry.buildData(IndexReader reader) and more static comparator builders in FieldSortedHitQueue that use the right CacheKeys. Obviously not very extensibility friendly at the moment. I'm sure with some thought, things could be much better. If you decided to jump into any of this, let me know if you have any suggestions, feedback. - Mark Britske wrote: That ArrayObject suggestion makes sense to me. It amost seemed to be as if you were referring as this option (or at least the interfaces needed to implement this) were already available as 1 out of 2 options available in 831? Could you give me a hint at were I have to be looking to extend what you're suggesting? a new Cache, CacheFactory and Cachekey implementaiton for all types of cachekeys? This may sound a bit ignorant, but it would be my first time to get my head around the internals of an api instead of merely using it to imbed in a client application so any help is highly appreciated. Thanks for your help, Geert-Jan markrmiller wrote: Its hard to predict the future of LUCENE-831. I would bet that it will end up in Lucene at some point in one form or another, but its hard to say if that form will be whats in the available patches (I'm a contrib committer so I won't have any real say in that, so take that prediction with a grain of salt). It has strong ties to other issues and a committer hasn't really had their whack at it yet. Having said that though, LUCENE-831 allows for two types for dealing with field values: either the old style int/string/long/etc arrays, or for a small speed hit and faster reopens, an ArrayObject type that is basically an Object that can provide access to one or two real or virtual arrays. So technically you could use an ArrayObject that had a sparse i
Re: InstantiatedIndex help
Check out the docs at: http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/index.html There is a performance graph there to check out. The code should be fairly straightforward - you can make an InstantiatedIndex thats empty, or seed it with an IndexReader. Then you can make an InstantiatedReader or Writer, which take the InstantiatedIndex as a constructor arg. You should be able to just wrap that InstantiatedReader in a regular Searcher. Darren Govoni wrote: Hi gang, I am trying to trace the 2.4 API to create an InstantiatedIndex, but its rather difficult to connect directory,reader,search,index etc just reading the javadocs. I have a (POI - plain old index) directory already and want to create a faster InstantiatedIndex and IndexSearcher to query it like before. What's the proper order to do this? Also, if anyone has any empirical data on the performance or reliability of InstantiatedIndex, I'd be curious. Thanks for the tips! Darren - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: InstantiatedIndex help
Can you start with an empty index? Then how about: // Adding these iindex = InstantiatedIndex() ireader = iindex.indexReaderFactory() isearcher = IndexSearcher(ireader) If you want a copy from another IndexReader though, you have to get that reader from somewhere right? - Mark Darren Govoni wrote: Hi Mark, Thanks for the tips. Here's what I will try (psuedo-code) endirectory = RAMDirectory("index/dictionary.en") ensearcher = IndexSearcher(endirectory) // Adding these reader = ensearcher.getIndexReader() iindex = InstantiatedIndex(reader) ireader = iindex.indexReaderFactory() isearcher = IndexSearcher(ireader) Kind of round about way to get an InstantiatedIndex I guess,but maybe there's a briefer way? Thank you. Darren On Sun, 2008-11-16 at 10:50 -0500, Mark Miller wrote: Check out the docs at: http://lucene.apache.org/java/2_4_0/api/contrib-instantiated/index.html There is a performance graph there to check out. The code should be fairly straightforward - you can make an InstantiatedIndex thats empty, or seed it with an IndexReader. Then you can make an InstantiatedReader or Writer, which take the InstantiatedIndex as a constructor arg. You should be able to just wrap that InstantiatedReader in a regular Searcher. Darren Govoni wrote: Hi gang, I am trying to trace the 2.4 API to create an InstantiatedIndex, but its rather difficult to connect directory,reader,search,index etc just reading the javadocs. I have a (POI - plain old index) directory already and want to create a faster InstantiatedIndex and IndexSearcher to query it like before. What's the proper order to do this? Also, if anyone has any empirical data on the performance or reliability of InstantiatedIndex, I'd be curious. Thanks for the tips! Darren - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Spread of lucene score
excitingComm2 wrote: Hi everybody, as far as I know the lucene score is an arbitrary number between 0.0 and 1.0. Is this correct, that the scores in my resultset are always normalised to this spread or is it possible to get higher scores? Regards, John W. Hits is the class that did the normalizing, and its deprecated. TopDocs didn't normalize last I checked, so you could get > 1 from there. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene implementation/performance question
Yeah, discussion came up on order and I believe we punted - its up to you to track order and sort at the moment. I think that was to prevent those that didnt need it from paying the sort cost, but I have to go find that discussion again (maybe its in the issue?) I'll look at the whole idea again though. Greg Shackles wrote: On Wed, Nov 19, 2008 at 12:33 PM, Greg Shackles <[EMAIL PROTECTED]> wrote: In the searching phase, I would run the search across all page documents, and then for each of those pages, do a search with PayloadSpanUtil.getPayloadsForQuery that made it so it only got payloads for each page at a time. The function returns a Collection of Payloads as far as I can tell, so is there any way of knowing which payloads go together? That is to say, if you were to do a search for "lucene rocks" on the page and it appeared 3 times, you would get back 6 payloads in total. Is there a quick way of knowing how to group them in the collection? Just a follow-up on my post now that I was able to see what the real data looks like when it comes back from PayloadSpanUtil. The order of payload terms in the collection doesn't seem useful, as I suspect it is somehow related to the order they are stored in the index itself. Because of that, grouping them is going to be difficult as I suspected, but this seems like something Lucene should be able to do for me. Is that not correct? I'd like to keep as much of the logic as possible out of my own implementation for the sake of performance so if there is some way to do this, I would love to know. Thanks! By the way, the Payloads feature is really cool! Definitely way better than how I was doing things originally. : ) - Greg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: # of fields, performance
There is not much impact as long as you turn off Norms for the majority of them. - Mark On Dec 2, 2008, at 8:47 AM, Darren Govoni <[EMAIL PROTECTED]> wrote: Hi, I saw this question asked before without a clear answer. Pardons if I missed it in the archive elsewhere. Is there a serious degradation of performance when using high number of fields per document? Like 100's? Is the impact more on the write than the read? What are the performance characteristics with a high number of fields and is anyone using indexes this way? thank you for any thoughts. Darren - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: lucene nicking my memory ?
Careful here. Not only do you need to pass -server, but you need the ability to use it :) It will silently not work if its not there I believe. Oddly, the JRE doesn't seem to come with the server hotspot implementation. The JDK always does appear to. Probably varies by OS to some degree. Some awesome options for visually watching garbage collection: straight visualgc the netbeans visualgc plugin the awesome visualvm and the visualgc plugin Eric Bowman wrote: Are you not passing -server on the command line? You need to do that. In my experience with Sun JVM 1.6.x, the default gc strategy is really amazingly good, as long as you pass -server. If passing -server doesn't fix it, I would recommend enabling the various verbose GC logs and watching what happens there, and using the Sun tools to analyze it a bit. If you do require specific heap tuning, the verbose gc logging will steer you in the right direction. Good luck! -Eric Michael McCandless wrote: Are you actually hitting OOME? Or, you're watching heap usage and it bothers you that the GC is taking a long time (allowing too much garbage to use up heap space) before sweeping? One thing to try (only for testing) might be a lower and lower -Xmx until you do hit OOME; then you'll know the "real" memory usage of the app. Mike Magnus Rundberget wrote: Sure, Tried with the following Java version: build 1.5.0_16-b06-284 (dev), 1.5.0_12 (production) OS : Mac OS/X Leopard(dev) and Windows XP(dev), Windows 2003 (production) Container : Jetty 6.1 and Tomcat 5.5 (latter is used both in dev and production) current jvm options -Xms512m -Xmx1024M -XX:MaxPermSize=256m ... tried a few gc settings as well but nothing that has helped (rather slowed things down) production hw running 2 XEON dual core processors in production our memory reaches the 1024 limit after a while (a few hours) and at some point it stops responding to forced gc (using jconsole). need to digg quite a bit more to figure out the exact prod settings. But safe to say the memory usage pattern can be recreated on different hardware configs, with different os's, different 1.5 jvms and different containers (jetty and tomcat). cheers Magnus On 3. des.. 2008, at 13.10, Glen Newton wrote: Hi Magnus, Could you post the OS, version, RAM size, swapsize, Java VM version, hardware, #cores, VM command line parameters, etc? This can be very relevant. Have you tried other garbage collectors and/or tuning as described in http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html? 2008/12/3 Magnus Rundberget <[EMAIL PROTECTED]>: Hi, We have an application using Tomcat, Spring etc and Lucene 2.4.0. Our index is about 100MB (in test) and has about 20 indexed fields. Performance is pretty good, but we are experiencing a very high usage of memory when searching. Looking at JConsole during a somewhat silly scenario (but illustrates the problem); (Allocated 512 MB Min heap space, max 1024) 0. Initially memory usage is about 70MB 1. Search for word "er", heap memory usage goes up by 100-150MB 1.1 Wait for 30 seconds... memory usage stays the same (ie no gc) 2. Search by word "og", heap memory usage goes up another 50-100MB 2.1 See 1.1 ...and so on until it seems to reach the 512 MB limit, and then a garbage collection is performed i.e garbage collection doesn't seem to occur until it "hits the roof" We believe the scenario is similar in production, were our heap space is limited to 1.5 GB. Our search is basically as follows -- 1. Open an IndexSearcher 2. Build a Boolean Query searching across 4 fields (title, summary, content and daterangestring MMDD) 2.1 Sort on title 3. Perform search 4. Iterate over hits to build a set of custom result objects (pretty small, as we dont include content in these) 5. Close searcher 6. Return result objects. You should not close the searcher: it can be shared by all queries. What happens when you warm Lucene with a (large) number of queries: do things stabilize over time? A 100MB index is (relatively) very small for Lucene (I have indexes 100GB). What kind of response times are you getting, independent of memory usage. -glen We have tried various options based on entries on this mailing list; a) Cache the IndexSearcher - Same results b) Remove sorting - Same result c) In point 4 only iterating over a limited amount of hits rather than whole collection - Same result in terms of memory usage, but obviously increased performance d) Using RamDirectory vs FSDirectory - Same result only initial heap usage is higher using ramdirectory (in conjuction with cached indexsearcher) Doing some profiling using YourKit shows a huge number of char[], int[] and string[], and ever increasing number of lucene related objects. Reading through the mailing lists, suspicions are that our problem is related to ThreadLocals and memory not being re
Re: NPE inside org.apache.lucene.index.SegmentReader.getNorms
Sounds familiar. This may actually be in JIRA already. - Mark On Dec 3, 2008, at 6:25 PM, "Teruhiko Kurosaka" <[EMAIL PROTECTED]> wrote: Mike, You are right. There was an error on my part. I think I was, in effect, making a SpanNearQuery object of: new SpanNearQuery(new SpanQuery[0], 0, true); -Original Message- From: Michael McCandless [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2008 10:47 AM To: java-user@lucene.apache.org Subject: Re: NPE inside org.apache.lucene.index.SegmentReader.getNorms Actually I think something "outside" Lucene is probably setting that field. How did you create the Query that you are searching on? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open IndexReader read-only
Chris Bamford wrote: So does that mean if you don't explicitly open an IndexReader, the IndexSearcher will do it for you? Or what? Right. The IndexReader takes a Directory, and the IndexSearcher takes an IndexReader - there are sugar constructors though - An IndexSearcher will also accept a String file path, which will be used to create a Directory which is used to create an IndexReader. It will also take a Directory, which will be used to create an IndexReader. It will also just accept the IndexReader. So you have to find how that IndexReader is being created (or where) and change the code so that you get to create it, and when you do, do it read-only. It should be easier than that roundabout info sounds. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fragment Highlighter Phrase?
Ian Vink wrote: Is there a way to get phrases counted in the list of fragments that come back from Highlighter.GetBestFragments() in general. It seems to only take words into account. Ian Not sure I fully understand, but have you tried the SpanScorer? It allows the Highlighter to work with phrase/span queries. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open IndexReader read-only
Look for the static factory methods on IndexReader. - Mark Chris Bamford wrote: Thanks Mark. I have identified the spot where I need to do the surgery. However, I discover that IndexReader is abstract, but it seems crazy that I need to make a concrete class for which I have no need to add any of my own logic... Is there a suitable subclass I can use? The documented ones - FilterIndexReader, InstantiatedIndexReader, MultiReader, ParallelReader - all seem too complicated for what I need. My only requirement is to open it read-only! Am I missing something? Mark Miller wrote: Chris Bamford wrote: So does that mean if you don't explicitly open an IndexReader, the IndexSearcher will do it for you? Or what? Right. The IndexReader takes a Directory, and the IndexSearcher takes an IndexReader - there are sugar constructors though - An IndexSearcher will also accept a String file path, which will be used to create a Directory which is used to create an IndexReader. It will also take a Directory, which will be used to create an IndexReader. It will also just accept the IndexReader. So you have to find how that IndexReader is being created (or where) and change the code so that you get to create it, and when you do, do it read-only. It should be easier than that roundabout info sounds. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Open IndexReader read-only
Chris Bamford wrote: Mark > Look for the static factory methods on IndexReader. I take it you mean IndexReader.open (dir, true) ? Yeah. If so, how do I then pass that into DelayCloseIndexSearcher() so that I can continue to rely on all the existing calls like: IndexReader reader = contentSearcher.getIndexReader(); Put another way, how do I associate the static IndexReader with an IndexSearcher object so I can use getIndexReader() to get it again? Find where that contentSearcher is being created. Use a different constructor to create the Searcher - use the one that takes an IndexReader. Now you control the IndexReader creation, and you can use the readonly constructor option when you create it. That Searcher is either using a constructor that takes an IndexReader, or a Directory, or a String. If its using a String constructor, instead, use the Directory factory that takes a String, make a Directory, and use it to make an IndexReader that you build the IndexSearcher with. If its using a Directory, use that directory to make the IndexReader that is used for you IndexSearcher. Thanks for your continued help with this :-) Chris Mark Miller wrote: Look for the static factory methods on IndexReader. - Mark Chris Bamford wrote: Thanks Mark. I have identified the spot where I need to do the surgery. However, I discover that IndexReader is abstract, but it seems crazy that I need to make a concrete class for which I have no need to add any of my own logic... Is there a suitable subclass I can use? The documented ones - FilterIndexReader, InstantiatedIndexReader, MultiReader, ParallelReader - all seem too complicated for what I need. My only requirement is to open it read-only! Am I missing something? Mark Miller wrote: Chris Bamford wrote: So does that mean if you don't explicitly open an IndexReader, the IndexSearcher will do it for you? Or what? Right. The IndexReader takes a Directory, and the IndexSearcher takes an IndexReader - there are sugar constructors though - An IndexSearcher will also accept a String file path, which will be used to create a Directory which is used to create an IndexReader. It will also take a Directory, which will be used to create an IndexReader. It will also just accept the IndexReader. So you have to find how that IndexReader is being created (or where) and change the code so that you get to create it, and when you do, do it read-only. It should be easier than that roundabout info sounds. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Has anyone written SpanFuzzyQuery?
http://issues.apache.org/jira/browse/LUCENE-522 note the bugs mentioned at the bottom. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: GWT port of Lucene's QueryParser
Paul Libbrecht wrote: Hello again list, has anyone tried to port or simply run the QueryParser of Lucene to GWT? It would look like a very nice thing to do to provide direct rendering of the query interpretation (it could be made into a whole editor probably, e.g. removing or selecting parts of the query). thanks in advance paul I dont think its worth the effort Paul (though it sounds like a fun thing to try, as long as JavaCC sticks to GWT compatible core classes). Seems a lot easier to just run the queryparser server side though and move the results back and forth with rpc. As a side note, Mark Harwood worked on a cool GWT port of Luke. - Mark - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Field.omitTF
Drops positions as well. - Mark On Dec 18, 2008, at 4:57 PM, "John Wang" wrote: Hi: In lucene 2.4, when Field.omitTF() is called, payload is disabled as well. Is this intentional? My understanding is payload is independent from the term frequencies. Thanks -John - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Field.omitTF
No, not a bug, certainly its the intended behavior (though the name is a bit tricky isn't it? I've actually thought about that in the past myself). If you check out the javadoc on Fieldable youll find: /** Expert: * * If set, omit term freq, positions and payloads from postings for this field. */ void setOmitTf(boolean omitTf); - Mark John Wang wrote: Thanks Mark!I don't think it is documented (at least the ones I've read), should this be considered as a bug or ... ? Thanks -John On Thu, Dec 18, 2008 at 2:05 PM, Mark Miller wrote: Drops positions as well. - Mark On Dec 18, 2008, at 4:57 PM, "John Wang" wrote: Hi: In lucene 2.4, when Field.omitTF() is called, payload is disabled as well. Is this intentional? My understanding is payload is independent from the term frequencies. Thanks -John - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Approximate release date for Lucene 2.9
Well look at the issues and see for yourself :) Its a subjective call I think. Heres my take: There are not going to be too many sweeping changes in the next release. There are tons of little bug fixes and improvements, but not a lot of the bullet point type stuff that you mention in your wishlist. Its a whole lot of little steps forward. When it comes to sorting, there a couple possible goodies coming in the next release: TrieRangeQuery has been added to contrib. Super awesome, super efficient, large scale sorting. Work is ongoing to change searching semantics so that sorting is much faster in many cases. In fact, their may be search speed improvements across the board in many cases (don't quote me ). Sort fieldcache loading in the multi segment case will likely also be *blazingly* faster. Also, Filters and Fieldcaches may be pushed down to a single segment, making reopening sort fieldcaches *much* more efficient. Thats a nice step towards realtime. RangeQuery, PrefixQuery and WildcardQuery will all have a constant score mode as well - this avoids maxclause limits and is often much faster on very large indexes. Locallucene, a very cool bit of code that allows geo search, might make contrib for the next release. Beyond that, there are a few more little gems, but its a lot of little fixes and improvements more than big features. Column stride fields and flexible indexing will not be in the next release in my opinion, but a lot of progress towards flexible indexing has been made. Keep in mind thats a biased view of the next release - I worked on two of those issues. Be sure to take it all with a healthy grain of salt. - Mark Ganesh wrote: Does Lucene 2.9 has real time search? Any improvements in sorting? Any facility to store a payload per document (without updating document)? Please highlight the important feature? Regards Ganesh - Original Message - From: "Michael McCandless" To: Sent: Friday, December 19, 2008 3:40 AM Subject: Re: Approximate release date for Lucene 2.9 Well... there are a couple threads on java-dev discussing this "now": http://www.nabble.com/2.9-3.0-plan---Java-1.5-td20972994.html http://www.nabble.com/2.9,-3.0-and-deprecation-td20099343.html though they seem to have petered out. Also we have 29 open issues for 2.9: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310110&fixfor=12312682&resolution=-1&sorter/field=priority&sorter/order=DESC For 2.4 it took at least a month to whittle the list down to 0. So it's hard to say? I'd love to see 2.9 out earlyish next year though. Mike Kay Kay wrote: Hi - I am just curious - what is the approximate release target date that we have for Lucene 2.9 ( currently in beta in dev). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Approximate release date for Lucene 2.9
Mark Miller wrote: TrieRangeQuery has been added to contrib. Super awesome, super efficient, large scale sorting. Sorry. Its way past my bedtime. Large scale numerical range searching. Sorting on the brain. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Approximate release date for Lucene 2.9
Right, I was debating throwing that in myself - its great stuff, but I wasn't sure how much of a feature benefit it brought now. My understanding is that its main benefit is along the flexible indexing path and using multiple consumers eg its more setup for the goodness yet to come. My understanding is certainly less than yours though :) - Mark Michael McCandless wrote: The new extensible TokenStream API (based on AttributeSource) is also in 2.9. Mike Mark Miller wrote: Well look at the issues and see for yourself :) Its a subjective call I think. Heres my take: There are not going to be too many sweeping changes in the next release. There are tons of little bug fixes and improvements, but not a lot of the bullet point type stuff that you mention in your wishlist. Its a whole lot of little steps forward. When it comes to sorting, there a couple possible goodies coming in the next release: TrieRangeQuery has been added to contrib. Super awesome, super efficient, large scale sorting. Work is ongoing to change searching semantics so that sorting is much faster in many cases. In fact, their may be search speed improvements across the board in many cases (don't quote me ). Sort fieldcache loading in the multi segment case will likely also be *blazingly* faster. Also, Filters and Fieldcaches may be pushed down to a single segment, making reopening sort fieldcaches *much* more efficient. Thats a nice step towards realtime. RangeQuery, PrefixQuery and WildcardQuery will all have a constant score mode as well - this avoids maxclause limits and is often much faster on very large indexes. Locallucene, a very cool bit of code that allows geo search, might make contrib for the next release. Beyond that, there are a few more little gems, but its a lot of little fixes and improvements more than big features. Column stride fields and flexible indexing will not be in the next release in my opinion, but a lot of progress towards flexible indexing has been made. Keep in mind thats a biased view of the next release - I worked on two of those issues. Be sure to take it all with a healthy grain of salt. - Mark Ganesh wrote: Does Lucene 2.9 has real time search? Any improvements in sorting? Any facility to store a payload per document (without updating document)? Please highlight the important feature? Regards Ganesh - Original Message - From: "Michael McCandless" To: Sent: Friday, December 19, 2008 3:40 AM Subject: Re: Approximate release date for Lucene 2.9 Well... there are a couple threads on java-dev discussing this "now": http://www.nabble.com/2.9-3.0-plan---Java-1.5-td20972994.html http://www.nabble.com/2.9,-3.0-and-deprecation-td20099343.html though they seem to have petered out. Also we have 29 open issues for 2.9: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&&pid=12310110&fixfor=12312682&resolution=-1&sorter/field=priority&sorter/order=DESC For 2.4 it took at least a month to whittle the list down to 0. So it's hard to say? I'd love to see 2.9 out earlyish next year though. Mike Kay Kay wrote: Hi - I am just curious - what is the approximate release target date that we have for Lucene 2.9 ( currently in beta in dev). - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org Send instant messages to your online friends http://in.messenger.yahoo.com - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Optimize and Out Of Memory Errors
Lebiram wrote: Also, what are norms Norms are a byte value per field stored in the index that is factored into the score. Its used for length normalization (shorter documents = more important) and index time boosting. If you want either of those, you need norms. When norms are loaded up into an IndexReader, its loaded into a byte[maxdoc] array for each field - so even if one document out of 400 million has a field, its still going to load byte[maxdoc] for that field (so a lot of wasted RAM). Did you say you had 400 million docs and 7 fields? Google says that would be: **400 million x 7 byte = 2 670.28809 megabytes** On top of your other RAM usage. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Optimize and Out Of Memory Errors
Mark Miller wrote: Lebiram wrote: Also, what are norms Norms are a byte value per field stored in the index that is factored into the score. Its used for length normalization (shorter documents = more important) and index time boosting. If you want either of those, you need norms. When norms are loaded up into an IndexReader, its loaded into a byte[maxdoc] array for each field - so even if one document out of 400 million has a field, its still going to load byte[maxdoc] for that field (so a lot of wasted RAM). Did you say you had 400 million docs and 7 fields? Google says that would be: **400 million x 7 byte = 2 670.28809 megabytes** On top of your other RAM usage. Just to avoid confusion, that should really read a byte per document per field. If I remember right, it gives 255 boost possibilities, limited to 25 with length normalization. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Optimize and Out Of Memory Errors
We don't know those norms are "the" problem. Luke is loading norms if its searching that index. But what else is Luke doing? What else is your App doing? I suspect your app requires more RAM than Luke? How much RAM do you have and much are you allocating to the JVM? The norms are not necessarily the problem you have to solve - but it would appear they are taking up over 2 gig of memory. Unless you have some to spare (and it sounds like you may not), it could be a good idea to turn them off for particular fields. - Mark Lebiram wrote: Is there away to not factor in norms data in scoring somehow? I'm just stumped as to how Luke is able to do a seach (with limit) on the docs but in my code it just dies with OutOfMemory errors. How does Luke not allocate these norms? ________ From: Mark Miller To: java-user@lucene.apache.org Sent: Tuesday, December 23, 2008 5:25:30 PM Subject: Re: Optimize and Out Of Memory Errors Mark Miller wrote: Lebiram wrote: Also, what are norms Norms are a byte value per field stored in the index that is factored into the score. Its used for length normalization (shorter documents = more important) and index time boosting. If you want either of those, you need norms. When norms are loaded up into an IndexReader, its loaded into a byte[maxdoc] array for each field - so even if one document out of 400 million has a field, its still going to load byte[maxdoc] for that field (so a lot of wasted RAM). Did you say you had 400 million docs and 7 fields? Google says that would be: **400 million x 7 byte = 2 670.28809 megabytes** On top of your other RAM usage. Just to avoid confusion, that should really read a byte per document per field. If I remember right, it gives 255 boost possibilities, limited to 25 with length normalization. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: about TopFieldDocs
Erick Erickson wrote: > The number of documents > is irrelevant here, what is relevant is the number of > distinct terms in your "fieldName" field. > Depending on the size of your index, the number of docs will matter though. You have to store the unique terms in a String[] array, but you also store an int[] array the size of maxdoc that indexes into the unique terms array. Depending on your index, this could be as much or more of a cost than the unique terms. It doesn't matter how many documents you get back though for a particular search - its just how many docs are in the index. - Mark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: ANNOUNCE: Welcome Patrick O'Leary as Contrib Committer
Welcome Patrick! +1 for LocalLucene. patrick o'leary wrote: Thanks Folks I'm in the business well over a decade now; Started my career in my country of origin in Ireland, and have since lived & worked in UK and the US. I've also traveled extensively establishing development groups in remote offices for my company in a few countries. I've worked in several areas, from global publishing services, CRM's / fulfillment systems, web server development, to technical operations and for the past number of years have made a home for myself in search and local search. My background has been in CS, math and physics. And despite the rumors my user name "pjaol" is actually an acronym of my full name, which is only ever used by my mother when I'm in trouble :-) It will be a pleasure to continue working with all of you, and thank you again for this honor. Thanks Patrick O'Leary On Jan 16, 2009, at 1:54 PM, Ryan McKinley wrote: The PMC is pleased to announce that Patrick O'Leary has been voted to be a a Lucene-Java Contrib committer. Patrick has contributed a great foundation for integrating spatial search with lucene. I look forward to future development in this area. Patrick - traditionally we ask you to send out an introduction to the community; its nice for folks to get a sense for who everyone is. Also check that your new svn karma works by adding yourself to the list of contrib committers. Welcome Patrick! ryan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: term offsets info seems to be wrong...
Okay, Koji, hopefully I'll be more luckily suggesting this this time. Have you tried http://issues.apache.org/jira/browse/LUCENE-1448 yet? I am not sure if its in an applyable state, but I hope that covers your issue. On Fri, Jan 16, 2009 at 7:15 PM, Koji Sekiguchi wrote: > Hello, > > I'm writing a highlighter by using term offsets info (yes, I borrowed > the idea > of LUCENE-644). In my highlighter, I'm seeing unexpected term offsets info > when getting multi-valued field. > > For example, if I indexed [" "," bbb "] (multi-valued), I got term info > bbb(7,10). This is expected result. But if I indexed [" aaa "," bbb "] > (note that using " aaa " instead of " "), I got term info bbb(6,9) > which > is unexpected. I would like to get same offset info for bbb because they > are same length of field values. > > Please use the following program to see the problem I'm seeing. I'm > using trunk: > > public static void main(String[] args) throws Exception { > // create an index > Directory dir = new RAMDirectory(); > Analyzer analyzer = new WhitespaceAnalyzer(); > IndexWriter writer = new IndexWriter( dir, analyzer, true, > MaxFieldLength.LIMITED ); > Document doc = new Document(); > doc.add( new Field( "f", " aaa ", Store.YES, Index.ANALYZED, > TermVector.WITH_OFFSETS ) ); > //doc.add( new Field( "f", " ", Store.YES, Index.ANALYZED, > TermVector.WITH_OFFSETS ) ); > doc.add( new Field( "f", " bbb ", Store.YES, Index.ANALYZED, > TermVector.WITH_OFFSETS ) ); > writer.addDocument( doc ); > writer.close(); > > // print the offsets > IndexReader reader = IndexReader.open( dir ); > TermPositionVector tpv = (TermPositionVector)reader.getTermFreqVector( > 0, "f" ); > for( int i = 0; i < tpv.getTerms().length; i++ ){ > System.out.print( "term = \"" + tpv.getTerms()[i] + "\"" ); > TermVectorOffsetInfo[] tvois = tpv.getOffsets( i ); > for( TermVectorOffsetInfo tvoi : tvois ){ > System.out.println( "(" + tvoi.getStartOffset() + "," + > tvoi.getEndOffset() + ")" ); > } > } > reader.close(); > } > > regards, > > Koji > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Group by in Lucene ?
Group-by in Lucene/Solr has not been solved in a great general way yet to my knowledge. Ideally, we would want a solution that does not need to fit into memory. However, you need the value of the field for each document. to do the grouping As you are finding, this is not cheap to get. Currently, the efficient way to get it is to use a FieldCache. This, however, requires that every distinct value can fit into memory. Once you have efficient access to the values, you need to be able to efficiently group the results, again not bounded by memory (which we already are with the FieldCache). There are quite a few ways to do this. The simplest is to group until you have used all the memory you want, then for everything left, anything that doesnt match a group, write it to a file, if it does, increment the group count. Use the overflow file as the input in the next run, repeat until there is no overflow. You can improve on that by partitioning the overflow file. And then there are a dozen other methods. Solr has a patch in JIRA that uses a sorting method. First the results are sorted on the group-by field, then scanned through for grouping - all field values that are the same will be next to each other. Finally, if you really wanted to sort on a different field, another sort is applied. Thats not ideal IMO, but its a start. - Mark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org