Re: Lucene Query
Oh sorry guys, ignore what I said. I am going to get myself a coffee. Uwe is absolutely correct here. On Aug 19, 2014, at 01:13 PM, Uwe Schindler wrote: Hi, Look at his docs. He has only 2 docs, the second one 3 keywords. I would use a simple phrase query with a slop value < Analyzers positionIncrementGap. This is the gap between fields with same name. Span or phrase cannot cross the gap, if slop if small enough, but large enough to find the terms next to each other. SpanQuery is not needed. Phrase does all thats needed. Slop is like edit distance of whole terms, order does not matter. Uwe Am 19. August 2014 22:05:23 MESZ, schrieb Tri Cao : >OR operator does that, AND only returns docs with ALL terms present. > >Note that you have two options here >1. Create a BooleanQuery object (see the Java doc I linked below) and >programatically >add the term queries with the following constraint: >http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST_NOT > >2. Use Lucene classic QueryParser and pass in the query string "states >AND america AND united" > >I would suggest 1) if you are going to learn more about Lucene, and 2) >if you are just want to get some thing out. > >Hope this helps, >Tri > >On Aug 19, 2014, at 12:17 PM, Jin Guang Zheng wrote: > >Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with >query: > >label:States AND label:America AND label:United > >Best, >Jin > > >On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao wrote: > > > given that example, the easy way is a boolean AND query of all >the terms: > > > > > > >http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html > > > > However, if your corpus is more sophisticated you'll find that >relevance > > ranking is not always that trivial :) > > > > On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng > wrote: > > > > Hi, > > > > I am wondering if someone can help me on this: > > > > I have index: > > > > doc 1 -- label: United States of America > > > > doc 2 -- label: United > > doc 2 -- label: America > > doc 2 -- label: States > > > > I am wondering how to generate a query with terms: states >united america > > > > so only doc 1 returns. > > > > > > I was thinking SpanNearQuery, but can't make it work. > > > > Thanks, > > Jin > > > > -- Uwe Schindler H.-H.-Meier-Allee 63, 28213 Bremen http://www.thetaphi.de
Re: Lucene Query
Whoops, the constraint should be MUST to force all terms present: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST On Aug 19, 2014, at 01:05 PM, "Tri Cao" wrote: OR operator does that, AND only returns docs with ALL terms present. Note that you have two options here 1. Create a BooleanQuery object (see the Java doc I linked below) and programatically add the term queries with the following constraint: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST_NOT 2. Use Lucene classic QueryParser and pass in the query string "states AND america AND united" I would suggest 1) if you are going to learn more about Lucene, and 2) if you are just want to get some thing out. Hope this helps, Tri On Aug 19, 2014, at 12:17 PM, Jin Guang Zheng wrote: Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with query: label:States AND label:America AND label:United Best, Jin On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao wrote: > given that example, the easy way is a boolean AND query of all the terms: > > > http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html > > However, if your corpus is more sophisticated you'll find that relevance > ranking is not always that trivial :) > > On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng wrote: > > Hi, > > I am wondering if someone can help me on this: > > I have index: > > doc 1 -- label: United States of America > > doc 2 -- label: United > doc 2 -- label: America > doc 2 -- label: States > > I am wondering how to generate a query with terms: states united america > > so only doc 1 returns. > > > I was thinking SpanNearQuery, but can't make it work. > > Thanks, > Jin > >
Re: Lucene Query
OR operator does that, AND only returns docs with ALL terms present. Note that you have two options here 1. Create a BooleanQuery object (see the Java doc I linked below) and programatically add the term queries with the following constraint: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanClause.Occur.html#MUST_NOT 2. Use Lucene classic QueryParser and pass in the query string "states AND america AND united" I would suggest 1) if you are going to learn more about Lucene, and 2) if you are just want to get some thing out. Hope this helps, Tri On Aug 19, 2014, at 12:17 PM, Jin Guang Zheng wrote: Thanks for reply, but won't BooleanQuery return both doc1 and doc2 with query: label:States AND label:America AND label:United Best, Jin On Tue, Aug 19, 2014 at 2:07 PM, Tri Cao wrote: > given that example, the easy way is a boolean AND query of all the terms: > > > http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html > > However, if your corpus is more sophisticated you'll find that relevance > ranking is not always that trivial :) > > On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng wrote: > > Hi, > > I am wondering if someone can help me on this: > > I have index: > > doc 1 -- label: United States of America > > doc 2 -- label: United > doc 2 -- label: America > doc 2 -- label: States > > I am wondering how to generate a query with terms: states united america > > so only doc 1 returns. > > > I was thinking SpanNearQuery, but can't make it work. > > Thanks, > Jin > >
Re: Lucene Query
given that example, the easy way is a boolean AND query of all the terms: http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/search/BooleanQuery.html However, if your corpus is more sophisticated you'll find that relevance ranking is not always that trivial :) On Aug 19, 2014, at 11:00 AM, Jin Guang Zheng wrote: Hi, I am wondering if someone can help me on this: I have index: doc 1 -- label: United States of America doc 2 -- label: United doc 2 -- label: America doc 2 -- label: States I am wondering how to generate a query with terms: states united america so only doc 1 returns. I was thinking SpanNearQuery, but can't make it work. Thanks, Jin
Re: Calculate Term Frequency
Erick, Solr termfreq implementation also uses DocsEnum with the assumption that freq are called on ascending doc IDs which is valid when scoring from from the hit list. If freq is requested for an out of order doc, a new DocsEnum has to be created. Bianca, can you explain your use case in more details? What did you mean by having a new document? A new document is added to the index? Then you already have to reopen the searcher/reader anyway to get a new DocsEnum. On Aug 19, 2014, at 08:26 AM, Erick Erickson wrote: Hmmm, I'm not at all an expert here, but Solr has a function query "termfreq" that does what you're doing I think? I wonder if the code for that function query would be a good place to copy (or even make use of)? See TermFreqValueSource... Maybe not helpful at all, but... Erick On Tue, Aug 19, 2014 at 7:04 AM, Bianca Pereira wrote: > Hi everybody, > > I would like to know your suggestions to calculate Term Frequency in a > Lucene document. Currently I am using MultiFields.getTermDocsEnum, > iterating through the DocsEnum 'de' returned and getting the frequency with > de.freq() for the desired document. > > My solution gives me the result I want but I am having time issues. For > instance, I want to calculate the term frequency for a given term for N > documents in a sequence. Then, every time I have a new document I have to > retrieve exactly the same DocsEnum again and iterate until find the > document I want. Of course I cannot cache DocsEnum (yes, I did this huge > mistake) because it is an iterator. > > Do you have any suggestions on how I can get Term Frequency in a fast way? > The unique suggestion I had up to now was "Do it programatically, don't use > Lucene". Should be this the solution? > > Thank you. > > Regards, > Bianca Pereira - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: deleteDocument with NRT
Solr has the notion of "soft commit" and "hard commit". A soft commit means Solr will reopen a new searcher. A hard commit means a flush to disk. All the update/delete logics are in Lucene, Solr doesn't maintain deleted doc IDs. It does maintain its own caches though. On Jul 14, 2014, at 03:09 AM, Ganesh wrote: How Solr handles this scenario... Is it reopening reader after every delete OR it maintains the list of delete documents in cache? Regards Ganesh On 7/11/2014 4:00 AM, Tri Cao wrote: > You need to reopen your searcher after deleting. From Java doc for > SearcherManager: > > In addition you should periodically call maybeRefresh > . > While it's possible to call this just before running each query, this > is discouraged since it penalizes the unlucky queries that do the > reopen. It's better to use a separate background thread, that > periodically calls maybeReopen. Finally, be sure to call close > > once you are done. > > > > On Jul 10, 2014, at 01:56 PM, Jamie wrote: > > > Hi > > > > I am using NRT search with the SearcherManager class. When the user > > elects to delete some documents, writer.deleteDocuments(terms) is called. > > > > The problem is that deletes are not immediately visible. What does it > > take to make them so? Even after calling commit(), the deleted > > documents are still returned. > > > > What is the recommended way to obtain a near realtime search result that > > immediately reflect all deleted documents? > > > > Much appreciate > > > > Jamie > > > > > > > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > <mailto:java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > <mailto:java-user-h...@lucene.apache.org > > >
Finding words not followed by other words
This is actually a tough problem in general: polysemy sense disambiguation. In your case, I think it's more like you'll probably need to do some named entity resolution to differentiate "George Washington" from "George Washington Carver" as they are two different entities. Do you have a list of all the entity names in your corpus (either manually curated or by some pattern matching?). If you do, one thing you can do is to write a tokenizer that emit one token for each entity. So, for example, "George Washington" string emits a token like _George_Washington_, "George Washington Carver" emits _George Washington_Carver_, etc. There are some open source NLP library that has does this, but the quality varies, as it will most likely depend on your domain and training data set. Hope this helps, Tri On Jul 11, 2014, at 07:20 AM, Michael Ryan wrote: I'm trying to solve the following problem... I have 3 documents that contain the following contents: 1: "George Washington Carver blah blah blah." 2: "George Washington blah blah blah." 3: "George Washington Carver blah blah blah. George Washington blah blah blah." I want to create a query that matches documents 2 and 3, but not 1. That is, I want to find documents that mention "George Washington". It's okay if they also mention "George Washington Carver", but I don't want documents that only mention "George Washington Carver". So simply doing something like this does not solve it: "George Washington" NOT "George Washington Carver" Is there a Query type that does this out of the box? I've looked at the various types of span queries, but none of them seem to do this. I think it should be theoretically possible given the position data that Lucene stores... -Michael
Re: deleteDocument with NRT
You need to reopen your searcher after deleting. From Java doc for SearcherManager: In addition you should periodically call maybeRefresh. While it's possible to call this just before running each query, this is discouraged since it penalizes the unlucky queries that do the reopen. It's better to use a separate background thread, that periodically calls maybeReopen. Finally, be sure to call close once you are done. On Jul 10, 2014, at 01:56 PM, Jamie wrote: Hi I am using NRT search with the SearcherManager class. When the user elects to delete some documents, writer.deleteDocuments(terms) is called. The problem is that deletes are not immediately visible. What does it take to make them so? Even after calling commit(), the deleted documents are still returned. What is the recommended way to obtain a near realtime search result that immediately reflect all deleted documents? Much appreciate Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: How to handle words that stem to stop words
I think emitting two tokens for "vans" is the right (potentially only) way to do it. You could also control the dictionary of terms that require this special treatment. Any reason makes you not happy with this approach? On Jul 06, 2014, at 11:48 AM, Arjen van der Meijden wrote: Hello list, We have a fairly large Lucene database for a 30+ million post forum. Users post and search for all kinds of things. To make sure users don't have to type exact matches, we combine a WordDelimiterFilter with a (Dutch) SnowballFilter. Unfortunately users sometimes find examples of words that get stemmed to a word that's basically a stop word. Or reversely, where a very common word is stemmed so that it becomes the same as a rare word. We do index stop words, so theoretically they could still find their result. But when a rare word is stemmed in such a way it yields a million hits, that makes it very unusable... One example is the Dutch word 'van' which is the equivalent of 'of' in English. A user tried to search for the shoe brand 'vans', which gets stemmed to 'van' and obviously gives useless results. I already noticed the 'KeywordRepeatFilter' to index/search both 'vans' and 'van' and the StemmerOverrideFilter to try and prevent these cases. Are there any other solutions for these kinds of problems? Best regards, Arjen van der Meijden - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Can Lucene based application be made to work with Scaled Elastic Beanstalk environemnt on Amazon Web Services
I would just use S3 as a data push mechanism. In your servlet's init(), you could download the index from S3 and unpack it to a local directory, then initialize your Lucene searcher to that directory. Downloading from S3 to EC2 instances is free, and 5G would take a minute or two. Also, if you pack the index inside your war file, the new instance has to download that data anyway. The big advantage is it also allows you to update your index without repacking your deployment .war. Just upload the new index to the same location in S3, then restart your webapp :) Hope this helps, Tri On Jun 27, 2014, at 04:13 AM, Paul Taylor wrote: Hi I have a simple WAR based web application that uses lucene created indexes to provide search results in a xml format. It works fine locally but I want to deploy it using Elastic Beanstalk within Amazon Webservices Problem 1 is that WAR definition doesn't seem to provide a location for data files (rather than config files) so when I deploy the WAR with EB it doesnt work at first because has no access to the data (lucene indexes) , however I solved this by connecting to the underlying EC2 instance and copy the lucene indexes from S3 to the instance, and ensuring the file location is defined in the Wars web.xml file. Problem 2 is more problematic, Im looking at AWS and EB because I wanted a way to deploy the application with little ongoing admin overhead and I like the way EB does load balancing and auto scaling for you, starting and stopping additional instances as required to meet demand. However these automatically started instances will not have access to the index files. Possible solutions could be 1. Is there a location I can store the data index within the WAR itself, the index is only 5GB so I do have space on my root disk to store the indexes in the WAR if there is a way to use them, Tomcat was also be need to unwar the file at deployement, I cant see if tomcat on AWSdoes this. 2. A way for EC2 instances to be started with data preloaded i some way (BTW Im aware of CloudSearch but its not an avenue I want to go down) Does anybody have any experience of this,please ? Paul - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
This is an interesting performance problem and I think there is probably not a single answer here, so I'll just layout the steps I would take to tackle this: 1. What is the variance of the query latency? You said the average is 5 minutes, but is it due to some really bad queries or most queries have the same perf? 2. We kind of assume that index size and number of docs is the issue here. Can you validate that assumption by trying to index with 10M, 50M, … docs and see how worse the performance is getting as a function of size? 3. What is the average doc hits for the bad queries? If you queries matches a lot of hits, scoring will be very expensive. While you only ask for 1000 top scored docs, Lucene still needs to score all the hits to get that 1000 docs. If this is the case, there could be some work around, but Iet's make sure that it's indeed the situation we are dealing with here. Hope this helps, Tri On Jun 01, 2014, at 11:50 PM, Jamie wrote: Greetings Despite following all the recommended optimizations (as described at http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of our installations, search performance has reached the point where is it unacceptably slow. For instance, in one environment, the total index size is 200GB, with 150 million documents indexed. With NRT enabled, search speed is roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux. The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 4.8.x. Is this likely to make any noticeable difference in performance? Clearly, longer term, we need to move to a distributed search model. We thought to take advantage of the distributed search features offered in Solr, however, our solution is very tightly integrated into Lucene directly (since Solr didn't exist when we started out). Moving to Solr now seems like a daunting prospect. We've also following the Katta project with interest, but it doesn't appear support distributed indexing, and development on it seems to have stalled. It would be nice if there were a distributed search project on the Lucene level that we could use. I realize this is a rather vague question, but are there any further suggestions on ways to improve search performance? We need cheap and dirty ideas, as well as longer term advice on a possible path forward. Much appreciate Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: maxDoc/numDocs int fields
I ran into this issue before and after some digging, I don't think there is an easy way to accommodate long IDs in Lucene. So I decided to go with sharding documents into multiple indexes. It turned out to be a good decision in my case because I would have to shard the index anyway for performance reasons. (There are queries that requires collecting and scoring a large portion of the index).On Mar 21, 2014, at 09:41 AM, Artem Gayardo-Matrosov wrote:Hi Oli,Thanks for your reply,I thought about this, but it feels like making a crude, inefficientimplementation of what's already in lucene -- CompositeReader, isn't it? Itwould involve writing my CompositeCompositeReader which would forward therequests to the underlying CompositeReader...Is there a better way?Thanks,Artem.On Fri, Mar 21, 2014 at 6:33 PM, Oliver Christwrote: > Can you split your corpus across multiple Lucene instances? > > Cheers, Oli > > -Original Message- > From: Artem Gayardo-Matrosov [mailto:ar...@gayardo.com] > Sent: Friday, March 21, 2014 12:29 PM > To: java-user@lucene.apache.org > Subject: maxDoc/numDocs int fields > > Hi all, > > I am using lucene to index a large corpus of text, with every word being a > separate document (this is something I cannot change), and I am hitting a > limitation of the CompositeReader only supporting Integer.MAX_VALUE > documents. > > Is there any way to work around this limitation? For the moment I have > implemented my own DirectoryReader and BaseCompositeReader to at least make > them support documents from Integer.MIN_VALUE to -1 (for twice more > documents supported), the problem is that all the APIs are restricted to > use the int type and after the docID value wraps back to 0, I have no way > to restore the original docID. > > -- > Thanks in advance, > Artem. >-- Artem.
Re: How to search for terms containing negation
StandardAnalyzer has a constructor that takes a stop word set, so I guess you can pass it an empty set:http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/standard/StandardAnalyzer.html#StandardAnalyzer(org.apache.lucene.util.Version, org.apache.lucene.analysis.util.CharArraySet)QueryParser is probably ok. I rarely use this parser but I don't think it recognizes "not" in its grammar.Hope this helps,TriOn Mar 17, 2014, at 12:46 PM, Natalia Connolly wrote:Hi Tri, Thank you so much for your message! Yes, it looks like the negation terms have indeed been filtered out; when I query on "no" or "not", I get no results. I am just using StandardAnalyzer and the classic QueryParser: Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_47); QueryParser parser = new QueryParser(Version.LUCENE_47, field, analyzer); Which analyzer/parser would you recommend? Thank you again, Natalia On Mon, Mar 17, 2014 at 3:35 PM, Tri Cao <tm...@me.com> wrote: Natalia,First make sure that your analyzers (both index and query analyzers) donot filter out these as stop words. I think the standard StopFilter listhas "no" and "not". You can try to see if you index have these terms byquerying for "no" as a TermQuery. If there is not match for that query,then you know for sure they have been filtered out.The next thing is to check is your query parser. What query parser are youusing? Some parser actually understands the "not" term and rewrite to anegation query.Hope this helps,TriOn Mar 17, 2014, at 12:02 PM, Natalia Connolly <natalia.v.conno...@gmail.com> wrote:Hi All,Is there any way I could construct a query that would not automaticallyexclude negation terms (such as "no", "not", etc)? For example, I need tofind strings like "not happy", "no idea", "never available". I triedusing a simple analyzer with combinations such as "not AND happy", andsimilar patterns, but it does not work.Any help would be appreciated!Natalia
Re: How to search for terms containing negation
Natalia,First make sure that your analyzers (both index and query analyzers) do not filter out these as stop words. I think the standard StopFilter list has "no" and "not". You can try to see if you index have these terms by querying for "no" as a TermQuery. If there is not match for that query, then you know for sure they have been filtered out.The next thing is to check is your query parser. What query parser are you using? Some parser actually understands the "not" term and rewrite to a negation query.Hope this helps,TriOn Mar 17, 2014, at 12:02 PM, Natalia Connolly wrote:Hi All, Is there any way I could construct a query that would not automatically exclude negation terms (such as "no", "not", etc)? For example, I need to find strings like "not happy", "no idea", "never available". I tried using a simple analyzer with combinations such as "not AND happy", and similar patterns, but it does not work. Any help would be appreciated! Natalia
Re: IndexWriter croaks on large file
John,Sure you can add identical documents to index if you like. I don't think Lucene requires a unique ID field, only Solr does. Lucene documents have internal doc IDs auto generated when indexing or merging index segments.If I remember correctly, Lucene 4.1 started doing cross document compression, so if could manage to index similar documents in the same chunk, it may help to reduce your stored fields.Hope this helps,TriOn Feb 19, 2014, at 04:51 AM, John Cecere wrote:Thanks Tri. I've tried a variation of the approach you suggested here and it appears to work well. Just one question. Will there be a problem with adding multiple Document objects to the IndexWriter that have the same field names and values for the StoredFields ? They all have different TextFields (the content). I've tried doing this and haven't found any problems with it, but I'm just wondering if there's anything I should be aware of. Regards, John On 2/14/14 4:37 PM, Tri Cao wrote:As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to users?John, for huge data set, it's usually a good idea to roll your own distributed indexes, and modelyou data schema very carefully. For example, if you are going to index log files, one reasonableidea is to make every 5 minutes of logs a document.Regards,TriOn Feb 14, 2014, at 01:20 PM, Glen Newton <glen.new...@gmail.com> wrote:You should consider making each _line_ of the log file a (Lucene)document (assuming it is a log-per-line log file)-GlenOn Fri, Feb 14, 2014 at 4:12 PM, John Cecere <john.cec...@oracle.com john.cec...@oracle.com>> wrote:I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. Atany rate, I don't have control over the size of the documents that go intomy database. Sometimes my customer's log files end up really big. I'mwilling to have huge indexes for these things.Wouldn't just changing from int to long for the offsets solve the problem ?I'm sure it would probably have to be changed in a lot of places, but whyimpose such a limitation ? Especially since it's using an InputStream andonly dealing with a block of data at a time.I'll take a look at your suggestion.Thanks,JohnOn 2/14/14 3:20 PM, Michael McCandless wrote:Hmm, why are you indexing such immense documents?In 3.x Lucene never sanity checked the offsets, so we would silentlyindex negative (int overflow'd) offsets into e.g. term vectors.But in 4.x, we now detect this and throw the exception you're seeing,because it can lead to index corruption when you index the offsetsinto the postings.If you really must index such enormous documents, maybe you couldcreate a custom tokenizer (derived from StandardTokenizer) that"fixes" the offset before setting them? Or maybe just doesn't evenset them.Note that position can also overflow, if your documents get too large.Mike McCandlesshttp://blog.mikemccandless.comOn Fri, Feb 14, 2014 at 1:36 PM, John Cecere <john.cec...@oracle.com john.cec...@oracle.com>>wrote:I'm having a problem with Lucene 4.5.1. Whenever I attempt to index afile >2GB in size, it dies with the following exception:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647Essentially, I'm doing this:Directory directory = new MMapDirectory(indexPath);Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,analyzer);IndexWriter iw = new IndexWriter(directory, iwc);InputStream is = ;InputStreamReader reader = new InputStreamReader(is);Document doc = new Document();doc.add(new StoredField("fileid", fileid));doc.add(new StoredField("pathname", pathname));doc.add(new TextField("content", reader));iw.addDocument(doc);It's the IndexWriter addDocument method that throws the exception. Inlooking at the Lucene source code, it appears that the offsets being usedinternally are int, which makes it somewhat obvious why this ishappening.This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectlycapable of handling a file over 2GB in this manner. What has changed andhowdo I get around this ? Is Lucene no longer capable of handling files thislarge, or is there some other way I should be doing this ?Here's the full stack trace sans my code:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647atorg.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.set
Re: IndexWriter croaks on large file
As docIDs are ints too, it's most likely he'll hit the limit of 2B documents per index though withthat approach though :)I do agree that indexing huge documents doesn't seem to have a lot of value, even when youknow a doc is a hit for a certain query, how are you going to display the results to users?John, for huge data set, it's usually a good idea to roll your own distributed indexes, and modelyou data schema very carefully. For example, if you are going to index log files, one reasonableidea is to make every 5 minutes of logs a document.Regards,TriOn Feb 14, 2014, at 01:20 PM, Glen Newton wrote:You should consider making each _line_ of the log file a (Lucene) document (assuming it is a log-per-line log file) -Glen On Fri, Feb 14, 2014 at 4:12 PM, John Cecerewrote:I'm not sure in today's world I would call 2GB 'immense' or 'enormous'. Atany rate, I don't have control over the size of the documents that go intomy database. Sometimes my customer's log files end up really big. I'mwilling to have huge indexes for these things.Wouldn't just changing from int to long for the offsets solve the problem ?I'm sure it would probably have to be changed in a lot of places, but whyimpose such a limitation ? Especially since it's using an InputStream andonly dealing with a block of data at a time.I'll take a look at your suggestion.Thanks,JohnOn 2/14/14 3:20 PM, Michael McCandless wrote:Hmm, why are you indexing such immense documents?In 3.x Lucene never sanity checked the offsets, so we would silentlyindex negative (int overflow'd) offsets into e.g. term vectors.But in 4.x, we now detect this and throw the exception you're seeing,because it can lead to index corruption when you index the offsetsinto the postings.If you really must index such enormous documents, maybe you couldcreate a custom tokenizer (derived from StandardTokenizer) that"fixes" the offset before setting them? Or maybe just doesn't evenset them.Note that position can also overflow, if your documents get too large.Mike McCandlesshttp://blog.mikemccandless.comOn Fri, Feb 14, 2014 at 1:36 PM, John Cecere wrote:I'm having a problem with Lucene 4.5.1. Whenever I attempt to index afile >2GB in size, it dies with the following exception:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647Essentially, I'm doing this:Directory directory = new MMapDirectory(indexPath);Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_45,analyzer);IndexWriter iw = new IndexWriter(directory, iwc);InputStream is = ;InputStreamReader reader = new InputStreamReader(is);Document doc = new Document();doc.add(new StoredField("fileid", fileid));doc.add(new StoredField("pathname", pathname));doc.add(new TextField("content", reader));iw.addDocument(doc);It's the IndexWriter addDocument method that throws the exception. Inlooking at the Lucene source code, it appears that the offsets being usedinternally are int, which makes it somewhat obvious why this ishappening.This issue never happened when I used Lucene 3.6.0. 3.6.0 was perfectlycapable of handling a file over 2GB in this manner. What has changed andhowdo I get around this ? Is Lucene no longer capable of handling files thislarge, or is there some other way I should be doing this ?Here's the full stack trace sans my code:java.lang.IllegalArgumentException: startOffset must be non-negative, andendOffset must be >= startOffset,startOffset=-2147483648,endOffset=-2147483647atorg.apache.lucene.analysis.tokenattributes.OffsetAttributeImpl.setOffset(OffsetAttributeImpl.java:45)atorg.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:183)atorg.apache.lucene.analysis.standard.StandardFilter.incrementToken(StandardFilter.java:49)atorg.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)atorg.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:82)atorg.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)atorg.apache.lucene.index.DocFieldProcessor.processDocument(DocFieldProcessor.java:248)atorg.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:254)atorg.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:446)atorg.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1551)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1221)atorg.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1202)Thanks,John--John CecerePrincipal Engineer - Oracle Corporation732-987-4317 / john.cec...@oracle.com-To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.orgFor additional commands, e-mail: java-user-h...@lucene.apache.org---
Re: Collector is collecting more than the specified hits
If I understand correctly, you'd like to shortcut the execution when you reach the desirednumber of hits. Unfortunately, I don't think there's a graceful way to do that right now inCollector. To stop further collecting, you need to throw an IOException (or a subtype of it)and catch the exception later in your code.Regards,TriOn Feb 14, 2014, at 09:36 AM, saisantoshi wrote:I am not interested in the scores at all. My requirement is simple, I only need the first 100 hits or the numHits I specify ( irrespective of there scores). The collector should stop after collecting the numHits specified. Is there a way to tell in the collector to stop after collecting the numHits. Please correct me if I am wrong. I am trying to do the following. public void collect(int doc) throws IOException { if (collector.getTotalHits() <= maxHits ) { // this way, I can stop it to not collect after the getTotalHits is more than numHits. delegate.collect(doc); } } I have to write a separate collector extending the Collector because I am not able to get the call to getTotalHits() if I am using PositiveScoresOnlyCollector. -- View this message in context: http://lucene.472066.n3.nabble.com/Collector-is-collecting-more-than-the-specified-hits-tp4117329p4117441.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: incrementally indexing
If you want to index your hard drive, you'll need to keep a copy of the current file system's directory/files structure. Otherwise, you won't be able to remove from your index files that have been deleted. On Jul 5, 2012, at 12:18 PM, Erick Erickson wrote: > Hmmm, it's not quite clear what the problem is. But let's > say you have indexed your hard drive. Somewhere you'll > have to keep a record of what you've done, say the timestamp > when you started looking at your hard drive to index it. > > Next time you run, you simply only index files that have changed > since the last timestamp, assuming you want any changed > documents on your disk to reflect those changes. That's usually > what's meant by "incremental indexing", you only add new/changed > data to your index. > > Hope that helps > Erick > > On Wed, Jul 4, 2012 at 7:09 AM, wrote: >> Hello, >> >> First ask your pardon for my poor English. >> >> I am making an application in Java using Lucene 3.6 for indexing the hard >> drive, and I've read that you can index incrementally, but not like >> putting that option, because every time I indexed the hard disk overwrite >> the existing index and makes me again, with the consequent expenditure of >> time in making such indexing. >> >> If someone could help me. >> >> Regards and thanks in advance >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: custom scoring
Hi,After reading through the IndexSearcher code, it seems I have to do the following:- implement a custom Collector to collect not just the doc IDs and score, but the fields I care about as well- extend ScoreDoc to hold the extra fields- when I get back a TopDocs from a search() call, I can go through the TopDocs and apply the constraints I need toI think this will work, but have some concern about performance. What would you think?Thanks,Tri.On Apr 06, 2012, at 10:06 AM, Tri Cao wrote:Hi all,What would be the best approach for a custom scoring that requires a "global" view of the result set. For example, I have a field call "color" and I would like to have constraints that there are at most 3 docs with color:red, 4 docs with color:blue in the first 16 hits. And the items should still be sorted in by their relevance scores after the constraints are applied.Thanks,Tri.
custom scoring
Hi all,What would be the best approach for a custom scoring that requires a "global" view of the result set. For example, I have a field call "color" and I would like to have constraints that there are at most 3 docs with color:red, 4 docs with color:blue in the first 16 hits. And the items should still be sorted in by their relevance scores after the constraints are applied.Thanks,Tri.