Re: full text as input ?
Hunter Peress wrote: is it efficient and feasible to use lucene to do full text comparisions. eg : take an entire text thats reasonably large ( eg more than 10 words) and find the result set within the lucene search index that is statistically similar with all the text. I do this kind of stuff all the time, no problem. I think this came up a month ago - probably appears monthly. For another variation search for "MoreLikeThis" in the list - it's code I mailed in that I haven't, yet, checked in. Anyway, if you want to search for docs that are similar to a source document, you can all this method to generate a similarity query. 'srch' is the source doc 'a' is your analyzer 'field' is the field that stores the body e.g. "contents" 'stop' is an opt Set of stop words to ignore as an optimization - it's not needed if the Analyzer ignores stop words, but if you keep stop words you might still want to ignore them in this kind of query as they probably won't help public static Query formSimilarQuery( String srch, Analyzer a, String field, Set stop) throws org.apache.lucene.queryParser.ParseException, IOException { TokenStream ts = a.tokenStream( field, new StringReader( srch)); org.apache.lucene.analysis.Token t; BooleanQuery tmp = new BooleanQuery(); Set already = new HashSet(); while ( (t = ts.next()) != null) { String word = t.termText(); if ( stop != null && stop.contains( word)) continue; if ( ! already.add( word)) continue; TermQuery tq = new TermQuery( new Term( field, word)); tmp.add( tq, false, false); } // tbd, from lucene in action book // https://secure.manning.com/catalog/view.php?book=hatcher2&item=source // exclude myself //likeThisQuery.add(new TermQuery( //new Term("isbn", doc.get("isbn"))), false, true); return tmp; } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: stop words and index size
: The corpus is the English Wikipedia, and I indexed the title and body of : the articles. I used a list of 525 stop words. : : With stopwords removed the index is 227MB. : With stopwords kept the index is 331MB. That doesn't seem horribly surprising. consider that for every Term in the index, lucene is keeping track of the list of pairs for every document that contains that term. Assume that something has to be in at least 25% of the docs before you decide it's worth making it a stop word. your URL indicates you are dealing with 400k docs, which means that for each stop word, the space need to store the int pairs for is... (4B + 4B) * 100,000 =~ 780KB (per stop word Term, minimum) ...not counting any indexing structures that may be used internally to improve the lookup of a Term. assuming some of those words are in more or less then 25% of your documents, that could easily account for a differents of 100MB. I suspect that an interesting excersize would be to use some of the code I've seen tossed arround on this list that lets you iterate over all Terms and find the most common once to help you determine your stopword list progromaticly. Then remove/reindex any documents that have each word as you add it to your stoplist (one word at a time) and watch your index shrink. -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
full text as input ?
is it efficient and feasible to use lucene to do full text comparisions. eg : take an entire text thats reasonably large ( eg more than 10 words) and find the result set within the lucene search index that is statistically similar with all the text. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
stop words and index size
Does anyone know how much stop words are supposed to affect the index size? I did an experiment of building an index once with, and once without, stop words. The corpus is the English Wikipedia, and I indexed the title and body of the articles. I used a list of 525 stop words. With stopwords removed the index is 227MB. With stopwords kept the index is 331MB. Thus, the index grows by 45% in this case, which I found suprising, as I expected it to not grow as much. I haven't dug into the details of the Lucene file formats but thought compression (field/term vector/sparse lists/ vints) would negate the affect of stopwords to a large extent. Some more details + a link to my stopword list are here: http://www.searchmorph.com/weblog/index.php?id=36 -- Dave - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi-threading problem: couldn't delete segments
On Thu, 2005-01-13 at 12:33, David Townsend wrote: > Just read your old post. I'm not quite sure whether I've read this correctly. > Is the search worker thread also doing deletes from the index > > "a test script is going that is hitting the search > part of our application (I think the script also updates and deletes > Documents, but I am not sure." > > Deleting also locks the index, so maybe the indexwriter is waiting for the > search thread to release the lock. I checked with my co-worker, and his script is doing a search, modifying assets (which deletes and re-inserts) and then deleting them. This is going on while new Documents are being added to the index from another thread. (Due to some weirdness in our application, it is also trying to delete Documents that don't exist before inserting them -- should be harmless, though.) I control access to the index with a lock object during all write accesses to the index, including deletes. You can see the code here: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=2068605&attachId=1 Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi-threading problem: couldn't delete segments
On Thu, 2005-01-13 at 12:25, David Townsend wrote: > The problem could be you're writing to an index with multiple processes. This > can happen if you're using a shared file system (NFS?). We saw this problem > when we had two IndexWriters getting access to a single index at the same > time. Usually if you're working on a single machine the file locks prevent > this from happening. No, there is a single process with multiple threads (synchronized). The filesystem is NTFS. Luke - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi-threading problem: couldn't delete segments
Just read your old post. I'm not quite sure whether I've read this correctly. Is the search worker thread also doing deletes from the index "a test script is going that is hitting the search part of our application (I think the script also updates and deletes Documents, but I am not sure." Deleting also locks the index, so maybe the indexwriter is waiting for the search thread to release the lock. -Original Message- From: David Townsend Sent: 13 January 2005 18:26 To: 'Lucene Users List' Subject: RE: Multi-threading problem: couldn't delete segments The problem could be you're writing to an index with multiple processes. This can happen if you're using a shared file system (NFS?). We saw this problem when we had two IndexWriters getting access to a single index at the same time. Usually if you're working on a single machine the file locks prevent this from happening. -Original Message- From: Luke Francl [mailto:[EMAIL PROTECTED] Sent: 13 January 2005 18:13 To: Lucene Users List Subject: Re: Multi-threading problem: couldn't delete segments I didn't get any response to this post so I wanted to follow up (you can read the full description of my problem in the archives: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=11986). Here's an additional piece of information: I wrote a small program to confirm that on Windows, you can't rename a file while another thread has it open. If I am performing a search, is it possible that the IndexReader is holding open the "segments" file when there is an attempt by my indexing code to overwrite it with File.renameTo()? Thanks, Luke Francl On Thu, 2005-01-06 at 17:43, Luke Francl wrote: > We are having a problem with Lucene in a high concurrency > create/delete/search situation. I thought I fixed all these problems, > but I guess not. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Multi-threading problem: couldn't delete segments
The problem could be you're writing to an index with multiple processes. This can happen if you're using a shared file system (NFS?). We saw this problem when we had two IndexWriters getting access to a single index at the same time. Usually if you're working on a single machine the file locks prevent this from happening. -Original Message- From: Luke Francl [mailto:[EMAIL PROTECTED] Sent: 13 January 2005 18:13 To: Lucene Users List Subject: Re: Multi-threading problem: couldn't delete segments I didn't get any response to this post so I wanted to follow up (you can read the full description of my problem in the archives: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=11986). Here's an additional piece of information: I wrote a small program to confirm that on Windows, you can't rename a file while another thread has it open. If I am performing a search, is it possible that the IndexReader is holding open the "segments" file when there is an attempt by my indexing code to overwrite it with File.renameTo()? Thanks, Luke Francl On Thu, 2005-01-06 at 17:43, Luke Francl wrote: > We are having a problem with Lucene in a high concurrency > create/delete/search situation. I thought I fixed all these problems, > but I guess not. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Multi-threading problem: couldn't delete segments
I didn't get any response to this post so I wanted to follow up (you can read the full description of my problem in the archives: http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]&msgNo=11986). Here's an additional piece of information: I wrote a small program to confirm that on Windows, you can't rename a file while another thread has it open. If I am performing a search, is it possible that the IndexReader is holding open the "segments" file when there is an attempt by my indexing code to overwrite it with File.renameTo()? Thanks, Luke Francl On Thu, 2005-01-06 at 17:43, Luke Francl wrote: > We are having a problem with Lucene in a high concurrency > create/delete/search situation. I thought I fixed all these problems, > but I guess not. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search failed with a "File not found" error
I was indexing at the time and I was under the impression that was safe, but it looks like the indexer may have removed a file that the search was trying to access. Is there something I should be doing to lock the index? Thanks, Jim. java.io.FileNotFoundException: /db/lucene/oasis/Clarify_Closed/_2meu.fnm (No such file or directory) at java.io.RandomAccessFile.open(Native Method) at java.io.RandomAccessFile.(RandomAccessFile.java:200) at org.apache.lucene.store.FSInputStream$Descriptor.(FSDirectory.java:376) at org.apache.lucene.store.FSInputStream.(FSDirectory.java:405) at org.apache.lucene.store.FSDirectory.openFile(FSDirectory.java:268) at org.apache.lucene.index.FieldInfos.(FieldInfos.java:53) at org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:109) at org.apache.lucene.index.SegmentReader.(SegmentReader.java:94) at org.apache.lucene.index.IndexReader$1.doBody(IndexReader.java:122) at org.apache.lucene.store.Lock$With.run(Lock.java:109) at org.apache.lucene.index.IndexReader.open(IndexReader.java:111) at org.apache.lucene.index.IndexReader.open(IndexReader.java:95) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Réf. : Re: Réf. : Re: IndexSearcher and number of occurence
Great, thanks for your help, I understand things quickly but I need lots of explanation .. ;-) For who is interested .. I was using : int id = hits.doc(i); instead of : int id = hits.id(i); Tchõ Bertrand On Jan 13, 2005, at 10:17 AM, Bertrand VENZAL wrote: > > > > Hi, > > Thanks for your quick answer, I understood wot u meant by using the > indexSearcher to get the termFreqVector. But, you use an int as an id > to > find the termFrequency so I suppose that it is the position number in > the > IndexReader vector. > My problem is : during the indexing phase, I can store the id, but if a > document is deleted and recreated later on (like in an update), this > will > change my vector and all the id's previously set will be no more > correct. > Am i right on this point ? or am i missing something ... Yes, the Document id (the one Lucene uses) is not to be relied on long-term. But, in the example you'd get it from Hits immediately after a search, and thus it would be accurate and usable. You do not need to store any the id during indexing - Lucene maintains it and gives it to you from Hits. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
calculate score
Hello, how does lucene calculate the score of a given document? In the class DefaultSimilarity are some parts of this formula (e.g. tf,itf), but how does these parts working together? Thanks, Michael Scholz
Re: Réf. : Re: IndexSearcher and number of occurence
On Jan 13, 2005, at 10:17 AM, Bertrand VENZAL wrote: Hi, Thanks for your quick answer, I understood wot u meant by using the indexSearcher to get the termFreqVector. But, you use an int as an id to find the termFrequency so I suppose that it is the position number in the IndexReader vector. My problem is : during the indexing phase, I can store the id, but if a document is deleted and recreated later on (like in an update), this will change my vector and all the id's previously set will be no more correct. Am i right on this point ? or am i missing something ... Yes, the Document id (the one Lucene uses) is not to be relied on long-term. But, in the example you'd get it from Hits immediately after a search, and thus it would be accurate and usable. You do not need to store any the id during indexing - Lucene maintains it and gives it to you from Hits. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Réf. : Re: IndexSearcher and number of occurence
Hi, Thanks for your quick answer, I understood wot u meant by using the indexSearcher to get the termFreqVector. But, you use an int as an id to find the termFrequency so I suppose that it is the position number in the IndexReader vector. My problem is : during the indexing phase, I can store the id, but if a document is deleted and recreated later on (like in an update), this will change my vector and all the id's previously set will be no more correct. Am i right on this point ? or am i missing something ... thanks ... |++| || Erik Hatcher || || <[EMAIL PROTECTED]|| || ns.com> | Pour :| || Envoyé par : | "Luce| || lucene-user-return-12| ne | || 431-bertrand.venzal=c| Users| || [EMAIL PROTECTED]| List"| || e.org| | |||| |||| ||| cc : | |||| |||| |||| |||| |||| ||| Objet : | ||| Re: | ||| Index| ||| Searc| ||| her | ||| and | ||| numbe| ||| r of | ||| occur| ||| ence | |||| |||| |++| On Jan 13, 2005, at 5:03 AM, Bertrand VENZAL wrote: > > > Hi all, > > Im quite new in this mailing list. I ve many difficulties to find the > number of a word (occurence) in a document, I need to use indexSearcher > because of the query but the score returning is not wot i m looking > for. > I found in the mailing List the class TermDoc but it seems to work only > with indexReader. > > If anyone can give a hand of this one, I will appreciate ... Perhaps this technique is what you're looking for set the field(s) you're interested in capturing frequency on to be vectored. You'll see that flag as additional overloaded methods on the Field. You'll still need to use an IndexReader, but that is no problem. Construct an IndexReader and use it to construct the IndexSearcher that you'll also use. Here's some snippets of code: // During indexing, "subject" field was added like this: doc.add(Field.UnStored("subject", subject, true)); ... // now during searching... IndexReader reader = IndexReader.open(directory); ... // from your Hits, get the document id int id = hits.doc(i); TermFreqVector vector = reader.getTermFreqVector(id, "subject"); Now read up on the TermFreqVector API to get at the frequency of a specific term. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: MySql and Lucene
On Thu, 2005-01-13 at 12:36 +0100, Daniel Cortes wrote: > I what to know your opinion about this: > > I've a new portal, and Lucene is the serach engine. This portal is an > integration of a lot of opensource software. > phpBB(MySql) is our election for the forum, and I have to do that > searches with the search engine include search in the forum. > I think that I have 2 options: > -Every new post in the forum, it was been indexed in the Mysql and > Lucene Index ( storing fields that I want to show in the results for > exemaple author, title date,...) > It means that I've almost a total copy of the MySQL in my Lucene Index. > - Or Do the search with lucene and after do a SQL query in the > servlett, but how I show the results.I can't show first the Lucene's > results and after the phorum's results. > Any Idea? > thks If space wasn't an issue I would just duplicate the data in Lucene because that makes things easiest. If space is a concern you could store the post's primary key in Lucene as the only stored field. Then do a search on Lucene, get the list of matching posts and pull out the rest of the information from MySQL. -- Miles Barr <[EMAIL PROTECTED]> Runtime Collective Ltd. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MySql and Lucene
I what to know your opinion about this: I've a new portal, and Lucene is the serach engine. This portal is an integration of a lot of opensource software. phpBB(MySql) is our election for the forum, and I have to do that searches with the search engine include search in the forum. I think that I have 2 options: -Every new post in the forum, it was been indexed in the Mysql and Lucene Index ( storing fields that I want to show in the results for exemaple author, title date,...) It means that I've almost a total copy of the MySQL in my Lucene Index. - Or Do the search with lucene and after do a SQL query in the servlett, but how I show the results.I can't show first the Lucene's results and after the phorum's results. Any Idea? thks - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and number of occurence
Bertrand VENZAL writes: > > Im quite new in this mailing list. I ve many difficulties to find the > number of a word (occurence) in a document, I need to use indexSearcher > because of the query but the score returning is not wot i m looking for. > I found in the mailing List the class TermDoc but it seems to work only > with indexReader. > The use of a searcher does not prevent the use of a reader (in fact the searcher relys on a reader). So I'd use the searcher to find the document and a reader to get the frequency using IndexReader.termDocs. Depending on how many frequencies your interested in, the term vector support might be of interest. HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: IndexSearcher and number of occurence
On Jan 13, 2005, at 5:03 AM, Bertrand VENZAL wrote: Hi all, Im quite new in this mailing list. I ve many difficulties to find the number of a word (occurence) in a document, I need to use indexSearcher because of the query but the score returning is not wot i m looking for. I found in the mailing List the class TermDoc but it seems to work only with indexReader. If anyone can give a hand of this one, I will appreciate ... Perhaps this technique is what you're looking for set the field(s) you're interested in capturing frequency on to be vectored. You'll see that flag as additional overloaded methods on the Field. You'll still need to use an IndexReader, but that is no problem. Construct an IndexReader and use it to construct the IndexSearcher that you'll also use. Here's some snippets of code: // During indexing, "subject" field was added like this: doc.add(Field.UnStored("subject", subject, true)); ... // now during searching... IndexReader reader = IndexReader.open(directory); ... // from your Hits, get the document id int id = hits.doc(i); TermFreqVector vector = reader.getTermFreqVector(id, "subject"); Now read up on the TermFreqVector API to get at the frequency of a specific term. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexSearcher and number of occurence
Hi all, Im quite new in this mailing list. I ve many difficulties to find the number of a word (occurence) in a document, I need to use indexSearcher because of the query but the score returning is not wot i m looking for. I found in the mailing List the class TermDoc but it seems to work only with indexReader. If anyone can give a hand of this one, I will appreciate ... Tchõ Bertrand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]