Re: Lucene and Eastern languages (Japanese, Korean and Chinese)
Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit : > Hi, guys, > I found Analyzers for Japanese, Korean and Chinese, but not stemmers; > the Snowball stemmers only include European languages. Does stemming > not make sense for ideograph-based languages (i.e., no stemming is > needed for Japanese, Korean and Chinese)? No. > Also for spell checking, does the default Lucene SpellChecker work for > Japanese, Korean and Chinese? Does edit distance make sense for these > languages? Japanese used group of ideogram, but levenstein distance don't make sense with few letters but I'm not a CJK expert. M. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: What replaced org.apache.lucene.document.Field.Text?
Hi Andy I think: Field.Text("name", "value"); has been replaced with: new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED); Patrick On 25/07/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: Please reference How do I get code written for Lucene 1.4.x to work with Lucene 2.x? http://wiki.apache.org/lucene-java/LuceneFAQ#head-86d479476c63a2579e867b 75d4faa9664ef6cf4d Andy -Original Message- From: Lindsey Hess [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 25, 2007 12:31 PM To: Lucene Subject: What replaced org.apache.lucene.document.Field.Text? I'm trying to get some relatively old Lucene code to compile (please see below), and it appears that Field.Text has been deprecated. Can someone please suggest what I should use in its place? Thank you. Lindsey public static void main(String args[]) throws Exception { String indexDir = System.getProperty("java.io.tmpdir", "tmp") + System.getProperty("file.separator") + "address-book"; Analyzer analyzer = new WhitespaceAnalyzer(); boolean createFlag = true; IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag); Document contactDocument = new Document(); contactDocument.add(Field.Text("type", "individual")); contactDocument.add(Field.Text("name", "Zane Pasolini")); contactDocument.add(Field.Text("address", "999 W. Prince St.")); contactDocument.add(Field.Text("city", "New York")); contactDocument.add(Field.Text("province", "NY")); contactDocument.add(Field.Text("postalcode", "10013")); contactDocument.add(Field.Text("country", "USA")); contactDocument.add(Field.Text("telephone", "1-212-345-6789")); writer.addDocument(contactDocument); writer.close(); } - Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user panel and lay it on us. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search for null
You will be unable to search for fields that do not exist which is what you originally wanted to do, instead you can do something like: -Establish the query that will select all non-null values TermQuery tq1 = new TermQuery(new Term("field","value1")); TermQuery tq2 = new TermQuery(new Term("field","value2")); ... TermQuery tqn = new TermQuery(new Term("field","valuen")); BooleanQuery query = new BooleanQuery(); booleanQuery.add(tq1,BooleanClause.Occur.SHOULD); booleanQuery.add(tq2,BooleanClause.Occur.SHOULD); ... booleanQuery.add(tqn,BooleanClause.Occur.SHOULD); OR perhaps a range query if your values are contiguous Term start = new Term("field","198805"); Term end = new Term("field","198810"); Query query = new RangeQuery(start, end, true); ; OR just use the QueryParser Query query = QueryParser.parse(parseCriteria, "field", new StandardAnalyzer()); -Create the QueryFilter QueryFilter queryFilter = new QueryFilter(query); -flip the bits final BitSet filterBitSet = queryFilter.bits(reader); filterBitSet.flip(0,filterBitSet.size()); Now you have a filter that contains document matching the opposite of that specified by the query, and can use in subsequent queries Dan On Tue, 2007-07-24 at 09:40 -0700, Jay Yu wrote: > > daniel rosher wrote: > > Perhaps you can use a filter in the following way. > > > > -Create a filter (via QueryFilter) that would contain all document that > > do not have null values for the field > Interesting: what does the QueryFilter look like? Isn't it just as hard > as finding out what docs have the null values for the field? > I really like to know your trick here. > > -flip the bits of the filter so that it now contains documents that have > > null values for a field > > -Use the filter in conjunction with subsequent queries. > > > > This would also help with performance as filters are simply bitsets and > > can cheaply be stored, generated once and used often. > > > > Dan > > > > On Mon, 2007-07-23 at 13:57 -0700, Jay Yu wrote: > >> If you want performance, a better way might be to assign some special > >> string/value (if it's easy to create) to the missing field of docs and > >> index the field without tokenizing it. Then you may search for that > >> special value to find the docs. > >> > >> Jay > >> > >> Les Fletcher wrote: > >>> Does this particular range query have any significant performance issues? > >>> > >>> Les > >>> > >>> Erik Hatcher wrote: > On Jul 23, 2007, at 11:32 AM, testn wrote: > > Is it possible to search for the document that specified field > > doesn't exist > > or such field value is null? > This is from Solr, so I'm not sure off the top of my head if this mojo > applies by itself, but a search for -fieldname:[* TO *] will result in > all documents that do not have the specified field. > > Erik > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > >>> - > >>> To unsubscribe, e-mail: [EMAIL PROTECTED] > >>> For additional commands, e-mail: [EMAIL PROTECTED] > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > >> <> > > Daniel Rosher > > Developer > > > > > > d: 0207 3489 912 > > t: 0870 2020 121 > > f: 0870 2020 131 > > m: > > http://www.hotonline.com/ > > > > > > > > > > > > > > - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > > - - - - - - - - - - - - - - - - - > > This message is sent in confidence for the addressee only. It may contain > > privileged > > information. The contents are not to be disclosed to anyone other than the > > addressee. > > Unauthorised recipients are requested to preserve this confidentiality and > > to advise > > us of any errors in transmission. Thank you. > > > > hotonline ltd is registered in England & Wales. Registered office: One > > Canada Square, > > Canary Wharf, London E14 5AP. Registered No: 1904765. > > > > > > This message has been scanned for viruses by BlackSpider MailControl - > > www.blackspider.com > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > Daniel Rosher Developer d: 0207 3489 912 t: 0870 2020 121 f: 0870 2020 131 m: http://www.hotonline.com/ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This message is sent in confidence for the addressee only. It may co
Recovering from a Crash
We were affected by the great SF outage yesterday and apparently the indexing machine crashed without being shutdown properly. I've taken a backup of the indexes which has the usual smattering of write.lock segments.gen, .cfs, .fdt, .fnm and .fdx etc files and looks to be about the right size. However, if I start up my indexer with that directory it shrinks to a fraction of its size (500 times smaller) and (obviously) contains virtually no documents. The data appears to be there - please tell me that I'm doing something stupid and I can recover from this. Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
"Simon Wistow" <[EMAIL PROTECTED]> wrote: > We were affected by the great SF outage yesterday and apparently the > indexing machine crashed without being shutdown properly. Eek, sorry! We are so reliant on electricity these days > I've taken a backup of the indexes which has the usual smattering of > write.lock segments.gen, .cfs, .fdt, .fnm and .fdx etc files and looks > to be about the right size. Hmm, how do you do your backups? Is there a segments_N file present in the backup? It's somewhat spooky that you have a write.lock present because that means you backed up while a writer was actively writing to the index which is a bit dangerous because if the timing is unlucky (backup does an "ls" but before it can copy the segments_N file a commit has happened) you could fail to copy a segments_N file. It's best to either pause the writer for backpus to occur (simplest) or make a custom deletion policy that safely allows the backup to slowly copy even while indexing is continuing (advanced). > However, if I start up my indexer with that directory it shrinks to a > fraction of its size (500 times smaller) and (obviously) contains > virtually no documents. It seems like the segments_N file may be missing? > The data appears to be there - please tell me that I'm doing something > stupid and I can recover from this. Maybe try other (older) backups to see if they have the segments_N file? Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
On Wed, Jul 25, 2007 at 10:08:56AM +0100, me said: > The data appears to be there - please tell me that I'm doing something > stupid and I can recover from this. It appears by deleting the write.lock files everything has recovered. Is this best practice? Have I just done something so terribly wrong that I've bought about the end of the universe? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Lucene and Eastern languages (Japanese, Korean and Chinese)
Mathieu Lecarme schrieb: > Le mardi 24 juillet 2007 à 13:01 -0700, Shaw, James a écrit : >> Hi, guys, >> I found Analyzers for Japanese, Korean and Chinese, but not stemmers; >> the Snowball stemmers only include European languages. Does stemming >> not make sense for ideograph-based languages (i.e., no stemming is >> needed for Japanese, Korean and Chinese)? > No. This not quite correct, Chinese doesn't need any stemming but Japanese is not completely ideograph-based and it could use stemming. I doubt anyone has done this, besides some commercial software for the japanese market. I don't know for Korean. >> Also for spell checking, does the default Lucene SpellChecker work for >> Japanese, Korean and Chinese? Does edit distance make sense for these >> languages? > Japanese used group of ideogram, but levenstein distance don't make > sense with few letters but I'm not a CJK expert. > > M. Edit distance only seems to work with latin character based (writen) languages. Spell checking Chinese, Japanese (and Korean?) is more or less pointless, as they are inputed using input methods, which should produce "correct" words. Best regards, Max -- Maximilian Hütter blue elephant systems GmbH Wollgrasweg 49 D-70599 Stuttgart Tel: (+49) 0711 - 45 10 17 578 Fax: (+49) 0711 - 45 10 17 573 e-mail : [EMAIL PROTECTED] Sitz : Stuttgart, Amtsgericht Stuttgart, HRB 24106 Geschäftsführer: Joachim Hörnle, Thomas Gentsch, Holger Dietrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said: > It's somewhat spooky that you have a write.lock present because that > means you backed up while a writer was actively writing to the index > which is a bit dangerous because if the timing is unlucky (backup does > an "ls" but before it can copy the segments_N file a commit has > happened) you could fail to copy a segments_N file. It's best to > either pause the writer for backpus to occur (simplest) or make a > custom deletion policy that safely allows the backup to slowly copy > even while indexing is continuing (advanced). Sorry, I should have been clearer - I took the backup of the state of the index when the machine restarted after the crash. I did have another backup from a day or so ago but I was hoping to not have to reindex a days worth of data (which is alot). Our backup strategy is currently - 1) Stop the writer (and let write tasks queue up) 2) cp -lr indexes indexes- 3) Restart the writer Is this something approximating best practice? Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
"Simon Wistow" <[EMAIL PROTECTED]> wrote: > On Wed, Jul 25, 2007 at 05:19:31AM -0400, Michael McCandless said: > > It's somewhat spooky that you have a write.lock present because that > > means you backed up while a writer was actively writing to the index > > which is a bit dangerous because if the timing is unlucky (backup does > > an "ls" but before it can copy the segments_N file a commit has > > happened) you could fail to copy a segments_N file. It's best to > > either pause the writer for backpus to occur (simplest) or make a > > custom deletion policy that safely allows the backup to slowly copy > > even while indexing is continuing (advanced). > > Sorry, I should have been clearer - I took the backup of the state of > the index when the machine restarted after the crash. I did have another > backup from a day or so ago but I was hoping to not have to reindex a > days worth of data (which is alot). Ahhh, OK. But do you have a segments_N file? > Our backup strategy is currently - > 1) Stop the writer (and let write tasks queue up) > 2) cp -lr indexes indexes- > 3) Restart the writer > > Is this something approximating best practice? Yes, this is perfect. This is the "simple" option I described. The more complex option is to use a custom deletion policy which enables you to safely do backups (even if the copy process is slow) without pausing the write task (indexing). Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
> > The data appears to be there - please tell me that I'm doing something > > stupid and I can recover from this. > > It appears by deleting the write.lock files everything has recovered. Hmmm -- it's odd that the existence of the write.lock caused you to lose most of your index. All that should have happened here is on creating a new writer it would throw a LockObtainTimedOut exception saying it could not obtain the write lock. I don't see how this would cause most of your index to be deleted... > Is this best practice? Have I just done something so terribly wrong > that I've bought about the end of the universe? Universe seems intact on my end :) But yes deleting the write.lock is the right thing to do in this case. You can also switch to native locking (NativeFSLockFactory) and then the OS would free the lock so you would not have to delete the write.lock manually... Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said: > Ahhh, OK. But do you have a segments_N file? Yup. > Yes, this is perfect. This is the "simple" option I described. The > more complex option is to use a custom deletion policy which enables > you to safely do backups (even if the copy process is slow) without > pausing the write task (indexing). I vaguely remember seeing something about that going past. Is there any documentation on custom deletion policies? Or example code for such a beast? At the moment at any given point we have to have disk space to allow for 3x Index size - index, backup we've just taken and previous backup we're just about to delete. Since our indexes are large even 2x is quite an issue. I've read through JIRA LUCENE-710 but a more point-and-drool explanation would be useful to someone who hasn't been up all night :) Simon - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Recovering from a Crash
"Simon Wistow" <[EMAIL PROTECTED]> wrote: > On Wed, Jul 25, 2007 at 05:49:41AM -0400, Michael McCandless said: > > Ahhh, OK. But do you have a segments_N file? > > Yup. OK, though I still don't understand why the existence of "write.lock" caused you to lose most of your index on creating a new writer. > > Yes, this is perfect. This is the "simple" option I described. The > > more complex option is to use a custom deletion policy which enables > > you to safely do backups (even if the copy process is slow) without > > pausing the write task (indexing). > > I vaguely remember seeing something about that going past. > > Is there any documentation on custom deletion policies? Or example code > for such a beast? At the moment at any given point we have to have disk > space to allow for 3x Index size - index, backup we've just taken and > previous backup we're just about to delete. Since our indexes are large > even 2x is quite an issue. > > I've read through JIRA LUCENE-710 but a more point-and-drool explanation > would be useful to someone who hasn't been up all night :) Good question ... there is no good documentation, sample code, etc., for this as of yet ... I've been secretly hoping the first person who creates this deletion policy would share it :) I don't think it's very difficult to create. If that doesn't happen sometime soon I'll try to make time to create an example. Mike - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer is slowing down highlighting a lot
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to JFlex, which for roughly the same grammar would sometimes be up to 10x (!) faster. You can have a look at our JFlex specification at: http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup This one seems more complex than the StandardAnalyzer's but it's much faster anyway. If anyone is interested, I could prepare a JFlex based Analyzer equivalent (to the extent possible) to current StandardAnalyzer, which might offer nice indexing and highlighting speed-ups. Best, Staszek -- Stanislaw Osinski, [EMAIL PROTECTED] http://www.carrot-search.com
Which field matched ?
This problem has been baffling me since quite some time now and has no perfect solution in the forum ! I have 10 documents, each with 10 fields with "parameterName and parameterValue". Now, When i search for some term and I get 5 hits, how do I find out which paramName-Value pair matched ? I am seeking an optimal solution for this. Explanation, highlighter etc are some of the solutions. But not the best since highlighter would perform very bad for wildcard queries and explanation is generally not the nice way of doing this ! I am talking really large datasets here. Any help, highly appreciated. -- View this message in context: http://www.nabble.com/Which-field-matched---tf4141549.html#a11780708 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer is slowing down highlighting a lot
I would be very interested. I have been playing around with Antlr to see if it is any faster than JavaCC, but haven't seen great gains in my simple tests. I had not considered trying JFlex. I am sure a faster StandardAnalyzer would be greatly appreciated. StandardAnalyzer appears widely used and horrendously slow. Even better would be a StandardAnalyzer that could have different recognizers enabled/disabled. For example, dropping NUM recognition if you don't need it in the current StandardAnalyzer gains like 25% speed. - Mark Stanislaw Osinski wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to JFlex, which for roughly the same grammar would sometimes be up to 10x (!) faster. You can have a look at our JFlex specification at: http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup This one seems more complex than the StandardAnalyzer's but it's much faster anyway. If anyone is interested, I could prepare a JFlex based Analyzer equivalent (to the extent possible) to current StandardAnalyzer, which might offer nice indexing and highlighting speed-ups. Best, Staszek - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Which field matched ?
Currently, we use regular expression pattern matching to get hold of which field matched. Again a pathetic solution since we have to agree upon the subset of the lucene search and pattern matching. We cannot use Boolean queries etc in this case. makkhar wrote: > > This problem has been baffling me since quite some time now and has no > perfect solution in the forum ! > > I have 10 documents, each with 10 fields with "parameterName and > parameterValue". Now, When i search for some term and I get 5 hits, how do > I > find out which paramName-Value pair matched ? > > I am seeking an optimal solution for this. Explanation, highlighter etc > are some of the solutions. But not the best since highlighter would > perform very bad for wildcard queries and explanation is generally not the > nice way of doing this ! I am talking really large datasets here. > > Any help, highly appreciated. > > > -- View this message in context: http://www.nabble.com/Which-field-matched---tf4141549.html#a11780757 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer is slowing down highlighting a lot
I am sure a faster StandardAnalyzer would be greatly appreciated. I'm increasing the priority of that task then :) StandardAnalyzer appears widely used and horrendously slow. Even better would be a StandardAnalyzer that could have different recognizers enabled/disabled. For example, dropping NUM recognition if you don't need it in the current StandardAnalyzer gains like 25% speed. That's a good idea, though I'd need to check if in case of JFlex there would be considerable performance differences depending on the grammar. Staszek -- Stanislaw Osinski, [EMAIL PROTECTED] http://www.carrot-search.com
Re: StandardTokenizer is slowing down highlighting a lot
On Jul 25, 2007, at 7:19 AM, Stanislaw Osinski wrote: Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. JavaCC is slow indeed. We used it for a while for Carrot2, but then (3 years ago :) switched to JFlex, which for roughly the same grammar would sometimes be up to 10x (!) faster. You can have a look at our JFlex specification at: http://carrot2.svn.sourceforge.net/viewvc/carrot2/trunk/carrot2/ components/carrot2-util-tokenizer/src/org/carrot2/util/tokenizer/ parser/jflex/JFlexWordBasedParserImpl.jflex?view=markup This one seems more complex than the StandardAnalyzer's but it's much faster anyway. If anyone is interested, I could prepare a JFlex based Analyzer equivalent (to the extent possible) to current StandardAnalyzer, which might offer nice indexing and highlighting speed-ups. +1. I think a lot of people would be interested in a faster StandardAnalyzer. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Lucene Highlighter linkage Error
Hello! I am working with Tomcat. I have put the Lucene highlighter.jar in the folder lib. And I have created an extra css, where I say that the background color has to be yellow. The searchword has to be highlighted know. I have got a dataTable in which the result of the following Lucene method is loaded: [code] public void search(String q, File index, String [] fields, ArrayList subresult, int numresults) throws Exception { Directory fsDir = FSDirectory.getDirectory(index, false); IndexSearcher is = new IndexSearcher(fsDir); Analyzer analyzer = new StandardAnalyzer(); Fragmenter fragmenter = new SimpleFragmenter(100); QueryParser queryparser = new MultiFieldQueryParser (fields, analyzer); Query query = queryparser.parse(q); Hits hits = is.search(query); IndexReader reader=null; query=query.rewrite(reader); QueryScorer scorer = new QueryScorer(query); SimpleHTMLFormatter formatter= new SimpleHTMLFormatter("",""); Highlighter high = new Highlighter(formatter,scorer); high.setTextFragmenter(fragmenter); numresults = numresults == -1 || numresults > hits.length() ? hits.length() : numresults; String rating = ""; for (int i = 0; i schwelli){ float f = hits.score(i); if (0.9f <= f) {rating = "**";} else if (0.8f <= f && f<0.9f){rating = "*";} else if (0.7f <= f && f<0.8f){rating = "";} else if (0.6f <= f && f<0.7f){rating = "***";} else if (0.5f <= f && f<0.6f){rating = "**";} else if (f <= 0.5f){rating = "*";} Document doc = hits.doc(i); String abstracts =doc.get("ABSTRACTS"); String title = doc.get("TITLE"); TokenStream abstract_stream = analyzer.tokenStream(q, new StringReader(abstracts)); TokenStream title_straem = analyzer.tokenStream(q, new StringReader(title)); String fragment_abstract = high.getBestFragments(abstract_stream,abstracts, 5, "..."); String fragment_title = high.getBestFragments(title_straem,title, 5, "..."); if(fragment_title.length()==0){ setAusgabeTitle(doc.get("TITLE")); }else{ setAusgabeTitle(fragment_title); } if(fragment_abstract.length()==0){ setAusgabeAbstract(doc.get("ABSTRACTS")); }else{ setAusgabeAbstract(fragment_abstract); } //list.add(i+1+"\t"+q+"\t"+doc.get(entry_medline)+"\t"+hits.score(i)+"\t"+abstract_stream+"\t"+title_straem+"\t"+"MEDLINE"); /*int No = i; subresult.add((new Integer(No)).toString());*/ subresult.add(doc.get(entry_medline));
Re: Fine Tuning Lucene implementation
Hey Guys, I need to know how I can use the HitCollector class ? I am using Hits and looping over all the possible document hits (turns out its 92 times I am looping; for 300 searches, its 300*92 !!). Can I avoid this using HitCollector ? I can't seem to understand how its used. thanks a lot, Askar On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: > > Askar, > why do you need to add +id:? > thanks, > dt, > www.ejinz.com > search engine news forms > - Original Message - > From: "Askar Zaidi" <[EMAIL PROTECTED]> > To: ; <[EMAIL PROTECTED]> > Sent: Wednesday, July 25, 2007 12:39 AM > Subject: Re: Fine Tuning Lucene implementation > > > > Hey Hira , > > > > Thanks so much for the reply. Much appreciate it. > > > > Quote: > > > > Would it be possible to just include a query clause? > > - i.e., instead of just contents:, also add > > +id: > > > > How can I do that ? > > > > I see my query as : > > > > +contents:harvard +contents:business +contents:review > > > > where the search phrase was: harvard business review > > > > Now how can I add +id: ?? > > > > This would give me that one exact document I am looking for , for that > id. > > I > > don't have to iterate through hits. > > > > thanks, > > > > Askar > > > > > > > > On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: > >> > >> I'm no expert on this (so please accept the comments in that context) > >> but 2 things seem weird to me: > >> > >> 1. Iterating over each hit is an expensive proposition. I've often > >> seen people recommending a HitCollector. > >> > >> 2. It seems that doBodySearch() is essentially saying, do this search > >> and return the score pertinent to this ID (using an exhaustive loop). > >> Would it be possible to just include a query clause? > >> - i.e., instead of just contents:, also add > >> +id: > >> > >> In general though, I think your algorithm seems inefficient (if I > >> understand it correctly):-- if I want to search for one term among 3 in > >> a "collection" of 300 documents (as defined by some external > attribute), > >> I will wind up executing 300 x 3 searches, and for each search that is > >> executed, I will iterate over every Hit, even if I've already found the > >> one that I "care about". > >> > >> What would break if you: > >> 1. Included "creator" in the Lucene index (or, filtered out the Hits > >> using a BitSet or something like it) > >> 2. Executed 1 search > >> 3. Collected the results of the first N Hits (where N is some > >> reasonable limit, like 100 or 500) > >> > >> -h > >> > >> > >> On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote: > >> > >> > Sure. > >> > > >> > public float doBodySearch(Searcher searcher,String query, int id){ > >> > > >> > try{ > >> > score = search(searcher, query,id); > >> > } > >> > catch(IOException io){} > >> > catch(ParseException pe){} > >> > > >> > return score; > >> > > >> > } > >> > > >> > private float search(Searcher searcher, String queryString, int id) > >> > throws ParseException, IOException { > >> > > >> > // Build a Query object > >> > > >> > QueryParser queryParser = new QueryParser("contents", new > >> > KeywordAnalyzer()); > >> > > >> > queryParser.setDefaultOperator(QueryParser.Operator.AND); > >> > > >> > Query query = queryParser.parse(queryString); > >> > > >> > // Search for the query > >> > > >> > Hits hits = searcher.search(query); > >> > Document doc = null; > >> > > >> > // Examine the Hits object to see if there were any matches > >> > int hitCount = hits.length(); > >> > > >> > for(int i=0;i >> > doc = hits.doc(i); > >> > String str = doc.get("item"); > >> > int tmp = Integer.parseInt(str); > >> > if(tmp==id) > >> > score = hits.score(i); > >> > } > >> > > >> > return score; > >> > } > >> > > >> > I really need to optimize doBodySearch(...) as this takes the most > >> > time. > >> > > >> > thanks guys, > >> > Askar > >> > > >> > > >> > On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: > >> > > >> > Could you show us the relevant source from doBodySearch()? > >> > > >> > -h > >> > > >> > On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote: > >> > > I ran some tests and it seems that the slowness is from > >> > Lucene calls when I > >> > > do "doBodySearch", if I remove that call, Lucene gives me > >> > results in 5 > >> > > seconds. otherwise it takes about 50 seconds. > >> > > > >> > > But I need to do Body search and that field contains lots > of > >> > text. The field > >> > > is . How can I optimize that ? > >> > > > >> > > thanks, > >> > > Askar > >> > > > >> >
Re: Fine Tuning Lucene implementation
Hi Askar, I suggest we take a step back, and ask the question, what are you trying to accomplish? That is, what is your application trying to do? Forget the code, etc. just explain what you want the end result to be and we can work from there. Based on what you have described, I am not sure you need access to the hits. It seems like you just need to make better queries. Is your itemID a unique identifier? If yes, then you shouldn't need to loop over hits at all, as you should only ever have one result IF your query contains a required term. Also, if this is the case, why do you need to do a search at all? Haven't you already identified the items of interest when you did your select query in the database? Or is it that you want to score the item based on some terms as well. If that is the case, there are other ways of doing this and we can discuss them. -Grant On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: Hey Guys, I need to know how I can use the HitCollector class ? I am using Hits and looping over all the possible document hits (turns out its 92 times I am looping; for 300 searches, its 300*92 !!). Can I avoid this using HitCollector ? I can't seem to understand how its used. thanks a lot, Askar On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: Askar, why do you need to add +id:? thanks, dt, www.ejinz.com search engine news forms - Original Message - From: "Askar Zaidi" <[EMAIL PROTECTED]> To: ; <[EMAIL PROTECTED]> Sent: Wednesday, July 25, 2007 12:39 AM Subject: Re: Fine Tuning Lucene implementation Hey Hira , Thanks so much for the reply. Much appreciate it. Quote: Would it be possible to just include a query clause? - i.e., instead of just contents:, also add +id: How can I do that ? I see my query as : +contents:harvard +contents:business +contents:review where the search phrase was: harvard business review Now how can I add +id: ?? This would give me that one exact document I am looking for , for that id. I don't have to iterate through hits. thanks, Askar On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: I'm no expert on this (so please accept the comments in that context) but 2 things seem weird to me: 1. Iterating over each hit is an expensive proposition. I've often seen people recommending a HitCollector. 2. It seems that doBodySearch() is essentially saying, do this search and return the score pertinent to this ID (using an exhaustive loop). Would it be possible to just include a query clause? - i.e., instead of just contents:, also add +id: In general though, I think your algorithm seems inefficient (if I understand it correctly):-- if I want to search for one term among 3 in a "collection" of 300 documents (as defined by some external attribute), I will wind up executing 300 x 3 searches, and for each search that is executed, I will iterate over every Hit, even if I've already found the one that I "care about". What would break if you: 1. Included "creator" in the Lucene index (or, filtered out the Hits using a BitSet or something like it) 2. Executed 1 search 3. Collected the results of the first N Hits (where N is some reasonable limit, like 100 or 500) -h On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote: Sure. public float doBodySearch(Searcher searcher,String query, int id){ try{ score = search(searcher, query,id); } catch(IOException io){} catch(ParseException pe){} return score; } private float search(Searcher searcher, String queryString, int id) throws ParseException, IOException { // Build a Query object QueryParser queryParser = new QueryParser("contents", new KeywordAnalyzer()); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query query = queryParser.parse(queryString); // Search for the query Hits hits = searcher.search(query); Document doc = null; // Examine the Hits object to see if there were any matches int hitCount = hits.length(); for(int i=0;i wrote: Could you show us the relevant source from doBodySearch()? -h On Tue, 2007-07-24 at 19:58 -0400, Askar Zaidi wrote: I ran some tests and it seems that the slowness is from Lucene calls when I do "doBodySearch", if I remove that call, Lucene gives me results in 5 seconds. otherwise it takes about 50 seconds. But I need to do Body search and that field contains lots of text. The field is . How can I optimize that ? thanks, Askar - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Grant Ingersoll Center for Natural Lan
Re: Fine Tuning Lucene implementation
Hi Grant, Thanks for the response. Heres what I am trying to accomplish: 1. Iterate over itemID (unique) in the database using one SQL query. 2. For every itemID found, run 4 searches on Lucene Index. 3. doTagSearch(itemID) ; collect score 4. doTitleSearch(itemID...) ; collect score 5. doSummarySearch(itemID...) ; collect score 6. doBodySearch(itemID) ; collect score These scores are then added and I get a total score for each unique item in the database. Lucene Index has: So if I am running a body search, I have 92 hits from over 300 documents for a query. I already know my hit with the . For instance, from step (1) if itemID 16 is passed to all the 4 searches, I just need to get the score of the document which has itemID field = 16. I don't have to iterate over all the hits. I suppose I have to change my query to look for where itemID=16. Can you guide me as to how to do it ? thanks a ton, Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Hi Askar, > > I suggest we take a step back, and ask the question, what are you > trying to accomplish? That is, what is your application trying to > do? Forget the code, etc. just explain what you want the end result > to be and we can work from there. Based on what you have described, > I am not sure you need access to the hits. It seems like you just > need to make better queries. > > Is your itemID a unique identifier? If yes, then you shouldn't need > to loop over hits at all, as you should only ever have one result IF > your query contains a required term. Also, if this is the case, why > do you need to do a search at all? Haven't you already identified > the items of interest when you did your select query in the > database? Or is it that you want to score the item based on some > terms as well. If that is the case, there are other ways of doing > this and we can discuss them. > > -Grant > > On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: > > > Hey Guys, > > > > I need to know how I can use the HitCollector class ? I am using > > Hits and > > looping over all the possible document hits (turns out its 92 times > > I am > > looping; for 300 searches, its 300*92 !!). Can I avoid this using > > HitCollector ? I can't seem to understand how its used. > > > > thanks a lot, > > > > Askar > > > > On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: > >> > >> Askar, > >> why do you need to add +id:? > >> thanks, > >> dt, > >> www.ejinz.com > >> search engine news forms > >> - Original Message - > >> From: "Askar Zaidi" <[EMAIL PROTECTED]> > >> To: ; <[EMAIL PROTECTED]> > >> Sent: Wednesday, July 25, 2007 12:39 AM > >> Subject: Re: Fine Tuning Lucene implementation > >> > >> > >>> Hey Hira , > >>> > >>> Thanks so much for the reply. Much appreciate it. > >>> > >>> Quote: > >>> > >>> Would it be possible to just include a query clause? > >>> - i.e., instead of just contents:, also add > >>> +id: > >>> > >>> How can I do that ? > >>> > >>> I see my query as : > >>> > >>> +contents:harvard +contents:business +contents:review > >>> > >>> where the search phrase was: harvard business review > >>> > >>> Now how can I add +id: ?? > >>> > >>> This would give me that one exact document I am looking for , for > >>> that > >> id. > >>> I > >>> don't have to iterate through hits. > >>> > >>> thanks, > >>> > >>> Askar > >>> > >>> > >>> > >>> On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: > > I'm no expert on this (so please accept the comments in that > context) > but 2 things seem weird to me: > > 1. Iterating over each hit is an expensive proposition. I've > often > seen people recommending a HitCollector. > > 2. It seems that doBodySearch() is essentially saying, do this > search > and return the score pertinent to this ID (using an exhaustive > loop). > Would it be possible to just include a query clause? > - i.e., instead of just contents:, also add > +id: > > In general though, I think your algorithm seems inefficient (if I > understand it correctly):-- if I want to search for one term > among 3 in > a "collection" of 300 documents (as defined by some external > >> attribute), > I will wind up executing 300 x 3 searches, and for each search > that is > executed, I will iterate over every Hit, even if I've already > found the > one that I "care about". > > What would break if you: > 1. Included "creator" in the Lucene index (or, filtered out the > Hits > using a BitSet or something like it) > 2. Executed 1 search > 3. Collected the results of the first N Hits (where N is some > reasonable limit, like 100 or 500) > > -h > > > On Tue, 2007-07-24 at 20:14 -0400, Askar Zaidi wrote: > > > Sure. > > > > public float doBodySearch(Searcher searcher,String query, int > > id){ > > > > try{ > >
Re: Search for null
what if I do not know all possible values of that field which is a typical case in a free text search? daniel rosher wrote: You will be unable to search for fields that do not exist which is what you originally wanted to do, instead you can do something like: -Establish the query that will select all non-null values TermQuery tq1 = new TermQuery(new Term("field","value1")); TermQuery tq2 = new TermQuery(new Term("field","value2")); ... TermQuery tqn = new TermQuery(new Term("field","valuen")); BooleanQuery query = new BooleanQuery(); booleanQuery.add(tq1,BooleanClause.Occur.SHOULD); booleanQuery.add(tq2,BooleanClause.Occur.SHOULD); ... booleanQuery.add(tqn,BooleanClause.Occur.SHOULD); OR perhaps a range query if your values are contiguous Term start = new Term("field","198805"); Term end = new Term("field","198810"); Query query = new RangeQuery(start, end, true); ; OR just use the QueryParser Query query = QueryParser.parse(parseCriteria, "field", new StandardAnalyzer()); -Create the QueryFilter QueryFilter queryFilter = new QueryFilter(query); -flip the bits final BitSet filterBitSet = queryFilter.bits(reader); filterBitSet.flip(0,filterBitSet.size()); Now you have a filter that contains document matching the opposite of that specified by the query, and can use in subsequent queries Dan On Tue, 2007-07-24 at 09:40 -0700, Jay Yu wrote: daniel rosher wrote: Perhaps you can use a filter in the following way. -Create a filter (via QueryFilter) that would contain all document that do not have null values for the field Interesting: what does the QueryFilter look like? Isn't it just as hard as finding out what docs have the null values for the field? I really like to know your trick here. -flip the bits of the filter so that it now contains documents that have null values for a field -Use the filter in conjunction with subsequent queries. This would also help with performance as filters are simply bitsets and can cheaply be stored, generated once and used often. Dan On Mon, 2007-07-23 at 13:57 -0700, Jay Yu wrote: If you want performance, a better way might be to assign some special string/value (if it's easy to create) to the missing field of docs and index the field without tokenizing it. Then you may search for that special value to find the docs. Jay Les Fletcher wrote: Does this particular range query have any significant performance issues? Les Erik Hatcher wrote: On Jul 23, 2007, at 11:32 AM, testn wrote: Is it possible to search for the document that specified field doesn't exist or such field value is null? This is from Solr, so I'm not sure off the top of my head if this mojo applies by itself, but a search for -fieldname:[* TO *] will result in all documents that do not have the specified field. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] <> Daniel Rosher Developer d: 0207 3489 912 t: 0870 2020 121 f: 0870 2020 131 m: http://www.hotonline.com/ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This message is sent in confidence for the addressee only. It may contain privileged information. The contents are not to be disclosed to anyone other than the addressee. Unauthorised recipients are requested to preserve this confidentiality and to advise us of any errors in transmission. Thank you. hotonline ltd is registered in England & Wales. Registered office: One Canada Square, Canary Wharf, London E14 5AP. Registered No: 1904765. This message has been scanned for viruses by BlackSpider MailControl - www.blackspider.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] Daniel Rosher Developer d: 0207 3489 912 t: 0870 2020 121 f: 0870 2020 131 m: http://www.hotonline.com/ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - This message is sent in confidence for the addressee only. It may contain privileged information. The contents are not to be disclosed to anyone other than the addressee. Unauthorised recipients are requested to preserve this confidentiality and to advise us of any errors in transmission. Thank you. hotonline ltd is registered in England & Wales. Registered office
Re: Fine Tuning Lucene implementation
So, you really want a single Lucene score (based on the scores of your 4 fields) for every itemID, correct? And this score consists of scoring the title, tag, summary and body against some keywords correct? Here's what I would do: while (rs.next()) { doc = getDocument(itemId); // Get your document, including contents from your database, no need even to put them in Lucene, although you could add the doc to a MemoryIndex (see contrib/memory) Run your 4 searches against that memory index to get your score. Even better, combine your query into a single query that searches all 4 fields at once, then Lucene will combine the score for you } MemoryIndex info can be found at http://lucene.zones.apache.org:8080/ hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ package-summary.html -Grant On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: Hi Grant, Thanks for the response. Heres what I am trying to accomplish: 1. Iterate over itemID (unique) in the database using one SQL query. 2. For every itemID found, run 4 searches on Lucene Index. 3. doTagSearch(itemID) ; collect score 4. doTitleSearch(itemID...) ; collect score 5. doSummarySearch(itemID...) ; collect score 6. doBodySearch(itemID) ; collect score These scores are then added and I get a total score for each unique item in the database. Lucene Index has: So if I am running a body search, I have 92 hits from over 300 documents for a query. I already know my hit with the . For instance, from step (1) if itemID 16 is passed to all the 4 searches, I just need to get the score of the document which has itemID field = 16. I don't have to iterate over all the hits. I suppose I have to change my query to look for where itemID=16. Can you guide me as to how to do it ? thanks a ton, Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: Hi Askar, I suggest we take a step back, and ask the question, what are you trying to accomplish? That is, what is your application trying to do? Forget the code, etc. just explain what you want the end result to be and we can work from there. Based on what you have described, I am not sure you need access to the hits. It seems like you just need to make better queries. Is your itemID a unique identifier? If yes, then you shouldn't need to loop over hits at all, as you should only ever have one result IF your query contains a required term. Also, if this is the case, why do you need to do a search at all? Haven't you already identified the items of interest when you did your select query in the database? Or is it that you want to score the item based on some terms as well. If that is the case, there are other ways of doing this and we can discuss them. -Grant On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: Hey Guys, I need to know how I can use the HitCollector class ? I am using Hits and looping over all the possible document hits (turns out its 92 times I am looping; for 300 searches, its 300*92 !!). Can I avoid this using HitCollector ? I can't seem to understand how its used. thanks a lot, Askar On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: Askar, why do you need to add +id:? thanks, dt, www.ejinz.com search engine news forms - Original Message - From: "Askar Zaidi" <[EMAIL PROTECTED]> To: ; <[EMAIL PROTECTED]> Sent: Wednesday, July 25, 2007 12:39 AM Subject: Re: Fine Tuning Lucene implementation Hey Hira , Thanks so much for the reply. Much appreciate it. Quote: Would it be possible to just include a query clause? - i.e., instead of just contents:, also add +id: How can I do that ? I see my query as : +contents:harvard +contents:business +contents:review where the search phrase was: harvard business review Now how can I add +id: ?? This would give me that one exact document I am looking for , for that id. I don't have to iterate through hits. thanks, Askar On 7/24/07, N. Hira <[EMAIL PROTECTED]> wrote: I'm no expert on this (so please accept the comments in that context) but 2 things seem weird to me: 1. Iterating over each hit is an expensive proposition. I've often seen people recommending a HitCollector. 2. It seems that doBodySearch() is essentially saying, do this search and return the score pertinent to this ID (using an exhaustive loop). Would it be possible to just include a query clause? - i.e., instead of just contents:, also add +id: In general though, I think your algorithm seems inefficient (if I understand it correctly):-- if I want to search for one term among 3 in a "collection" of 300 documents (as defined by some external attribute), I will wind up executing 300 x 3 searches, and for each search that is executed, I will iterate over every Hit, even if I've already found the one that I "care about". What would break if you: 1. Included "creator" in the Lucene index (or, filtered out the Hits using a BitSet or something like it) 2. Executed 1 search 3.
Re: Fine Tuning Lucene implementation
Instead of refactoring the code, would there be a way to just modify the query in each search routine ? Such as, "search contents: and item:"; This means it would just collect the score of that one document whose itemID field = itemID passed from while(rs.next()). I just need to collect the score of the already in the index. Would there be a way to modify the query ? Add a clause ? thanks, Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > So, you really want a single Lucene score (based on the scores of > your 4 fields) for every itemID, correct? And this score consists of > scoring the title, tag, summary and body against some keywords correct? > > Here's what I would do: > > while (rs.next()) > { > doc = getDocument(itemId); // Get your document, including > contents from your database, no need even to put them in Lucene, > although you could > add the doc to a MemoryIndex (see contrib/memory) > Run your 4 searches against that memory index to get your > score. Even better, combine your query into a single query that > searches all 4 fields at once, then Lucene will combine the score for > you > } > > MemoryIndex info can be found at http://lucene.zones.apache.org:8080/ > hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ > package-summary.html > > -Grant > > On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: > > > Hi Grant, > > > > Thanks for the response. Heres what I am trying to accomplish: > > > > 1. Iterate over itemID (unique) in the database using one SQL query. > > 2. For every itemID found, run 4 searches on Lucene Index. > > 3. doTagSearch(itemID) ; collect score > > 4. doTitleSearch(itemID...) ; collect score > > 5. doSummarySearch(itemID...) ; collect score > > 6. doBodySearch(itemID) ; collect score > > > > These scores are then added and I get a total score for each unique > > item in > > the database. > > > > Lucene Index has: > > > > So if I am running a body search, I have 92 hits from over 300 > > documents for > > a query. I already know my hit with the . > > > > For instance, from step (1) if itemID 16 is passed to all the 4 > > searches, I > > just need to get the score of the document which has itemID field = > > 16. I > > don't have to iterate over all the hits. > > > > I suppose I have to change my query to look for where > > itemID=16. > > Can you guide me as to how to do it ? > > > > thanks a ton, > > > > Askar > > > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> > >> Hi Askar, > >> > >> I suggest we take a step back, and ask the question, what are you > >> trying to accomplish? That is, what is your application trying to > >> do? Forget the code, etc. just explain what you want the end result > >> to be and we can work from there. Based on what you have described, > >> I am not sure you need access to the hits. It seems like you just > >> need to make better queries. > >> > >> Is your itemID a unique identifier? If yes, then you shouldn't need > >> to loop over hits at all, as you should only ever have one result IF > >> your query contains a required term. Also, if this is the case, why > >> do you need to do a search at all? Haven't you already identified > >> the items of interest when you did your select query in the > >> database? Or is it that you want to score the item based on some > >> terms as well. If that is the case, there are other ways of doing > >> this and we can discuss them. > >> > >> -Grant > >> > >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: > >> > >>> Hey Guys, > >>> > >>> I need to know how I can use the HitCollector class ? I am using > >>> Hits and > >>> looping over all the possible document hits (turns out its 92 times > >>> I am > >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using > >>> HitCollector ? I can't seem to understand how its used. > >>> > >>> thanks a lot, > >>> > >>> Askar > >>> > >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: > > Askar, > why do you need to add +id:? > thanks, > dt, > www.ejinz.com > search engine news forms > - Original Message - > From: "Askar Zaidi" <[EMAIL PROTECTED]> > To: ; <[EMAIL PROTECTED]> > Sent: Wednesday, July 25, 2007 12:39 AM > Subject: Re: Fine Tuning Lucene implementation > > > > Hey Hira , > > > > Thanks so much for the reply. Much appreciate it. > > > > Quote: > > > > Would it be possible to just include a query clause? > > - i.e., instead of just contents:, also add > > +id: > > > > How can I do that ? > > > > I see my query as : > > > > +contents:harvard +contents:business +contents:review > > > > where the search phrase was: harvard business review > > > > Now how can I add +id: ?? > > > > This would give me that one exact document I am looking for , for > > that > id. > > I > > don't have to iterate thro
Re: Fine Tuning Lucene implementation
Yes, you can do that. On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote: Heres what I mean: http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields title:"The Right Way" AND text:go Although, I am not searching for the title "the right way" , I am looking for the score by specifying a unique field (itemID). when I do System.out.println(query); I get: +contents:Harvard +contents:Business + contents: Review Can I just add: +contents:Harvard +contents:Business + contents: Review +itemID=id ?? That query would just return one document. On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote: Instead of refactoring the code, would there be a way to just modify the query in each search routine ? Such as, "search contents: and item:"; This means it would just collect the score of that one document whose itemID field = itemID passed from while( rs.next()). I just need to collect the score of the already in the index. Would there be a way to modify the query ? Add a clause ? thanks, Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: So, you really want a single Lucene score (based on the scores of your 4 fields) for every itemID, correct? And this score consists of scoring the title, tag, summary and body against some keywords correct? Here's what I would do: while (rs.next()) { doc = getDocument(itemId); // Get your document, including contents from your database, no need even to put them in Lucene, although you could add the doc to a MemoryIndex (see contrib/memory) Run your 4 searches against that memory index to get your score. Even better, combine your query into a single query that searches all 4 fields at once, then Lucene will combine the score for you } MemoryIndex info can be found at http://lucene.zones.apache.org: 8080/ hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ package-summary.html -Grant On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: Hi Grant, Thanks for the response. Heres what I am trying to accomplish: 1. Iterate over itemID (unique) in the database using one SQL query. 2. For every itemID found, run 4 searches on Lucene Index. 3. doTagSearch(itemID) ; collect score 4. doTitleSearch(itemID...) ; collect score 5. doSummarySearch(itemID...) ; collect score 6. doBodySearch(itemID) ; collect score These scores are then added and I get a total score for each unique item in the database. Lucene Index has: So if I am running a body search, I have 92 hits from over 300 documents for a query. I already know my hit with the . For instance, from step (1) if itemID 16 is passed to all the 4 searches, I just need to get the score of the document which has itemID field = 16. I don't have to iterate over all the hits. I suppose I have to change my query to look for where itemID=16. Can you guide me as to how to do it ? thanks a ton, Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: Hi Askar, I suggest we take a step back, and ask the question, what are you trying to accomplish? That is, what is your application trying to do? Forget the code, etc. just explain what you want the end result to be and we can work from there. Based on what you have described, I am not sure you need access to the hits. It seems like you just need to make better queries. Is your itemID a unique identifier? If yes, then you shouldn't need to loop over hits at all, as you should only ever have one result IF your query contains a required term. Also, if this is the case, why do you need to do a search at all? Haven't you already identified the items of interest when you did your select query in the database? Or is it that you want to score the item based on some terms as well. If that is the case, there are other ways of doing this and we can discuss them. -Grant On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: Hey Guys, I need to know how I can use the HitCollector class ? I am using Hits and looping over all the possible document hits (turns out its 92 times I am looping; for 300 searches, its 300*92 !!). Can I avoid this using HitCollector ? I can't seem to understand how its used. thanks a lot, Askar On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: Askar, why do you need to add +id:? thanks, dt, www.ejinz.com search engine news forms - Original Message - From: "Askar Zaidi" <[EMAIL PROTECTED] > To: ; <[EMAIL PROTECTED]> Sent: Wednesday, July 25, 2007 12:39 AM Subject: Re: Fine Tuning Lucene implementation Hey Hira , Thanks so much for the reply. Much appreciate it. Quote: Would it be possible to just include a query clause? - i.e., instead of just contents:, also add +id: How can I do that ? I see my query as : +contents:harvard +contents:business +contents:review where the search phrase was: harvard business review Now how can I add +id: ?? This would give me that one exact document I am looking for , for that id. I d
Re: Fine Tuning Lucene implementation
Heres what I mean: http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields title:"The Right Way" AND text:go Although, I am not searching for the title "the right way" , I am looking for the score by specifying a unique field (itemID). when I do System.out.println(query); I get: +contents:Harvard +contents:Business + contents: Review Can I just add: +contents:Harvard +contents:Business + contents: Review +itemID=id ?? That query would just return one document. On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote: > > Instead of refactoring the code, would there be a way to just modify the > query in each search routine ? > > Such as, "search contents: and item:"; This means it would > just collect the score of that one document whose itemID field = itemID > passed from while( rs.next()). > > I just need to collect the score of the already in the index. > > Would there be a way to modify the query ? Add a clause ? > > thanks, > Askar > > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > > > So, you really want a single Lucene score (based on the scores of > > your 4 fields) for every itemID, correct? And this score consists of > > scoring the title, tag, summary and body against some keywords correct? > > > > Here's what I would do: > > > > while (rs.next()) > > { > > doc = getDocument(itemId); // Get your document, including > > contents from your database, no need even to put them in Lucene, > > although you could > > add the doc to a MemoryIndex (see contrib/memory) > > Run your 4 searches against that memory index to get your > > score. Even better, combine your query into a single query that > > searches all 4 fields at once, then Lucene will combine the score for > > you > > } > > > > MemoryIndex info can be found at http://lucene.zones.apache.org:8080/ > > hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ > > package-summary.html > > > > -Grant > > > > On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: > > > > > Hi Grant, > > > > > > Thanks for the response. Heres what I am trying to accomplish: > > > > > > 1. Iterate over itemID (unique) in the database using one SQL query. > > > 2. For every itemID found, run 4 searches on Lucene Index. > > > 3. doTagSearch(itemID) ; collect score > > > 4. doTitleSearch(itemID...) ; collect score > > > 5. doSummarySearch(itemID...) ; collect score > > > 6. doBodySearch(itemID) ; collect score > > > > > > These scores are then added and I get a total score for each unique > > > item in > > > the database. > > > > > > Lucene Index has: > > > > > > So if I am running a body search, I have 92 hits from over 300 > > > documents for > > > a query. I already know my hit with the . > > > > > > For instance, from step (1) if itemID 16 is passed to all the 4 > > > searches, I > > > just need to get the score of the document which has itemID field = > > > 16. I > > > don't have to iterate over all the hits. > > > > > > I suppose I have to change my query to look for where > > > itemID=16. > > > Can you guide me as to how to do it ? > > > > > > thanks a ton, > > > > > > Askar > > > > > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > > >> > > >> Hi Askar, > > >> > > >> I suggest we take a step back, and ask the question, what are you > > >> trying to accomplish? That is, what is your application trying to > > >> do? Forget the code, etc. just explain what you want the end result > > >> to be and we can work from there. Based on what you have described, > > >> I am not sure you need access to the hits. It seems like you just > > >> need to make better queries. > > >> > > >> Is your itemID a unique identifier? If yes, then you shouldn't need > > >> to loop over hits at all, as you should only ever have one result IF > > >> your query contains a required term. Also, if this is the case, why > > >> do you need to do a search at all? Haven't you already identified > > >> the items of interest when you did your select query in the > > >> database? Or is it that you want to score the item based on some > > >> terms as well. If that is the case, there are other ways of doing > > >> this and we can discuss them. > > >> > > >> -Grant > > >> > > >> On Jul 25, 2007, at 10:10 AM, Askar Zaidi wrote: > > >> > > >>> Hey Guys, > > >>> > > >>> I need to know how I can use the HitCollector class ? I am using > > >>> Hits and > > >>> looping over all the possible document hits (turns out its 92 times > > >>> I am > > >>> looping; for 300 searches, its 300*92 !!). Can I avoid this using > > >>> HitCollector ? I can't seem to understand how its used. > > >>> > > >>> thanks a lot, > > >>> > > >>> Askar > > >>> > > >>> On 7/25/07, Dmitry <[EMAIL PROTECTED]> wrote: > > > > Askar, > > why do you need to add +id:? > > thanks, > > dt, > > www.ejinz.com > > search engine news forms > > - Original Message - > > From: "Askar Zaidi" <[EMAIL PROTECTED] > > > >
Re: Search for null
In this case you should look at the source for RangeFilter.java. Using this you could create your own filter using TermEnum and TermDocs to find all documents that had some value for the field. You would then flip this filter (perhaps write a FlipFilter.java, that takes an existing filter in it's constructor, for reuse) to get all documents then didn't have a value for this field (i.e. null values). Depending on the time it takes to generate these filters, you could then cache this filter with CachingWrappingFilter for subsequent searches. Dan On Wed, 2007-07-25 at 08:57 -0700, Jay Yu wrote: > what if I do not know all possible values of that field which is a > typical case in a free text search? > > daniel rosher wrote: > > You will be unable to search for fields that do not exist which is what > > you originally wanted to do, instead you can do something like: > > > > -Establish the query that will select all non-null values > > > > TermQuery tq1 = new TermQuery(new Term("field","value1")); > > TermQuery tq2 = new TermQuery(new Term("field","value2")); > > ... > > TermQuery tqn = new TermQuery(new Term("field","valuen")); > > BooleanQuery query = new BooleanQuery(); > > booleanQuery.add(tq1,BooleanClause.Occur.SHOULD); > > booleanQuery.add(tq2,BooleanClause.Occur.SHOULD); > > ... > > booleanQuery.add(tqn,BooleanClause.Occur.SHOULD); > > > > OR perhaps a range query if your values are contiguous > > > > Term start = new Term("field","198805"); > > Term end = new Term("field","198810"); > > Query query = new RangeQuery(start, end, true); > > ; > > > > OR just use the QueryParser > > > > Query query = QueryParser.parse(parseCriteria, > > "field", new StandardAnalyzer()); > > > > -Create the QueryFilter > > > > QueryFilter queryFilter = new QueryFilter(query); > > > > -flip the bits > > > > final BitSet filterBitSet = queryFilter.bits(reader); > > filterBitSet.flip(0,filterBitSet.size()); > > > > Now you have a filter that contains document matching the opposite of > > that specified by the query, and can use in subsequent queries > > > > Dan > > > > > > > > On Tue, 2007-07-24 at 09:40 -0700, Jay Yu wrote: > >> daniel rosher wrote: > >>> Perhaps you can use a filter in the following way. > >>> > >>> -Create a filter (via QueryFilter) that would contain all document that > >>> do not have null values for the field > >> Interesting: what does the QueryFilter look like? Isn't it just as hard > >> as finding out what docs have the null values for the field? > >> I really like to know your trick here. > >>> -flip the bits of the filter so that it now contains documents that have > >>> null values for a field > >>> -Use the filter in conjunction with subsequent queries. > >>> > >>> This would also help with performance as filters are simply bitsets and > >>> can cheaply be stored, generated once and used often. > >>> > >>> Dan > >>> > >>> On Mon, 2007-07-23 at 13:57 -0700, Jay Yu wrote: > If you want performance, a better way might be to assign some special > string/value (if it's easy to create) to the missing field of docs and > index the field without tokenizing it. Then you may search for that > special value to find the docs. > > Jay > > Les Fletcher wrote: > > Does this particular range query have any significant performance > > issues? > > > > Les > > > > Erik Hatcher wrote: > >> On Jul 23, 2007, at 11:32 AM, testn wrote: > >>> Is it possible to search for the document that specified field > >>> doesn't exist > >>> or such field value is null? > >> This is from Solr, so I'm not sure off the top of my head if this mojo > >> applies by itself, but a search for -fieldname:[* TO *] will result in > >> all documents that do not have the specified field. > >> > >> Erik > >> > >> > >> - > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > <> > >>> Daniel Rosher > >>> Developer > >>> > >>> > >>> d: 0207 3489 912 > >>> t: 0870 2020 121 > >>> f: 0870 2020 131 > >>> m: > >>> http://www.hotonline.com/ > >>> > >>> > >>> > >>> > >>> > >>> > >>> - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - > >>> - - - - - - - - - - - - - - - - - - > >>> This message is sent in confidence for the addressee only. It may contain > >>> privileged > >>> information. The contents are not to be disclosed to anyone other than > >>> the address
Re: Fine Tuning Lucene implementation
Hey guys, One last question and I think I'll have an optimized algorithm. How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer()); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query q = queryParser.parse(queryString); So doing : System.out.println(q) shows: +contents:harvard +contents:business +contents:review I'd like to modify Query q to read: +contents:harvard +contents:business +contents:review +itemID: (id passed in the search method) So this would pick the one document I need from the Index and give me the score. I don't have to iterate over Hits. Any clues ? I can't find any examples on query building . thanks ! Askar On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Yes, you can do that. > > > On Jul 25, 2007, at 12:31 PM, Askar Zaidi wrote: > > > Heres what I mean: > > > > http://lucene.apache.org/java/docs/queryparsersyntax.html#Fields > > > > title:"The Right Way" AND text:go > > > > > > Although, I am not searching for the title "the right way" , I am > > looking > > for the score by specifying a unique field (itemID). > > > > when I do System.out.println(query); > > > > I get: > > > > +contents:Harvard +contents:Business + contents: Review > > > > Can I just add: > > > > +contents:Harvard +contents:Business + contents: Review > > +itemID=id ?? > > > > That query would just return one document. > > > > On 7/25/07, Askar Zaidi <[EMAIL PROTECTED]> wrote: > >> > >> Instead of refactoring the code, would there be a way to just > >> modify the > >> query in each search routine ? > >> > >> Such as, "search contents: and item:"; This means it > >> would > >> just collect the score of that one document whose itemID field = > >> itemID > >> passed from while( rs.next()). > >> > >> I just need to collect the score of the already in the > >> index. > >> > >> Would there be a way to modify the query ? Add a clause ? > >> > >> thanks, > >> Askar > >> > >> > >> On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >>> > >>> So, you really want a single Lucene score (based on the scores of > >>> your 4 fields) for every itemID, correct? And this score > >>> consists of > >>> scoring the title, tag, summary and body against some keywords > >>> correct? > >>> > >>> Here's what I would do: > >>> > >>> while (rs.next()) > >>> { > >>> doc = getDocument(itemId); // Get your document, including > >>> contents from your database, no need even to put them in Lucene, > >>> although you could > >>> add the doc to a MemoryIndex (see contrib/memory) > >>> Run your 4 searches against that memory index to get your > >>> score. Even better, combine your query into a single query that > >>> searches all 4 fields at once, then Lucene will combine the score > >>> for > >>> you > >>> } > >>> > >>> MemoryIndex info can be found at http://lucene.zones.apache.org: > >>> 8080/ > >>> hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/index/memory/ > >>> package-summary.html > >>> > >>> -Grant > >>> > >>> On Jul 25, 2007, at 11:45 AM, Askar Zaidi wrote: > >>> > Hi Grant, > > Thanks for the response. Heres what I am trying to accomplish: > > 1. Iterate over itemID (unique) in the database using one SQL > query. > 2. For every itemID found, run 4 searches on Lucene Index. > 3. doTagSearch(itemID) ; collect score > 4. doTitleSearch(itemID...) ; collect score > 5. doSummarySearch(itemID...) ; collect score > 6. doBodySearch(itemID) ; collect score > > These scores are then added and I get a total score for each unique > item in > the database. > > Lucene Index has: > > So if I am running a body search, I have 92 hits from over 300 > documents for > a query. I already know my hit with the . > > For instance, from step (1) if itemID 16 is passed to all the 4 > searches, I > just need to get the score of the document which has itemID field = > 16. I > don't have to iterate over all the hits. > > I suppose I have to change my query to look for where > itemID=16. > Can you guide me as to how to do it ? > > thanks a ton, > > Askar > > On 7/25/07, Grant Ingersoll <[EMAIL PROTECTED] > wrote: > > > > Hi Askar, > > > > I suggest we take a step back, and ask the question, what are you > > trying to accomplish? That is, what is your application trying to > > do? Forget the code, etc. just explain what you want the end > > result > > to be and we can work from there. Based on what you have > > described, > > I am not sure you need access to the hits. It seems like you just > > need to make better queries. > > > > Is your itemID a unique identifier? If yes, then you shouldn't > > need > > to loop over hits at all, as you should only ever have
Re: What replaced org.apache.lucene.document.Field.Text?
Andy, Patrick, Thank you. I replaced Field.Text with new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED); and it works just fine. Cheers, Lindsey Patrick Kimber <[EMAIL PROTECTED]> wrote: Hi Andy I think: Field.Text("name", "value"); has been replaced with: new Field("name", "value", Field.Store.YES, Field.Index.TOKENIZED); Patrick On 25/07/07, [EMAIL PROTECTED] wrote: > Please reference How do I get code written for Lucene 1.4.x to work with > Lucene 2.x? > http://wiki.apache.org/lucene-java/LuceneFAQ#head-86d479476c63a2579e867b > 75d4faa9664ef6cf4d > > > Andy > -Original Message- > From: Lindsey Hess [mailto:[EMAIL PROTECTED] > Sent: Wednesday, July 25, 2007 12:31 PM > To: Lucene > Subject: What replaced org.apache.lucene.document.Field.Text? > > I'm trying to get some relatively old Lucene code to compile (please see > below), and it appears that Field.Text has been deprecated. Can someone > please suggest what I should use in its place? > > Thank you. > > Lindsey > > > > public static void main(String args[]) throws Exception > { > String indexDir = > System.getProperty("java.io.tmpdir", "tmp") + > System.getProperty("file.separator") + "address-book"; > Analyzer analyzer = new WhitespaceAnalyzer(); > boolean createFlag = true; > > IndexWriter writer = new IndexWriter(indexDir, analyzer, createFlag); > Document contactDocument = new Document(); > contactDocument.add(Field.Text("type", "individual")); > > contactDocument.add(Field.Text("name", "Zane Pasolini")); > contactDocument.add(Field.Text("address", "999 W. Prince St.")); > contactDocument.add(Field.Text("city", "New York")); > contactDocument.add(Field.Text("province", "NY")); > contactDocument.add(Field.Text("postalcode", "10013")); > contactDocument.add(Field.Text("country", "USA")); > contactDocument.add(Field.Text("telephone", "1-212-345-6789")); > writer.addDocument(contactDocument); > writer.close(); > } > > > - > Fussy? Opinionated? Impossible to please? Perfect. Join Yahoo!'s user > panel and lay it on us. > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - Boardwalk for $500? In 2007? Ha! Play Monopoly Here and Now (it's updated for today's economy) at Yahoo! Games.
Re: Fine Tuning Lucene implementation
On Jul 25, 2007, at 1:26 PM, Askar Zaidi wrote: Hey guys, One last question and I think I'll have an optimized algorithm. How can I build a query in my program ? This is what I am doing: QueryParser queryParser = new QueryParser("contents", new StandardAnalyzer()); queryParser.setDefaultOperator(QueryParser.Operator.AND); Query q = queryParser.parse(queryString); Just concatenate it onto your string: Query q = queryParser.parse(queryString + " +" + itemID); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MoreLikeThis for multiple documents
Hello, I'm looking to extract significant terms characterizing a set of documents (which in turn relate to a topic). This basically comes down to functionality similar to determining the terms with the greatest offer weight (as used for blind relevance feedback), or maximizing tf.idf (as is done in MoreLikeThis). Is there anything like this already implemented, or do I need to iterate through all documents in the set "manually", re-tokenize each one (or maybe use TermVectors), and then calculate the weight for each term? Thanks, Jens - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Assembling a query from multiple fields
Hi all, Apologies for the cryptic subject line, but I couldn't think of a more descriptive one-liner to describe my problem/question to you all. Still fairly new to Lucene here, although I'm hoping to have more of a clue once I get a chance to read "Lucene In Action". I am implementing a search engine using Lucene for a web application. It is not really a free-text search like some other, more standard implementations. The requirement is for the search to be as easy and user-friendly as possible, so instead of specifying the field to search in the query itself - such as ip:192.168.102.230 - and being parsed with QueryParser, the field is being selected via a HTML element, and the search keywords are entered in a text field. As far as I can tell, I basically have two options: (1) Manually prepend the field identifier to the query text, for example: String fullQuery = field + ":" + queryText; then parse this query normally with QueryParser, OR (2) Since I know it is only going to be searching one term, manually create a TermQuery with a Term object representing what the user typed in, for example: Query query = new TermQuery(new Term(field, queryText)); Is there any advantage or disadvantage to any of these, or is one preferable over the other? My gut tells me that directly creating the TermQuery is more efficient since it doesn't have to perform parsing, but I'm not sure. I have other questions, too, but I don't want to get ahead of myself. One at a time... :) Appreciate any help you all might have! -- Joe Attardi
Re: StandardTokenizer is slowing down highlighting a lot
On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten comparably slower. Does JFlex have a jar associated with it? It's GPL (although you can freely use the files it generates under any license), so if there were other non-generated files required, we wouldn't be able to incorporate them. -Yonik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Linear Hashing in Lucene?
Hey, Some common questions about Lucene. 1. does exist Ontology Wraper in Lucene implementation? 2. Does Lucene using Linear Hashing? thnaks, DT, www.ejinz.com Search news - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search for null
On Thursday 26 July 2007 03:12:20 daniel rosher wrote: > In this case you should look at the source for RangeFilter.java. > > Using this you could create your own filter using TermEnum and TermDocs > to find all documents that had some value for the field. That's certainly the way to do it for speed. For the least code you can probably do... BooleanFilter f = new BooleanFilter(); f.add(new FilterClause(RangeFilter.More("field", ""), BooleanClause.Occur.MUST_NOT)); f = new CachingWrapperFilter(f); Daniel -- Daniel Noll Nuix Pty Ltd Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699 Web: http://nuix.com/ Fax: +61 2 9212 6902 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Highlighter strategy in Lucene
Waht kind of Highlighter strategy Lucene is using? thanks, Dt www.ejinz.com Search Engine for News - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Displaying results in the order
Is there a way to update a document in the Index without causing any change to the order in which it comes up in searches? thanks, DT, www.ejinz.com Search everything news, tech, movies, music - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Strange Error while deleting Documents from index while indexing.
Hi, I am dumping the database tables into lucene documents. I am doing like this: 1. Get the rowset from database to be stored as Lucene Document. 2. Open IndexReader and check if they are already indexed. If Indexed, delete them and add the new rowset. Continue this till the end 3. Close IndexReader 4. Open IndexWriter 5. Write the same rowset in the index. 6. delete the rowset from database.. 7. Repeat the same process[step 1 - step 7 ] till there are records in database. Like this i am doing Indexing and deletion. Some key points: 1. New indexWriter is opened when there is not instance of indexwriter available,but if available it makes use of the same IndexWriter. i.e. My index Writer opens once in Step 4 and after that the whole process makes use of it. 2. But i open indexReader for each deletion and close. 3. I optimize IndexWriter after certain threshold is crossed. Now my problem is: In the first deletion of document (if present) in step 2 and closing of indexreader in step 3. I get no error. But in the second loop, i get the error while trying to close the IndexReader. The error is : Unable to cast object of type 'System.Collections.DictionaryEntry' to type 'System.String'. Stack Trace: at Lucene.Net.Index.IndexFileDeleter.DeleteFiles(ArrayList files) at Lucene.Net.Index.IndexFileDeleter.DeleteFiles() at Lucene.Net.Index.IndexFileDeleter.CommitPendingFiles() at Lucene.Net.Index.IndexReader.Commit() at Lucene.Net.Index.IndexReader.Close() at QueryDatabaseForIndexing.Program.Main(String[] args) in E:\Test Applications\ORS Lucene Developments\July 25\ TotalIndexingAndSearching_25_july\T otalIndexingAndSearching\ QueryDatabaseForIndexing \Program2.cs:line 159 I dont know whats the cause of this error. I am in real need of help. Please help me find error. -- View this message in context: http://www.nabble.com/Strange-Error-while-deleting-Documents-from-index-while-indexing.-tf4149570.html#a11804824 Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: StandardTokenizer is slowing down highlighting a lot
On 25/07/07, Yonik Seeley <[EMAIL PROTECTED]> wrote: On 7/25/07, Stanislaw Osinski <[EMAIL PROTECTED]> wrote: > JavaCC is slow indeed. JavaCC is a very fast parser for a large document... the issue is small fields and JavaCC's use of an exception for flow control at the end of a value. As JVMs have advanced, exception-as-control-flow as gotten comparably slower. In Carrot2 we tokenize mostly very short documents (search results), so in this context JFlex proved much faster. I did a very rough performance test of Highlighter using JFlex and JavaCC-generated analyzers with medium-sized documents (up to ~1kB), and JFlex was still faster. What size would a 'large' document be? Does JFlex have a jar associated with it? It's GPL (although you can freely use the files it generates under any license), so if there were other non-generated files required, we wouldn't be able to incorporate them. You need JFlex jar only to generate the tokenizer (one Java class). The generated tokenizer is standalone and doesn't need the JFlex jar to run. Staszek
Re: Fine Tuning Lucene implementation
Hey Guys, Thanks for all the responses. I finally got it working with some query modification. The idea was to pick an itemID from the database and for that itemID in the Index, get the scores across 4 fields; add them up and ta-da ! I still have to verify my scores. Thanks a ton, I'll be active on this list from now on and try and answer questions to which I was seeking answers. later, Askar On 7/25/07, Doron Cohen <[EMAIL PROTECTED]> wrote: > > "Askar Zaidi" wrote: > > > ... Heres what I am trying to accomplish: > > > > 1. Iterate over itemID (unique) in the database using one SQL query. > > 2. For every itemID found, run 4 searches on Lucene Index. > > 3. doTagSearch(itemID) ; collect score > > 4. doTitleSearch(itemID...) ; collect score > > 5. doSummarySearch(itemID...) ; collect score > > 6. doBodySearch(itemID) ; collect score > > > > These scores are then added and I get a total score for each > > unique item in the database. > > oining this late I might be missing something. Still I > would like to understand better *what* you are trying to do > here (before going into the *how*). > > By your description above, my understanding is this: > > 1. Assume one table in the DB, with textual >columns: ItemID(unique), Title, Summary, Body, Tags. > 2. The ItemID columns is a unique key in the table. > 3. Assume entries in the ItemID column looks like >this: itemID=127, itemID=75, etc. > 4. Some of the other columns (not the ItemID column) >can contain IDs as well. > 5. You are iterating over the ItemID column, and, >for each value, (each ID), ranking all the documents >in the index (all the rows in that table) for >occurrences of that ID. > > Is that so? > > If so, you are actually trying to find for each row (doc), > which (other) rows (docs) "refer" to it most. Right? > Is this really a textual search problem? > > For instance, if rows X has N references to row Z, > and row Y has N+1 references to row Z, but the length > of the text in row Z is much more than that of row X, > would you expect row X to rank higher, because it is > shorter (what Lucene is likely to do) or that row Y > will rank higher, because it has slightly more > references to row Z? > > In another email you have this: > > > Can I just add: > > > > +contents:Harvard +contents:Business +contents: Review +itemID=77 > ?? > > > > That query would just return one document. > > Which is different than the above - it has a textual > task, not only ID. Are you interested here in all docs > (rows) that reference itemID=77 or only want to check > if the specific row whose ID is itemID=77, satisfies > the textual part of this query? > > This brings back to the start point: perhaps it would > help more if you once again define the task/problem you > are trying to solve? Forget about loops and doXyzSearch() > methods - just define input; output; logic; > > Regards, > Doron > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
java gc with a frequently changing index?
Hi, I am indexing a set of constantly changing documents. The change rate is moderate (about 10 docs/sec over a 10M document collection with a 6G total size) but I want to be right up to date (ideally within a second but within 5 seconds is acceptable) with the index. Right now I have code that adds new documents to the index and deletes old ones using updateDocument() in the 2.1 IndexWriter. In order to see the changes, I need to recreate the IndexReader/IndexSearcher every second or so. I am not calling optimize() on the index in the writer, and the mergeFactor is 10. The problem I am facing is that java gc is terrible at collecting the IndexSearchers I am discarding. I usually have a 3msec query time, but I get gc pauses of 300msec to 3 sec (I assume is is collecting the "tenured" generation in these pauses, which is my old IndexSearcher) I've tried "-Xincgc", "-XX:+UseConcMarkSweepGC -XX:+UseParNewGC" and calling System.gc() right after I close the old index without much luck (I get the pauses down to 1sec, but get 3x as many. I want < 25 msec pauses). So my question is, should I be avoiding reloading my index in this way? Should I keep a separate IndexReader (which only deletes old documents) and one for new documents? Is there a standard technique for a quickly changing index? Thanks, Tim - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Fine Tuning Lucene implementation
"Askar Zaidi" wrote: > ... Heres what I am trying to accomplish: > > 1. Iterate over itemID (unique) in the database using one SQL query. > 2. For every itemID found, run 4 searches on Lucene Index. > 3. doTagSearch(itemID) ; collect score > 4. doTitleSearch(itemID...) ; collect score > 5. doSummarySearch(itemID...) ; collect score > 6. doBodySearch(itemID) ; collect score > > These scores are then added and I get a total score for each > unique item in the database. oining this late I might be missing something. Still I would like to understand better *what* you are trying to do here (before going into the *how*). By your description above, my understanding is this: 1. Assume one table in the DB, with textual columns: ItemID(unique), Title, Summary, Body, Tags. 2. The ItemID columns is a unique key in the table. 3. Assume entries in the ItemID column looks like this: itemID=127, itemID=75, etc. 4. Some of the other columns (not the ItemID column) can contain IDs as well. 5. You are iterating over the ItemID column, and, for each value, (each ID), ranking all the documents in the index (all the rows in that table) for occurrences of that ID. Is that so? If so, you are actually trying to find for each row (doc), which (other) rows (docs) "refer" to it most. Right? Is this really a textual search problem? For instance, if rows X has N references to row Z, and row Y has N+1 references to row Z, but the length of the text in row Z is much more than that of row X, would you expect row X to rank higher, because it is shorter (what Lucene is likely to do) or that row Y will rank higher, because it has slightly more references to row Z? In another email you have this: > Can I just add: > > +contents:Harvard +contents:Business +contents: Review +itemID=77 ?? > > That query would just return one document. Which is different than the above - it has a textual task, not only ID. Are you interested here in all docs (rows) that reference itemID=77 or only want to check if the specific row whose ID is itemID=77, satisfies the textual part of this query? This brings back to the start point: perhaps it would help more if you once again define the task/problem you are trying to solve? Forget about loops and doXyzSearch() methods - just define input; output; logic; Regards, Doron - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Query parsing?
On Wednesday 25 July 2007 00:44, Lindsey Hess wrote: > Now, I do not need Lucene to index anything, but I'm wondering if Lucene > has query parsing classes that will allow me to transform the queries. The Lucene QueryParser class can parse the format descriped at http://lucene.apache.org/java/docs/queryparsersyntax.html. To adapt it to other formats, the javacc grammar needs to be modified. To output in yet another format, either the Java code would need to be modified or you'd need to write some new Java code that iterates over the object produced by the QueryParser. In other words: this is not what Lucene's QueryParser was made for and it's not too simple unless you're already familiar with javacc. Regards Daniel -- http://www.danielnaber.de - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Delete corrupted doc
Hi guys, Is there a way of deleting a document that, because of some corruption, got and docID larger than the maxDoc() ? I´m trying to do this but I get this Exception: Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 106577 at org.apache.lucene.util.BitVector.set(BitVector.java:53) at org.apache.lucene.index.SegmentReader.doDelete (SegmentReader.java :301) at org.apache.lucene.index.IndexReader.deleteDocument(IndexReader.java :674) at org.apache.lucene.index.MultiReader.doDelete(MultiReader.java:125) at org.apache.lucene.index.IndexReader.deleteDocument (IndexReader.java :674) at teste.DeleteError.main(DeleteError.java:9) Thanks