[ANNOUNCE] : Lucene Server
I am glad to introduce a new project on SourceForge that is related to Lucene. Lucene Server is a java server application for simply create and manage Jakarta Lucene Indexes. It is designed to help you integrate Lucene in distributed environnements. The first release 0.1 is available for download. Hope it will be usefull for somebody. http://sourceforge.net/projects/luceneserver/ Remi COCULA.
Strange search results with wildcard - Bug?
Hi all, first, here's how to reproduce the problem: Go to http://www.denic.de/en/special/index.jsp and enter obscure service in the search field. You'll get 132 hits. Now enter obscure service* - and you only get 1 hit. The above website is running Lucene 1.3rc3, but I was able to reproduce this locally with 1.4.1. Here are my local results with controlled pseudo documents, perhaps you can see a pattern: searching for 00700* gets two documents: 007001 action and 007002 handle searching for handle gets two documents: 007002 handle and 011010 handle searching for 00700* handle gets two documents: 007002 handle and 011010 handle But where is 007001 action? searching for handle 00700* gets two documents: 007001 action and 007002 handle But where is 001010 handle? We're using the MultiFieldQueryParser and the Snowball Stemmers, if that makes any difference. Many thanks in advance for any pointers, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Ulrich Mayring writes: Hi all, first, here's how to reproduce the problem: Go to http://www.denic.de/en/special/index.jsp and enter obscure service in the search field. You'll get 132 hits. Now enter obscure service* - and you only get 1 hit. The above website is running Lucene 1.3rc3, but I was able to reproduce this locally with 1.4.1. Here are my local results with controlled pseudo documents, perhaps you can see a pattern: searching for 00700* gets two documents: 007001 action and 007002 handle searching for handle gets two documents: 007002 handle and 011010 handle searching for 00700* handle gets two documents: 007002 handle and 011010 handle But where is 007001 action? searching for handle 00700* gets two documents: 007001 action and 007002 handle But where is 001010 handle? We're using the MultiFieldQueryParser and the Snowball Stemmers, if that makes any difference. Your number/handle samples look ok to me if the default operator is AND. Note that wildcard expressions are not analyzed so if service is stemmed to anything different from service, it's not surprising that service* doesn't find it. I think you should look at a) what's the analyzed form of your terms and b) how does the rewritten query look like (there's a rewrite method for query that expands wildcard queries into basic queries). HTH Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Morus Walter wrote: Your number/handle samples look ok to me if the default operator is AND. But it's OR ;-) Using AND explicitly I get different results and using OR explicitly I get the same results as documented. Note that wildcard expressions are not analyzed so if service is stemmed to anything different from service, it's not surprising that service* doesn't find it. Ok, I didn't know that, but it makes sense. Perhaps the phenomenon on the live pages is different from my local test installation. I was just looking for a comparable case on our live pages, but the real problem is in pages that I'm just developing locally and which look similar to the number/handle example. I think you should look at a) what's the analyzed form of your terms and b) how does the rewritten query look like (there's a rewrite method for query that expands wildcard queries into basic queries). Will do, thank you very much. However, how do I get at the analyzed form of my terms? Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Ulrich Mayring writes: Will do, thank you very much. However, how do I get at the analyzed form of my terms? Instanciate the analyzer, create a token stream feeding your input, loop over the tokens, output the results. Morus - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
MultiSearcher + Sort
Guys Apologies Am I doing Wrong or is ther a bug with Lucene on Linux O/s When using ' MultiSearcher with Sort ' Please Somebody Reply me ASAP Tested both Lucene-1.4-final.jar,Lucene-1.4.1.jar hits = multiSearcher.search(query,sortField); Exception raised on Linux O/s Only [ On Windows it Works Perfectly ] Query String : (contents:gifts contents:articles) (path:gifts path:articles) (modified:gifts modified:articles) (filename:gifts filename:articles) (bookid:gifts bookid:articles) (creation:gifts creation:articles) (chapNme:gifts chapNme:articles) (itmName:gifts itmName:articles) (urltext:gifts urltext:articles) (itemCode:gifts itemCode:articles) (itemPrice:gifts itemPrice:articles) (pageid:gifts pageid:articles) --- EXCEPTION START- The Exception Raised file = SearchCreateArrayDataFiles.createArray1 Centralized Boolean Factor =false SYSTEM IS STOPPING COMPILATION -- EXCEPTION END- --- java.lang.RuntimeException: no terms in field bookid - cannot determine sort type at org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:319) at org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedHitQu eue.java:326) at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSorted HitQueue.java:167) at org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue.java :58) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:118) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:141) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.init(Hits.java:51) at org.apache.lucene.search.Searcher.search(Searcher.java:41) --- - /*at com.controlnet.indexing.search.SearchCreateArrayDataFiles.createArray1(Searc hCreateArrayDataFiles.java:263) *at com.controlnet.indexing.search.SearchCreateArrayDataFiles.main(SearchCreateA rrayDataFiles.java:308) */ WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Clustering lucene's results
Dear all, I saw a post about an attempt to integrate Carrot2 with Lucene. It was a while ago, so I'm curious if any outcome has been achieved. Anyway, as the project coordinator I can offer my help with such integration; if you're looking for some ready-to-use code then there is a clustering plugin for Nutch that integrates one of the clustering algorithms from Carrot2 with Nutch; I'm sure porting it to Lucene wouldn't be a big problem. Ragards, Dawid _ List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl ) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with get/setBoost of document fields
hmm ok, but how will i be able to set different boosts to fields, if this value is not stored?! i dont really understand why i can set a boost factor and it is not stored and used. what i want to do, is to weight my searchable index fields (type: Field.UnStored) with a different factors for those fields and if am not totally wrong this is done with set boost when i create the doc and write it to the index... or is there another way to do this? thanks, bastian Daniel Naber wrote: See the documentation for getBoost: Note: this value is not stored directly with the document in the index. Documents returned from IndexReader.document(int) and Hits.doc(int) may thus not have the same value present as when this field was indexed. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with get/setBoost of document fields
The boost is not thrown away, but rather combined with the length normalization factor during indexing. So while your actual boost value is not stored directly in the index, it is taken into consideration for scoring appropriately. Erik On Sep 23, 2004, at 8:17 AM, Bastian Grimm [Eastbeam GmbH] wrote: hmm ok, but how will i be able to set different boosts to fields, if this value is not stored?! i dont really understand why i can set a boost factor and it is not stored and used. what i want to do, is to weight my searchable index fields (type: Field.UnStored) with a different factors for those fields and if am not totally wrong this is done with set boost when i create the doc and write it to the index... or is there another way to do this? thanks, bastian Daniel Naber wrote: See the documentation for getBoost: Note: this value is not stored directly with the document in the index. Documents returned from IndexReader.document(int) and Hits.doc(int) may thus not have the same value present as when this field was indexed. Regards Daniel - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Clustering lucene's results
Hi Dawid, I would like to use Carrot2 with lucene. Do you have examples ? Thanks a lot, William. From: Dawid Weiss [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Clustering lucene's results Date: Thu, 23 Sep 2004 13:36:03 +0200 Dear all, I saw a post about an attempt to integrate Carrot2 with Lucene. It was a while ago, so I'm curious if any outcome has been achieved. Anyway, as the project coordinator I can offer my help with such integration; if you're looking for some ready-to-use code then there is a clustering plugin for Nutch that integrates one of the clustering algorithms from Carrot2 with Nutch; I'm sure porting it to Lucene wouldn't be a big problem. Ragards, Dawid _ List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl ) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Express yourself instantly with MSN Messenger! Download today - it's FREE! hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
Hi William, No, I don't have examples because I never used Lucene directly. If you provide me with a sample index and an API that executes a query on this index (I need document titles, summaries, or snippets and an anchor (identifier), can be an URL). Send me such a snippet and I'll try to write the integration code with Lucene. It is only a matter of writing a simple InputComponent instance and this is really trivial (see Nutch's plugin code). Dawid William W wrote: Hi Dawid, I would like to use Carrot2 with lucene. Do you have examples ? Thanks a lot, William. From: Dawid Weiss [EMAIL PROTECTED] Reply-To: Lucene Users List [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Clustering lucene's results Date: Thu, 23 Sep 2004 13:36:03 +0200 Dear all, I saw a post about an attempt to integrate Carrot2 with Lucene. It was a while ago, so I'm curious if any outcome has been achieved. Anyway, as the project coordinator I can offer my help with such integration; if you're looking for some ready-to-use code then there is a clustering plugin for Nutch that integrates one of the clustering algorithms from Carrot2 with Nutch; I'm sure porting it to Lucene wouldn't be a big problem. Ragards, Dawid _ List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl ) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Express yourself instantly with MSN Messenger! Download today - it's FREE! hthttp://messenger.msn.click-url.com/go/onm00200471ave/direct/01/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] From - Thu _ List sprawdzony skanerem poczty mks_vir ( http://www.mks.com.pl ) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with get/setBoost of document fields
thanks for your reply, eric. so i am right that its not possible to change the boost without reindexing all files? thats not good... or is it ok only to change the boosts an optimize the index to take changes effecting the index? if not, will i be able to boost those fields in the searcher? thanks, bastian - The boost is not thrown away, but rather combined with the length normalization factor during indexing. So while your actual boost value is not stored directly in the index, it is taken into consideration for scoring appropriately. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: MultiSearcher + Sort
Karthik, I have a kind of similar problem. Test the following: when you create a field, don't use Field(String), instead use Field(String, int) where int is a constant for the field's type. May be this could help. -Mensaje original- De: Karthik N S [mailto:[EMAIL PROTECTED] Enviado el: Jueves, 23 de Septiembre de 2004 06:42 a.m. Para: LUCENE Asunto: MultiSearcher + Sort Guys Apologies Am I doing Wrong or is ther a bug with Lucene on Linux O/s When using ' MultiSearcher with Sort ' Please Somebody Reply me ASAP Tested both Lucene-1.4-final.jar,Lucene-1.4.1.jar hits = multiSearcher.search(query,sortField); Exception raised on Linux O/s Only [ On Windows it Works Perfectly ] Query String : (contents:gifts contents:articles) (path:gifts path:articles) (modified:gifts modified:articles) (filename:gifts filename:articles) (bookid:gifts bookid:articles) (creation:gifts creation:articles) (chapNme:gifts chapNme:articles) (itmName:gifts itmName:articles) (urltext:gifts urltext:articles) (itemCode:gifts itemCode:articles) (itemPrice:gifts itemPrice:articles) (pageid:gifts pageid:articles) --- EXCEPTION START- The Exception Raised file = SearchCreateArrayDataFiles.createArray1 Centralized Boolean Factor =false SYSTEM IS STOPPING COMPILATION -- EXCEPTION END- --- java.lang.RuntimeException: no terms in field bookid - cannot determine sort type at org.apache.lucene.search.FieldCacheImpl.getAuto(FieldCacheImpl.java:319) at org.apache.lucene.search.FieldSortedHitQueue.comparatorAuto(FieldSortedH itQu eue.java:326) at org.apache.lucene.search.FieldSortedHitQueue.getCachedComparator(FieldSo rted HitQueue.java:167) at org.apache.lucene.search.FieldSortedHitQueue.init(FieldSortedHitQueue. java :58) at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:118) at org.apache.lucene.search.MultiSearcher.search(MultiSearcher.java:141) at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64) at org.apache.lucene.search.Hits.init(Hits.java:51) at org.apache.lucene.search.Searcher.search(Searcher.java:41) --- - /*at com.controlnet.indexing.search.SearchCreateArrayDataFiles.createArray1(S earc hCreateArrayDataFiles.java:263) *at com.controlnet.indexing.search.SearchCreateArrayDataFiles.main(SearchCre ateA rrayDataFiles.java:308) */ WITH WARM REGARDS HAVE A NICE DAY [ N.S.KARTHIK] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Questions related to closing the searcher
The best way is to use IndexReader's getCurrentVersion() method to check whether the index has changed. If it has, just get a new Searcher http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReade r.html#getCurrentVersion(java.lang.String) Aviran -Original Message- From: Edwin Tang [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 22, 2004 11:38 AM To: [EMAIL PROTECTED] Subject: Fwd: Questions related to closing the searcher Hello, In my testing, it seems like if the searcher (in my case ParallelMultiSearcher) is not closed, the searcher will not pick up any new data that has been added to the index since it was opened. I'm wondering if this is a correct statement. Assuming the above is true, I went about closing the searcher with searcher.close(), then setting both the searcher and QueryParser to null, then did a System.gc(). The application will sleep for a set period of time, then resumes to process another batch of queries against the index. When the application resumes, the following method is ran: /** * Creates a [EMAIL PROTECTED] ParallelMultiSearcher} and [EMAIL PROTECTED] QueryParser} if they * do not already exist. * * @return 0 if successful or the objects already exist; -1 if failed. */ private int getSearcher() { Analyzer analyzer; IndexSearcher[] searchers; int iReturn; Vector vector; if (logger.isDebugEnabled()) logger.debug(Entering getSearcher()); if (searcher == null || parser == null) { analyzer = new CIAnalyzer(utility.sStopWordsFile); try { vector = new Vector(); if (utility.bSearchAMX) vector.add(new IndexSearcher(utility.amxIndexDir)); if (utility.bSearchCOMTEX) vector.add(new IndexSearcher(utility.comtexIndexDir)); if (utility.bSearchDJNW) vector.add(new IndexSearcher(utility.djnwIndexDir)); if (utility.bSearchMoreover) vector.add(new IndexSearcher(utility.moreoverIndexDir)); searchers = (IndexSearcher[]) vector.toArray(new IndexSearcher[vector.size()]); searcher = new ParallelMultiSearcher(searchers); parser = new QueryParser(body, analyzer); iReturn = 0; } catch (IOException ioe) { logger.error(Error creating searcher, ioe); iReturn = -1; } catch (Exception e) { logger.error(Unexpected error while creating searcher, e); iReturn = -1; } } else iReturn = 0; if (logger.isDebugEnabled()) logger.debug(Exitng getSearcher() with + iReturn); return iReturn; } // End method getSearcher() This seems to get me around the problem where the searcher was not picking up new data from the index. However, I would run out of memory after 8 iterations of the application processing a batch query, sleeping, process another batch query, sleep, etc. I'm probably missing something completely obvious, but I'm just not seeing it. Can someone please tell me what I'm doing wrong? Thanks, Ed __ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Erik Hatcher wrote: Look at AnalysisDemo referred to here: http://wiki.apache.org/jakarta-lucene/AnalysisParalysis Keep in mind that phrase queries do not support wildcards - they are analyzed and any wildcard characters are likely stripped and cause tokens to split. Ok, I did all that and identified a basic case: If the user searches for 007001 handle, the MultiFieldQueryParser, which searches in the fields title and contents, changes that query to: (title:007001 +title:handl) (contents:007001 +contents:handl) So, actually it has nothing to do with the wildcard, the problem comes from the + modifier - where does it originate? Obviously, this way I can never find a document without the term handle, but with the number 007001. Kind regards, Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: problem with get/setBoost of document fields
You can change field boosts without re-indexing. http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte) Doug Bastian Grimm [Eastbeam GmbH] wrote: thanks for your reply, eric. so i am right that its not possible to change the boost without reindexing all files? thats not good... or is it ok only to change the boosts an optimize the index to take changes effecting the index? if not, will i be able to boost those fields in the searcher? thanks, bastian - The boost is not thrown away, but rather combined with the length normalization factor during indexing. So while your actual boost value is not stored directly in the index, it is taken into consideration for scoring appropriately. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Strange search results with wildcard - Bug?
Ulrich Mayring wrote: If the user searches for 007001 handle, the MultiFieldQueryParser, which searches in the fields title and contents, changes that query to: (title:007001 +title:handl) (contents:007001 +contents:handl) Ok, I cleared this up, there was some invisible magic going on in the code, sorry for the inconvenience. Anyway: field1:foo field2:bar AND field3:true turns into field1:foo +field2:bar +field3:true If I lose the AND and use a + instead, then everything works as expected. Now, is this a bug or a feature that I haven't quite grasped? :) Ulrich - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
Dawid Weiss wrote: Hi William, No, I don't have examples because I never used Lucene directly. If you provide me with a sample index and an API that executes a query on this index (I need document titles, summaries, or snippets and an anchor (identifier), can be an URL). Hi Dawid :-) I believe the approach to this component should be that you first initialize it by reading a mapping of Lucene index field names to logical names (metadata) like title, url, body, etc. The reason is that each index uses its own metadata schema, i.e. in Lucene-speak, the field names. Moreover, when you execute a query you get just a document id plus its score. It's up to you to build a snippet. There is a code in the jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from the query and the hit list, take a look at this... Send me such a snippet and I'll try to write the integration code with Lucene. It is only a matter of writing a simple InputComponent instance and this is really trivial (see Nutch's plugin code). The basic usage scenario is that you open the IndexReader (either using directory name as a String or a Directory instance), and then create a Query instance, usually using QueryParser, and finally you search using IndexSearcher. You get a list of Hits, which you can use to get scores, and the contents of the documents. Take a look at the IndexFiles and SearchFiles classes in org.apache.lucene.demo package (under /src/demo). -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
Hi Fred, We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's just a guess, I would love to hear from others about this. Anyway, since it is a separate thread, a token error could kill it and there is no way for the calling thread to know about it. We had to create our own html parser since we only cared about grabbing the entire text from the html document and also we wanted to avoid the extra thread. We also do a lot of SKIPping for minimal EOF errors (html documents in email almost never follow standards). For your html needs, you might want to check out other JavaCC HTML parsers from the JavaCC web site. Roy. On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the JVM hangs instead of exiting. Presumably this is because there are threads hanging around doing something. But I'm not sure what! Any pointers? I just want to exit gracefully on an error such as a required meta tag is missing or similar. Thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
compiling 1.4 source
Hi guys, So we started upgrading to 1.4 and we need to add some of our own custom code. After compiling with ant, I noticed that the 1.4 ant script builds a jar called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar. I'm pretty sure I did not download the wrong source. Is this just a wrong name in the properties or does the source code actually contain lucene 1.5 rc1 code? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: compiling 1.4 source
If you obtained the 1.4.1 source distribution, then you're fine and its simply an issue with the properties. We keep the properties set to the _next_ version of Lucene (or as a beta/rc version label) to avoid the CVS HEAD codebase from building as a release label when it is very likely not the same. If you obtained the source from CVS HEAD, you're using code that has been greatly modified since the 1.4.1 release. Erik On Sep 23, 2004, at 12:13 PM, [EMAIL PROTECTED] wrote: Hi guys, So we started upgrading to 1.4 and we need to add some of our own custom code. After compiling with ant, I noticed that the 1.4 ant script builds a jar called lucene-1.5-rc1-dev.jar, not lucene-1.4-final.jar. I'm pretty sure I did not download the wrong source. Is this just a wrong name in the properties or does the source code actually contain lucene 1.5 rc1 code? Roy. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
Hi Andrzej :) Yep, ok, I'll take a look at it. After I come back from abroad (next week). I just wanted to save myself some time and have an already written code that fetches the information we need for clustering; you know what I mean, I'm sure. But I'll start from scratch when I get back. D. Andrzej Bialecki wrote: Dawid Weiss wrote: Hi William, No, I don't have examples because I never used Lucene directly. If you provide me with a sample index and an API that executes a query on this index (I need document titles, summaries, or snippets and an anchor (identifier), can be an URL). Hi Dawid :-) I believe the approach to this component should be that you first initialize it by reading a mapping of Lucene index field names to logical names (metadata) like title, url, body, etc. The reason is that each index uses its own metadata schema, i.e. in Lucene-speak, the field names. Moreover, when you execute a query you get just a document id plus its score. It's up to you to build a snippet. There is a code in the jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from the query and the hit list, take a look at this... Send me such a snippet and I'll try to write the integration code with Lucene. It is only a matter of writing a simple InputComponent instance and this is really trivial (see Nutch's plugin code). The basic usage scenario is that you open the IndexReader (either using directory name as a String or a Directory instance), and then create a Query instance, usually using QueryParser, and finally you search using IndexSearcher. You get a list of Hits, which you can use to get scores, and the contents of the documents. Take a look at the IndexFiles and SearchFiles classes in org.apache.lucene.demo package (under /src/demo). - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Power Point Processing
Hi, Does anyone know a good tool to processing MS Power Point file (*.ppt) into plain text so we can use lucene to index it? I looked at jakarta/POI, and only see Word and Excel documents can be processed, some JavaDoc pages mentioned ppt, but status is not clear to me? Thanks very much for helps, Lisheng - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
Hi Dawid, The demos (under /src/demo) are very good. They have the basic usage scenario. Thanks Andrzej. William. Dawid Weiss wrote: Hi William, No, I don't have examples because I never used Lucene directly. If you provide me with a sample index and an API that executes a query on this index (I need document titles, summaries, or snippets and an anchor (identifier), can be an URL). Hi Dawid :-) I believe the approach to this component should be that you first initialize it by reading a mapping of Lucene index field names to logical names (metadata) like title, url, body, etc. The reason is that each index uses its own metadata schema, i.e. in Lucene-speak, the field names. Moreover, when you execute a query you get just a document id plus its score. It's up to you to build a snippet. There is a code in the jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from the query and the hit list, take a look at this... Send me such a snippet and I'll try to write the integration code with Lucene. It is only a matter of writing a simple InputComponent instance and this is really trivial (see Nutch's plugin code). The basic usage scenario is that you open the IndexReader (either using directory name as a String or a Directory instance), and then create a Query instance, usually using QueryParser, and finally you search using IndexSearcher. You get a list of Hits, which you can use to get scores, and the contents of the documents. Take a look at the IndexFiles and SearchFiles classes in org.apache.lucene.demo package (under /src/demo). -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Get ready for school! Find articles, homework help and more in the Back to School Guide! http://special.msn.com/network/04backtoschool.armx - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
[EMAIL PROTECTED] wrote: We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's almost right. I originally wrote it that way to avoid having to ever buffer the entire text of the document. The document is indexed while it is parsed. But, as observed, this has lots of problems and was probably a bad idea. Could someone provide a patch that removes the multi-threading? We'd simply use a StringBuffer in HTMLParser.jj to collect the text. Calls to pipeOut.write() would be replaced with text.append(). Then have the HTMLParser's constructor parse the page before returning, rather than spawn a thread, and getReader() would return a StringReader. The public API of HTMLParser need not change at all and lots of complex threading code would be thrown away. Anyone interested in coding this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
yeah... I know there have to be demos... I tried to be lazy, you know :) Anyway, as I told Andrzej -- I'll take a look at it (and with a pleasure) after I come back. i don't think the delay will matter much. And if it does, ask Andrzej -- he has excellent experience with both projects -- he's just very shy by nature and doesn't talk much, hehe. D. William W wrote: Hi Dawid, The demos (under /src/demo) are very good. They have the basic usage scenario. Thanks Andrzej. William. Dawid Weiss wrote: Hi William, No, I don't have examples because I never used Lucene directly. If you provide me with a sample index and an API that executes a query on this index (I need document titles, summaries, or snippets and an anchor (identifier), can be an URL). Hi Dawid :-) I believe the approach to this component should be that you first initialize it by reading a mapping of Lucene index field names to logical names (metadata) like title, url, body, etc. The reason is that each index uses its own metadata schema, i.e. in Lucene-speak, the field names. Moreover, when you execute a query you get just a document id plus its score. It's up to you to build a snippet. There is a code in the jakarta-lucene-sandbox CVS repo. (highlighter) to create snippets from the query and the hit list, take a look at this... Send me such a snippet and I'll try to write the integration code with Lucene. It is only a matter of writing a simple InputComponent instance and this is really trivial (see Nutch's plugin code). The basic usage scenario is that you open the IndexReader (either using directory name as a String or a Directory instance), and then create a Query instance, usually using QueryParser, and finally you search using IndexSearcher. You get a list of Hits, which you can use to get scores, and the contents of the documents. Take a look at the IndexFiles and SearchFiles classes in org.apache.lucene.demo package (under /src/demo). -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Get ready for school! Find articles, homework help and more in the Back to School Guide! http://special.msn.com/network/04backtoschool.armx - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document contents split among different Fields
I am working on extending Lucene to support documents with special islands of an XML language, and I want to index the islands differently from the text. My current plan is to break the document's contents into two Fields, one with all the text and one with all the special islands, and use a different Analyzer on each Field. In heading down this road, I realized that this approach breaks the whole model of Token as it supports highlighting. Token seems designed to store offsets within a given Field, so if you break a document up into pieces, the offsets are meaningless in terms of the original source document. Am I right in saying that the design of Token's support for highlighting really only supports having the entire document stored as one monolithic contents Field? Has anyone tackled indexing multiple content Fields before that could shed some light? Thanks, Greg Langmead Design Science, Inc., How Science Communicates http://www.dessci.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document contents split among different Fields
Greg Langmead wrote: Am I right in saying that the design of Token's support for highlighting really only supports having the entire document stored as one monolithic contents Field? No, I don't think so. Has anyone tackled indexing multiple content Fields before that could shed some light? Do you need highlights from all fields? If so, then you can use: TextFragment[] getBestTextFragments(TokenStream, ...); with a TokenStream for each field, then select the highest scoring fragments across all fields. Would that work for you? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Document contents split among different Fields
Doug Cutting wrote: Do you need highlights from all fields? If so, then you can use: TextFragment[] getBestTextFragments(TokenStream, ...); with a TokenStream for each field, then select the highest scoring fragments across all fields. Would that work for you? Thanks for the reply. I can't find code like this in the lucene or lucene-demo packages -- is this something implemented, or did you mean it as an example? Once I get a text fragment, are you proposing using it to do a secondary search within the source document, to match the fragment? I would like to do highlighting on content from either of my Fields, but I think that even if I didn't I'd have the same problem, because I'll have punched holes in the text Field and the positional data within the Field no longer reflects the position in the source. I think that if I want to pick the document apart into pieces like this, then I need to do some work to restore global positional data, by squirreling away the size of the holes I punch (the size of the XML islands, from the text Field's point of view, and the size of the text runs, from the island Field's point of view). If I store a special textual escape within the Field data that records the length of each gap, then I can read those escapes when Tokenizing the Field and add the number stored therein to the Token offset, restoring the global positional data. Does that make sense? I'm concerned this does violence to Lucene's model, which I've only been studying for a couple of weeks now. Greg - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]