Problem in WebLucene
Hello , I'm trying to use weblucene in our application . i have created the index using IndexRunner class sucessfuly. When i try to access the webapplication using - http://localhost:8080/weblucene/search?dir=blog&query=query i'm getting a blank page , with the following error in the console . Caught error: java.io.IOException: D:\home\weblucene\webapp\WEB-INF\var\blog\index not a directory java.io.IOException: D:\home\weblucene\webapp\WEB-INF\var\blog\index not a directory . Where should i set my path for weblucene directory ? Where could be the problem ? Thanks in advance !!
RE: Search Lucene documents returns 0 hits
Thanks Lars, thanks heaps! -Original Message- From: Lars Klevan [mailto:[EMAIL PROTECTED] Sent: Friday, October 08, 2004 3:30 AM To: Lucene Users List Subject: RE: Search Lucene documents returns 0 hits Use BooleanQuery to combine multiple Queries: BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term("type", "stockSingle")), true, false); query.add(new TermQuery(new Term("seqNo", "1000")), true, false); ... -Original Message- From: Fred Yu [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 06, 2004 5:36 PM To: Lucene Users List Subject: RE: Search Lucene documents returns 0 hits Hi Lars Thanks for that! That solved my problem. By the way, I need build a QueryFilter using MultiTermQuery. How do I create a MultiTermQuery object that contains three terms, e.g. new Term("type", "stockSingle"); new Term("code", "1234"); new Term("seqNo", "1000"); Thanks Fred -Original Message- From: Lars Klevan [mailto:[EMAIL PROTECTED] Sent: Thursday, October 07, 2004 9:59 AM To: Lucene Users List Subject: RE: Search Lucene documents returns 0 hits If you're indexing with a Keyword field you need to use a TermQuery. QueryParser will only work for Text fields. The reason for this is that both the Text field and the QueryParser use the Analyzer to chop up the input into searchable chunks. Depending on the Analyzer this includes converting to lower-case, stripping trailing "s" and "ing" and removing stopwords like "the" and "and". The TermQuery and Keyword field both treat the input exactly as is. -Lars -Original Message- From: Fred Yu [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 06, 2004 3:25 PM To: [EMAIL PROTECTED] Subject: Search Lucene documents returns 0 hits Hi Does anyone know why Lucene returns 0 hits when there are in fact three matches? The attached are two java class that repeat the problem. In the example, I created a Keyword field "type" for each document added. Lucene can correctly find the documents if I use "Text" field instead of "Keyword" field. Thanks in advance Fred package test; import java.io.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public class IndexItems { public static void main(String[] args) throws IOException { try { IndexWriter writer = new IndexWriter("/test/index", new StandardAnalyzer(), true); indexDocs(writer); writer.optimize(); writer.close(); System.out.println("index finished."); } catch (IOException e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } private static void indexDocs(IndexWriter writer) throws IOException { Document document=new Document(); document.add(Field.Keyword("type", "stockSingle")); document.add(Field.Text("desc", "test single 1")); writer.addDocument(document); document=new Document(); document.add(Field.Keyword("type", "stockSingle")); document.add(Field.Text("desc", "test single 2")); writer.addDocument(document); document=new Document(); document.add(Field.Keyword("type", "stockItem")); document.add(Field.Text("desc", "test single 3")); writer.addDocument(document); } } package test; import java.io.IOException; import java.io.BufferedReader; import java.io.InputStreamReader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.QueryFilter; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; public class SearchItems { public static void main(String[] args) { try { Searcher searcher = new IndexSearcher("/test/index"); QueryParser qp=new QueryParser("type", new StandardAnalyzer()); Query query=qp.parse("type:stockSingle"); Hits hits = searcher.search(query); System.out.println("search found: "+hits.length() + " total matching documents"); searcher.close(); } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: multifield-boolean vs singlefield-enum query performance
Tea Yu wrote: For the following implementations: 1) storing boolean strings in fields X and Y separately 2) storing the same info in a field XY as 3 enums: X, Y, B, N meaning only X is True, only Y is True, both are True or both are False Is there significant performance gain when we substitute "X:T OR Y:T" by "XY:B", while significant loss in "X:T" by "XY:X OR XY:B"? Or are they negligible? As with most performance questions, it's best to try both and measure! It depends on the size of your index, the relative frequencies of X and Y, etc. Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Analyzer reuse
Yes you can reuse analyzers. The only performance gain will come from not having to create the objects and not having garbage collection overhead. I create one for each of my index reading threads. On Thu, 07 Oct 2004 16:59:38 +, sam s <[EMAIL PROTECTED]> wrote: > Hi, > Can instance of an analyzer be reused? > If yes then will it give any performance gain? > > sam > > _ > Add photos to your messages with MSN 8. Get 2 months FREE*. > http://join.msn.com/?page=features/featuredemail > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Analyzer reuse
Hi, Can instance of an analyzer be reused? If yes then will it give any performance gain? sam _ Add photos to your messages with MSN 8. Get 2 months FREE*. http://join.msn.com/?page=features/featuredemail - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
Someone posted an Arabic analyzer about 1 year ago, however, I don't think the licensing was very friendly and we no longer use it. We have a cross language system that works w/ Arabic (among other languages). We have written several stemmers based on the literature that perform pretty well and were not too difficult to implement (but are not available as open source at this point). Light stemming seems to work much better in IR applications then aggressive stemmers due to the problems with roots discussed earlier. -Grant -- Grant Ingersoll Sr. Software Engineer Center for Natural Language Processing Syracuse University School of Information Studies http://www.cnlp.org >>> [EMAIL PROTECTED] 10/7/2004 8:45:42 AM >>> Dawid Weiss wrote: > >> nothing to do with each other furthermore, Arabic uses phonetic >> indicators on each letter called diacritics that change the way you >> pronounce the word which in turn changes the words meaning so two word >> spelled exactly the same way with different diacritics will mean two >> separate things, > > > Just to point out the fact: most slavic languages also use diacritic > marks (above, like 'acute', or 'dot' marks, or below, like the Polish > 'ogonek' mark). Some people argue that they can be stripped off the text > upon indexing and that the queries usually disambiguate the context of > the word. Hmm. This brings up a question: the algorithmic stemmer package from Egothor works quite well for Polish (http://www.getopt.org/stempel), wouldn't it work well for Arabic, too? I lack the necessary expertise to evaluate results (knowing only two or three arabic words ;-) ), but I can certainly help someone to get started with testing... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
No problem. Let people know if it worked for you -- I look forward to hearing your experiences (good or bad). Dawid William W wrote: Thanks Dawid ! :) From: Dawid Weiss <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: Lucene Users List <[EMAIL PROTECTED]> Subject: Re: Clustering lucene's results Date: Thu, 07 Oct 2004 10:39:26 +0200 Hi William, Ok, here is some demo code I've put together that shows how you can achieve clustering of Lucene's results. I hope this will get you started on your projects. If you have questions, please don't hesitate to ask -- cross posts to carrot2-developers would be a good idea too. The code (plus the binaries so that you don't have to check out all of Carrot2 ;) are at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Take a look at Demo.java -- it is the main link between Lucene and Carrot. Play with the parameters, I used 100 as the number of search results to be clustered. Adjust it to your needs. int start = 0; int requiredHits = 100; I hope the code will be self-explanatory. Good luck, Dawid From the readme file: An example of using Carrot2 components to clustering search results from Lucene. === Prerequisities -- You must have an index created with Lucene and containing documents with the following fields: url, title, summary. The Lucene demo works with exactly these fields -- I just indexed all of Lucene's source code and documentation using the following line: mkdir index java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create -index index . The index is now in 'index' folder. Remember that the quality of snippets and titles heavily influences the output of the clustering; in fact, the above example index of Lucene's API is not too good because most queries will return nonsensical cluster labels (see below). Building Carrot2-Lucene demo Basically you should have all of Carrot2 source code checked out and issue the building command: ant -Dcopy.dependencies=true All of the required libraries and Carrot2 components will end up in 'tmp/dist/deps-carrot2-lucene-example-jar' folder. You can also spare yourself some time and download precompiled binaries I've put at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Now, once you have the compiled binaries, issue the following command (all on one line of course): java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \ com.dawidweiss.carrot.lucene.Demo index query The first argument is the location of the Lucene's index created before. The second argument is a query. In the output you should have clusters and max. three documents from every cluster: Results for: query Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s :> Search Lucene Rc1 Dev API - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html org.apache.lucene.search (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API) (and 19 more) :> Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html Jakarta Lucene API - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html Jakarta Lucene - Who We Are - Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html Jakarta Lucene - Overview - Jakarta Lucene (and 12 more) If you look at the source code of Demo.java, there are plenty of things apt for customization -- number of results from each cluster, number of displayed clusters (I would cut it to some reasonable number, say 10 or 15 -- the further a cluster is from the "top", the less it is likely to be important). Also keep in mind that some of Carrot2 components produce hierarchical clusters. This demonstration works with "flat" version of Lingo algorithm, so you don't need to worry about it. Hope this gets you started with using Carrot2 and Lucene. Please let me know about any successes or failures. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Check out Election 2004 for up-to-date election news, plus voter tools and more! http://special.msn.com/msn/election2004.armx - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Lucene documents returns 0 hits
Use BooleanQuery to combine multiple Queries: BooleanQuery query = new BooleanQuery(); query.add(new TermQuery(new Term("type", "stockSingle")), true, false); query.add(new TermQuery(new Term("seqNo", "1000")), true, false); ... -Original Message- From: Fred Yu [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 06, 2004 5:36 PM To: Lucene Users List Subject: RE: Search Lucene documents returns 0 hits Hi Lars Thanks for that! That solved my problem. By the way, I need build a QueryFilter using MultiTermQuery. How do I create a MultiTermQuery object that contains three terms, e.g. new Term("type", "stockSingle"); new Term("code", "1234"); new Term("seqNo", "1000"); Thanks Fred -Original Message- From: Lars Klevan [mailto:[EMAIL PROTECTED] Sent: Thursday, October 07, 2004 9:59 AM To: Lucene Users List Subject: RE: Search Lucene documents returns 0 hits If you're indexing with a Keyword field you need to use a TermQuery. QueryParser will only work for Text fields. The reason for this is that both the Text field and the QueryParser use the Analyzer to chop up the input into searchable chunks. Depending on the Analyzer this includes converting to lower-case, stripping trailing "s" and "ing" and removing stopwords like "the" and "and". The TermQuery and Keyword field both treat the input exactly as is. -Lars -Original Message- From: Fred Yu [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 06, 2004 3:25 PM To: [EMAIL PROTECTED] Subject: Search Lucene documents returns 0 hits Hi Does anyone know why Lucene returns 0 hits when there are in fact three matches? The attached are two java class that repeat the problem. In the example, I created a Keyword field "type" for each document added. Lucene can correctly find the documents if I use "Text" field instead of "Keyword" field. Thanks in advance Fred package test; import java.io.IOException; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; public class IndexItems { public static void main(String[] args) throws IOException { try { IndexWriter writer = new IndexWriter("/test/index", new StandardAnalyzer(), true); indexDocs(writer); writer.optimize(); writer.close(); System.out.println("index finished."); } catch (IOException e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } private static void indexDocs(IndexWriter writer) throws IOException { Document document=new Document(); document.add(Field.Keyword("type", "stockSingle")); document.add(Field.Text("desc", "test single 1")); writer.addDocument(document); document=new Document(); document.add(Field.Keyword("type", "stockSingle")); document.add(Field.Text("desc", "test single 2")); writer.addDocument(document); document=new Document(); document.add(Field.Keyword("type", "stockItem")); document.add(Field.Text("desc", "test single 3")); writer.addDocument(document); } } package test; import java.io.IOException; import java.io.BufferedReader; import java.io.InputStreamReader; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.search.Searcher; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.QueryFilter; import org.apache.lucene.search.Hits; import org.apache.lucene.queryParser.QueryParser; public class SearchItems { public static void main(String[] args) { try { Searcher searcher = new IndexSearcher("/test/index"); QueryParser qp=new QueryParser("type", new StandardAnalyzer()); Query query=qp.parse("type:stockSingle"); Hits hits = searcher.search(query); System.out.println("search found: "+hits.length() + " total matching documents"); searcher.close(); } catch (Exception e) { System.out.println(" caught a " + e.getClass() + "\n with message: " + e.getMessage()); } } } - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] smime.p7s Description: S/MIME cryptographic signature
Re: Clustering lucene's results
Thanks Dawid ! :) From: Dawid Weiss <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: Lucene Users List <[EMAIL PROTECTED]> Subject: Re: Clustering lucene's results Date: Thu, 07 Oct 2004 10:39:26 +0200 Hi William, Ok, here is some demo code I've put together that shows how you can achieve clustering of Lucene's results. I hope this will get you started on your projects. If you have questions, please don't hesitate to ask -- cross posts to carrot2-developers would be a good idea too. The code (plus the binaries so that you don't have to check out all of Carrot2 ;) are at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Take a look at Demo.java -- it is the main link between Lucene and Carrot. Play with the parameters, I used 100 as the number of search results to be clustered. Adjust it to your needs. int start = 0; int requiredHits = 100; I hope the code will be self-explanatory. Good luck, Dawid From the readme file: An example of using Carrot2 components to clustering search results from Lucene. === Prerequisities -- You must have an index created with Lucene and containing documents with the following fields: url, title, summary. The Lucene demo works with exactly these fields -- I just indexed all of Lucene's source code and documentation using the following line: mkdir index java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create -index index . The index is now in 'index' folder. Remember that the quality of snippets and titles heavily influences the output of the clustering; in fact, the above example index of Lucene's API is not too good because most queries will return nonsensical cluster labels (see below). Building Carrot2-Lucene demo Basically you should have all of Carrot2 source code checked out and issue the building command: ant -Dcopy.dependencies=true All of the required libraries and Carrot2 components will end up in 'tmp/dist/deps-carrot2-lucene-example-jar' folder. You can also spare yourself some time and download precompiled binaries I've put at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Now, once you have the compiled binaries, issue the following command (all on one line of course): java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \ com.dawidweiss.carrot.lucene.Demo index query The first argument is the location of the Lucene's index created before. The second argument is a query. In the output you should have clusters and max. three documents from every cluster: Results for: query Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s :> Search Lucene Rc1 Dev API - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html org.apache.lucene.search (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API) (and 19 more) :> Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html Jakarta Lucene API - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html Jakarta Lucene - Who We Are - Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html Jakarta Lucene - Overview - Jakarta Lucene (and 12 more) If you look at the source code of Demo.java, there are plenty of things apt for customization -- number of results from each cluster, number of displayed clusters (I would cut it to some reasonable number, say 10 or 15 -- the further a cluster is from the "top", the less it is likely to be important). Also keep in mind that some of Carrot2 components produce hierarchical clusters. This demonstration works with "flat" version of Lingo algorithm, so you don't need to worry about it. Hope this gets you started with using Carrot2 and Lucene. Please let me know about any successes or failures. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Check out Election 2004 for up-to-date election news, plus voter tools and more! http://special.msn.com/msn/election2004.armx - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
I'd be happy to help anyone test this out, my Arabic is pretty good. Nader Andrzej Bialecki wrote: Dawid Weiss wrote: nothing to do with each other furthermore, Arabic uses phonetic indicators on each letter called diacritics that change the way you pronounce the word which in turn changes the words meaning so two word spelled exactly the same way with different diacritics will mean two separate things, Just to point out the fact: most slavic languages also use diacritic marks (above, like 'acute', or 'dot' marks, or below, like the Polish 'ogonek' mark). Some people argue that they can be stripped off the text upon indexing and that the queries usually disambiguate the context of the word. Hmm. This brings up a question: the algorithmic stemmer package from Egothor works quite well for Polish (http://www.getopt.org/stempel), wouldn't it work well for Arabic, too? I lack the necessary expertise to evaluate results (knowing only two or three arabic words ;-) ), but I can certainly help someone to get started with testing... - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
Dawid Weiss wrote: nothing to do with each other furthermore, Arabic uses phonetic indicators on each letter called diacritics that change the way you pronounce the word which in turn changes the words meaning so two word spelled exactly the same way with different diacritics will mean two separate things, Just to point out the fact: most slavic languages also use diacritic marks (above, like 'acute', or 'dot' marks, or below, like the Polish 'ogonek' mark). Some people argue that they can be stripped off the text upon indexing and that the queries usually disambiguate the context of the word. Hmm. This brings up a question: the algorithmic stemmer package from Egothor works quite well for Polish (http://www.getopt.org/stempel), wouldn't it work well for Arabic, too? I lack the necessary expertise to evaluate results (knowing only two or three arabic words ;-) ), but I can certainly help someone to get started with testing... -- Best regards, Andrzej Bialecki - Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator - FreeBSD developer (http://www.freebsd.org) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
maybe inline? http://www.w3.org/2001/XMLSchema-instance";> japan フィールドサービスエンジニア Indexing the above document using the HTMLParser demo and the CJKAnalyzer, only the term "japan" is found in the content. This is not correct, is it? Should I convert the entities by hand? Sorry for the mess I send before. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
I guess something wnet wrong; Daan Hoogland wrote: Daan Hoogland wrote: Daan Hoogland wrote: Hello, Does anyone do indexeing of numeric entities for japanese characters? I have (non-x)html containing those entities and need to index and search them. Can the CJKAnalyzer index a string like "●入社"? It seems to be ignored completely when used with the demo. There was talk on this list of fixes for the demo HTMLParser, do these adres this issue? When I look ate the code it seems that the entities should have been interpreted before indexing. What am I missing? Any comment please? Or a pointer to a howto for dumm^H^H^H^H^H westerners? Indexing the attached document using the HTMLParser demo and the CJKAnalyzer, only the term "japan" is found in the content. This is not correct, is it? Should I convert the entities by hand? thanks, - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
Daan Hoogland wrote: >Daan Hoogland wrote: > > > >>Hello, >> >>Does anyone do indexeing of numeric entities for japanese characters? I >>have (non-x)html containing those entities and need to index and search >>them. >> >> >> >> >> >> >Can the CJKAnalyzer index a string like "●入社"? It >seems to be ignored completely when used with the demo. There was talk >on this list of fixes for the demo HTMLParser, do these adres this >issue? When I look ate the code it seems that the entities should have >been interpreted before indexing. What am I missing? > >Any comment please? >Or a pointer to a howto for dumm^H^H^H^H^H westerners? > > Indexing the attached document using the HTMLParser demo and the CJKAnalyzer, only the term "japan" is found in the content. This is not correct, is it? Should I convert the entities by hand? > >thanks, > > > > -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: *term search
[EMAIL PROTECTED] wrote: .. and here is the way to do it: (See attached file: SUPPOR~1.RAR) Hi all, I got from iouli the solution to enable prefix queries (*term). In fact you can find the solution in lucene source, in QueryParser.jj is said in a comment how to enable prefix queries. I did so ... but I found a lot of bugs. If you define WildTerm as | | ( [ "*", "?" ] ))* > a lot of constructions will be validated, and you will get a lot of errors ... for example "" and "+" are considered valid, * is considered valid, and they generate TooManyBooleanClausesExceptions, I' not very good in creating regular expressions but I succesfully use the following construction .. | (<_TERM_CHAR> | ( [ "*", "?" ] ))* ) | ( [ "*", "?" ] <_TERM_START_CHAR> (<_TERM_CHAR> | ( [ "*", "?" ] ) )* ) > Can anyone improve the construction and update the comment in QueryParser.jj? Thanks a lot, Sergiu Erik Hatcher <[EMAIL PROTECTED]To: "Lucene Users List" utions.com> <[EMAIL PROTECTED]> cc: (bcc: Iouli Golovatyi/X/GP/Novartis) 08.09.2004 12:46 Subject: Re: *term search Please respond to "Lucene UsersCategory: |-| List"| ( ) Action needed | | ( ) Decision needed | | ( ) General Information | |-| On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote: I want to discuss a little problem, lucene doesn't support *Term like queries. First of all, this is untrue. WildcardQuery itself most definitely supports wildcards at the beginning. I would like to use "*schreiben". The dilemma you've encountered is that QueryParser prevents queries that begin with a wildcard. So my question is if there is a simple solution for implementing the funtionality mentioned above. Maybe subclassing one class and overwriting some methods will sufice. It will require more than that in this case. You will need to create a custom parser that allows the grammar you'd like. Feel free to use the JavaCC source code to QueryParser as a basis of your customizations. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: leakage in RAMDirectory ?
Hi the major issue is that when using FSDirectory and indexing to a directory there are no missing enteries where as when indexed using RAMDirectory i get missing enteries currently i am investigating which are the missing enteries, since the application is configures to shutdown in the event of Exception either all get indexed or none Rupinder >-Original Message- >From: Daniel Naber [mailto:[EMAIL PROTECTED] >Sent: 06 October 2004 20:22 >To: Lucene Users List >Subject: Re: leakage in RAMDirectory ? > > >On Tuesday 05 October 2004 20:31, Rupinder Singh Mazara wrote: > >> ( there >> are 18746 records in the table. ) >> using a database result set , i loop over all the records , >> creating a document object and indexing into ramDirectory and then onto >> the fileSystem >> >> when I open a IndexReader and output numDoc i get 18740, > >It seems even in this case some documents are lost. Do you maybe ignore >exceptions? Could you build a self-contained test case that shows the >problem? The interesting question is of course *which* documents are lost >and if the behaviour is reproducible. The test case will either help you >to fix the bug in your code, or it will help us fix the bug in Lucene, if >there is any. > >Regards > Daniel > >-- >http://www.danielnaber.de > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
Nope, because the example I showed is based on the "local interfaces" pipeline and output-xsltrenderer is for remote components only. Anyway, I don't think it makes much sense -- if you need xslt badly, just modify the source code to output the results as XML and put an xslt filter on top of what it returns. Shouldn't be too hard. Dawid Albert Vila wrote: That's great, thanks dawid. Just a question, how can I modify your code in order to use the carrot2-output-xsltrenderer to output the clustering results in a html page? Can you provide an example? Thanks Dawid Weiss wrote: Hi William, Ok, here is some demo code I've put together that shows how you can achieve clustering of Lucene's results. I hope this will get you started on your projects. If you have questions, please don't hesitate to ask -- cross posts to carrot2-developers would be a good idea too. The code (plus the binaries so that you don't have to check out all of Carrot2 ;) are at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Take a look at Demo.java -- it is the main link between Lucene and Carrot. Play with the parameters, I used 100 as the number of search results to be clustered. Adjust it to your needs. int start = 0; int requiredHits = 100; I hope the code will be self-explanatory. Good luck, Dawid From the readme file: An example of using Carrot2 components to clustering search results from Lucene. === Prerequisities -- You must have an index created with Lucene and containing documents with the following fields: url, title, summary. The Lucene demo works with exactly these fields -- I just indexed all of Lucene's source code and documentation using the following line: mkdir index java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create -index index . The index is now in 'index' folder. Remember that the quality of snippets and titles heavily influences the output of the clustering; in fact, the above example index of Lucene's API is not too good because most queries will return nonsensical cluster labels (see below). Building Carrot2-Lucene demo Basically you should have all of Carrot2 source code checked out and issue the building command: ant -Dcopy.dependencies=true All of the required libraries and Carrot2 components will end up in 'tmp/dist/deps-carrot2-lucene-example-jar' folder. You can also spare yourself some time and download precompiled binaries I've put at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Now, once you have the compiled binaries, issue the following command (all on one line of course): java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \ com.dawidweiss.carrot.lucene.Demo index query The first argument is the location of the Lucene's index created before. The second argument is a query. In the output you should have clusters and max. three documents from every cluster: Results for: query Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s :> Search Lucene Rc1 Dev API - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html org.apache.lucene.search (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API) (and 19 more) :> Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html Jakarta Lucene API - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html Jakarta Lucene - Who We Are - Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html Jakarta Lucene - Overview - Jakarta Lucene (and 12 more) If you look at the source code of Demo.java, there are plenty of things apt for customization -- number of results from each cluster, number of displayed clusters (I would cut it to some reasonable number, say 10 or 15 -- the further a cluster is from the "top", the less it is likely to be important). Also keep in mind that some of Carrot2 components produce hierarchical clusters. This demonstration works with "flat" version of Lingo algorithm, so you don't need to worry about it. Hope this gets you started with using Carrot2 and Lucene. Please let me know about any successes or failures. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAI
Re: Clustering lucene's results
That's great, thanks dawid. Just a question, how can I modify your code in order to use the carrot2-output-xsltrenderer to output the clustering results in a html page? Can you provide an example? Thanks Dawid Weiss wrote: Hi William, Ok, here is some demo code I've put together that shows how you can achieve clustering of Lucene's results. I hope this will get you started on your projects. If you have questions, please don't hesitate to ask -- cross posts to carrot2-developers would be a good idea too. The code (plus the binaries so that you don't have to check out all of Carrot2 ;) are at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Take a look at Demo.java -- it is the main link between Lucene and Carrot. Play with the parameters, I used 100 as the number of search results to be clustered. Adjust it to your needs. int start = 0; int requiredHits = 100; I hope the code will be self-explanatory. Good luck, Dawid From the readme file: An example of using Carrot2 components to clustering search results from Lucene. === Prerequisities -- You must have an index created with Lucene and containing documents with the following fields: url, title, summary. The Lucene demo works with exactly these fields -- I just indexed all of Lucene's source code and documentation using the following line: mkdir index java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create -index index . The index is now in 'index' folder. Remember that the quality of snippets and titles heavily influences the output of the clustering; in fact, the above example index of Lucene's API is not too good because most queries will return nonsensical cluster labels (see below). Building Carrot2-Lucene demo Basically you should have all of Carrot2 source code checked out and issue the building command: ant -Dcopy.dependencies=true All of the required libraries and Carrot2 components will end up in 'tmp/dist/deps-carrot2-lucene-example-jar' folder. You can also spare yourself some time and download precompiled binaries I've put at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Now, once you have the compiled binaries, issue the following command (all on one line of course): java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \ com.dawidweiss.carrot.lucene.Demo index query The first argument is the location of the Lucene's index created before. The second argument is a query. In the output you should have clusters and max. three documents from every cluster: Results for: query Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s :> Search Lucene Rc1 Dev API - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html org.apache.lucene.search (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API) (and 19 more) :> Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html Jakarta Lucene API - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html Jakarta Lucene - Who We Are - Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html Jakarta Lucene - Overview - Jakarta Lucene (and 12 more) If you look at the source code of Demo.java, there are plenty of things apt for customization -- number of results from each cluster, number of displayed clusters (I would cut it to some reasonable number, say 10 or 15 -- the further a cluster is from the "top", the less it is likely to be important). Also keep in mind that some of Carrot2 components produce hierarchical clusters. This demonstration works with "flat" version of Lingo algorithm, so you don't need to worry about it. Hope this gets you started with using Carrot2 and Lucene. Please let me know about any successes or failures. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Albert Vila Director de proyectos I+D http://www.imente.com 902 933 242 [iMente ïLa informaciÃn con mÃs beneficiosï] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: indexing numeric entities?
Daan Hoogland wrote: >Hello, > >Does anyone do indexeing of numeric entities for japanese characters? I >have (non-x)html containing those entities and need to index and search >them. > > > > Can the CJKAnalyzer index a string like "●入社"? It seems to be ignored completely when used with the demo. There was talk on this list of fixes for the demo HTMLParser, do these adres this issue? When I look ate the code it seems that the entities should have been interpreted before indexing. What am I missing? Any comment please? Or a pointer to a howto for dumm^H^H^H^H^H westerners? thanks, -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. ASML is neither liable for the proper and complete transmission of the information contained in this communication, nor for any delay in its receipt. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Clustering lucene's results
Hi William, Ok, here is some demo code I've put together that shows how you can achieve clustering of Lucene's results. I hope this will get you started on your projects. If you have questions, please don't hesitate to ask -- cross posts to carrot2-developers would be a good idea too. The code (plus the binaries so that you don't have to check out all of Carrot2 ;) are at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Take a look at Demo.java -- it is the main link between Lucene and Carrot. Play with the parameters, I used 100 as the number of search results to be clustered. Adjust it to your needs. int start = 0; int requiredHits = 100; I hope the code will be self-explanatory. Good luck, Dawid From the readme file: An example of using Carrot2 components to clustering search results from Lucene. === Prerequisities -- You must have an index created with Lucene and containing documents with the following fields: url, title, summary. The Lucene demo works with exactly these fields -- I just indexed all of Lucene's source code and documentation using the following line: mkdir index java -Djava.ext.dirs=build org.apache.lucene.demo.IndexHTML -create -index index . The index is now in 'index' folder. Remember that the quality of snippets and titles heavily influences the output of the clustering; in fact, the above example index of Lucene's API is not too good because most queries will return nonsensical cluster labels (see below). Building Carrot2-Lucene demo Basically you should have all of Carrot2 source code checked out and issue the building command: ant -Dcopy.dependencies=true All of the required libraries and Carrot2 components will end up in 'tmp/dist/deps-carrot2-lucene-example-jar' folder. You can also spare yourself some time and download precompiled binaries I've put at: http://www.cs.put.poznan.pl/dweiss/tmp/carrot2-lucene.zip Now, once you have the compiled binaries, issue the following command (all on one line of course): java -Djava.ext.dirs=tmp\dist;tmp\dist\deps-carrot2-lucene-example-jar \ com.dawidweiss.carrot.lucene.Demo index query The first argument is the location of the Lucene's index created before. The second argument is a query. In the output you should have clusters and max. three documents from every cluster: Results for: query Timings: index opened in: 0,181s, search: 0,13s, clustering: 0,721s :> Search Lucene Rc1 Dev API - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/class-use/Query.html Uses of Class org.apache.lucene.search.Query (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-summary.html org.apache.lucene.search (Lucene 1.5-rc1-dev API) - F:/Repositories/cvs.apache.org/jakarta-lucene/build/docs/api/org/apache/lucene/search/package-use.html Uses of Package org.apache.lucene.search (Lucene 1.5-rc1-dev API) (and 19 more) :> Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/src/java/overview.html Jakarta Lucene API - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/whoweare.html Jakarta Lucene - Who We Are - Jakarta Lucene - F:/Repositories/cvs.apache.org/jakarta-lucene/docs/index.html Jakarta Lucene - Overview - Jakarta Lucene (and 12 more) If you look at the source code of Demo.java, there are plenty of things apt for customization -- number of results from each cluster, number of displayed clusters (I would cut it to some reasonable number, say 10 or 15 -- the further a cluster is from the "top", the less it is likely to be important). Also keep in mind that some of Carrot2 components produce hierarchical clusters. This demonstration works with "flat" version of Lingo algorithm, so you don't need to worry about it. Hope this gets you started with using Carrot2 and Lucene. Please let me know about any successes or failures. Dawid - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Arabic analyzer
There is a way of writing an Arabic stemmer, it's just not a weekend project, I've seen the translate/stem option as well, and even tried it with Lucene, we've implemented Lucene on our database and we have about a million records in our DB with 19 indexed fields (some of which are clobs) in each record, the free text fields in each record are in many cases Arabic, we do not provide stemming on those just because I couldn't find a valid stemming or translation option, which held up to proper testing, some were ok, but after collecting data from user searches (averaging out at 5 searches per second) the Arabic stemming options would not be able to manage user expectations, which is what it comes down to, sometimes theory does not translate well to practice. Nader Henein Dawid Weiss wrote: nothing to do with each other furthermore, Arabic uses phonetic indicators on each letter called diacritics that change the way you pronounce the word which in turn changes the words meaning so two word spelled exactly the same way with different diacritics will mean two separate things, Just to point out the fact: most slavic languages also use diacritic marks (above, like 'acute', or 'dot' marks, or below, like the Polish 'ogonek' mark). Some people argue that they can be stripped off the text upon indexing and that the queries usually disambiguate the context of the word. It is just a digression. Now back to the arabic stemmer -- there has to be a way of doing it. I know Vivisimo has clustering options for arabic. They must be using a stemmer (and an English translation dictionary), although it might be a commercial one. Take a look: http://vivisimo.com/search?v:file=cnnarabic D. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Indexing XML using lucene_xml_indexing
Hello.. I have tried of indexing XML files as a standalone application . has any one tried of indexing XMLs using lucene_xml_indexing from isogen.com for a webapplication ,simillar to the demo 'luceneweb' . It would be great if i get a webdemo for indexing XMLs . Expecting some guidance !! Thanks.