Re: lucene index parser problem

2004-09-08 Thread sergiu gordea
maybe you should encode the html code ... Patrick Burleson wrote: Why oh why did you send this to the tomcat lists? Don't cross post! Especially when the question doesn't even apply to one of the lists. Patrick On Tue, 7 Sep 2004 16:35:35 -0400, hui liu [EMAIL PROTECTED] wrote: Hi, I have such

Re: pdf in Chinese

2004-09-08 Thread Chandan Tamrakar
which analyzer you are using to index chinese pdf documents ? I think you should use cjkanalyzer - Original Message - From: [EMAIL PROTECTED] [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 11:27 AM Subject: pdf in Chinese Hi all, i use pdfbox to parse

RE: pdf in Chinese

2004-09-08 Thread Alex Kiselevski
Hi, Can you pls,advice me any solution for hebrew analyzer -Original Message- From: Chandan Tamrakar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 11:15 AM To: Lucene Users List Subject: Re: pdf in Chinese which analyzer you are using to index chinese pdf documents ? I

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread René Hackl
Hi Bill, - But even if it didn't, the second problem is that the query formed would be +(title:cutting title:lucene) +(author:cutting author:lucene) That is, if the word Lucene was in both the author field and the title field, the match would fit. This clearly isn't what the searcher

Re: Use of explain() vs search()

2004-09-08 Thread Erik Hatcher
Could you create a simple piece of code (using a RAMDirectory) that demonstrates this issue? Erik On Sep 8, 2004, at 12:35 AM, Minh Kama Yie wrote: Hi all, Sorry I should clarify my last point. The search() would return no hits, but the explain() using the apparently invalid docId

Re: pdf in Chinese

2004-09-08 Thread [EMAIL PROTECTED]
it is not about analyzer ,i need to read text from pdf file first. - Original Message - From: Chandan Tamrakar [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 4:15 PM Subject: Re: pdf in Chinese which analyzer you are using to index

*term search

2004-09-08 Thread sergiu gordea
Hi all, I want to discuss a little problem, lucene doesn't support *Term like queries. I know that this can bring a lot of results in the memory and therefore it is restricted. I think that allowing this kind of search and limiting the amount of returned results would be a more usefull

Re: *term search

2004-09-08 Thread Erik Hatcher
On Sep 8, 2004, at 6:26 AM, sergiu gordea wrote: I want to discuss a little problem, lucene doesn't support *Term like queries. First of all, this is untrue. WildcardQuery itself most definitely supports wildcards at the beginning. I would like to use *schreiben. The dilemma you've encountered

Re: *term search

2004-09-08 Thread Morus Walter
sergiu gordea writes: Hi all, I want to discuss a little problem, lucene doesn't support *Term like queries. I know that this can bring a lot of results in the memory and therefore it is restricted. That's not the reason for the restriction. That's possible with a* also. The

Re: *term search

2004-09-08 Thread iouli . golovatyi
.. and here is the way to do it: (See attached file: SUPPOR~1.RAR) Erik Hatcher

Re: pdf in Chinese

2004-09-08 Thread Ben Litchfield
This appears to be more of a PDFBox issue than a lucene issue, please post an issue to the PDFBox site. Also note, that because of certain encodings that a PDF writer can use, it is impossible to extract text from all PDF documents. Ben On Wed, 8 Sep 2004, [EMAIL PROTECTED] wrote: it is not

RE: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread Wermus Fernando
Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original- De: Bill Janssen [mailto:[EMAIL PROTECTED] Enviado el: Martes, 07 de Septiembre de 2004 10:06 p.m. Para: Lucene Users List CC: Ali Rouhi Asunto: MultiFieldQueryParser seems broken... Fix

Re: Moving from a single server to a cluster

2004-09-08 Thread Nader Henein
Hey Ben, We've been using a distributed environment with three servers and three separate indecies for the past 2 years since the first stable Lucene release and it has been great, recently and for the past two months I've been working on a redesign for our Lucene App and I've shared my

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea
The class is at the end of the message. But it hink that a better solution is that one suggested by Rene: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116 Wermus Fernando wrote: Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original-

Re: pdf in Chinese

2004-09-08 Thread Chas Emerick
I'm not aware of any Java library that can reliably extract Chinese text from PDF documents. We're planning on supporting Chinese, Japanese, and Korean in version 2 of PDFTextStream, but there's no doubt that it's a huge challenge. Chas Emerick | [EMAIL PROTECTED] PDFTextStream: fast PDF

PDF-Text Performance comparison

2004-09-08 Thread Ben Litchfield
On Wed, 8 Sep 2004, Chas Emerick wrote: PDFTextStream: fast PDF text extraction for Java applications http://snowtide.com/home/PDFTextStream/ For those that have not seen, snowtide.com has done a performance comparison against several Java PDF-Text libraries, including Snowtide's

RE: Moving from a single server to a cluster

2004-09-08 Thread David Townsend
Would it be cheeky to ask you to post the docs to the group? It would be interesting to read how you've tackled this. -Original Message- From: Nader Henein [mailto:[EMAIL PROTECTED] Sent: 08 September 2004 13:57 To: Lucene Users List Subject: Re: Moving from a single server to a cluster

Re: PDF-Text Performance comparison

2004-09-08 Thread Chas Emerick
Ben, Wow, thanks for the plug! :-) Truthfully, I was worried that our open-source brethren might feel slighted by the comparison -- that's partially why we wanted to make sure it was as thorough and transparent as possible so that anyone could review the results for themselves. I'm glad that

Re: Moving from a single server to a cluster

2004-09-08 Thread Nader Henein
be a pleasure, just didn't want to mislead someone down the wrong way. Give me a few days and I'll have the new version up. Nader - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: Moving from a single server to a cluster

2004-09-08 Thread Praveen Peddi
We went thru the same scenario as yours. We recently made our application clsuterable and I wrote our own version of jdbc directory (similar to the SQLDirectory posted by someone) with our own caching. It was great for searching for indexing had become a real bottleneck. So we have decided to move

RE: -- TomCat/Lucene, filesystem

2004-09-08 Thread Will Allen
I think you might be refering to the xml files you keep in C:\Program Files\Apache\Tomcat\conf\Catalina\localhost I have a file with the contents (myapp.xml): ?xml version='1.0' encoding='utf-8'? Context docBase=C:/work/aggregation/myapp/web path=/myapp reloadable=true /Context -Original

where is the SnowBallAnalyzer?

2004-09-08 Thread Wermus Fernando
I have to look better, but why the SnowBallAnalizer isn't in org.apache.lucene.analysis.snowball.SnowballAnalyzer package? I have lucene 1.4. I'm doing my own spanish stemmer.

Re: where is the SnowBallAnalyzer?

2004-09-08 Thread Ernesto De Santis
Is in snowball-1.0.jar I sent you it in private email. Bye Ernesto. - Original Message - From: Wermus Fernando [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 1:12 PM Subject: where is the SnowBallAnalyzer? I have to look better, but why the

IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread David Spencer
http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/search/IndexSearcher.html#close() What is the intent of IndexSearcher.close()? I want to know how, in a web app, one can stop a search that's in progress - use case is a user is limited to one search at at time, and when one (expensive)

Full web search engine package using Lucene

2004-09-08 Thread Anne Y. Zhang
Hi, I am assistanting a professor for a IR course. We need to provide the student with a full-fuctioned search engine package, and the professor prefers it being powered by lucene. Since I am new to lucene, can anyone provide me some information that where can I get the package? We also want the

Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.
Is it safe to change the compound file format option at any time during the life of an index? Can I build an index with it off, then turn it on, and call optimize, and have a compound file formatted index? And then later, turn it on, call optimize again, and go back the other way? The

Re: Full web search engine package using Lucene

2004-09-08 Thread Anne Y. Zhang
Thanks, David. But it seems that this is downloadable. Could you please provide me the link for download? Thank you very much! Ya - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 2:43 PM Subject: Re:

Re: Compound File Format question

2004-09-08 Thread Andrzej Bialecki
Armbrust, Daniel C. wrote: Is it safe to change the compound file format option at any time during the life of an index? Can I build an index with it off, then turn it on, and call optimize, and have a compound file formatted index? And then later, turn it on, call optimize again, and go back

Re: Full web search engine package using Lucene

2004-09-08 Thread Bernhard Messer
Anne Y. Zhang wrote: Thanks, David. But it seems that this is downloadable. Could you please provide me the link for download? Thank you very much! http://www.nutch.org/release/ Ya - Original Message - From: David Spencer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent:

RE: Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.
Hmm, I tried that in Luke - but it doesn't seem to take. When I uncheck the use compound file check box, and then select optimize, it doesn't change anything. I guess I should just write some code already :) Dan -Original Message- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]

Re: Full web search engine package using Lucene

2004-09-08 Thread Anne Y. Zhang
Thanks a lot! Ya - Original Message - From: Bernhard Messer [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, September 08, 2004 3:38 PM Subject: Re: Full web search engine package using Lucene Anne Y. Zhang wrote: Thanks, David. But it seems that this is

RE: Compound File Format question

2004-09-08 Thread Armbrust, Daniel C.
Ahh - two new discoveries: You have to add a document, remove a document, and then call optimize. Then everything works (nearly as expected) The version of Lucene that ships with Luke still has the broken optimize code in it that didn't clean up after itself - so you need to just download

Re: IndexSearcher.close() and aborting searches in progress

2004-09-08 Thread Otis Gospodnetic
Dave, I haven't tried this, but I think this would be messy. Lucene needs to keep index files open, so that when you pull a Document from Hits, it can read this stuff from those files. If you close index files, you are likely to get some NPEs or some such. I don't think you'll find a ready to

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread Bill Janssen
René, Thanks for your note. I'd think that if a user specified a query cutting lucene, with an implicit AND and the default fields title and author, they'd expect to see a match in which both cutting and lucene appears. That is, (title:cutting OR author:cutting) AND (title:lucene OR

Re: indexing size

2004-09-08 Thread Dmitry Serebrennikov
Niraj Alok wrote: Hi PA, Thanks for the detail ! Since we are using lucene to store the data also, I guess I would not be able to use it. By the way, I could be wrong, but I think the 35% figure you referenced in the your first e-mail actually does not include any stored fields. The deal with

maximum index size

2004-09-08 Thread Chris Fraschetti
I know the index size is very dependent on the content being index... but running on a unix based machine w/o a filesize limit, best case scenario... what is the largest number of documents that can be indexed. I've seen throughout the list mentions of millions of documents.. 8 million, 20

Re: maximum index size

2004-09-08 Thread Otis Gospodnetic
Given adequate hardware, it can. Take a look at nutch.org. Nutch uses Lucene at its core. Otis --- Chris Fraschetti [EMAIL PROTECTED] wrote: I know the index size is very dependent on the content being index... but running on a unix based machine w/o a filesize limit, best case

Re: maximum index size

2004-09-08 Thread Doug Cutting
Chris Fraschetti wrote: I've seen throughout the list mentions of millions of documents.. 8 million, 20 million, etc etc.. but can lucene potentially handle billions of documents and still efficiently search through them? Lucene can currently handle up to 2^31 documents in a single index. To a