Index MSOffice Documents

2004-06-25 Thread Sergiu Gordea
this and will a better source code. Congratulations to all people involved in development of the Jakarta project and it's subprojects, Sergiu Gordea Ps: ExeConverteImpl uses an external stand alone application (like antiwort or pdf2txt) to extract the text. /* @(#) CWK 1.4 07.06.2004 * * Copyright 2003

Re: Index MSOffice Documents

2004-06-28 Thread Sergiu Gordea
. Sergiu - Original Message - From: Sergiu Gordea [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: POI Users List [EMAIL PROTECTED] Sent: Friday, June 25, 2004 8:42 AM Subject: Index MSOffice Documents Hi all, I'm working on a project in which we

Re: Searching against Database

2004-07-15 Thread Sergiu Gordea
Hi again, I'm thinking to get the list of IDs from the database and the list of hits from Lucene Index and to create a comparator in order to eliminate the not permitted Hits from the list. Which solution do you think is better? Thanks, Sergiu Sergiu Gordea wrote: Hi, I have a simillar problem

Re: Searching against Database

2004-07-15 Thread Sergiu Gordea
AND group:developers to the user's query. Then you will not have to merge results. -Will -Original Message- From: Sergiu Gordea [mailto:[EMAIL PROTECTED] Sent: Thursday, July 15, 2004 2:58 AM To: Lucene Users List Subject: Re: Searching against Database Hi, I have a simillar problem. I'm working on a web

rebuild index

2004-07-22 Thread Sergiu Gordea
Hi all, I have a question related to reindexing of documents with lucene. We want to implement the functinality of rebuilding lucene index. That means I want to delete all documents in the index and to add newer versions. All information I need to reindex is kept in the database so that I have a

Re: rebuild index

2004-07-22 Thread Sergiu Gordea
fill that this is the right way to solve the problem. Sergiu Aviran wrote: Why don't you just build a new index in a different location and at the end add the missing documents from the old index to the new one, and then delete the old index. Aviran -Original Message- From: Sergiu Gordea

Re: continous index update

2004-07-28 Thread Sergiu Gordea
you have to delete the documents using IndexReader and write the Documents using IndexWriter, both of them place a lock on the index file, so ... you cannot work with both of them in the same time. (you get errors when you have an opened IndexWriter and try to delete a document with an

Re: Having common word in the search

2004-08-02 Thread Sergiu Gordea
I have the same problem. Right now I think is not possible to do what you want by using MultifieldQueryParser. Right now I iplemented a query normalization for our product, but I consider that the best way is to take the source code and to implement: Query q =

Re: search exception in servlet!

2004-08-02 Thread Sergiu Gordea
Probably it will be a good idea to provide the stack trace of the error you get. It's a little bit hard to guess the error in the code you provided. Sergiu xuemei li wrote: hi,all I am using lucene to search.It works fine before I put the code into the doPost of servlet.But after that it will

I appologize for this email...

2004-09-01 Thread Sergiu Gordea
Sory, I send this email to transfer my contacts between Mozilla and Thunderbird email client. Sergiu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene index parser problem

2004-09-08 Thread sergiu gordea
maybe you should encode the html code ... Patrick Burleson wrote: Why oh why did you send this to the tomcat lists? Don't cross post! Especially when the question doesn't even apply to one of the lists. Patrick On Tue, 7 Sep 2004 16:35:35 -0400, hui liu [EMAIL PROTECTED] wrote: Hi, I have such

*term search

2004-09-08 Thread sergiu gordea
Hi all, I want to discuss a little problem, lucene doesn't support *Term like queries. I know that this can bring a lot of results in the memory and therefore it is restricted. I think that allowing this kind of search and limiting the amount of returned results would be a more usefull

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-08 Thread sergiu gordea
The class is at the end of the message. But it hink that a better solution is that one suggested by Rene: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116 Wermus Fernando wrote: Bill, I don't receive any .java. Could you send it again? Thanks. -Mensaje original-

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-09 Thread sergiu gordea
Hi Bill, I think that more people wait for this patch of MultifieldIndexParser. It would be nice if it will be included in the next realease candidate All the best, Sergiu Bill Janssen wrote: René, Thanks for your note. I'd think that if a user specified a query cutting lucene, with

Re: Handling user queries (Was: Re: MultiFieldQueryParser seems broken... Fix attached.)

2004-09-09 Thread sergiu gordea
René Hackl wrote: is it a problem if the users will search coffee OR tea as a search string in the case that MultifieldQueryParser is modifyed as Bill suggested?, and the default opperator is set to AND? No. There's not a problem with the proposed correction to MFQP. MFQP should work the way

Re: MultiFieldQueryParser seems broken... Fix attached.

2004-09-10 Thread sergiu gordea
. I reckon there has been a discussion (and solution :-) on how to achieve the functionality you've been after: http://issues.apache.org/eyebrowse/[EMAIL PROTECTED]msgId=1798116 I'm not sure if this would be the same though. Best regards, René Hi all, I took the code indicated by Rene but I've

Re: OutOfMemory example

2004-09-13 Thread sergiu gordea
I have a few comments regarding your code ... 1. Why do you use RamDirectory and not the hard disk? 2. as John said, you should reuse the index instead of creating it each time in the main function if(!indexExists(File indexFile)) IndexWriter writer = new IndexWriter(directory, new

Re: OutOfMemory example

2004-09-13 Thread sergiu gordea
and in deterministic way produce OutOfMemoryError. That's all. Jiri. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Monday, September 13, 2004 5:16 PM To: Lucene Users List Subject: Re: OutOfMemory example I have a few comments regarding your code ... 1. Why do you use

Re: Search PharseQuery

2004-09-14 Thread sergiu gordea
String queryString = \waht is java\; Query q = QueryParser.parse(queryString, field, new StandardAnalyzer()); System.out.println(q.toString()); This is enough for starting consult Lucene API for more information Sergiu Natarajan.T wrote: Hi, Thanks for your mail, that link says only

Re: Search PharseQuery

2004-09-14 Thread sergiu gordea
Natarajan.T wrote: Hi, Thanks for your response. For example search keyword is like below... Language what is java Token 1: language Token 2: what is java(like google) Regards, Natarajan. Lucene works exaclty as you describe above with a simple correction ... The analyzer has a list of

Re: Search PharseQuery

2004-09-14 Thread sergiu gordea
you luck, Sergiu Regards, Natarajan. -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 14, 2004 7:38 PM To: Lucene Users List Subject: Re: Search PharseQuery Natarajan.T wrote: Hi, Thanks for your response. For example search keyword

Re: QueryParser.parse() and Lucene1.4.1

2004-09-17 Thread sergiu gordea
Hi Polima, It seems to me that your query string is not correct ... (A AND -(B)) AND = + NOT = - In lucene AND and NOT opperators are mapped internal to +/-, (AND and NOT are supported only because they are comming from natural language) so ... A + - (B) makes no sense ... Sergiu Polina Litvak

Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea
Hi Fred, I think that we can help you if you provide us your code, and the context in which it is used. we need to see how you open and close the searcher and the reader, and what operations are you doing on index. All the best, Sergiu Fred Toth wrote: Hi, I have built a nice lucene

Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea
Hi Fred, That's right, there are many references to this kind of problems in the lucene-user list. This suggestions were already made, but I'll list them once again: 1. One way to use the IndexSearcher is to use yopur code, but I don't encourage users to do that IndexReader reader =

Re: indexes won't close on windows

2004-09-20 Thread sergiu gordea
Fred Toth wrote: Hi Sergiu, Thanks for your suggestions. I will try using just the IndexSearcher(String...) and see if that makes a difference in the problem. I can confirm that I am doing a proper close() and that I'm checking for exceptions. Again, the problem is not with the search function,

Re: Using lucene in Tomcat

2004-09-28 Thread sergiu gordea
mahaveer jain wrote: Hi all, I have implemented lucene search for my documents and db successfully. Now my problem is, the index i created is indexing to my local disk, i want the index to be created with reference to my server. Right now I index C:/tomcat/webapps/jetspeed/document, but I want to

Re: different analyzer all produce the same index?

2004-10-04 Thread sergiu gordea
Daan Hoogland wrote: H all, I try to create different indices using different Analyzer-classes. I tried standard, german, russian, and cjk. They all produce exactly the same index file (md5-wise). There are over 280 pages so I expected at least some differences. Take a look in the lucene

Re: BooleanQuery - Too Many Clases on date range.

2004-10-04 Thread Sergiu Gordea
Chris Fraschetti wrote: absoultely, limiting the user's query is no problem here. I've currently implemented the lucene javascript to catcha lot of user quries that could cause issues.. blank queries, ? or * at the beginning of query, etc etc... but I couldn't think of a way to prevent the user

Re: *term search

2004-10-07 Thread sergiu gordea
8, 2004, at 6:26 AM, sergiu gordea wrote: I want to discuss a little problem, lucene doesn't support *Term like queries. First of all, this is untrue. WildcardQuery itself most definitely supports wildcards at the beginning. I would like to use

Re: Null or no analyzer

2004-10-20 Thread sergiu gordea
Erik Hatcher wrote: On Oct 20, 2004, at 9:55 AM, Aviran wrote: AFIK if the term Election 2004 will be between quotation marks this should work fine. No, it won't. The Analyzer will analyze it, and the WhitespaceAnalyzer would split it into two tokens [Election] and [2004]. This is a tricky

Re: Null or no analyzer

2004-10-20 Thread Sergiu Gordea
Rupinder Singh Mazara wrote: hi the basic problem here is that there are data source which contain a) id, b) text c) title d) authors AND d) subject heading text, title and authors need to be tokenized the subject heading can be one or more words, the subject must be also tokennized,

Re: Null or no analyzer

2004-10-21 Thread sergiu gordea
Erik Hatcher wrote: I don't like the idea of users having to know how a field was indexed though. That seems to defeat the purpose of a general-purpose QueryParser. Erik I agree that, but maybe lucene should provide some subclasses of QueryParser that should deal this problems. I'm just a

Re: Null or no analyzer

2004-10-21 Thread sergiu gordea
Erik Hatcher wrote: On Oct 21, 2004, at 5:38 AM, sergiu gordea wrote: Erik Hatcher wrote: I don't like the idea of users having to know how a field was indexed though. That seems to defeat the purpose of a general-purpose QueryParser. Erik I agree that, but maybe lucene should provide

Re: Need advice: what pdf lib to use?

2004-10-25 Thread sergiu gordea
[EMAIL PROTECTED] wrote: Hi Iouli, If you don't think is illegal, you can hack the pdfbox code to remove the protection ... Sergiu PDFbox stumbles also with class java.io.IOException with message: - You do not have permission to extract text in case the doc is copy/print protected. I

Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread sergiu gordea
of course POI, for open source. There are some commercial products based on POI also. for WORD consider textmining.org for XLS, POI does anything you need for powerpoint there is one commercial (it's about 1000$), but you can also find some source code in archives. All the best, Sergiu [EMAIL

Re: Need advice: what pdf lib to use?

2004-10-25 Thread sergiu gordea
Ben Litchfield wrote: In order to write software that consumes PDF documents you must agree to a list of conditions. One of those conditions is that permissions specified by the author of the PDF document are respected. PDFBox complies with this statement, if there is software that does not then

Re: Need advice: what Word/Excel/PowerPoint lib to use?

2004-10-25 Thread Sergiu Gordea
Genty Jean-Paul wrote: At 17:05 25/10/2004, you wrote: of course POI, for open source. There are some commercial products based on POI also. for WORD consider textmining.org for XLS, POI does anything you need for powerpoint there is one commercial (it's about 1000$), but you can also find some

Re: new version of NewMultiFieldQueryParser

2004-10-27 Thread sergiu gordea
Bill Janssen wrote: I'm not sure this solution is very robust I think I already sent an email with a better code... Sergiu Thanks to something Doug said when I first opened this discussion, I went back and looked at my implementation. He said, Can't we just do this in getFieldQuery?.

Re: new version of NewMultiFieldQueryParser

2004-10-28 Thread sergiu gordea
Bill Janssen wrote: I'm not sure this solution is very robust Thanks, but I'm pretty sure it *is* robust. Can you please offer a specific critique? Always happy to learn and improve :-). Try to see the behavior if you want to have a single term query juat something like: robust

Re: Searching for a path

2004-10-29 Thread sergiu gordea
Bill Tschumy wrote: I have a need to search an index for documents that were taken ffrom particulars files in the filesystem. Each document in the index has a field named url that is created using: doc.add(Field.Text(url, urlStr)); I understand this is both stored and indexed. My search

Re: new version of NewMultiFieldQueryParser

2004-10-29 Thread sergiu gordea
Bill Janssen wrote: Try to see the behavior if you want to have a single term query juat something like: robust .. and print out the query string ... Sure, that works fine. For instance, if you have the three default fields title, authors, and contents, the one-word search robust

Re: new version of NewMultiFieldQueryParser

2004-10-29 Thread sergiu gordea
Morus Walter wrote: Bill Janssen writes: Try to see the behavior if you want to have a single term query juat something like: robust .. and print out the query string ... Sure, that works fine. For instance, if you have the three default fields title, authors, and contents, the

Re: jaspq: dashed numerical values tokenized differently

2004-11-01 Thread sergiu gordea
Daniel Taurat wrote: Hi, I have just another stupid parser question: There seems to be a special handling of the dash sign - different from Lucene 1.2 at least in Lucene 1.4.RC3 StandardAnalyzer. From the behaviour you describe I think that the dash sign is removed from the text by the

Re: How do Lucene applications deal with API changes?

2004-11-03 Thread sergiu gordea
Bill Janssen wrote: Thanks to Bill Tschumy, who points out that Lucene 1.4.21 *breaks* the API exported by 1.4 by removing a parameter from QueryParser.getFieldQuery(). That means that my NewMultiFieldQueryParser also breaks, since it overrides that method. To fix, just remove the Analyzer

Re: one huge index or many small ones?

2004-11-04 Thread Sergiu Gordea
javier muguruza wrote: Hi Javier, I think the your optimization should take care of the response time of search queries. I asume that this is the variable you need to optimize. Probably it will be a good thing to read first the lucene benchmarks:

Re: one huge index or many small ones?

2004-11-05 Thread sergiu gordea
refactor your code in the future. So ... chose one solution and implement the first prototype, and keep in mind that your information is managed by the database, and lucene is just your search module. Sergiu On Thu, 04 Nov 2004 19:01:53 +0100, Sergiu Gordea [EMAIL PROTECTED] wrote: javier

Re: Is there an easy way to have indexing ignore a CVS subdirectory in the index directory?

2004-11-05 Thread sergiu gordea
Chuck Williams wrote: Otis, thanks for looking at this. The stack trace of the exception is below. I looked at the code. It wants to delete every file in the index directory, but fails to delete the CVS subdirectory entry (presumably because it is marked read-only; the specific exception is

Re: Lucene : avoiding locking (incremental indexing)

2004-11-15 Thread sergiu gordea
Luke Shannon wrote: I like the sound of the Queue approach. I also don't like that I have to focefully unlock the index. Personally I don't like the Queue aproach... because I already implemented multithreading in out application to improve its performance. In our application indexing is not

Re: Lucene : avoiding locking (incremental indexing)

2004-11-16 Thread Sergiu Gordea
[EMAIL PROTECTED] wrote: I am interested in pursuing experienced peoples' understanding as I have half the queue approach developed already. well I think that experienced people developed lucene :) theyoffered us the possibility to use multithreading and concurent searching. Of course ..

Re: Thread safety

2004-12-02 Thread sergiu gordea
Otis Gospodnetic wrote: 1. yes 2. yes error, meaningful, it depends what you find meaningful :) 3. searcher will still find the document, unless you close it and reopen it (searcher) ... What about LockException? I tried to index objects in a thread and to use a IndexSearcher to search

Re: restricting search result

2004-12-06 Thread Sergiu Gordea
Paul wrote: Hi, how yould you restrict the search results for a certain user? I'm indexing all the existing data in my application but there are certain access levels so some users should see more results then an other. Each lucene document has a field with an internal id and I want to restrict on

Re: Lucene index files from two different applications.

2004-12-21 Thread Sergiu Gordea
Gururaja H wrote: Hi ! Have two applications. Both are supposed to write Lucene index files and the WebApplication is supposed to read these index files. Here are the questions: 1. Can two applications write index files, in the same directory, at the same time ? if you implement the

Re: addIndexes() Question

2004-12-23 Thread Sergiu Gordea
I think you should change a little bit your plans, and to think that your goal is to create a fast search engine not a fast indexing engine. When you plan to index a lot of documents then it is possible to creata a lot of segments (if you don't optimize the index) and the serch will be very slow

Re: reading fields selectively

2005-01-25 Thread sergiu gordea
Hi to all lucene developers, The read fields selectively feature would be a very useful for me. Do you plan to include it in the next lucene realeases? I can patch lucene, but I will need to do it each time I upgrade my version, and probably I would need to run the unit tests, and this is just

Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? maybe you can try this library...

Re: Duplicate Hits

2005-02-01 Thread sergiu gordea
Erik Hatcher wrote: On Feb 1, 2005, at 10:51 AM, Jerry Jalenak wrote: OK - but I'm dealing with indexing between 1.5 and 2 million documents, so I really don't want to 'batch' them up if I can avoid it. And I also don't think I can keep an IndexRead open to the index at the same time I have an

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple

Re: *term

2005-02-02 Thread sergiu gordea
Tim Lebedkov (UPK) wrote: Hi, is there a way to make QueryParser accept *term? yes, if you apply a patch the lucene sources. Search for *term search in lucene archive. Best, Sergiu thank you --Tim - To unsubscribe, e-mail:

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Kauler, Leto S wrote: Another very cheap, but robust solution in the case you use linux is to make lynx to parse your pages. lynx page.html page.txt. This will strip out all html and script, style, csimport tags. And you will have a .txt file ready for indexing. Best, Sergiu We index the

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote: Hello Sergiu, thank you for your help so far. I appreciate it. I am working with Java 1.1 which does not include regular expressions. Why are you using Java 1.1? Are you so limited in resources? What operating system do you use? I asume that you just need to index the html

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
-Strip-1.04/Strip.pm Otis --- sergiu gordea [EMAIL PROTECTED] wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on

Re: Starts With x and Ends With x Queries

2005-02-06 Thread sergiu gordea
Hi Erick, In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards code*/code or code?/code. I don't read that as saying you cannot use an initial wildcard character, but rather as if you use a leading wildcard character you risk

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea
Hi Erik, I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery does not support leading wildcard characters. Erik From what I

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea
to index our pages. Best, Sergiu -Original Message- From: sergiu gordea [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 08, 2005 10:38 AM To: Lucene Users List Subject: Re: Starts With x and Ends With x Queries From what I was reading in the mailing list there are more lucene users

Re: Starts With x and Ends With x Queries

2005-02-08 Thread sergiu gordea
Erik Hatcher wrote: On Feb 8, 2005, at 10:37 AM, sergiu gordea wrote: Hi Erik, I'm not changing any functionality. WildcardQuery will still support leading wildcard characters, QueryParser will still disallow them. All I'm going to change is the javadoc that makes it sound like WildcardQuery

Re: HELP! JIT error when searching... Lucene 1.3 on Java 1.1

2005-02-08 Thread sergiu gordea
Karl Koch wrote: When I switch to Java 1.2, I can also not run it. Also I cannot index anything. I have no idea why... Can sombody help me? I think you are a pioneer in this domain :) . I'm not very familiar with the lucene source code, but I think it uses the advantages of java 1.3 and 1.4.

Re: Search Performance

2005-02-19 Thread sergiu gordea
Michael Celona wrote: My index is changing in real time constantly... in this case I guess this will not work for me any suggestions... using a singleton pattern for the your index searcher makes sense anyway ... I don'T think that you change the index after each search. the computing