Re: Backup strategies

2004-11-16 Thread Doug Cutting
Christoph Kiehl wrote: I'm curious about your strategy to backup indexes based on FSDirectory. If I do a file based copy I suspect I will get corrupted data because of concurrent write access. My current favorite is to create an empty index and use IndexWriter.addIndexes() to copy the current

Re: Parsing .ppt

2004-11-16 Thread Bernhard Messer
Hi, i tested the implementation. It seems to work with basic Powerpoint slides. The problem i have is that it doesn't extract special characters like german umlaute. Does anybody already adressed the problem ? thanks Bernhard Magnus Johansson schrieb: There's some code using POI at

Re: Backup strategies

2004-11-16 Thread Nader Henein
We've recently implemented something similar with the backup process creating a file (much like the lock files during indexing) that the IndexWriter recognizes (tweak) and doesn't attempt to start and indexing or a delete while it's there, wasn't that much work actually. Nader Doug Cutting

Re: IndexSearcher Refresh

2004-11-16 Thread Otis Gospodnetic
I don't think so, you have to forget or close the old one and create a new instance. Otis --- Ravi [EMAIL PROTECTED] wrote: Is there a way to refresh the IndexSearcher object with the newly added documents to the index instead of creating a new object? Thanks in advance, Ravi.

Re: lock file paths

2004-11-16 Thread Otis Gospodnetic
Good question. I'm not looking at the API now, but I don't recall any methods that would let you know where Lucene decided to store its locks. You could peek at the source and follow its logic, though. Otis --- [EMAIL PROTECTED] wrote: Hey guys, Quick question... is there a way to get the

Whitespace Analyzer not producing expected search results

2004-11-16 Thread lee . a . carroll
Hi, We have indexed a set of web files (jsp , js , xslt , java properties and html) using the lucene Whitespace Analyzer. The purpose is to allow developers to find where code / functions are used and defined across a large and dissperate content management repository. Hopefully to aid code

Re: document ID and performance

2004-11-16 Thread Doug Cutting
Yan Pujante wrote: I want to run a very fast search that simply returns the matching document id. Is there any way to associate the document id returned in the hit collector to the internal document ID stored in the index ? Anybody has any idea how to do that ? Ideally you would want to be able

Searching and indexing from different processes (applications)

2004-11-16 Thread K Kim
Hi. I just started to play around with Lucene. I was wondering if searching and indexing can be done simultaneously from different processes (two different processes.) For example, searching is serviced from a web appliation, while indexing is done periodically from a stand-alone application.

Re: IndexSearcher Refresh

2004-11-16 Thread Luke Shannon
It would nice if the IndexerSearcher contained a method that could return the last modified date of the index folder it was created with. This would make it easier to know when you need to create a new Searcher. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene

how do you work with PDF

2004-11-16 Thread Miguel Angel
Hi, i need know how do you work with PDF, please give the process. Thanks... -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands,

Re: how do you work with PDF

2004-11-16 Thread Luke Shannon
www.pdfbox.org Once you get the package installed the code you can use is: Document doc = LucenePDFDocument.getDocument(file); writer.addDocument(doc); This method returns the PDF in Lucene document format. Luke - Original Message - From: Miguel Angel [EMAIL PROTECTED] To:

Re: Searching and indexing from different processes (applications)

2004-11-16 Thread Morus Walter
K Kim writes: I just started to play around with Lucene. I was wondering if searching and indexing can be done simultaneously from different processes (two different processes.) For example, searching is serviced from a web appliation, while indexing is done periodically from a

Re: IndexSearcher Refresh

2004-11-16 Thread Otis Gospodnetic
This will help: http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/index/IndexReader.html#getCurrentVersion(org.apache.lucene.store.Directory) Otis --- Luke Shannon [EMAIL PROTECTED] wrote: It would nice if the IndexerSearcher contained a method that could return the last modified

RE: Searching and indexing from different processes (applications)

2004-11-16 Thread Cocula Remi
I have created a tool that could respond to your question. It is called Lucene Server (http://luceneserver.sourceforge.net/) It is a tool for integration of Lucene in distributed environnements (via RMI). A new release is under developpement. It will include a paginated search service using

how about this google

2004-11-16 Thread Miguel Angel
When you use Google and you put in the box amig then press ENTER Sometimes google show Perhaps it meant amigus how make this solution?? -- Miguel Angel Angeles R. Asesoria en Conectividad y Servidores Telf. 97451277 -

Re: IndexSearcher Refresh

2004-11-16 Thread Luke Shannon
Yes it will. Thanks. - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, November 16, 2004 10:28 AM Subject: Re: IndexSearcher Refresh This will help:

Re: how about this google

2004-11-16 Thread Bloomfield Nutrition
You want to use ngrams on your lucene index, then pick the highest ranking score. For a demo: http://www.searchmorph.com/kat/spell.jsp or http://www.searchmorph.com/pub/ngramspeller/NGramSpeller.java for source code. TB http://www.shopbloomfield.com On Tue, 16 Nov 2004, Miguel Angel wrote:

Re: Whitespace Analyzer not producing expected search results

2004-11-16 Thread Erik Hatcher
Try using a TermQuery instead of QueryParser to see if you get the results you expect. Exact case matters. Also, when troubleshooting issues with QueryParser, it is helpful to see what the actual Query returned is - try displaying its toString output. Erik On Nov 16, 2004, at 6:25

Lucene - index fields design question

2004-11-16 Thread Venkatraju
Hi, I am a new user of Lucene. so please point me to documentation/archives if these issues have been covered before. I plan to use Lucene in a application with the following (fairly standard) requirements: - Index documents that contain a title, author, date and content - It is fairly common to

RAM, FS Directory: Problem during merge

2004-11-16 Thread Ravi Rao
All, Lucene 1.4 final. I have an index that has to be updated frequently. A search may happen at any time. I implemented this by indexing into a RAMDirectory and then merging with an FSDirecotory at regular intervals (or sometimes when a search is requested). This seems to work quite well.

Re: Lucene : avoiding locking (incremental indexing)

2004-11-16 Thread jeichels
I am interested in pursuing experienced peoples' understanding as I have half the queue approach developed already. I am not following why you don't like the queue approach Sergiu. From what I gathered from this board, if you do lots of updates, the opening of the WriterIndex is very

RE: Lucene - index fields design question

2004-11-16 Thread Chuck Williams
I do most of these same things and made these relevant design decisions: 1. Use a combination of query expansion to search across multiple fields and field concatenation to create document fields that combine separate object fields. I use multiple fields only when it is important to weight them

_4c.fnm missing

2004-11-16 Thread Luke Shannon
I received the error below when I was attempting to over whelm my system with incremental update requests. What is this file it is looking for? I checked the index. It contains: _4c.del _4d.cfs deletable segments Where does _4c.fnm come from? Here is the error: Unable to create the create

Re: how do you work with PDF

2004-11-16 Thread Chas Emerick
Alternatively, you can use PDFTextStream (http://snowtide.com). It also has an easy-to-use Lucene API, with code that looks like this: Document doc = PDFDocumentFactory.buildPDFDocument(pdfFile, config); indexWriter.addDocument(doc); One of the nice advantages of this is that the resulting Lucene

Re: _4c.fnm missing

2004-11-16 Thread Otis Gospodnetic
Field names are stored in the field info file, with suffix .fnm. - see http://jakarta.apache.org/lucene/docs/fileformats.html The .fnm should be inside the .cfs file (cfs files are compound files that contain all index files described at the above URL). Maybe you can provide the code that

Re: Lucene : avoiding locking (incremental indexing)

2004-11-16 Thread Sergiu Gordea
[EMAIL PROTECTED] wrote: I am interested in pursuing experienced peoples' understanding as I have half the queue approach developed already. well I think that experienced people developed lucene :) theyoffered us the possibility to use multithreading and concurent searching. Of course ..

Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
It conistantly breaks when I run more than 10 concurrent incremental updates. I can post the code on Bugzilla (hopefully when I get to the site it will be obvious how I can post things). Luke - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL

BooleanQuery - TooManyClauses Issue

2004-11-16 Thread Joe Krause
Hey Folks, I just inherited a deployed Lucene based application that started throwing the following exception: org.apache.lucene.search.BooleanQuery$TooManyClauses at org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79) at

Re: _4c.fnm missing

2004-11-16 Thread Nader Henein
what kind of incremental updates are you doing, because we update our index every 15 minutes with 100 ~ 200 documents and we're writing to a 6 GB memory resident index, the IndexWriter runs one instance at a time, so what kind of increments are we talking about it takes a bit of doing to

Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
The schedule is determined by the users of the system. Basically when the user(s) change the content (adding/deleting a folder or file, modify a file's content) through a web based interface a re-index is required of the content. This could happen 20 times in the span of a few seconds or once in

Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
This is the latest error I have received: IndexReader out of date and no longer valid for delete, undelete, or setNorm operations I need synchronize this process more carefully. I think this goes back to the point that during my incremental update I sometimes need to forcefully clear the lock on

Re: _4c.fnm missing

2004-11-16 Thread Luke Francl
On Tue, 2004-11-16 at 14:57, Luke Shannon wrote: This is the latest error I have received: IndexReader out of date and no longer valid for delete, undelete, or setNorm operations What you need to do is check the version number of the index to determine if you need to open a new IndexReader

Re: _4c.fnm missing

2004-11-16 Thread Otis Gospodnetic
'Concurrent' and 'updates' in the same sentence sounds like a possible source of the problem. You have to use a single IndexWriter and it should not overlap with an IndexReader that is doing deletes. Otis --- Luke Shannon [EMAIL PROTECTED] wrote: It conistantly breaks when I run more than 10

Re: _4c.fnm missing

2004-11-16 Thread Nader Henein
That's it, you need to batch your updates, it comes down to do you need to give your users search accuracy to the second, take your database and put an is_dirty row on the master table of the object you're indexing and run a scheduled task every x minutes and have your process read the objects

Re: _4c.fnm missing

2004-11-16 Thread Luke Shannon
It doesn't have to be to the second. If things take a few minutes it's ok. It looks like the first lock issue I'm hitting in my program is when I try and delete from the Index for the first time. No writer has been created yet, only the reader so I am not sure why it thinks its locked. -

Re: BooleanQuery - TooManyClauses Issue

2004-11-16 Thread Paul Elschot
On Tuesday 16 November 2004 21:35, Joe Krause wrote: Hey Folks, I just inherited a deployed Lucene based application that started throwing the following exception: org.apache.lucene.search.BooleanQuery$TooManyClauses ... I did some research regarding this error and found out that the default

Re: BooleanQuery - TooManyClauses Issue

2004-11-16 Thread Luke Francl
On Tue, 2004-11-16 at 16:32, Paul Elschot wrote: Once you approach 1000 days, you'll get the same problem again, so you might want to use a filter for the dates. See DateFilter and the archives on MMDD. Can anyone point to a good example of how to use the DateFilter? Thanks, Luke

Re: BooleanQuery - TooManyClauses Issue

2004-11-16 Thread Edwin Tang
This is what I have been doing with DateFilter DateFilter dateFilter = new DateFilter(published, lLimitDate, System.currentTimeMillis()); TopFieldDocs docs = searcher.search(parser.parse(sSearchPhrase), dateFilter, utility.iMaxResults, new Sort(sortFields)); Ed --- Luke Francl [EMAIL

Need help with filtering

2004-11-16 Thread Edwin Tang
Hello, I have been using DateFilter to limit my search results to a certain date range. I am now asked to replace this filter with one where my search results have document IDs greater than a given document ID. This document ID is assigned during indexing and is a Keyword field. I've browsed

Re: Is opening IndexReader multiple times safe?

2004-11-16 Thread Satoshi Hasegawa
Thank you, Luke. I decided to branch (use multiple try/catch clauses) so that I know if the IndexReader is open or not. Your remark on locking was helpful for my understanding of Lucene anyway. - Original Message - From: "Luke Shannon" [EMAIL PROTECTED] To: "Lucene

Best Implementation of Next and Prev in Lucene

2004-11-16 Thread Ramon Aseniero
Hi All, What's the best implementation of displaying the Next and Prev search result in Lucene? Thanks, Ramon

Index Locking Issues Resolved...I hope

2004-11-16 Thread Luke Shannon
Hello; I think I have solved my locking issues. I just made it through the set of test cases that previously resulted in Index Locking Errors. I just removed the method from my code that checks for a Index lock and forcefully removes it after 1 minute. Hopefully they never need to be put back in.

Re: Index Locking Issues Resolved...I hope

2004-11-16 Thread jeichels
Very cool Luke. I am not quite there yet. I am half way through implementing the queue approach, but I have hit walls that are making me sit back and figure out my strategy. I have a struts/tomcat/ojb/mysql project that can potentially have a million records and growing over time and

Re: COUNT SUBINDEX [IN MERGERINDEX]

2004-11-16 Thread Otis Gospodnetic
Once the index is merged there is only 1 index - there are no subindices. Otis --- Karthik N S [EMAIL PROTECTED] wrote: Hi Guys, Apologies . Can Some body Tell me which API to use to Count the number of SubIndexe's in a MERGED Index. Thx in Advance

Re: Need help with filtering

2004-11-16 Thread Nader Henein
Well if the document ID is number (even if it isn't really) you could use a range query, or just rebuild your index using that specific filed as a sorted field but if it numeric be aware that if you use integer it limits how high your numbers can get. nader Edwin Tang wrote: Hello, I have been

RE: COUNT SUBINDEX [IN MERGERINDEX]

2004-11-16 Thread Karthik N S
Hi guy's Apologies. So A Mergeed Index is again a Single [ addition of subIndexes... ), If that case , If One of the Field Types is of type 'Field.Keyword' whic is Unique across the subIndexes [Before Merging]. and If I want to Count this Unique Field in a MergerIndex [After

Re: Index Locking Issues Resolved...I hope

2004-11-16 Thread Chris Lamprecht
MySQL does offer a basic fulltext search (with MyISAM tables), but it doesn't really approach the functionality of Lucene, such as pluggable tokenizers, stemming, etc. I think MS SQL server has fulltext search as well, but I have no idea if it's any good. See