Lock handling

2004-08-25 Thread Claes Holmerson
Hello, I am interested to hear how people handle locked indexes, for example when catching an IOException like below. java.io.IOException: Lock obtain timed out: Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock at org.apache.lucene.store.Lock.obtain(Lock.java:58) at

what is wrong with query

2004-08-25 Thread Alex Kiselevski
Hi, pls, Tell me what is wrong with query: author:( +name AND full name~) AND book:( +university) Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 RD, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message

Re: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
You'll have to give us more information than that... What is the problem you are seeing? I'll assume that you get no results. Tell us of the structure of your documents and how you index every field. Concerning your syntax, if you are using the distributed query parser, you don't need the +

RE: what is wrong with query

2004-08-25 Thread Alex Kiselevski
I use QueryParser And I got an exception : org.apache.lucene.queryParser.ParseException: Encountered ~ at line 1, column 44. Was expecting one of: AND ... OR ... NOT ... + ... - ... ( ... ) ... ^ ... QUOTED ... TERM ... SLOP ... PREFIXTERM ...

RE: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html Fuzzy Searches Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search use the tilde, ~, symbol at the end of a Single word Term. I haven't used fuzzy searches, but it

Re: Lock handling

2004-08-25 Thread Otis Gospodnetic
Hello, If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the same index), you will see this error. Lucene has no way of telling whether the lock file was left over from a previous process, or whether it's a valid lock file because another process is currently indexing documents or

Re: worddoucments search

2004-08-25 Thread Santosh
I have gon through textmining.org, I am able to extract text in string format. but how can I get it as lucene document format - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 11:54 PM Subject: Re:

Re: Lucene Search Applet

2004-08-25 Thread Simon mcIlwaine
Hi Jon, Where do I go to get the attached files? Many Thanks Simon - Original Message - From: Jon Schuster [EMAIL PROTECTED] To: 'Lucene Users List' [EMAIL PROTECTED] Sent: Monday, August 23, 2004 6:25 PM Subject: RE: Lucene Search Applet Hi all, The changes I made to get past

Re: worddoucments search

2004-08-25 Thread Otis Gospodnetic
that part you have to do yourself. It is easy, just create a new Document, create an appropriate Field, give it a name and the string value you got with textmining.org library, then add the Field to your Document, and then add the Document to the index with IndexWriter. Look at one of the

How not to show results with the same score?

2004-08-25 Thread B. Grimm [Eastbeam GmbH]
hi there, i browsed through the list and had some different searches but i do not find, what i'm looking for. i got an index which is generated by a bot, collecting websites. there are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1 these different urls have the same

Hebrew Analyzer

2004-08-25 Thread Alex Kiselevski
Hi, anybody heard about Hebrew Analyzer ? Alex Kiselevsky Speech Technology Tel:972-9-776-43-46 RD, Amdocs - IsraelMobile: 972-53-63 50 38 mailto:[EMAIL PROTECTED] The information contained in this message is proprietary of Amdocs, protected from disclosure, and

Re: what is wrong with query

2004-08-25 Thread Erik Hatcher
That is correct... fuzzy searches are only on a per-term basis. If what you meant, though, was a phrase query (full near name) you have to add an explicit slop factor like full name~5 Erik On Aug 25, 2004, at 2:19 AM, Stephane James Vaucher wrote: From:

Re: worddoucments search

2004-08-25 Thread Chandan Tamrakar
Santosh please read the API' of lucene. When you can string from word doc. using textmining api's . try to convert into some temp. file and try indexing them If you are able to index PDF and normal file what trouble will you face indexing a string extracted from word docs ? please also read

Re: Lock handling

2004-08-25 Thread Otis Gospodnetic
My suggestion was referring to a timestamp that could be obtained via java.io.File, not something provided by Lucene. Otis --- Claes Holmerson [EMAIL PROTECTED] wrote: Yes, looking at the time of the lock was an idea I had but I could not find anything like a time stamp. Am I missing

lucene 1.4 in maven repository

2004-08-25 Thread Zilverline info
Hi, Can anyone tell me why there is no lucene 1.4 jar in the maven repository @ http://www.ibiblio.org/maven/lucene/jars/ ? Who makes them available? It would be very convenient to be able to get the latest version from there (or anywhere else) regards, Michael Franken

Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
I've used Lucene for a long time, but only in the most basic way. I have a custom analyzer and a slightly hacked query parser, but in general it's the basic add document/remove document/query documents cycle. In my system, I'm indexing a store of external documents, maintaining an index for

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Claes Holmerson
Avi Drissman wrote: I've used Lucene for a long time, but only in the most basic way. I have a custom analyzer and a slightly hacked query parser, but in general it's the basic add document/remove document/query documents cycle. In my system, I'm indexing a store of external documents,

Re: Lucene Search Applet

2004-08-25 Thread Simon mcIlwaine
Hi Jon, I modified the three files exactly the way you said using separate declaration and static initializer block but for IndexWriter I had to change 4 of the variables because they were final. Then I updated the Lucene JAR file with the three files in the appropriate directory. But i'm still

Re: How to implement KWIC (KeyWord In Context) display

2004-08-25 Thread yinjin
Hi, Otis, Thank you very much. I'll try it. Best, Ying - Original Message - From: Otis Gospodnetic [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Tuesday, August 24, 2004 5:55 PM Subject: Re: How to implement KWIC (KeyWord In Context) display Hello Ying, Take a

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Otis Gospodnetic
What if all Documents in your index contained some flag field + an 'add date' field. Then you could make a query such as: flag:1 and sort it by 'add date' field, taking only the very first hit as the most recently added Document. Otis --- Avi Drissman [EMAIL PROTECTED] wrote: I've used Lucene

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Bernhard Messer
Avi, i would prefer the second approach. If you already store the date time when the doc was index, you could use the following trick to get the last document added to the index: IndexReader ir = IndexReader.open(/tmp/testindex); int maxDoc = ir.maxDoc();

Introduction to Lucene [was Re: worddoucments search]

2004-08-25 Thread Steven Rowe
A collection of links to introductory level Lucene articles (including one in simplified Chinese and one in Turkish) is available on the Lucene Wiki at: URL:http://wiki.apache.org/jakarta-lucene/IntroductionToLucene Steve Otis Gospodnetic wrote: that part you have to do yourself. It is easy,

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote: If you already store the date time when the doc was index, you could use the following trick to get the last document added to the index: while (--maxDoc 0) { Yes, but that's a linear search :( On Aug 25, 2004, at 11:25 AM, Otis

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Otis Gospodnetic
The more documents match, the slower the search; how long your particular search would take I cannot tell, though - you should just test it out and see. I never needed to use the trick with a flag field in all documents, but I know others do it. Otis --- Avi Drissman [EMAIL PROTECTED] wrote:

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll
[EMAIL PROTECTED] 8/25/2004 11:50:01 AM On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote: If you already store the date time when the doc was index, you could use the following trick to get the last document added to the index: while (--maxDoc 0) { Yes, but that's a

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote: You are right, in the worst case, this would be linear, No, in _all_ cases this would be linear. I would bet, that on average, arguably nearly all cases, you would go through very few iterations before finding the doc you are interested in Then

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll
Avi, I may be confused, as I understand it you said you were interested in the last document indexed, Berhnard's code does that. Lucene adds documents sequentially, so counting backwards from the maxDoc() should get you the last indexed document pretty quickly. If all documents were deleted,

Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 12:25 PM, Grant Ingersoll wrote: I may be confused, as I understand it you said you were interested in the last document indexed, Yes, I see what you meant. I'm sorry. That's actually an interesting option. Is getting the timestamp of the last document indexed a good enough

Re: How not to show results with the same score?

2004-08-25 Thread Paul Elschot
On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote: hi there, i browsed through the list and had some different searches but i do not find, what i'm looking for. i got an index which is generated by a bot, collecting websites. there are sites like www.domain.de/article/1 and

Time to index documents

2004-08-25 Thread Hetan Shah
Hello all, Is there a way to reduce the indexing time taken when the indexer is indexing about 30,000 + files. It is roughly taking around 6-7 hours to do this. I am using IndexHTML class to create the index out of HTML files. Another issue that I see is every once in a while I get the

Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed, 25 Aug 2004, Hetan Shah wrote: Hello all, Is there a way to reduce the indexing time taken when the indexer is

Re: Time to index documents

2004-08-25 Thread Hetan Shah
Do you have any pointers for sample code for them? Would highly appreciate it. Thanks. -H Stephane James Vaucher wrote: I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed,

Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
JGuru explanation: http://www.jguru.com/faq/view.jsp?EID=1074228 I have no sample code for neko, I think nutch uses it though. For tidy, you can look at ant in the sandbox:

Content from multiple folders in single index

2004-08-25 Thread John Greenhill
Hi, I suspect this is an easy one but I didn't see a reference in the FAQ's so I thought I'd ask. I have a file structure like this: web - pages - downloads (pdf docs) - include I want to index the html in pages and the pdf's in downloads, but not the html in include, so I don't want to

RE: Time to index documents

2004-08-25 Thread Karthik N S
Hi Hetan Th's the major Problem of non Standatrdized Tags for HTML Document's u are Indexing ,resulting in lag time taken for Indexing process If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html' file [U have to have some Knowledge of JAVACC for this].

RE: Time to index documents

2004-08-25 Thread Stephane James Vaucher
Hetan, If you are using a corpus with multiple editors, I suggest that you use a cleaner like tidy as there might be weird stuff appearing in the html. sv On Thu, 26 Aug 2004, Karthik N S wrote: Hi Hetan Th's the major Problem of non Standatrdized Tags for HTML Document's u are