Lock handling

2004-08-25 Thread Claes Holmerson
Hello,
I am interested to hear how people handle locked indexes, for example 
when catching an IOException like below.

java.io.IOException: Lock obtain timed out:
Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock
   at org.apache.lucene.store.Lock.obtain(Lock.java:58)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:223)
   at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:213)
As far as I can tell, there is no good way to tell whether the lock is 
only temporary (working as it should), or if it was created by a process 
that later died, and therefore can not remove it. How can I detect the 
latter case, and how should I best handle it?

Thanks,
Claes
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


what is wrong with query

2004-08-25 Thread Alex Kiselevski

Hi, pls,
Tell me what is wrong with query:
author:( +name AND full name~) AND book:( +university)


Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
RD, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

Re: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
You'll have to give us more information than that...

What is the problem you are seeing? I'll assume that you get no results.

Tell us of the structure of your documents and how you index every field.

Concerning your syntax, if you are using the distributed query parser, you
don't need the + before name, nor the + before university as they will be
added by the parser.

sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:


 Hi, pls,
 Tell me what is wrong with query:
 author:( +name AND full name~) AND book:( +university)


 Alex Kiselevsky
  Speech TechnologyTel:972-9-776-43-46
 RD, Amdocs - Israel  Mobile: 972-53-63 50 38
 mailto:[EMAIL PROTECTED]




 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated recipient(s)
 of the message. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, use, distribution or copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us immediately
 by replying to the message and deleting it from your computer.
 Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: what is wrong with query

2004-08-25 Thread Alex Kiselevski

I use QueryParser
And I got an exception :
org.apache.lucene.queryParser.ParseException: Encountered ~ at line 1,
column 44.
Was expecting one of:
AND ...
OR ...
NOT ...
+ ...
- ...
( ...
) ...
^ ...
QUOTED ...
TERM ...
SLOP ...
PREFIXTERM ...
WILDTERM ...
[ ...
{ ...
NUMBER ...

at
org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa
rser.java:1045
at
org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j
ava:925)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
at
com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89)
at com.stp.test.CVTest.main(CVTest.java:223)

-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 10:07 AM
To: Lucene Users List
Subject: Re: what is wrong with query


You'll have to give us more information than that...

What is the problem you are seeing? I'll assume that you get no results.

Tell us of the structure of your documents and how you index every
field.

Concerning your syntax, if you are using the distributed query parser,
you don't need the + before name, nor the + before university as they
will be added by the parser.

sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:


 Hi, pls,
 Tell me what is wrong with query:
 author:( +name AND full name~) AND book:( +university)


 Alex Kiselevsky
  Speech TechnologyTel:972-9-776-43-46
 RD, Amdocs - Israel  Mobile: 972-53-63 50 38
 mailto:[EMAIL PROTECTED]




 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged. The information is
 intended to be conveyed only to the designated recipient(s) of the
 message. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, use, distribution or
 copying of this communication is strictly prohibited and may be
 unlawful. If you have received this communication in error, please
 notify us immediately by replying to the message and deleting it from
 your computer. Thank you.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: what is wrong with query

2004-08-25 Thread Stephane James Vaucher
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html

Fuzzy Searches

Lucene supports fuzzy searches based on the Levenshtein Distance, or
Edit Distance algorithm. To do a fuzzy search use the tilde, ~, symbol
at the end of a Single word Term.

I haven't used fuzzy searches, but it seems to indicate that it can only
be used with single word terms. The query parser might have been written
to support that (the output indicates that as well).

HTH,
sv

On Wed, 25 Aug 2004, Alex Kiselevski wrote:


 I use QueryParser
 And I got an exception :
 org.apache.lucene.queryParser.ParseException: Encountered ~ at line 1,
 column 44.
 Was expecting one of:
 AND ...
 OR ...
 NOT ...
 + ...
 - ...
 ( ...
 ) ...
 ^ ...
 QUOTED ...
 TERM ...
 SLOP ...
 PREFIXTERM ...
 WILDTERM ...
 [ ...
 { ...
 NUMBER ...

 at
 org.apache.lucene.queryParser.QueryParser.generateParseException(QueryPa
 rser.java:1045
 at
 org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.j
 ava:925)
 at
 org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
 at
 org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
 at
 org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
 at
 com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java:89)
 at com.stp.test.CVTest.main(CVTest.java:223)

 -Original Message-
 From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 25, 2004 10:07 AM
 To: Lucene Users List
 Subject: Re: what is wrong with query


 You'll have to give us more information than that...

 What is the problem you are seeing? I'll assume that you get no results.

 Tell us of the structure of your documents and how you index every
 field.

 Concerning your syntax, if you are using the distributed query parser,
 you don't need the + before name, nor the + before university as they
 will be added by the parser.

 sv

 On Wed, 25 Aug 2004, Alex Kiselevski wrote:

 
  Hi, pls,
  Tell me what is wrong with query:
  author:( +name AND full name~) AND book:( +university)
 
 
  Alex Kiselevsky
   Speech Technology  Tel:972-9-776-43-46
  RD, Amdocs - IsraelMobile: 972-53-63 50 38
  mailto:[EMAIL PROTECTED]
 
 
 
 
  The information contained in this message is proprietary of Amdocs,
  protected from disclosure, and may be privileged. The information is
  intended to be conveyed only to the designated recipient(s) of the
  message. If the reader of this message is not the intended recipient,
  you are hereby notified that any dissemination, use, distribution or
  copying of this communication is strictly prohibited and may be
  unlawful. If you have received this communication in error, please
  notify us immediately by replying to the message and deleting it from
  your computer. Thank you.


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 The information contained in this message is proprietary of Amdocs,
 protected from disclosure, and may be privileged.
 The information is intended to be conveyed only to the designated recipient(s)
 of the message. If the reader of this message is not the intended recipient,
 you are hereby notified that any dissemination, use, distribution or copying of
 this communication is strictly prohibited and may be unlawful.
 If you have received this communication in error, please notify us immediately
 by replying to the message and deleting it from your computer.
 Thank you.

 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock handling

2004-08-25 Thread Otis Gospodnetic
Hello,

If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the same
index), you will see this error.  Lucene has no way of telling whether
the lock file was left over from a previous process, or whether it's a
valid lock file because another process is currently indexing documents
or some such.
You could try adding some logic to your app, though.  For instance, you
can look at lock's timestamp, and using IndexReader.unlock(...) method
to forcefully unlock the index.

Otis

--- Claes Holmerson [EMAIL PROTECTED] wrote:

 Hello,
 
 I am interested to hear how people handle locked indexes, for example
 
 when catching an IOException like below.
 
 java.io.IOException: Lock obtain timed out:
 Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:58)
 at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:223)
 at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:213)
 
 As far as I can tell, there is no good way to tell whether the lock
 is 
 only temporary (working as it should), or if it was created by a
 process 
 that later died, and therefore can not remove it. How can I detect
 the 
 latter case, and how should I best handle it?
 
 Thanks,
 Claes
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: worddoucments search

2004-08-25 Thread Santosh
I have gon through textmining.org, I am able to extract text in string
format. but how can I get it as
lucene document format
- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 11:54 PM
Subject: Re: worddoucments search


 As I just answered in a separate email to Ryan - we used textmining.orglibrary, too, 
as an example of something that is easier to use thanPOI.  It's been a while since I 
wrote that chapter, so it slipped mymind when I replied.  Yes, use textmining.org 
first, you'll be able toinclude it in your code in 2 minutes.  Good stuff.

 Otis





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lucene Search Applet

2004-08-25 Thread Simon mcIlwaine
Hi Jon,

Where do I go to get the attached files?

Many Thanks

Simon

- Original Message - 
From: Jon Schuster [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 6:25 PM
Subject: RE: Lucene Search Applet


 Hi all,

 The changes I made to get past the System.getProperty issues are
essentially
 the same in the three files org.apache.lucene.index.IndexWriter,
 org.apache.lucene.store.FSDirectory, and
 org.apache.lucene.search.BooleanQuery.

 Change the static initializations from a form like this:

   public static long WRITE_LOCK_TIMEOUT =

 Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout,
   1000));

 to a separate declaration and static initializer block like this:

public static long WRITE_LOCK_TIMEOUT;
static
{
 try
 {
 WRITE_LOCK_TIMEOUT =
 Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout,
 1000));
 }
 catch ( Exception e )
 {
 WRITE_LOCK_TIMEOUT = 1000;
 }
};

 As before, the variables are initialized when the class is loaded, but if
 the System.getProperty fails, the variable still gets initialized to its
 default value in the catch block.

 You can use a separate static block for each variable, or put them all
into
 a single static block. You could also add a setter for each variable if
you
 want the ability to set the value separately from the class init.

 In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are
 marked final, which I had to remove to do the initialization as described.

 I've also attached the three modified files if you want to just copy and
 paste.

 --Jon

 -Original Message-
 From: Simon mcIlwaine [mailto:[EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 7:37 AM
 To: Lucene Users List
 Subject: Re: Lucene Search Applet

 Hi,

 Just used the RODirectory and I'm now getting the following error:
 java.security.AccessControlException: access denied
 (java.util.PropertyPermission user.dir read) I'm reckoning that this is
what
 Jon was on about with System.getProperty() within certain files because im
 using an applet. Is this correct and if so can someone show me one of the
 hacked files so that I know what I need to modify.

 Many Thanks

 Simon
 .
 - Original Message -
 From: Simon mcIlwaine [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 3:12 PM
 Subject: Re: Lucene Search Applet

  Hi Stephane,
 
  A bit of a stupid question but how do you mean set the system property
  disableLuceneLocks=true? Can I do it from a call from FSDirectory API or
 do
  I have to actually hack the code? Also if I do use RODirectory how do I
go
  about using it? Do I have to update the Lucene JAR archive file with
  RODirectory class included as I tried using it and its not recognising
the
  class?
 
  Many Thanks
 
  Simon
 
  - Original Message -
  From: Stephane James Vaucher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Monday, August 23, 2004 2:22 PM
  Subject: Re: Lucene Search Applet
 
 
   Hi Simon,
  
   Does this work? From FSDirectory api:
  
   If the system property 'disableLuceneLocks' has the String value of
   true, lock creation will be disabled.
  
   Otherwise, I think there was a Read-Only Directory hack:
  
  
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
  
   HTH,
   sv
  
   On Mon, 23 Aug 2004, Simon mcIlwaine wrote:
  
Thanks Jon that works by putting the jar file in the archive
 attribute.
  Now
im getting the disablelock error cause of the unsigned applet. Do I
 just
comment out the code anywhere where System.getProperty() appears in
 the
files that you specified and then update the JAR Archive?? Is it
  possible
you could show me one of the hacked files so that I know what I'm
  modifying?
Does anyone else know if there is another way of doing this without
  having
to hack the source code?
   
Many thanks.
   
Simon
   
- Original Message -
From: Jon Schuster [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, August 21, 2004 2:08 AM
Subject: Re: Lucene Search Applet
   
   
 I have Lucene working in an applet and I've seen this problem only
  when
 the jar file really was not available (typo in the jar name),
which
 is
 what you'd expect. It's possible that the classpath for your
 application is not the same as the classpath for the applet;
perhaps
 they're using different VMs or JREs from different locations.

 Try referencing the Lucene jar file in the archive attribute of
the
 applet tag.

 Also, to get Lucene to work from an unsigned applet, I had to
modify
 a
 few classes that call System.getProperty(), because the properties
  that
 were being requested were disallowed for applets. I think the
 classes
 were IndexWriter, FSDirectory, and BooleanQuery.

 --Jon


 On 

Re: worddoucments search

2004-08-25 Thread Otis Gospodnetic
that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.

Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.

Otis

--- Santosh [EMAIL PROTECTED] wrote:

 I have gon through textmining.org, I am able to extract text in
 string
 format. but how can I get it as
 lucene document format
 - Original Message -
 From: Otis Gospodnetic [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Tuesday, August 24, 2004 11:54 PM
 Subject: Re: worddoucments search
 
 
  As I just answered in a separate email to Ryan - we used
 textmining.orglibrary, too, as an example of something that is easier
 to use thanPOI.  It's been a while since I wrote that chapter, so it
 slipped mymind when I replied.  Yes, use textmining.org first, you'll
 be able toinclude it in your code in 2 minutes.  Good stuff.
 
  Otis
 
 
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



How not to show results with the same score?

2004-08-25 Thread B. Grimm [Eastbeam GmbH]
hi there,
i browsed through the list and had some different searches but i do not 
find, what i'm looking for.

i got an index which is generated by a bot, collecting websites. there 
are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1
these different urls have the same content and when u search for a word, 
matching, both are returned, which is correct.

they have excatly the same score because of there content an so one, so 
i would like to know if its possible to group by (mysql, of course) 
the returned score, so that only the first match is collected into 
Hits and all following matches with the same score are ignored.

it would be great if anyone has an idea how to do that.
thanks and have a nice day.
bastian
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Hebrew Analyzer

2004-08-25 Thread Alex Kiselevski

Hi, anybody heard about Hebrew Analyzer ?

Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
RD, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]




The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated recipient(s)
of the message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us immediately
by replying to the message and deleting it from your computer.
Thank you.

Re: what is wrong with query

2004-08-25 Thread Erik Hatcher
That is correct... fuzzy searches are only on a per-term basis.
If what you meant, though, was a phrase query (full near name) you  
have to add an explicit slop factor like full name~5

Erik
On Aug 25, 2004, at 2:19 AM, Stephane James Vaucher wrote:
From: http://jakarta.apache.org/lucene/docs/queryparsersyntax.html
Fuzzy Searches
Lucene supports fuzzy searches based on the Levenshtein Distance, or
Edit Distance algorithm. To do a fuzzy search use the tilde, ~,  
symbol
at the end of a Single word Term.

I haven't used fuzzy searches, but it seems to indicate that it can  
only
be used with single word terms. The query parser might have been  
written
to support that (the output indicates that as well).

HTH,
sv
On Wed, 25 Aug 2004, Alex Kiselevski wrote:
I use QueryParser
And I got an exception :
org.apache.lucene.queryParser.ParseException: Encountered ~ at line  
1,
column 44.
Was expecting one of:
AND ...
OR ...
NOT ...
+ ...
- ...
( ...
) ...
^ ...
QUOTED ...
TERM ...
SLOP ...
PREFIXTERM ...
WILDTERM ...
[ ...
{ ...
NUMBER ...

at
org.apache.lucene.queryParser.QueryParser.generateParseException(Query 
Pa
rser.java:1045
at
org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser 
.j
ava:925)
at
org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:562)
at
org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:500)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:108)
at
com.stp.corr.cv.search.CVSearcher.getMatchedResults(CVSearcher.java: 
89)
at com.stp.test.CVTest.main(CVTest.java:223)

-Original Message-
From: Stephane James Vaucher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 10:07 AM
To: Lucene Users List
Subject: Re: what is wrong with query
You'll have to give us more information than that...
What is the problem you are seeing? I'll assume that you get no  
results.

Tell us of the structure of your documents and how you index every
field.
Concerning your syntax, if you are using the distributed query parser,
you don't need the + before name, nor the + before university as they
will be added by the parser.
sv
On Wed, 25 Aug 2004, Alex Kiselevski wrote:
Hi, pls,
Tell me what is wrong with query:
author:( +name AND full name~) AND book:( +university)
Alex Kiselevsky
 Speech Technology  Tel:972-9-776-43-46
RD, Amdocs - IsraelMobile: 972-53-63 50 38
mailto:[EMAIL PROTECTED]

The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged. The information is
intended to be conveyed only to the designated recipient(s) of the
message. If the reader of this message is not the intended recipient,
you are hereby notified that any dissemination, use, distribution or
copying of this communication is strictly prohibited and may be
unlawful. If you have received this communication in error, please
notify us immediately by replying to the message and deleting it from
your computer. Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
The information contained in this message is proprietary of Amdocs,
protected from disclosure, and may be privileged.
The information is intended to be conveyed only to the designated  
recipient(s)
of the message. If the reader of this message is not the intended  
recipient,
you are hereby notified that any dissemination, use, distribution or  
copying of
this communication is strictly prohibited and may be unlawful.
If you have received this communication in error, please notify us  
immediately
by replying to the message and deleting it from your computer.
Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: worddoucments search

2004-08-25 Thread Chandan Tamrakar
Santosh
please read the API' of lucene.

  When you can string from word doc. using textmining api's . try to
convert into some temp.  file and try indexing them

If you are able to index PDF and normal file what trouble will you face
indexing a string extracted from word docs ? please also read /search the
previous posting. it should help understanding about lucene more...


- Original Message - 
From: Karthik N S [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, August 25, 2004 4:21 PM
Subject: RE: worddoucments search


 Hi

   Santosh

   Please .

   If u have Downloded the Lucene (zip )bundel , First try to read the
 docs/index.html  which is in the bundel,
   if  u are still in trouble, then  approach the Form for Help  [ Un
 necessarily  asking silly Questions will be ignored ]


 Karthik




 -Original Message-
 From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, August 25, 2004 3:01 PM
 To: Lucene Users List
 Subject: Re: worddoucments search


 that part you have to do yourself.  It is easy, just create a new
 Document, create an appropriate Field, give it a name and the string
 value you got with textmining.org library, then add the Field to your
 Document, and then add the Document to the index with IndexWriter.

 Look at one of the articles about Lucene to get started.  I wrote one
 called something like Introduction to Text Indexing with Lucene.  You
 probably want to read that one to get going.

 Otis

 --- Santosh [EMAIL PROTECTED] wrote:

  I have gon through textmining.org, I am able to extract text in
  string
  format. but how can I get it as
  lucene document format
  - Original Message -
  From: Otis Gospodnetic [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Tuesday, August 24, 2004 11:54 PM
  Subject: Re: worddoucments search
 
 
   As I just answered in a separate email to Ryan - we used
  textmining.orglibrary, too, as an example of something that is easier
  to use thanPOI.  It's been a while since I wrote that chapter, so it
  slipped mymind when I replied.  Yes, use textmining.org first, you'll
  be able toinclude it in your code in 2 minutes.  Good stuff.
 
   Otis
 
 
 
 
 
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
 
 


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]


 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Lock handling

2004-08-25 Thread Otis Gospodnetic
My suggestion was referring to a timestamp that could be obtained via
java.io.File, not something provided by Lucene.

Otis

--- Claes Holmerson [EMAIL PROTECTED] wrote:

 Yes, looking at the time of the lock was an idea I had but I could
 not
 find anything like a time stamp. Am I missing something obvious here?
 
 Claes
 
 Otis Gospodnetic wrote:
 
 Hello,
 
 If you use Lucene incorrectly (e.g. 2 IndexWriters writing to the
 same
 index), you will see this error.  Lucene has no way of telling
 whether
 the lock file was left over from a previous process, or whether it's
 a
 valid lock file because another process is currently indexing
 documents
 or some such.
 You could try adding some logic to your app, though.  For instance,
 you
 can look at lock's timestamp, and using IndexReader.unlock(...)
 method
 to forcefully unlock the index.
 
 Otis
 
 --- Claes Holmerson [EMAIL PROTECTED] wrote:
 
   
 
 Hello,
 
 I am interested to hear how people handle locked indexes, for
 example
 
 when catching an IOException like below.
 
 java.io.IOException: Lock obtain timed out:
 Lock@/tmp/lucene-0b978f2c0aa12e8dcdbd5b0df491bfc4-write.lock
 at org.apache.lucene.store.Lock.obtain(Lock.java:58)
 at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:223)
 at
 org.apache.lucene.index.IndexWriter.init(IndexWriter.java:213)
 
 As far as I can tell, there is no good way to tell whether the lock
 is 
 only temporary (working as it should), or if it was created by a
 process 
 that later died, and therefore can not remove it. How can I detect
 the 
 latter case, and how should I best handle it?
 
 Thanks,
 Claes
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 
 
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
   
 
 
 -- 
 Claes Holmerson
 Polopoly - Cultivating the information garden
 Kungsgatan 88, SE-112 27 Stockholm, SWEDEN
 Direct: +46 8 506 782 59
 Mobile: +46 704 47 82 59
 Fax:  +46 8 506 782 51
 [EMAIL PROTECTED], http://www.polopoly.com
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



lucene 1.4 in maven repository

2004-08-25 Thread Zilverline info
Hi,
Can anyone tell me why there is no lucene 1.4 jar in the maven 
repository @ http://www.ibiblio.org/maven/lucene/jars/ ? Who makes them 
available? It would be very convenient to be able to get the latest 
version from there (or anywhere else)

regards,
 Michael Franken
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so that 
I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing the 
timestamp of the last indexed document in it. I know how to do this, 
but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached 
timestamp fields to documents in the past so that I could do range 
queries on them. However, I don't know how to do a query like the 
document with the latest timestamp or even if that's possible.

3. Create a dummy document (with some unique field identifier so you 
could quickly query for it) with a field last timestamp. This is a 
global value storage approach, as you could just store any field with 
any value on it. But I'd be updating this timestamp field a lot, which 
means that every time I updated the index I'd have to remove this 
special document and reindex it. Is there any way to update the value 
of a field in a document directly in the index without removing and 
adding it again to the index? The field I'd want to update would just 
be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities of 
Lucene.

Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Claes Holmerson
Avi Drissman wrote:
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so 
that I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing 
the timestamp of the last indexed document in it. I know how to do 
this, but I don't like the idea of keeping a separate file. 
This is similar to the way I chose (I used a property file for this, and 
stored certain data within it, in the index directory). I didn't like 
the idea at first either, but later I thought - why not? It is the 
simplest way. As long as the file name is not used by Lucene, I thought 
it should be safe.

Claes
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Lucene Search Applet

2004-08-25 Thread Simon mcIlwaine
Hi Jon,

I modified the three files exactly the way you said using separate
declaration and static initializer block but for IndexWriter I had to change
4 of the variables because they were final. Then I updated the Lucene JAR
file with the three files in the appropriate directory. But i'm still
getting the error: java.security.AccessControlException: access denied
(java.util.PropertyPermission user.dir read)?? What am I doing wrong? The
last mail you sent I was unable to download the files you attached. Is it
possible you could send them to my work address: [EMAIL PROTECTED]

Many Thanks

Simon


- Original Message - 
From: Jon Schuster [EMAIL PROTECTED]
To: 'Lucene Users List' [EMAIL PROTECTED]
Sent: Monday, August 23, 2004 6:25 PM
Subject: RE: Lucene Search Applet


 Hi all,

 The changes I made to get past the System.getProperty issues are
essentially
 the same in the three files org.apache.lucene.index.IndexWriter,
 org.apache.lucene.store.FSDirectory, and
 org.apache.lucene.search.BooleanQuery.

 Change the static initializations from a form like this:

   public static long WRITE_LOCK_TIMEOUT =

 Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout,
   1000));

 to a separate declaration and static initializer block like this:

public static long WRITE_LOCK_TIMEOUT;
static
{
 try
 {
 WRITE_LOCK_TIMEOUT =
 Integer.parseInt(System.getProperty(org.apache.lucene.writeLockTimeout,
 1000));
 }
 catch ( Exception e )
 {
 WRITE_LOCK_TIMEOUT = 1000;
 }
};

 As before, the variables are initialized when the class is loaded, but if
 the System.getProperty fails, the variable still gets initialized to its
 default value in the catch block.

 You can use a separate static block for each variable, or put them all
into
 a single static block. You could also add a setter for each variable if
you
 want the ability to set the value separately from the class init.

 In the FSDirectory class, the variables DISABLE_LOCKS and LOCK_DIR are
 marked final, which I had to remove to do the initialization as described.

 I've also attached the three modified files if you want to just copy and
 paste.

 --Jon

 -Original Message-
 From: Simon mcIlwaine [mailto:[EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 7:37 AM
 To: Lucene Users List
 Subject: Re: Lucene Search Applet

 Hi,

 Just used the RODirectory and I'm now getting the following error:
 java.security.AccessControlException: access denied
 (java.util.PropertyPermission user.dir read) I'm reckoning that this is
what
 Jon was on about with System.getProperty() within certain files because im
 using an applet. Is this correct and if so can someone show me one of the
 hacked files so that I know what I need to modify.

 Many Thanks

 Simon
 .
 - Original Message -
 From: Simon mcIlwaine [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Monday, August 23, 2004 3:12 PM
 Subject: Re: Lucene Search Applet

  Hi Stephane,
 
  A bit of a stupid question but how do you mean set the system property
  disableLuceneLocks=true? Can I do it from a call from FSDirectory API or
 do
  I have to actually hack the code? Also if I do use RODirectory how do I
go
  about using it? Do I have to update the Lucene JAR archive file with
  RODirectory class included as I tried using it and its not recognising
the
  class?
 
  Many Thanks
 
  Simon
 
  - Original Message -
  From: Stephane James Vaucher [EMAIL PROTECTED]
  To: Lucene Users List [EMAIL PROTECTED]
  Sent: Monday, August 23, 2004 2:22 PM
  Subject: Re: Lucene Search Applet
 
 
   Hi Simon,
  
   Does this work? From FSDirectory api:
  
   If the system property 'disableLuceneLocks' has the String value of
   true, lock creation will be disabled.
  
   Otherwise, I think there was a Read-Only Directory hack:
  
  
http://www.mail-archive.com/[EMAIL PROTECTED]/msg05148.html
  
   HTH,
   sv
  
   On Mon, 23 Aug 2004, Simon mcIlwaine wrote:
  
Thanks Jon that works by putting the jar file in the archive
 attribute.
  Now
im getting the disablelock error cause of the unsigned applet. Do I
 just
comment out the code anywhere where System.getProperty() appears in
 the
files that you specified and then update the JAR Archive?? Is it
  possible
you could show me one of the hacked files so that I know what I'm
  modifying?
Does anyone else know if there is another way of doing this without
  having
to hack the source code?
   
Many thanks.
   
Simon
   
- Original Message -
From: Jon Schuster [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Saturday, August 21, 2004 2:08 AM
Subject: Re: Lucene Search Applet
   
   
 I have Lucene working in an applet and I've seen this problem only
  when
 the jar file really was not available (typo in the jar name),
which
 is
 what you'd expect. It's possible that the classpath for your
 application is not the 

Re: How to implement KWIC (KeyWord In Context) display

2004-08-25 Thread yinjin
Hi, Otis,

Thank you very much. I'll try it.

Best,
Ying
- Original Message - 
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 5:55 PM
Subject: Re: How to implement KWIC (KeyWord In Context) display


 Hello Ying,
 
 Take a look at Lucene Highlighter in Lucene Sandbox:
 http://jakarta.apache.org/lucene/docs/lucene-sandbox/
 
 Otis
 
 --- yinjin [EMAIL PROTECTED] wrote:
 
  Hello all,
  
  Does anyone know how to implement KWIC display using Lucene? I'd like
  to display the result similar to google search.
  
  Thanks for any help,
  Ying
  
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Otis Gospodnetic
What if all Documents in your index contained some flag field + an 'add
date' field.  Then you could make a query such as: flag:1 and sort it
by 'add date' field, taking only the very first hit as the most
recently added Document.

Otis

--- Avi Drissman [EMAIL PROTECTED] wrote:

 I've used Lucene for a long time, but only in the most basic way. I 
 have a custom analyzer and a slightly hacked query parser, but in 
 general it's the basic add document/remove document/query documents 
 cycle.
 
 In my system, I'm indexing a store of external documents, maintaining
 
 an index for full-text querying. However, I might be turned off when 
 documents are added, and then when I'm restarted, I'm going to need
 to 
 determine the timestamp of the last document added to the index so
 that 
 I can pick up where I left off.
 
 There are three approaches to doing this, two using Lucene. I don't 
 know how I would do the two Lucene approaches, or even if they're 
 possible.
 
 1. Just keep a file in parallel with the index, reading and writing
 the 
 timestamp of the last indexed document in it. I know how to do this, 
 but I don't like the idea of keeping a separate file.
 
 2. Drop a timestamp onto each document as it's indexed. I've attached
 
 timestamp fields to documents in the past so that I could do range 
 queries on them. However, I don't know how to do a query like the 
 document with the latest timestamp or even if that's possible.
 
 3. Create a dummy document (with some unique field identifier so you 
 could quickly query for it) with a field last timestamp. This is a 
 global value storage approach, as you could just store any field
 with 
 any value on it. But I'd be updating this timestamp field a lot,
 which 
 means that every time I updated the index I'd have to remove this 
 special document and reindex it. Is there any way to update the value
 
 of a field in a document directly in the index without removing and 
 adding it again to the index? The field I'd want to update would just
 
 be stored, not indexed or tokenized.
 
 Thanks for your help in guiding my exploration into the capabilities
 of 
 Lucene.
 
 Avi
 
 -- 
 Avi 'rlwimi' Drissman
 [EMAIL PROTECTED]
 Argh! This darn mail server is trunca
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Bernhard Messer
Avi,
i would prefer the second approach. If you already store the date time 
when the doc was index, you could use the following trick to get the 
last document added to the index:

   IndexReader ir = IndexReader.open(/tmp/testindex);
 
   int maxDoc = ir.maxDoc();
   while (--maxDoc  0) {
 if (!ir.isDeleted(maxDoc)) {
   Document doc = ir.document(maxDoc);
   System.out.println(doc.getField(indexDate));
   break;
 }
   }

What do you think about the implementation, no extra properties, nothing 
to worry about. Every information is within you index.

regards
Bernhard
Avi Drissman wrote:
I've used Lucene for a long time, but only in the most basic way. I 
have a custom analyzer and a slightly hacked query parser, but in 
general it's the basic add document/remove document/query documents 
cycle.

In my system, I'm indexing a store of external documents, maintaining 
an index for full-text querying. However, I might be turned off when 
documents are added, and then when I'm restarted, I'm going to need to 
determine the timestamp of the last document added to the index so 
that I can pick up where I left off.

There are three approaches to doing this, two using Lucene. I don't 
know how I would do the two Lucene approaches, or even if they're 
possible.

1. Just keep a file in parallel with the index, reading and writing 
the timestamp of the last indexed document in it. I know how to do 
this, but I don't like the idea of keeping a separate file.

2. Drop a timestamp onto each document as it's indexed. I've attached 
timestamp fields to documents in the past so that I could do range 
queries on them. However, I don't know how to do a query like the 
document with the latest timestamp or even if that's possible.

3. Create a dummy document (with some unique field identifier so you 
could quickly query for it) with a field last timestamp. This is a 
global value storage approach, as you could just store any field 
with any value on it. But I'd be updating this timestamp field a lot, 
which means that every time I updated the index I'd have to remove 
this special document and reindex it. Is there any way to update the 
value of a field in a document directly in the index without removing 
and adding it again to the index? The field I'd want to update would 
just be stored, not indexed or tokenized.

Thanks for your help in guiding my exploration into the capabilities 
of Lucene.

Avi

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Introduction to Lucene [was Re: worddoucments search]

2004-08-25 Thread Steven Rowe
A collection of links to introductory level Lucene articles (including 
one in simplified Chinese and one in Turkish) is available on the 
Lucene Wiki at:

URL:http://wiki.apache.org/jakarta-lucene/IntroductionToLucene
Steve
Otis Gospodnetic wrote:
that part you have to do yourself.  It is easy, just create a new
Document, create an appropriate Field, give it a name and the string
value you got with textmining.org library, then add the Field to your
Document, and then add the Document to the index with IndexWriter.
Look at one of the articles about Lucene to get started.  I wrote one
called something like Introduction to Text Indexing with Lucene.  You
probably want to read that one to get going.
Otis
--- Santosh [EMAIL PROTECTED] wrote:
I have gon through textmining.org, I am able to extract text in
string format. but how can I get it as lucene document format
- Original Message -
From: Otis Gospodnetic [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Tuesday, August 24, 2004 11:54 PM
Subject: Re: worddoucments search
As I just answered in a separate email to Ryan - we used
textmining.orglibrary, too, as an example of something that is easier
to use thanPOI.  It's been a while since I wrote that chapter, so it
slipped mymind when I replied.  Yes, use textmining.org first, you'll
be able toinclude it in your code in 2 minutes.  Good stuff.
Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:
If you already store the date time when the doc was index, you could 
use the following trick to get the last document added to the index:

   while (--maxDoc  0) {
Yes, but that's a linear search :(
On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:
What if all Documents in your index contained some flag field + an 'add
date' field.  Then you could make a query such as: flag:1 and sort it
by 'add date' field, taking only the very first hit as the most
recently added Document.
That's a very clever approach. I'm currently using Lucene 1.3, so I 
hadn't thought about using the new sorting abilities. I'd need to move 
to 1.4, of course.

A question, though: how efficient is it to make a query that matches 
all documents and then sort it? I'm looking for something as small as I 
can; after all, storing the last date in a file separate from the index 
is O(1)...

Thanks!
Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Otis Gospodnetic
The more documents match, the slower the search; how long your
particular search would take I cannot tell, though - you should just
test it out and see.

I never needed to use the trick with a flag field in all documents, but
I know others do it.

Otis

--- Avi Drissman [EMAIL PROTECTED] wrote:

 On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:
 
  If you already store the date time when the doc was index, you
 could 
  use the following trick to get the last document added to the
 index:
 
 while (--maxDoc  0) {
 
 Yes, but that's a linear search :(
 
 On Aug 25, 2004, at 11:25 AM, Otis Gospodnetic wrote:
 
  What if all Documents in your index contained some flag field + an
 'add
  date' field.  Then you could make a query such as: flag:1 and sort
 it
  by 'add date' field, taking only the very first hit as the most
  recently added Document.
 
 That's a very clever approach. I'm currently using Lucene 1.3, so I 
 hadn't thought about using the new sorting abilities. I'd need to
 move 
 to 1.4, of course.
 
 A question, though: how efficient is it to make a query that matches 
 all documents and then sort it? I'm looking for something as small as
 I 
 can; after all, storing the last date in a file separate from the
 index 
 is O(1)...
 
 Thanks!
 
 Avi
 
 -- 
 Avi 'rlwimi' Drissman
 [EMAIL PROTECTED]
 Argh! This darn mail server is trunca
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll


 [EMAIL PROTECTED] 8/25/2004 11:50:01 AM 
On Aug 25, 2004, at 11:39 AM, Bernhard Messer wrote:

 If you already store the date time when the doc was index, you could

 use the following trick to get the last document added to the index:

while (--maxDoc  0) {

Yes, but that's a linear search :(


You are right, in the worst case, this would be linear, but that would
require you to delete a lot of documents.  I would bet, that on average,
arguably nearly all cases, you would go through very few iterations
before finding the doc you are interested in

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:
You are right, in the worst case, this would be linear,
No, in _all_ cases this would be linear.
I would bet, that on average,
arguably nearly all cases, you would go through very few iterations
before finding the doc you are interested in
Then you don't understand what I'm trying to do. I'm trying to find the 
document with the biggest value for the field. That would involve 
checking the field's value in every document to ensure this.

Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Grant Ingersoll
Avi,

I may be confused, as I understand it you said you were interested in
the last document indexed, Berhnard's code does that.   Lucene adds
documents sequentially, so counting backwards from the maxDoc() should
get you the last indexed document pretty quickly.  If all documents were
deleted, then this would go through all documents, otherwise, it is
going to find it pretty quickly.  It doesn't have to traverse through
all of the documents, it just has to find the first document that is
not deleted (since we are starting at the end of the list and going
backward)

 [EMAIL PROTECTED] 8/25/2004 12:01:50 PM 
On Aug 25, 2004, at 11:57 AM, Grant Ingersoll wrote:

 You are right, in the worst case, this would be linear,

No, in _all_ cases this would be linear.

 I would bet, that on average,
 arguably nearly all cases, you would go through very few iterations
 before finding the doc you are interested in

Then you don't understand what I'm trying to do. I'm trying to find the

document with the biggest value for the field. That would involve 
checking the field's value in every document to ensure this.

Avi

-- 
Avi 'rlwimi' Drissman
[EMAIL PROTECTED] 
Argh! This darn mail server is trunca


-
To unsubscribe, e-mail: [EMAIL PROTECTED] 
For additional commands, e-mail: [EMAIL PROTECTED] 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Advanced timestamp usage (or global value storage)

2004-08-25 Thread Avi Drissman
On Aug 25, 2004, at 12:25 PM, Grant Ingersoll wrote:
I may be confused, as I understand it you said you were interested in
the last document indexed,
Yes, I see what you meant. I'm sorry.
That's actually an interesting option. Is getting the timestamp of the 
last document indexed a good enough solution or must I find the latest 
timestamp of all indexed documents? I'd have to ponder that for a 
while.

Avi
--
Avi 'rlwimi' Drissman
[EMAIL PROTECTED]
Argh! This darn mail server is trunca
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: How not to show results with the same score?

2004-08-25 Thread Paul Elschot
On Wednesday 25 August 2004 12:21, B. Grimm [Eastbeam GmbH] wrote:
 hi there,

 i browsed through the list and had some different searches but i do not
 find, what i'm looking for.

 i got an index which is generated by a bot, collecting websites. there
 are sites like www.domain.de/article/1 and www.domain.de/article/1?page=1
 these different urls have the same content and when u search for a word,
 matching, both are returned, which is correct.

 they have excatly the same score because of there content an so one, so
 i would like to know if its possible to group by (mysql, of course)
 the returned score, so that only the first match is collected into
 Hits and all following matches with the same score are ignored.

 it would be great if anyone has an idea how to do that.

You can implement your own HitCollector and pass it to IndexSearcher.search()
Have a look at the javadocs of the org.apache.lucene.search package,
it's quite straightforward. The PriorityQueue from the
util package is useful to collect results. For every distinct score you could
store an int[] of document nrs in there while collecting the hits.
Basically you'll end up implementing your own Hits class.

For URL's that have the same content, it's better
to store multiple URL's for the same document. However, this
merging is normally done by a crawler because the same contents
means the same outgoing URL's. Crawlers also keep track
of multiple host names resolving to the same IP address.

In case you need to crawl and index an intranet or more, have a look
at Nutch.

Regards,
Paul Elschot




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Time to index documents

2004-08-25 Thread Hetan Shah
Hello all,
Is there a way to reduce the indexing time taken when the indexer is 
indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following 
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered \ at line 7, column 1.
Was expecting one of:
ArgName ...
= ...
TagEnd ...
Any suggestions on preventing this from happening?
Thanks in advance.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
I don't think that the demo parser is meant as a production 
system component. You can look at Tidy or NekoHtml. They cleanup your html 
and are probably optimised.

sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

 Hello all,
 
 Is there a way to reduce the indexing time taken when the indexer is 
 indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
 do this. I am using IndexHTML class to create the index out of HTML files.
 
 Another issue that I see is every once in a while I get the following 
 output on the screen.
 
 adding ../31/1104852.html
 Parse Aborted: Encountered \ at line 7, column 1.
 Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 Any suggestions on preventing this from happening?
 
 Thanks in advance.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Time to index documents

2004-08-25 Thread Hetan Shah
Do you have any pointers for sample code for them?
Would highly appreciate it.
Thanks.
-H
Stephane James Vaucher wrote:
I don't think that the demo parser is meant as a production 
system component. You can look at Tidy or NekoHtml. They cleanup your html 
and are probably optimised.

sv
On Wed, 25 Aug 2004, Hetan Shah wrote:

Hello all,
Is there a way to reduce the indexing time taken when the indexer is 
indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following 
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered \ at line 7, column 1.
Was expecting one of:
ArgName ...
= ...
TagEnd ...
Any suggestions on preventing this from happening?
Thanks in advance.
-H
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
JGuru explanation: 
http://www.jguru.com/faq/view.jsp?EID=1074228

I have no sample code for neko, I think nutch uses it though. For tidy, 
you can look at ant in the sandbox:

http://cvs.apache.org/viewcvs.cgi/jakarta-lucene-sandbox/contributions/ant/src/main/org/apache/lucene/ant/HtmlDocument.java?rev=1.3view=markup

HTH,
sv

On Wed, 25 Aug 2004, Hetan Shah wrote:

 Do you have any pointers for sample code for them?
 Would highly appreciate it.
 Thanks.
 -H
 
 Stephane James Vaucher wrote:
 
  I don't think that the demo parser is meant as a production 
  system component. You can look at Tidy or NekoHtml. They cleanup your html 
  and are probably optimised.
  
  sv
  
  On Wed, 25 Aug 2004, Hetan Shah wrote:
  
  
 Hello all,
 
 Is there a way to reduce the indexing time taken when the indexer is 
 indexing about 30,000 + files. It is roughly taking around 6-7 hours to 
 do this. I am using IndexHTML class to create the index out of HTML files.
 
 Another issue that I see is every once in a while I get the following 
 output on the screen.
 
 adding ../31/1104852.html
 Parse Aborted: Encountered \ at line 7, column 1.
 Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 Any suggestions on preventing this from happening?
 
 Thanks in advance.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
  
  
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]
  
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Content from multiple folders in single index

2004-08-25 Thread John Greenhill
Hi,

I suspect this is an easy one but I didn't see a reference in the FAQ's
so I thought I'd ask. I have a file structure like this:

web
  - pages
  - downloads (pdf docs)
  - include

I want to index the html in pages and the pdf's in downloads, but not
the html in include, so I don't want to start my index at web. I've
modified the IndexHTML in demo to do the pdf's. 

What is the best way to do this? Thanks for your suggestions.

John
 


RE: Time to index documents

2004-08-25 Thread Karthik N S
Hi Hetan


   Th's the  major Problem of non Standatrdized Tags for HTML Document's
  u are Indexing ,resulting in lag time taken for Indexing process


   If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
file
   [U have to have some Knowledge of JAVACC for this].



Karthik

-Original Message-
From: Hetan Shah [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 26, 2004 3:01 AM
To: Lucene Users List
Subject: Time to index documents


Hello all,

Is there a way to reduce the indexing time taken when the indexer is
indexing about 30,000 + files. It is roughly taking around 6-7 hours to
do this. I am using IndexHTML class to create the index out of HTML files.

Another issue that I see is every once in a while I get the following
output on the screen.

adding ../31/1104852.html
Parse Aborted: Encountered \ at line 7, column 1.
Was expecting one of:
 ArgName ...
 = ...
 TagEnd ...

Any suggestions on preventing this from happening?

Thanks in advance.
-H


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Time to index documents

2004-08-25 Thread Stephane James Vaucher
Hetan,

If you are using a corpus with multiple editors, I suggest that you 
use a cleaner like tidy as there might be weird stuff appearing in the 
html.

sv

On Thu, 26 Aug 2004, Karthik N S wrote:

 Hi Hetan
 
 
Th's the  major Problem of non Standatrdized Tags for HTML Document's
   u are Indexing ,resulting in lag time taken for Indexing process
 
 
If u can Tweak the HTMLParser.jj file within  lucene.zip   '/demo/html'
 file
[U have to have some Knowledge of JAVACC for this].
 
 
 
 Karthik
 
 -Original Message-
 From: Hetan Shah [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 26, 2004 3:01 AM
 To: Lucene Users List
 Subject: Time to index documents
 
 
 Hello all,
 
 Is there a way to reduce the indexing time taken when the indexer is
 indexing about 30,000 + files. It is roughly taking around 6-7 hours to
 do this. I am using IndexHTML class to create the index out of HTML files.
 
 Another issue that I see is every once in a while I get the following
 output on the screen.
 
 adding ../31/1104852.html
 Parse Aborted: Encountered \ at line 7, column 1.
 Was expecting one of:
  ArgName ...
  = ...
  TagEnd ...
 
 Any suggestions on preventing this from happening?
 
 Thanks in advance.
 -H
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]