Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Hi i took a look at Andrey Grishin russian character problem and found 
something strange happening while we tried to debug it. It seems that 
he has avoided the usual querying with different encoding than 
indexed problem as he can dump out correctly encoded russian at all 
points in his application.

Is the strings for terms treated differently than the text stored in 
text fields? The reason i ask is that his russian words are correct in 
the stored text fields, but shows up faulty in a terms() dump. If he 
had a character encoding problem in his application the fields should 
show up faulty as well i think. Even stranger is that i use Lucene 1.2 
successfully for utf-8, iso-8859-1, iso-8859-5 and iso-8859-7. Why is 
this problem showing in russian(Cp1251) and not the other encodings?

Strangeness number two is the theory that if the russian word ,!,_,U was 
skewed to say 0d66539qw upon indexing, and the problem was just a 
consistent encoding problem, wouldn't a query with  ,!,_,U be skewed to 
0d66539qw and be found anyway?

mvh karl )*ie


Begin forwarded message:

From: Andrey Grishin [EMAIL PROTECTED]
Date: Thu Nov 21, 2002  15:13:33 Europe/Oslo
To: Karl Oie [EMAIL PROTECTED]
Subject: Re: How to include strange characters??

yes, you are right - there are no russian words in returned terms :(((
I've just executed the following
--
IndexReader r =
IndexReader.open(C:\\j\\jakarta-tomcat-4.1.12\\index\\ukrenergo);
TermEnum e = r.terms();
while (e.next()) {
  Term term = (Term) e.term();
  System.out.println(term :  + term.text());
}
--
and got no russian words in result
there are some strange terms returned instead of russian:
term : 0d4xvp70w
term : 0d66539qw
term : 0d67les2o
term : 0d6eqgic0
etc.

So, I think we got a problem. THis is great :)), thank you...
but how to fix it?




- Original Message -
From: Karl ?e [EMAIL PROTECTED]
To: Andrey Grishin [EMAIL PROTECTED]
Sent: Thursday, November 21, 2002 3:56 PM
Subject: Re: How to include strange characters??


another thing to check is weither the IndexReader.terms() actually
contains your term.

mvh karl oie

On Thursday, Nov 21, 2002, at 14:31 Europe/Oslo, Andrey Grishin wrote:


Karl,
I have the same problem with lucene search within russian content.
I tried all your advises, but lucene still can't find anything :
I indexed the content using Cp1251 charset

text = new String(text.getBytes(Cp1251));
doc.add(Field.Text(CONTENT_FIELD,text));

and I am searching using the same charset
String txt = ,!,_,U;
txt = new String(txt.getBytes(Cp1251));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any
I tried UTF-8/16 - and got the same result.
if I list all index's content via iterating IndexReader- I can see
that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else can
be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: problems with search on Russian content

2002-11-22 Thread Karl Øie
Sorry, my bad! Didn't read this informative post :-)

mvh karl øie


On Thursday, Nov 21, 2002, at 16:35 Europe/Oslo, Otis Gospodnetic wrote:


Look at CHANGES.txt document in CVS - there is some new stuff in
org.apache.lucene.analysis.ru package that you will want to use.
Get the Lucene from the nightly build...

Otis

--- Andrey Grishin [EMAIL PROTECTED] wrote:

Hi All,
I have a problems with searching on Russian content using lucene 1.2

I indexed the content using Cp1251 charset

text = new String(text.getBytes(Cp1251));
doc.add(Field.Text(CONTENT_FIELD,text));


and I am searching using the same charset

String txt = ·Œƒ;
txt = new String(txt.getBytes(Cp1251));
PrefixQuery query = new PrefixQuery(new
Term(PortalHTMLDocument.CONTENT_FIELD, txt));
hits = searcher.search(query);

or

Analyzer analyzer = new StandardAnalyzer();
String txt = ·Œƒ“≈ ;
txt = new String(txt.getBytes(Cp1251));
Query query = QueryParser.parse(txt,
PortalHTMLDocument.CONTENT_FIELD, analyzer);

hits = searcher.search(query);


and lucene can't find nothing.
Also I checked for the DecodeInterceptor in my server.xml - there
isn't any

I tried UTF-8/16 - and got the same result.

Also, if I list all index's content via iterating IndexReader - I can
see that my russian content is stored in index...
Can you please help me? Do you have any more ideas about what else
can be done here to fix this problem?

I will appreciate any help.
Thanks, Andrey.

P.S.
I am using lucene 1.2, tomcat 4.1.12, jdk 1.4.1 on Win2000 AS



__
Do you Yahoo!?
Yahoo! Mail Plus ñ Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




PDF parser

2002-11-22 Thread Thomas Chacko
Whats the best parser available to extarct text from PDF documents. Expecting a reply 
ASAP

Thanks in advance
Thomas Chacko


AW: PDF parser

2002-11-22 Thread Borkenhagen, Michael (ofd-ko zdfin)
There are different Parsers available - every Parser has other advantages
and disadvantages.
I use a combination of the PDFBox  http://www.pdfbox.org/ and Etymon PJ
http://www.etymon.com/pjc/, cause their APIs are very simple. Both of them
parse PDF in a format of their own an provide interfaces to get the PDF
Documents contents.

Other developers on this list prefer JPedal http://www.jpedal.org/ which
parses PDF into XML an provide a XML Tree with the PDF Documents contentsest, but the 
Documentation isn´t very detailed.

Micha

-Ursprüngliche Nachricht-
Von: Thomas Chacko [mailto:[EMAIL PROTECTED]]
Gesendet: Freitag, 22. November 2002 15:26
An: Lucene Users List
Betreff: PDF parser


Whats the best parser available to extarct text from PDF documents.
Expecting a reply ASAP

Thanks in advance
Thomas Chacko


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




How does delete work?

2002-11-22 Thread Rob Outar
Hello all,

I used the delete(Term) method, then I looked at the index files, only one
file changed _1tx.del  I found references to the file still in some of the
index files, so my question is how does Lucene handle deletes?

Thanks,

Rob


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How does delete work?

2002-11-22 Thread Scott Ganyo
It just marks the record as deleted.  The record isn't actually removed 
until the index is optimized.

Scott

Rob Outar wrote:

Hello all,

	I used the delete(Term) method, then I looked at the index files, 
only one
file changed _1tx.del  I found references to the file still in some 
of the
index files, so my question is how does Lucene handle deletes?

Thanks,

Rob


--
To unsubscribe, e-mail:
For additional commands, e-mail: 


--
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Updating documents

2002-11-22 Thread Rob Outar
I have something odd going on, I have code that updates documents in the
index so I have to delete it and then re add it.  When I re-add the document
I immediately do a search on the newly added field which fails.  However, if
I rerun the query a second time it works??  I have the Searcher class as an
attribute of my search class, does it not see the new changes?  Seems like
when it is reinitialized with the changed index it is then able to search on
the newly added field??

Let me know if anyone has encountered this.

Thanks,

Rob



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: Updating documents

2002-11-22 Thread Rob Outar
There is a reloading issue but I do not think lastModified is it:

static long lastModified(Directory directory)
  Returns the time the index in this directory was last modified.
static long lastModified(File directory)
  Returns the time the index in the named directory was last
modified.
static long lastModified(String directory)
  Returns the time the index in the named directory was last
modified.

Do I need to create a new instance of IndexSearcher each time I search?

Thanks,

Rob


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 22, 2002 12:20 PM
To: Lucene Users List
Subject: Re: Updating documents


Don't you have to make use of lastModified method (I think in
IndexSearcher), to 'reload' your instance of IndexSearcher?  I'm
pulling this from some old, not very fresh memory

Otis

--- Rob Outar [EMAIL PROTECTED] wrote:
 I have something odd going on, I have code that updates documents in
 the
 index so I have to delete it and then re add it.  When I re-add the
 document
 I immediately do a search on the newly added field which fails.
 However, if
 I rerun the query a second time it works??  I have the Searcher class
 as an
 attribute of my search class, does it not see the new changes?  Seems
 like
 when it is reinitialized with the changed index it is then able to
 search on
 the newly added field??

 Let me know if anyone has encountered this.

 Thanks,

 Rob



 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



__
Do you Yahoo!?
Yahoo! Mail Plus  Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Updating documents

2002-11-22 Thread Otis Gospodnetic
Btw. I have posted the code for this before, so you can find it in the
list archives.

Otis

--- Scott Ganyo [EMAIL PROTECTED] wrote:
 Not each time you search, but if you've modified the index since you 
 opened the searcher, you need to create a new searcher to get the
 changes.
 
 Scott
 
 Rob Outar wrote:
 
  There is a reloading issue but I do not think lastModified is it:
 
  static long lastModified(Directory directory)
Returns the time the index in this directory was last
 modified.
  static long lastModified(File directory)
Returns the time the index in the named directory was
 last
  modified.
  static long lastModified(String directory)
Returns the time the index in the named directory was
 last
  modified.
 
  Do I need to create a new instance of IndexSearcher each time I
 search?
 
  Thanks,
 
  Rob
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
  Sent: Friday, November 22, 2002 12:20 PM
  To: Lucene Users List
  Subject: Re: Updating documents
 
 
  Don't you have to make use of lastModified method (I think in
  IndexSearcher), to 'reload' your instance of IndexSearcher?  I'm
  pulling this from some old, not very fresh memory
 
  Otis
 
  --- Rob Outar  wrote:
 
  I have something odd going on, I have code that updates documents
 in
  the
  index so I have to delete it and then re add it.  When I re-add
 the
  document
  I immediately do a search on the newly added field which fails.
  However, if
  I rerun the query a second time it works??  I have the Searcher
 class
  as an
  attribute of my search class, does it not see the new changes? 
 Seems
  like
  when it is reinitialized with the changed index it is then able to
  search on
  the newly added field??
  
  Let me know if anyone has encountered this.
  
  Thanks,
  
  Rob
  
  
  
  --
  To unsubscribe, e-mail:
  
  For additional commands, e-mail:
  
  
 
 
  __
  Do you Yahoo!?
  Yahoo! Mail Plus  Powerful. Affordable. Sign up now.
  http://mailplus.yahoo.com
 
  --
  To unsubscribe, e-mail:
 
  For additional commands, e-mail:
 
 
 
  --
  To unsubscribe, e-mail:
  For additional commands, e-mail: 
 
 
 -- 
 Brain: Pinky, are you pondering what I’m pondering?
 Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what
 were 
 they thinking?
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How does delete work?

2002-11-22 Thread Otis Gospodnetic
This is via mergeFactor?

--- Doug Cutting [EMAIL PROTECTED] wrote:
 The data is actually removed the next time its segment is merged. 
 Optimizing forces it to happen, but it will also eventually happen as
 
 more documents are added to the index, without optimization.
 
 Scott Ganyo wrote:
  It just marks the record as deleted.  The record isn't actually
 removed 
  until the index is optimized.
  
  Scott
  
  Rob Outar wrote:
  
  Hello all,
 
  I used the delete(Term) method, then I looked at the index
 files, 
  only one
  file changed _1tx.del  I found references to the file still in
 some 
  of the
  index files, so my question is how does Lucene handle deletes?
 
  Thanks,
 
  Rob
 
 
  -- 
  To unsubscribe, e-mail:
  For additional commands, e-mail: 
  
  
  
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: Updating documents

2002-11-22 Thread Doug Cutting
A deletion is only visible in other IndexReader instances created after 
the IndexReader where you made the deletion is closed.  So if you're 
searching using a different IndexReader, you need to re-open it after 
the deleting IndexReader is closed.  The lastModified method helps you 
to figure out when this is required.  The standard idiom is to cache the 
lastModified date returned when a reader is opened, then check it 
against the current value before each search.  When it is different, 
re-open.

Note: If you have many searching threads, it is most efficient for them 
to share an IndexReader.  But if one thread closes the reader while 
others are still searching it, then those searches may crash.  So, when 
re-opening the index, don't immediately close the old one.  Rather just 
let the garbage collector close its open files.  The only problem with 
this approach is that, if your index changes more frequently than the 
garbage collector collects old indexes then you can run out of file handles.

Hmm.  It would probably make things simpler if an IndexReader cached its 
lastModifiedDate when it was opened, so that applications don't have to 
do this themseleves to find out whether an IndexReader is out-of-date...

Doug

Rob Outar wrote:
There is a reloading issue but I do not think lastModified is it:

static long lastModified(Directory directory)
  Returns the time the index in this directory was last modified.
static long lastModified(File directory)
  Returns the time the index in the named directory was last
modified.
static long lastModified(String directory)
  Returns the time the index in the named directory was last
modified.

Do I need to create a new instance of IndexSearcher each time I search?

Thanks,

Rob


-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: Friday, November 22, 2002 12:20 PM
To: Lucene Users List
Subject: Re: Updating documents


Don't you have to make use of lastModified method (I think in
IndexSearcher), to 'reload' your instance of IndexSearcher?  I'm
pulling this from some old, not very fresh memory

Otis

--- Rob Outar [EMAIL PROTECTED] wrote:


I have something odd going on, I have code that updates documents in
the
index so I have to delete it and then re add it.  When I re-add the
document
I immediately do a search on the newly added field which fails.
However, if
I rerun the query a second time it works??  I have the Searcher class
as an
attribute of my search class, does it not see the new changes?  Seems
like
when it is reinitialized with the changed index it is then able to
search on
the newly added field??

Let me know if anyone has encountered this.

Thanks,

Rob



--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




__
Do you Yahoo!?
Yahoo! Mail Plus  Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How does delete work?

2002-11-22 Thread Doug Cutting
Merging happens constantly as documents are added.  Each document is 
initially added in its own segment, and pushed onto the segment stack. 
Whenever there are mergeFactor segments on the top of the stack that are 
the same size, these are merged together into a new single segment that 
replaces them.  So, if mergeFactor is 10, and you've added 122 
documents, the stack will have five segments, as follows:
  document 121
  document 120
  documents 110-119
  documents 100-109
  documents 0-100
The next merge will happen after document 129 is added, when a new 
segment will replace the segments for document 120 through document 129 
with a new single segment.

It's actually a little more complicated than that, since (among other 
reasons) after docuuments are deleted a segment's size will no longer be 
exactly a power of the mergeFactor.

Doug

Otis Gospodnetic wrote:
This is via mergeFactor?

--- Doug Cutting [EMAIL PROTECTED] wrote:


The data is actually removed the next time its segment is merged. 
Optimizing forces it to happen, but it will also eventually happen as

more documents are added to the index, without optimization.

Scott Ganyo wrote:

It just marks the record as deleted.  The record isn't actually


removed 

until the index is optimized.

Scott

Rob Outar wrote:



Hello all,

   I used the delete(Term) method, then I looked at the index



files, 

only one
file changed _1tx.del  I found references to the file still in



some 

of the
index files, so my question is how does Lucene handle deletes?

Thanks,

Rob


--
To unsubscribe, e-mail:
For additional commands, e-mail: 





--
To unsubscribe, e-mail:  
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




large index - slow optimize()

2002-11-22 Thread Otis Gospodnetic
Hello,

I am building an index with a few 1M documents, and every X documents
added to the index I call optimize() on the IndexWriter.
I have noticed that as the index grows this calls takes more and more
time, even though the number of new segments that need to be merged is
the same between every optimize() call.
I suspect this is normal and not a bug, but is there no way around
that?  Do you know which part is the part that takes longer and longer
as the index grows?

Thanks,
Otis


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How does delete work?

2002-11-22 Thread Otis Gospodnetic
I see, so every mergeFactor documents they are compined into a single
new segment in the index, and only when optimize() is called do those
multiple segments get merged into a single segment.
In your example below that would mean that optimize() was called after
document 100 was added, hence a single segment with documents 0-100.
Is this right?

Thanks,
Otis

--- Doug Cutting [EMAIL PROTECTED] wrote:
 Merging happens constantly as documents are added.  Each document is 
 initially added in its own segment, and pushed onto the segment
 stack. 
 Whenever there are mergeFactor segments on the top of the stack that
 are 
 the same size, these are merged together into a new single segment
 that 
 replaces them.  So, if mergeFactor is 10, and you've added 122 
 documents, the stack will have five segments, as follows:
document 121
document 120
documents 110-119
documents 100-109
documents 0-100
 The next merge will happen after document 129 is added, when a new 
 segment will replace the segments for document 120 through document
 129 
 with a new single segment.
 
 It's actually a little more complicated than that, since (among other
 
 reasons) after docuuments are deleted a segment's size will no longer
 be 
 exactly a power of the mergeFactor.
 
 Doug
 
 Otis Gospodnetic wrote:
  This is via mergeFactor?
  
  --- Doug Cutting [EMAIL PROTECTED] wrote:
  
 The data is actually removed the next time its segment is merged. 
 Optimizing forces it to happen, but it will also eventually happen
 as
 
 more documents are added to the index, without optimization.
 
 Scott Ganyo wrote:
 
 It just marks the record as deleted.  The record isn't actually
 
 removed 
 
 until the index is optimized.
 
 Scott
 
 Rob Outar wrote:
 
 
 Hello all,
 
 I used the delete(Term) method, then I looked at the index
 
 files, 
 
 only one
 file changed _1tx.del  I found references to the file still in
 
 some 
 
 of the
 index files, so my question is how does Lucene handle deletes?
 
 Thanks,
 
 Rob
 
 
 -- 
 To unsubscribe, e-mail:
 For additional commands, e-mail: 
 
 
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
  
  
  __
  Do you Yahoo!?
  Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
  http://mailplus.yahoo.com
  
  --
  To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
  
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: large index - slow optimize()

2002-11-22 Thread Armbrust, Daniel C.
Note - this is not a fact, this is what I think I know about how it works.

My working assumption has been its just a matter of disk speed, since during optimize, 
the entire index is copied into new files, and then at the end, the old one is 
removed.  So the more GB you have to copy, the longer it takes.

This is also the reason that you need double the size of your index available on the 
drive in order to perform an optimize, correct?  Or does this only apply when you are 
merging indexes?


Dan



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] 
Sent: Friday, November 22, 2002 12:52 PM
To: [EMAIL PROTECTED]
Subject: large index - slow optimize()


Hello,

I am building an index with a few 1M documents, and every X documents
added to the index I call optimize() on the IndexWriter.
I have noticed that as the index grows this calls takes more and more
time, even though the number of new segments that need to be merged is
the same between every optimize() call.
I suspect this is normal and not a bug, but is there no way around
that?  Do you know which part is the part that takes longer and longer
as the index grows?

Thanks,
Otis


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: has this exception been seen before

2002-11-22 Thread Chris D



I am getting this problem as well, but have not been able to pinpoint the 
cause.

A tip for those who are doing a complete re-index.  You can save alot of 
time by creating a new index and then merging the old files into the new 
index.  One disadvantage here is that you may have to re-point your app to 
the new index.  I find that the bug prevents the old index from being 
deleted on Win2K.


_
The new MSN 8: smart spam protection and 2 months FREE*  
http://join.msn.com/?page=features/junkmail


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]



Readability score?

2002-11-22 Thread petite_abeille
Hello,

This is slightly off topic but...

Does anyone have a handy library to compute readability score?

Something like Flesch Reading Ease score  Co:

http://thibs.menloschool.org/~djwong/docs/wordReadabilityformulas.html

Would you like to share?-)

Thanks.

R.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: How does delete work?

2002-11-22 Thread Doug Cutting
No, in my example optimize() was never called.  The merge rule operates 
recursively.  So, after 99 documents had been added the segment stack 
contained nine indexes with ten documents and nine with one document. 
When the hundredth document was added, the nine one document segments 
were popped of the stack and merged into a single segment that was 
pushed onto the stack.  So then the top of the stack had ten segments 
each containing ten documents, i.e., mergeFactor segments of the same 
size, and these ten segments were then merged into a single segment of 
100 documents.  So adding the 100th document triggered two merges.

(One error in my previous example: the 100 document segment actually 
contains documents 0-99, not 0-100.)

A corollary of this is, when mergeFactor is 10 and no deletions have 
been made, the segments correspond to the digits in the decimal 
representation of the number of documents in the index.  So, an 85 
document index has eight segments with 10 documents and five segments 
with one documents.  (This is somewhat of a simplification, as Lucene 
automatically merges single document segments before ever writing them 
to disk as an optimization.)

It is most beneficial to call IndexWriter.optimize() only when you know 
you won't be adding documents to an index for a while, but will be 
searching it a lot.  Calling optimize() periodically while indexing 
mostly just slows things down.

Doug

Otis Gospodnetic wrote:
I see, so every mergeFactor documents they are compined into a single
new segment in the index, and only when optimize() is called do those
multiple segments get merged into a single segment.
In your example below that would mean that optimize() was called after
document 100 was added, hence a single segment with documents 0-100.
Is this right?

Thanks,
Otis

--- Doug Cutting [EMAIL PROTECTED] wrote:


Merging happens constantly as documents are added.  Each document is 
initially added in its own segment, and pushed onto the segment
stack. 
Whenever there are mergeFactor segments on the top of the stack that
are 
the same size, these are merged together into a new single segment
that 
replaces them.  So, if mergeFactor is 10, and you've added 122 
documents, the stack will have five segments, as follows:
  document 121
  document 120
  documents 110-119
  documents 100-109
  documents 0-100
The next merge will happen after document 129 is added, when a new 
segment will replace the segments for document 120 through document
129 
with a new single segment.

It's actually a little more complicated than that, since (among other

reasons) after docuuments are deleted a segment's size will no longer
be 
exactly a power of the mergeFactor.

Doug

Otis Gospodnetic wrote:

This is via mergeFactor?

--- Doug Cutting [EMAIL PROTECTED] wrote:



The data is actually removed the next time its segment is merged. 
Optimizing forces it to happen, but it will also eventually happen


as


more documents are added to the index, without optimization.

Scott Ganyo wrote:



It just marks the record as deleted.  The record isn't actually


removed 


until the index is optimized.

Scott

Rob Outar wrote:




Hello all,

  I used the delete(Term) method, then I looked at the index



files, 


only one
file changed _1tx.del  I found references to the file still in



some 


of the
index files, so my question is how does Lucene handle deletes?

Thanks,

Rob


--
To unsubscribe, e-mail:
For additional commands, e-mail: 




--
To unsubscribe, e-mail:  
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:  

mailto:[EMAIL PROTECTED]


For additional commands, e-mail:


mailto:[EMAIL PROTECTED]


--
To unsubscribe, e-mail:  
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Question on having IndexReader and IndexWriter simultaneously

2002-11-22 Thread Herman Chen
Hi,

According to my experimentation, I am unable to create an IndexWriter
while any IndexReader/Searcher is open on the same index.  Since I have
all search threads share one IndexReader, each time I need to create an
IndexWriter I have to wait until all searches are done so that I can close the
IndexReader.  Only then I am able to create an IndexWriter.  Does this
concurrency problem really exist?  Because one problem I have now is
starvation of modification threads.  Thanks.

--
Herman




Date Range - I've searched FAQs and mail list archive..... no help..... Really

2002-11-22 Thread Michael Caughey
Part of my problem seems to be that the Range Query Object isn't acting as it should 
as per the FAQ and other mail list entries.
I'm using Lucene 1.2

I have a field in my index called DATE.  I'd like to do a date range search on it. I 
am using Strings in the format of MMdd.

I have the following dates in my Index:
20021105
20021126
20021113
20021115
20021103
20021125


When I use the follwing code to search, I get an exception:
*NOTE: I'm using the MultiFieldQueryParser becuase In some cases I check other field, 
I've simplified this one to demonstrate (and run my tests isolated from other factors)

   IndexSearcher search = new IndexSearcher(myindex);
   SimpleAnalyzer analyzer = new SimpleAnalyzer();
   String[] fields = new String[1];
   fields[0] = DATE

  String buff = ( DATE:[20021101 - 20021131] );
  Query query = MultiFieldQueryParser.parse(buff, fields, analyzer);
  searcher.search(query);

I get the following error:
java.lang.IllegalArgumentException: At least one term must be non-null

if buff = ( DATE:20021101 - 20021131 );
as well as
if buff = ( DATE:(20021101 - 20021131 ));
I simply get no results.

I have added the date to the document by both
Field.Text(DATE, dateStr);
and
Field.Keyword(DATE, dateStr);

I have also tried to build the queries up creating Objects.  One of the things I 
notice is that if I use the RangeQuery Object there are no spaces on either side of 
the -.

The documents which I created have the following Fields:
TITLE, DESCRIPTION and DATE.
If I search on TITLE or DESCRIPTION or a combination of both I get results just fine.

Am I doing something stupid, or is this a bug?  Seems to based on what I read that the 
example above where String buff = ( DATE:[20021101 - 20021131] ); is correct and 
should work.

I published the complete source in an earlier posting called Problem with Range.  It 
also contains a stack trace of the error.

Thanks in advance,
Michael





Re: Question on having IndexReader and IndexWriter simultaneously

2002-11-22 Thread Otis Gospodnetic
Sounds like problem outside Lucene.
Can you create a self-contained class that demonstrates the problem?
If you cannot it probably is not a problem.

Otis

--- Herman Chen [EMAIL PROTECTED] wrote:
 Hi,
 
 According to my experimentation, I am unable to create an IndexWriter
 while any IndexReader/Searcher is open on the same index.  Since I
 have
 all search threads share one IndexReader, each time I need to create
 an
 IndexWriter I have to wait until all searches are done so that I can
 close the
 IndexReader.  Only then I am able to create an IndexWriter.  Does
 this
 concurrency problem really exist?  Because one problem I have now is
 starvation of modification threads.  Thanks.
 
 --
 Herman
 
 


__
Do you Yahoo!?
Yahoo! Mail Plus – Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]