Re: [Lucene.Net] Test case for: possible infinite loop bug in portuguese snowball stemmer?

2011-09-13 Thread Robert Stewart
Here is a test case:

string text = @Califórnia;

Lucene.Net.Analysis.KeywordTokenizer tokenizer = new KeywordTokenizer(new 
StringReader(text));

Lucene.Net.Analysis.Snowball.SnowballFilter stemmer=
new Lucene.Net.Analysis.Snowball.SnowballFilter(tokenizer, 
Portuguese);

Lucene.Net.Analysis.Token token;

while ((token = stemmer.Next()) != null)
{
System.Console.WriteLine(tokenText);

}

Seems to go into infinite loop.  Call to stemmer.Next() never returns.  Not 
sure if this is the only stemmer I am having trouble with.  And it does happen 
to us on a near daily basis.  

Thanks,
Bob


On Sep 13, 2011, at 9:37 AM, Robert Stewart wrote:

 Are there any known issues with snowball stemmers (portuguese in particular) 
 going into some infinite loop?  I have a problem that happens on a recurring 
 basis where IndexWriter locks up on AddDocument and never returns (it has 
 taken up to 3 days before we realize it), requiring manual killing of the 
 process.  It seems to happen only on portuguese documents from what I can 
 tell so far, and the stack trace when thread is aborted is always as follows:
 
 System.Threading.ThreadAbortException: Thread was being aborted.
   at System.RuntimeMethodHandle._InvokeMethodFast(IRuntimeMethodInfo method, 
 Object target, Object[] arguments, SignatureStruct sig, MethodAttributes 
 methodAttributes, RuntimeType typeOwner)
   at System.RuntimeMethodHandle.InvokeMethodFast(IRuntimeMethodInfo method, 
 Object target, Object[] arguments, Signature sig, MethodAttributes 
 methodAttributes, RuntimeType typeOwner)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
 invokeAttr, Binder binder, Object[] parameters, CultureInfo culture, Boolean 
 skipVisibilityChecks)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
 invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
 System.SystemException: System.Threading.ThreadAbortException: Thread was 
 being aborted.
   at System.RuntimeMethodHandle._InvokeMethodFast(IRuntimeMethodInfo method, 
 Object target, Object[] arguments, SignatureStruct sig, MethodAttributes 
 methodAttributes, RuntimeType typeOwner)
   at System.RuntimeMethodHandle.InvokeMethodFast(IRuntimeMethodInfo method, 
 Object target, Object[] arguments, Signature sig, MethodAttributes 
 methodAttributes, RuntimeType typeOwner)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
 invokeAttr, Binder binder, Object[] parameters, CultureInfo culture, Boolean 
 skipVisibilityChecks)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
 invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
   at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
   at Lucene.Net.Analysis.TokenStream.IncrementToken()
   at Lucene.Net.Index.DocInverterPerField.ProcessFields(Fieldable[] fields, 
 Int32 count)
   at Lucene.Net.Index.DocFieldProcessorPerThread.ProcessDocument()
   at Lucene.Net.Index.DocumentsWriter.UpdateDocument(Document doc, Analyzer 
 analyzer, Term delTerm)
   at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer analyzer)
 
 
 Is there another list of contrib/snowball issues?  I have not been able to 
 reproduce a small test case yet however.  Have there been any such issues 
 with stemmers in the past?
 
 Thanks,
 Bob



[Lucene.Net] possible infinite loop bug in portuguese snowball stemmer?

2011-09-13 Thread Robert Stewart
Are there any known issues with snowball stemmers (portuguese in particular) 
going into some infinite loop?  I have a problem that happens on a recurring 
basis where IndexWriter locks up on AddDocument and never returns (it has taken 
up to 3 days before we realize it), requiring manual killing of the process.  
It seems to happen only on portuguese documents from what I can tell so far, 
and the stack trace when thread is aborted is always as follows:

System.Threading.ThreadAbortException: Thread was being aborted.
   at System.RuntimeMethodHandle._InvokeMethodFast(IRuntimeMethodInfo method, 
Object target, Object[] arguments, SignatureStruct sig, MethodAttributes 
methodAttributes, RuntimeType typeOwner)
   at System.RuntimeMethodHandle.InvokeMethodFast(IRuntimeMethodInfo method, 
Object target, Object[] arguments, Signature sig, MethodAttributes 
methodAttributes, RuntimeType typeOwner)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
invokeAttr, Binder binder, Object[] parameters, CultureInfo culture, Boolean 
skipVisibilityChecks)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
System.SystemException: System.Threading.ThreadAbortException: Thread was being 
aborted.
   at System.RuntimeMethodHandle._InvokeMethodFast(IRuntimeMethodInfo method, 
Object target, Object[] arguments, SignatureStruct sig, MethodAttributes 
methodAttributes, RuntimeType typeOwner)
   at System.RuntimeMethodHandle.InvokeMethodFast(IRuntimeMethodInfo method, 
Object target, Object[] arguments, Signature sig, MethodAttributes 
methodAttributes, RuntimeType typeOwner)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
invokeAttr, Binder binder, Object[] parameters, CultureInfo culture, Boolean 
skipVisibilityChecks)
   at System.Reflection.RuntimeMethodInfo.Invoke(Object obj, BindingFlags 
invokeAttr, Binder binder, Object[] parameters, CultureInfo culture)
   at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
   at Lucene.Net.Analysis.Snowball.SnowballFilter.Next()
   at Lucene.Net.Analysis.TokenStream.IncrementToken()
   at Lucene.Net.Index.DocInverterPerField.ProcessFields(Fieldable[] fields, 
Int32 count)
   at Lucene.Net.Index.DocFieldProcessorPerThread.ProcessDocument()
   at Lucene.Net.Index.DocumentsWriter.UpdateDocument(Document doc, Analyzer 
analyzer, Term delTerm)
   at Lucene.Net.Index.IndexWriter.AddDocument(Document doc, Analyzer analyzer)


Is there another list of contrib/snowball issues?  I have not been able to 
reproduce a small test case yet however.  Have there been any such issues with 
stemmers in the past?

Thanks,
Bob

[Lucene.Net] How to add document to more than one index (but only analyze once)?

2011-09-09 Thread Robert Stewart
Is it possible to add a document to more than one index at the same time, such 
that document fields are only analyzed one time?  For instance, to add document 
to both a master index, and a smaller near real-time index.  I would like to 
avoid analyzing document fields more than once but I dont see if that is 
possible at all using Lucene API.

Thanks,
Bob

Re: [Lucene.Net] How to add document to more than one index (but only analyze once)?

2011-09-09 Thread Robert Stewart
That sounds like a good plan.  How will that affect existing merge scheduling?  
For master index I use merge factor of 2.


On Sep 9, 2011, at 11:44 AM, digy digy wrote:

 How about indexing the new document(s) in memory using RAMDirectory then
 calling indexWriter.AddIndexesNoOptimize for NRT  master index?
 
 DIGY
 
 On Fri, Sep 9, 2011 at 5:33 PM, Robert Stewart robert_stew...@epam.comwrote:
 
 Is it possible to add a document to more than one index at the same time,
 such that document fields are only analyzed one time?  For instance, to add
 document to both a master index, and a smaller near real-time index.  I
 would like to avoid analyzing document fields more than once but I dont see
 if that is possible at all using Lucene API.
 
 Thanks,
 Bob



Re: [Lucene.Net] Lucene Steroids

2011-07-07 Thread Robert Stewart
I have built something similar using NTFS hard-links and re-using existing 
local snapshot files, etc.  It runs in production for 3+ years now with more 
than 100 million docs, and distributes new snapshots from master servers every 
minute.  It does not use any rsync, but only leverages unique file names in 
lucene - it only copies files not already existing on slaves, and uses NTFS 
hard links to copy existing local files into new snapshot directory. Also, on 
the masters, it just uses NTFS hard links to create a new snapshot of the 
master index, and then slaves just look for new snapshot directories on the 
master servers.  When new directory shows up, it looks at existing local 
snapshot to see which files are new on master (or have been deleted by master), 
and then only copies new files.  It does not need to send any explicit commit 
operations, and there is no explicit communication between masters and slaves 
(slaves just look in some remote directory for new snapshot sub-directories).   
This has worked great with no problems at all.  All this was built prior to 
SOLR being available on windows.  Going forward we are transitioning to Java 
and SOLR on Linux (it is just to hard to keep up with improvements otherwise 
IMO).



On Jul 6, 2011, at 8:22 PM, Guilherme Balena Versiani wrote:

 Hi,
 
 I am working on a derived work of Solr for .NET. The purpose is to obtain a 
 similar solution of Lucene replication available at Solr, but without the 
 need to port all Solr code.
 
 There is a SnapShooter, SnapPuller and a SnapInstaller. The SnapShooter does 
 similar work as in Solr script. The SnapPuller uses cwRsync to replicate the 
 database between machines, but without storing the 
 snapshot.current.MACHINENAME files on master, as cwRsync does no support sync 
 with the server. The SnapInstaller tries to substitute the Lucene database 
 files in-place -- the Lucene application should use a SteroidsFSDirectory 
 that creates a special SteroidsFSIndexInput that permits to rename files in 
 use; after that, SnapInstaller sends a commit operation through a Windows 
 named pipe to the application to reset its current IndexSearcher instance.
 
 This solution has the suggestive name of Lucene Steroids, and was hosted in 
 BitBucket.org. What is the best way to continue to distribute it? Should I 
 continue to maintain it on BitBucket.org or should I apply to Lucene.NET 
 project (I don't know how) to include it on Contrib modules?
 
 The current code is available at http://bitbucket.org/guibv/lucene.steroids. 
 The work is incomplete; the first stable version should be available on next 
 few days.
 
 Best regards,
 Guilherme Balena Versiani.



[Lucene.Net] alternatives to FSDirectory for multi-threaded search performance

2011-06-16 Thread Robert Stewart
What are the recommended best practices for using FSDirectory vs. RamDirectory, 
etc. for use in multi-threaded search?

In a previous version of Lucene.Net (1.9) I used a modified FSDirectory 
implementation which used a pool of open FileStream objects for each segment 
file, and handed them out in round-robin fashion from the Clone() method.  That 
way multiple threads could read most segment files in parallel.  It definitely 
increased multithreaded search performance quite a bit.  My indexes are quite 
large (100+ million docs) and I can not load entire segments in to RAM using 
RamDirectory.

My question is what is the best practice here?  Is using a pool of descriptors 
as described above the best idea?

Thanks
Bob

[Lucene.Net] Score(collector) called for each subReader - but not what I need

2011-06-10 Thread Robert Stewart
As I previously tried to explain, I have custom query for some pre-cached 
terms, which I load into RAM in efficient compressed form.  I need this for 
faster searching and also for much faster faceting.  So what I do is process 
incoming query and replace certain sub-queries with my own CachedTermQuery 
objects, which extend Query.  Since these are not per-segment, I only want 
scorer.Score(collector) called once, not once for each segment in my index.  
Essentially what happens now if I have a search is it collects the same 
documents N times, 1 time for each segment.  Is there anyway to combine 
different Scorers/Collectors such that I can control when it enumerates 
collection by multiple sub-readers, and when not to?  This all worked in 
previous version of Lucene because enumerating sub-indexes (segments) was 
pushed to a lower level inside Lucene API and not it is elevated to a higher 
level.

Thanks
Bob


On Jun 9, 2011, at 4:33 PM, Robert Stewart wrote:

 I found the problem.  The problem is that I have a custom query optimizer, 
 and that replaces certain TermQuery's within a Boolean query with a custom 
 Query and this query has its own weight/scorer that retrieves matching 
 documents from an in-memory cache (and that is not Lucene backed).  But it 
 looks like my custom hitcollectors are now wrapped in a HitCollectorWrapper 
 which assumes Collect() needs called for multiple segments - so it is adding 
 a start offset to the doc ID that comes from my custom query implementation.  
 I looked at the new Collector class and it seems it works the same way 
 (assumes it needs to set the next index reader with some offset).  How can I 
 make my custom query work with the new API (so that there is basically a 
 single segment in RAM that my query uses, but still other query clauses in 
 same boolean query use multiple lucene segments)?  I am sure that is not 
 clear and will try to provide more detail soon.
 
 Thanks
 Bob
 
 
 On Jun 9, 2011, at 1:48 PM, Digy wrote:
 
 Sorry no idea. Maybe optimizing the index with 2.9.2 can help to detect the
 problem.
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 8:40 PM
 To: lucene-net-...@lucene.apache.org
 Subject: Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I tried converting index using IndexWriter as follows:
 
 Lucene.Net.Index.IndexWriter writer = new IndexWriter(TestIndexPath+_2.9,
 new Lucene.Net.Analysis.KeywordAnalyzer());
 
 writer.SetMaxBufferedDocs(2);
 writer.SetMaxMergeDocs(100);
 writer.SetMergeFactor(2);
 
 writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new
 Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
 
 writer.Commit();
 
 
 That seems to work (I get what looks like a valid index directory at least).
 
 But still when I run some tests using IndexSearcher I get the same problem
 (I get documents in Collect() which are larger than IndexReader.MaxDoc()).
 Any idea what the problem could be?  
 
 BTW, this is a problem because I lookup some fields (date ranges, etc.) in
 some custom collectors which filter out documents, and it assumes I dont get
 any documents larger than maxDoc.
 
 Thanks,
 Bob
 
 
 On Jun 9, 2011, at 12:37 PM, Digy wrote:
 
 One more point, some write operations using Lucene.Net 2.9.2 (add, delete,
 optimize etc.) upgrades automatically your index to 2.9.2.
 But if your index is somehow corrupted(eg, due to some bug in 1.9) this
 may
 result in data loss.
 
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 7:06 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I have a Lucene index created with Lucene.Net 1.9.  I have a multi-segment
 index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that index,
 I
 get IndexOutOfRange exceptions in my collectors.  It is giving me document
 IDs that are larger than maxDoc.  
 
 My index contains 377831 documents, and IndexReader.MaxDoc() is returning
 377831, but I get documents from Collect() with large values (for instance
 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?  If
 not, is there some way I can convert it (in production we have many
 indexes
 containing about 200 million docs so I'd rather convert existing indexes
 than rebuilt them).
 
 Thanks
 Bob=
 
 
 



Re: [Lucene.Net] Score(collector) called for each subReader - but not what I need

2011-06-10 Thread Robert Stewart
No I will try it though. Thanks.

Bob


On Jun 10, 2011, at 12:37 PM, Digy wrote:

 Have you tried to use Lucene.Net as is, before working on optimizing your
 code? There are a lot of speed improvements in it since 1.9.
 There is also a Faceted Search project in contrib.
 (https://cwiki.apache.org/confluence/display/LUCENENET/Simple+Faceted+Search
 )
 
 DIGY
 
 
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Friday, June 10, 2011 7:14 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] Score(collector) called for each subReader - but not
 what I need
 
 As I previously tried to explain, I have custom query for some pre-cached
 terms, which I load into RAM in efficient compressed form.  I need this for
 faster searching and also for much faster faceting.  So what I do is process
 incoming query and replace certain sub-queries with my own CachedTermQuery
 objects, which extend Query.  Since these are not per-segment, I only want
 scorer.Score(collector) called once, not once for each segment in my index.
 Essentially what happens now if I have a search is it collects the same
 documents N times, 1 time for each segment.  Is there anyway to combine
 different Scorers/Collectors such that I can control when it enumerates
 collection by multiple sub-readers, and when not to?  This all worked in
 previous version of Lucene because enumerating sub-indexes (segments) was
 pushed to a lower level inside Lucene API and not it is elevated to a higher
 level.
 
 Thanks
 Bob
 
 
 On Jun 9, 2011, at 4:33 PM, Robert Stewart wrote:
 
 I found the problem.  The problem is that I have a custom query
 optimizer, and that replaces certain TermQuery's within a Boolean query
 with a custom Query and this query has its own weight/scorer that retrieves
 matching documents from an in-memory cache (and that is not Lucene backed).
 But it looks like my custom hitcollectors are now wrapped in a
 HitCollectorWrapper which assumes Collect() needs called for multiple
 segments - so it is adding a start offset to the doc ID that comes from my
 custom query implementation.  I looked at the new Collector class and it
 seems it works the same way (assumes it needs to set the next index reader
 with some offset).  How can I make my custom query work with the new API (so
 that there is basically a single segment in RAM that my query uses, but
 still other query clauses in same boolean query use multiple lucene
 segments)?  I am sure that is not clear and will try to provide more detail
 soon.
 
 Thanks
 Bob
 
 
 On Jun 9, 2011, at 1:48 PM, Digy wrote:
 
 Sorry no idea. Maybe optimizing the index with 2.9.2 can help to detect
 the
 problem.
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 8:40 PM
 To: lucene-net-...@lucene.apache.org
 Subject: Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I tried converting index using IndexWriter as follows:
 
 Lucene.Net.Index.IndexWriter writer = new
 IndexWriter(TestIndexPath+_2.9,
 new Lucene.Net.Analysis.KeywordAnalyzer());
 
 writer.SetMaxBufferedDocs(2);
 writer.SetMaxMergeDocs(100);
 writer.SetMergeFactor(2);
 
 writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new
 Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
 
 writer.Commit();
 
 
 That seems to work (I get what looks like a valid index directory at
 least).
 
 But still when I run some tests using IndexSearcher I get the same
 problem
 (I get documents in Collect() which are larger than
 IndexReader.MaxDoc()).
 Any idea what the problem could be?  
 
 BTW, this is a problem because I lookup some fields (date ranges, etc.)
 in
 some custom collectors which filter out documents, and it assumes I dont
 get
 any documents larger than maxDoc.
 
 Thanks,
 Bob
 
 
 On Jun 9, 2011, at 12:37 PM, Digy wrote:
 
 One more point, some write operations using Lucene.Net 2.9.2 (add,
 delete,
 optimize etc.) upgrades automatically your index to 2.9.2.
 But if your index is somehow corrupted(eg, due to some bug in 1.9) this
 may
 result in data loss.
 
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 7:06 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I have a Lucene index created with Lucene.Net 1.9.  I have a
 multi-segment
 index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that
 index,
 I
 get IndexOutOfRange exceptions in my collectors.  It is giving me
 document
 IDs that are larger than maxDoc.  
 
 My index contains 377831 documents, and IndexReader.MaxDoc() is
 returning
 377831, but I get documents from Collect() with large values (for
 instance
 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?
 If
 not, is there some way I can convert it (in production we have many
 indexes
 containing about

[Lucene.Net] index version compatibility (1.9 to 2.9.2)?

2011-06-09 Thread Robert Stewart
I have a Lucene index created with Lucene.Net 1.9.  I have a multi-segment 
index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that index, I 
get IndexOutOfRange exceptions in my collectors.  It is giving me document IDs 
that are larger than maxDoc.  

My index contains 377831 documents, and IndexReader.MaxDoc() is returning 
377831, but I get documents from Collect() with large values (for instance 
379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?  If not, 
is there some way I can convert it (in production we have many indexes 
containing about 200 million docs so I'd rather convert existing indexes than 
rebuilt them).

Thanks
Bob

[Lucene.Net] index version compatibility (1.9 to 2.9.2)?

2011-06-09 Thread Robert Stewart
I have a Lucene index created with Lucene.Nethttp://Lucene.Net/ 1.9.  I have 
a multi-segment index (non-optimized).   When I run 
Lucene.Nethttp://Lucene.Net/ 2.9.2 on top of that index, I get 
IndexOutOfRange exceptions in my collectors.  It is giving me document IDs that 
are larger than maxDoc.

My index contains 377831 documents, and IndexReader.MaxDoc() is returning 
377831, but I get documents from Collect() with large values (for instance 
379018).  Is an index built with Lucene.Nethttp://Lucene.Net/ 1.9 compatible 
with 2.9.2?  If not (and I assume it is not), is there some way I can convert 
existing indexes? (in production we have many indexes containing about 200 
million docs so I'd much rather convert existing indexes than rebuilt them).

Thanks
Bob


Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?

2011-06-09 Thread Robert Stewart
I tried converting index using IndexWriter as follows:

Lucene.Net.Index.IndexWriter writer = new IndexWriter(TestIndexPath+_2.9, new 
Lucene.Net.Analysis.KeywordAnalyzer());

writer.SetMaxBufferedDocs(2);
writer.SetMaxMergeDocs(100);
writer.SetMergeFactor(2);

writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new 
Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
  
writer.Commit();


That seems to work (I get what looks like a valid index directory at least).

But still when I run some tests using IndexSearcher I get the same problem (I 
get documents in Collect() which are larger than IndexReader.MaxDoc()).  Any 
idea what the problem could be?  

BTW, this is a problem because I lookup some fields (date ranges, etc.) in some 
custom collectors which filter out documents, and it assumes I dont get any 
documents larger than maxDoc.

Thanks,
Bob


On Jun 9, 2011, at 12:37 PM, Digy wrote:

 One more point, some write operations using Lucene.Net 2.9.2 (add, delete,
 optimize etc.) upgrades automatically your index to 2.9.2.
 But if your index is somehow corrupted(eg, due to some bug in 1.9) this may
 result in data loss.
 
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 7:06 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I have a Lucene index created with Lucene.Net 1.9.  I have a multi-segment
 index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that index, I
 get IndexOutOfRange exceptions in my collectors.  It is giving me document
 IDs that are larger than maxDoc.  
 
 My index contains 377831 documents, and IndexReader.MaxDoc() is returning
 377831, but I get documents from Collect() with large values (for instance
 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?  If
 not, is there some way I can convert it (in production we have many indexes
 containing about 200 million docs so I'd rather convert existing indexes
 than rebuilt them).
 
 Thanks
 Bob=
 



Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?

2011-06-09 Thread Robert Stewart
I found the problem.  The problem is that I have a custom query optimizer, 
and that replaces certain TermQuery's within a Boolean query with a custom 
Query and this query has its own weight/scorer that retrieves matching 
documents from an in-memory cache (and that is not Lucene backed).  But it 
looks like my custom hitcollectors are now wrapped in a HitCollectorWrapper 
which assumes Collect() needs called for multiple segments - so it is adding a 
start offset to the doc ID that comes from my custom query implementation.  I 
looked at the new Collector class and it seems it works the same way (assumes 
it needs to set the next index reader with some offset).  How can I make my 
custom query work with the new API (so that there is basically a single 
segment in RAM that my query uses, but still other query clauses in same 
boolean query use multiple lucene segments)?  I am sure that is not clear and 
will try to provide more detail soon.

Thanks
Bob


On Jun 9, 2011, at 1:48 PM, Digy wrote:

 Sorry no idea. Maybe optimizing the index with 2.9.2 can help to detect the
 problem.
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 8:40 PM
 To: lucene-net-...@lucene.apache.org
 Subject: Re: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I tried converting index using IndexWriter as follows:
 
 Lucene.Net.Index.IndexWriter writer = new IndexWriter(TestIndexPath+_2.9,
 new Lucene.Net.Analysis.KeywordAnalyzer());
 
 writer.SetMaxBufferedDocs(2);
 writer.SetMaxMergeDocs(100);
 writer.SetMergeFactor(2);
 
 writer.AddIndexesNoOptimize(new Lucene.Net.Store.Directory[] { new
 Lucene.Net.Store.SimpleFSDirectory(new DirectoryInfo(TestIndexPath)) });
 
 writer.Commit();
 
 
 That seems to work (I get what looks like a valid index directory at least).
 
 But still when I run some tests using IndexSearcher I get the same problem
 (I get documents in Collect() which are larger than IndexReader.MaxDoc()).
 Any idea what the problem could be?  
 
 BTW, this is a problem because I lookup some fields (date ranges, etc.) in
 some custom collectors which filter out documents, and it assumes I dont get
 any documents larger than maxDoc.
 
 Thanks,
 Bob
 
 
 On Jun 9, 2011, at 12:37 PM, Digy wrote:
 
 One more point, some write operations using Lucene.Net 2.9.2 (add, delete,
 optimize etc.) upgrades automatically your index to 2.9.2.
 But if your index is somehow corrupted(eg, due to some bug in 1.9) this
 may
 result in data loss.
 
 DIGY
 
 -Original Message-
 From: Robert Stewart [mailto:robert_stew...@epam.com] 
 Sent: Thursday, June 09, 2011 7:06 PM
 To: lucene-net-...@lucene.apache.org
 Subject: [Lucene.Net] index version compatibility (1.9 to 2.9.2)?
 
 I have a Lucene index created with Lucene.Net 1.9.  I have a multi-segment
 index (non-optimized).   When I run Lucene.Net 2.9.2 on top of that index,
 I
 get IndexOutOfRange exceptions in my collectors.  It is giving me document
 IDs that are larger than maxDoc.  
 
 My index contains 377831 documents, and IndexReader.MaxDoc() is returning
 377831, but I get documents from Collect() with large values (for instance
 379018).  Is an index built with Lucene.Net 1.9 compatible with 2.9.2?  If
 not, is there some way I can convert it (in production we have many
 indexes
 containing about 200 million docs so I'd rather convert existing indexes
 than rebuilt them).
 
 Thanks
 Bob=