Compatibility problems between AnalyzerWrapper api & MultiTerms.getTerms api

2020-04-14 Thread 小鱼儿
I'm using AnalyzerWrapper to do per-field analyzer to do special indexing:

PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(..);
// PerFieldAnalyzerWrapper  is subclass of Lucene's AnalyzerWrapper
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);

However, i found that later when i used MultiTerms.getTerms to load the
specific field's term dictionary, it's like it is still analyzed by
Lucene's StandardAnalyzer.

I have to use another trick to bypass this problem(use
custom IndexableField class to do per-field custom analyzer, which is not
needed to detail here), i guess ultiTerms.getTerms is an experimental api
so it's not consistent with nalyzerWrapper?


RE: Need suggetion in replacing forcemerge(1) with alternative which consumes less space

2020-04-14 Thread Uwe Schindler
Hi,

from what you are describing it is not clear, what you are seeing. Asking the 
question about "forceMerge(1)" seems like an XY-Problem 
(https://en.wikipedia.org/wiki/XY_problem).

(1) forceMerge(1) should never be used, only for some very special 
circumstances (like indexes that are read only and never be updated again). If 
you forceMerge an index its "internal structure" gets corrupted and later 
merging never works again like it should. This requires you to forceMerge it 
over an over.

(2) forceMerge does not solve the problem you are asking for! What you see 
might just be a side effect of something else!

(3) you say: 

> Lucene Document is getting corrupted. (data is not getting updated correctly.
> Merging of different row data).

This looks like an issue in your code. Be sure to create new Documents and pass 
them to IndexReader. Documents may be indexed asynchronously (depending on how 
ou setup everything), so it looks like you change already created/existing 
documents while indexing.

> 2. when we are trying to updateDocument method for single record. It is not
> reflecting in IndexReader until the count is 8.  Once the count exceeds, than
> records are visible for IndexReader. (creating 8 segment files.) is there any
> alternative for reducing these segment file creation.

Segments are perfectly fine and required to make incremental updates work 
correctly. What you say with "up to 8" does not make sense. Lucene has no 
mechanism of making the visibility dependent of number of segments. The issue 
you are seing is more related to wrong usage of the real-time readers. 
IndexReaders are point-in-time snapshorts. When you getReader on the Writer you 
get a reader that does not change anymore (point-in-time snapshot). To get the 
updates, you have to open a new reader. There is SearcherManager to help with 
that. It allows to manage a pool of searchers/indexreaders and takes care of 
reopening them if underlying index data changes.

> 3. above two issues are resolved by forcemerge(1). But it is not feasible for 
> our
> use case , because it takes 3X memory. We are creating indexes for huge data.

Don't use forceMerge, especially not to work around some issue that comes from 
wrong multi-threading code and basic misunderstanding on IndexReaders and their 
relationship to IndexWriters.

Uwe

-
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Jyothsna Bavisetti 
> Sent: Tuesday, April 14, 2020 7:56 AM
> To: java-user@lucene.apache.org
> Subject: Need suggetion in replacing forcemerge(1) with alternative which
> consumes less space
> 
> Hi,
> 
> 
> 
> 1.We Upgraded Lucene 4.6 to 8+, After upgrading we are facing issue with
> Lucene Index Creation.
> 
> We are indexing in Multi-threading environment. When we create bulk indexes
> , Lucene Document is getting corrupted. (data is not getting updated 
> correctly.
> Merging of different row data).
> 
> 2. when we are trying to updateDocument method for single record. It is not
> reflecting in IndexReader until the count is 8.  Once the count exceeds, than
> records are visible for IndexReader. (creating 8 segment files.) is there any
> alternative for reducing these segment file creation.
> 
> 3. above two issues are resolved by forcemerge(1). But it is not feasible for 
> our
> use case , because it takes 3X memory. We are creating indexes for huge data.
> 
> 
> 
> 4. IndexWriter Config:
> analyzer=com.datanomic.director.casemanagement.indexing.AnalyzerFactory$
> MA
> 
> ramBufferSizeMB=64.0
> 
> maxBufferedDocs=-1
> 
> mergedSegmentWarmer=null
> 
> delPolicy=com.datanomic.director.casemanagement.indexing.engines.TimedDel
> etionPolicy
> 
> commit=null
> 
> openMode=CREATE_OR_APPEND
> 
> similarity=org.apache.lucene.search.similarities.BM25Similarity
> 
> mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=-1,
> maxMergeCount=-1, ioThrottle=true
> 
> codec=Lucene80
> 
> infoStream=org.apache.lucene.util.InfoStream$NoOutput
> 
> mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10,
> maxMergeAtOnceExplicit=30, maxMergedSegmentMB=5120.0,
> floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0,
> segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022207999E12,
> noCFSRatio=0.1, deletesPctAllowed=33.0
> 
> indexerThreadPool=org.apache.lucene.index.DocumentsWriterPerThreadPool@
> 24348e05
> 
> readerPooling=true
> 
> perThreadHardLimitMB=1945
> 
> useCompoundFile=false
> 
> commitOnClose=true
> 
> indexSort=null
> 
> checkPendingFlushOnUpdate=true
> 
> softDeletesField=null
> 
> readerAttributes={}
> 
> writer=org.apache.lucene.index.IndexWriter@23a84a99
> 
> 
> 
> Please suggest some ideas alternate of forceMerge, dealing with
> indexwriter.commit for multithreading, committing  data while updating single
> record.
> 
> 
> 
> 
> 
> Thanks,
> 
> Jyothsna
> 
> 


-
To unsubscribe, e-mail: