Re: Best strategy migrate indexes

2022-11-07 Thread Trejkaz
dy using lucenemigrator > > > > > > What do you mean with "lucenemigrator"? Is it a public tool? > > > > I am trying to create a tool to read docs from a lucene5 index and > > generate lucene9 documents from them (with docValues). That might work, > >

Re: Best strategy migrate indexes

2022-11-02 Thread Trejkaz
ene9 to avoid package conflicts. > > Thanks! > > El mar, 1 nov 2022 a las 0:35, Trejkaz () escribió: > > > Well... > > > > There's a way, but I wouldn't necessarily recommend it. > > > > You can write custom migration code against some version of Lucene >

Re: Best strategy migrate indexes

2022-10-31 Thread Trejkaz
Well... There's a way, but I wouldn't necessarily recommend it. You can write custom migration code against some version of Lucene which supports doc values, to create doc values fields. It's going to involve writing a FilterCodecReader which wraps your real index and then pretends to also have

Re: TermPositions (Lucene 3.3) replacement?

2021-07-08 Thread Trejkaz
Hi. What you probably want is `TermsEnum.postings(PostingsEnum reuse, int flags)`, with `PostingsEnum.POSITIONS` in the flags. I'd also recommend using the `TermsEnum` to iterate the terms instead of using your own loop, as working with postings works better if you do. TX On Fri, 9 Jul 2021 at

Fwd: org.apache.lucene.index.DirectoryReader Javadocs

2020-12-10 Thread Trejkaz
> May i request to add more info into Lucene > org.apache.lucene.index.DirectoryReader about reaOnly=true attribute and > > more info on readerAttributes parameters please? Referring to the current documentation:

Re: Port on iOS

2020-08-21 Thread Trejkaz
This looks interesting. https://github.com/lukhnos/LuceneSearchDemo-iOS On Sat, 22 Aug 2020 at 00:12, Saad Umar wrote: > > I want to run Lucene with iOS, how do I do that > > -- > > Best, > > Saad Umar > > Senior Software Engineer > > *Avanza Solutions (Pvt.) Ltd.* > > Office # 14-B, Fakhri

Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-30 Thread Trejkaz
On Mon, 27 Jul 2020 at 19:24, Adrien Grand wrote: > > It's interesting you're not seeing the same slowdown on the other field. > How hard would it be for you to test what the performance is if you > lowercase the name of the digest algorithms, ie. "md5;[md5 value in hex]", > etc. The reason I'm

Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-27 Thread Trejkaz
to make seeking to the values faster. ^^;; TX On Mon, 27 Jul 2020 at 17:08, Adrien Grand wrote: > > Alex, this issue you linked is about the terms dictionary of doc values. > Trejkaz linked the correct issue which is about the terms dictionary of the > inverted index. > > I

Fwd: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-26 Thread Trejkaz
Hi all. I've been tracking down slow seeking performance in TermsEnum after updating to Lucene 8.5.1. On 8.5.1: SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in our code) SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%) CompressionAlgorithm$2.read:

Re: CheckIndex complaining about -1 for norms value

2020-06-14 Thread Trejkaz
computed the norm value)? > > On Thu, Jun 11, 2020 at 8:45 AM Trejkaz wrote: > > > Well, > > > > We're using the default Lucene similarity. But as far as I know, we've > > always disabled norms as well. So I'm surprised I'm even seeing norms > > mentioned in the

Re: CheckIndex complaining about -1 for norms value

2020-06-11 Thread Trejkaz
r became 0 or something. About the only thing I'm sure about at the moment is that whatever is going on is weird. TX On Thu, 11 Jun 2020 at 15:38, Adrien Grand wrote: > > Hi Trejkaz, > > Negative norm values are legal. The problem here is that Lucene expects > that documents that

CheckIndex complaining about -1 for norms value

2020-06-10 Thread Trejkaz
Hi all. We use CheckIndex as a post-migration sanity check and are seeing this quirk, and I'm wondering whether negative norms is even legit or whether it should have been treated as if it were zero... TX 0.00% total deletions; 378 documents; 0 deleteions Segments file=segments_1 numSegments=1

Re: StanardFilter Question : https://issues.apache.org/jira/browse/LUCENE-8356

2019-06-25 Thread Trejkaz
this affects my rest of the code very much but luckliy there is > > Math.toIntExact which throws ArithmeticException when number is really > > long number outside integer limit. > > > > In my case i will not exceed integer limit anyways. > > > > > > Best reg

Re: StanardFilter Question : https://issues.apache.org/jira/browse/LUCENE-8356

2019-06-24 Thread Trejkaz
I did the research on this one because it confused me as well, but it seems it was a no-op. So the replacement is just to remove it from the filter chain. We have a backwards compatibility filter factory, so we deal with it by keeping around a compatibility implementation which just does nothing

Re: IntField to IntPoint

2019-06-05 Thread Trejkaz
How we would do it: - update the index format to v7 (this in itself is fiddly but there are ways) - open the index in-place migrated: - get all the leaf indices and wrap each in a new subclass of FilterCodecReader - override getPointsReader() on that subclass to return a

Re: JapaneseAnalyzer's system vs user dict

2019-05-26 Thread Trejkaz
On Sun, 26 May 2019 at 23:49, Namgyu Kim wrote: > I think so about that approach. > It's not user-friendly and it is not good for the user. I think it's better to get the parameters in JapaneseTokenizer. > > What do you think about this? A way to override the system dictionary would be

Re: How can I decode geo point postings?

2019-03-31 Thread Trejkaz
On Mon, Apr 1, 2019 at 5:32 AM David Smiley wrote: > > Yup. And if you have the original lat/lon then you can forgo the > complexity of reverse-engineering it from postings. It has been a long day. I did manage to reverse engineer it by reversing the stuff in geoCodedToPrefixCodedBytes - to

Re: How can I decode geo point postings?

2019-03-27 Thread Trejkaz
On Mon, Mar 11, 2019 at 1:15 PM Trejkaz wrote: > > Hi all. > > I'm attempting to migrate from GeoPointField to LatLonPoint so that we > might have a hope in updating to Lucene 7. The first hurdle I'm > hitting is while writing the migration code. > > I inserted a sing

How can I decode geo point postings?

2019-03-10 Thread Trejkaz
Hi all. I'm attempting to migrate from GeoPointField to LatLonPoint so that we might have a hope in updating to Lucene 7. The first hurdle I'm hitting is while writing the migration code. I inserted a single document with one geo point in it on Lucene 6.6, and when I iterate the postings, I see

Re: Lucene sort in AaZz order

2019-01-15 Thread Trejkaz
On Wed, Jan 16, 2019 at 2:29 AM Adrien Grand wrote: > > Assuming that you need case-insensitive sort, the most straightforward > way to do this would be to index the lowercase family name: > SortedDocValuesField("by_name", new > BytesRef(family.getName().toLowerCase(Local.ROOT))). > > It is also

Re: Efficient way to define large Boolean Occur.FILTER clause in Lucene 6

2018-06-26 Thread Trejkaz
On Tue, Jun 26, 2018 at 7:02 PM, Hasenberger, Josef wrote: > However, I have a feeling that the conversion from Long values to Terms is > rather inefficient for large collections and also uses a lot of memory. > To ease conversion overhead somewhat, I created a class that converts a > Long value

Re: Recommendation for doing a search plus collecting extra information?

2018-03-26 Thread Trejkaz
On Mon, Oct 12, 2015 at 4:32 AM, Alan Woodward <a...@flax.co.uk> wrote: > Hi Trejkaz, > > You can still use a standard collector if you don’t need to worry about > multi-threaded search. > It sounds as though what you want to do is implement your own Collector that >

Re: Recommendation for doing a search plus collecting extra information?

2018-03-05 Thread Trejkaz
I did some experiments. As it turns out, changing SortedNumericSortField to SortField had no effect on the timings at all. However, changing the SortField.Type from LONG to INT makes queries come back 3 times faster. (20ms vs. 6.5ms comparing the fastest runs for each.) Why would using int be 3

Re: Recommendation for doing a search plus collecting extra information?

2018-02-28 Thread Trejkaz
On Mon, Oct 12, 2015 at 3:28 PM, Uwe Schindler wrote: > Hi, > > it may sound a bit stupid, but you can do the following: > > If you search for a docvalues (previously fieldcache) field in lucene, the > returned TopFieldDocs contains also the field values > that were sorted

Re: Lucene with Database

2017-12-27 Thread Trejkaz
On Thu, Dec 28, 2017 at 1:07 AM, Riccardo Tasso wrote: > Hi, > I am not aware of any lucene integration with rdbms Derby has a plugin of some sort. I haven't tried it so I have no idea what it actually does, but it looks like it adds table functions which you could join

Re: UnsupportedOperationException from Outputs.merge, during addIndexes

2017-12-11 Thread Trejkaz
On Mon, Dec 11, 2017 at 10:59 PM, Adrien Grand wrote: > This means the FST builder is fed twice with the same key, so it tries to > merge their outputs. This should not happen since the terms dictionary > deduplicates terms. > > Do you get additional errors if you enable

UnsupportedOperationException from Outputs.merge, during addIndexes

2017-12-10 Thread Trejkaz
Hi all. I have an addIndexes call which in my over-weekend run threw an UnsupportedOperationException from deep inside Lucene's code. I'm wondering what sort of condition this is expected to occur in. The source postings it's writing might be corrupt in some way, and if I figure out what way

Re: Lucene 6.1.0 index upgrade

2017-11-10 Thread Trejkaz
On Sat, Nov 11, 2017 at 7:09 AM, Krishnamurthy, Kannan wrote: > Never mind my previous question, understood what you meant about the impact > to norms after > looking at the uses of CreatedMajorVersion in various Similarity classes. It > almost looks >

Re: How to fetch documents for which field is not defined

2017-07-16 Thread Trejkaz
On Sat, Jul 15, 2017 at 8:12 PM, Uwe Schindler wrote: > That is the "Solr" answer. But it is slow like hell. > > In Lucene there is a natove query named FieldValueQuery already for this. > It requires DocValues enabled for the field. > > IMHO, the best and fastest variant (also

Re: DocValue update methods don't appear to throw exception if the document doesn't exist

2017-07-07 Thread Trejkaz
On Thu, Jul 6, 2017 at 8:28 PM, Joe Ye wrote: > Thanks very much TX! > > Regarding "But the updates don't actually occur during the call", could you > elaborate on this a bit more? So when would the actual update occur, by > which I mean persisting to disk? The same as any

Re: DocValue update methods don't appear to throw exception if the document doesn't exist

2017-07-04 Thread Trejkaz
On Tue, 4 Jul 2017 at 22:39, Joe Ye wrote: > Hi, > > I'm using Lucene core 6.6. > > I noticed an issue that DocValue update methods > (indexWriter.updateNumericDocValue > & indexWriter.updateBinaryDocValue) don't appear to throw exception or > return any error code if the

Re: Ways to store and search tens of billions of text document content in one lucene index

2017-06-23 Thread Trejkaz
On Fri, Jun 23, 2017 at 4:24 PM, Ranganath B N wrote: > Hi, [cutting X-Y problem stuff] > What strategies do you recommend for this task "Ways to store and search > tens of billions > of text document content in one lucene index"? so that I can accomplish >

Re: Improving Performance by Combining Multiple Fields into Single Field

2017-06-22 Thread Trejkaz
On Thu, Jun 22, 2017 at 3:23 PM, aravinth thangasami wrote: > Hi, > Reading through the web, How elastic search's *_source* field stores > entire document and use* _source* for field retrieving. > Does it better than* document.get * or loading entire

Re: Does forceMerge(1) not always merge to one segment?

2017-05-21 Thread Trejkaz
On Mon, May 22, 2017 at 3:36 PM, Uwe Schindler <u...@thetaphi.de> wrote: > Hi Trejkaz, > > yes, it calls forceMerge, but this is just a "trick" to look at each segment > while merging. But finally it > decides on the version number of each segment, if it gets merged

Does forceMerge(1) not always merge to one segment?

2017-05-21 Thread Trejkaz
We're using IndexUpgrader to upgrade indexes. The 4.10.4 version of this appears to be implemented with a call to forceMerge(1). But when I look at the result for one particular index here, I see that it has 11 segments after doing the merge. When the 5.5.2 version was then run against the same

Re: will lucene traverse all segments to search a 'primary key'term or will it stop as soon as it get one?

2017-04-20 Thread Trejkaz
On Fri, Apr 21, 2017 at 1:09 PM, 马可阳 wrote: > Let’s say I have a user info index and user id is the ‘primary key’. So when > I do a userid term search, > will lucene traverse all segments to search a 'primary key'term or will it > stop as soon as it get one? > > If it is the

QueryNode / query parser performance

2017-04-12 Thread Trejkaz
So... I know none of this work is possible to contribute back to Lucene because the API I've ended up with is too different, but I thought I would share anyway. For a query with 10,000 terms: Before any changes: ~7s Change 1: Change QueryNodeImpl to hold an immutable list of children and only

Weird cloning in QueryNode implementations

2017-04-10 Thread Trejkaz
Hi all. Something queer I found while looking at QueryNode implementations is this sort of thing: @Override public FieldQueryNode cloneTree() throws CloneNotSupportedException { FieldQueryNode fqn = (FieldQueryNode) super.cloneTree(); fqn.begin = this.begin;

Re: Is there some sensible way to do giant BooleanQuery or similar lazily?

2017-04-03 Thread Trejkaz
On Mon, Apr 3, 2017 at 6:25 PM, Adrien Grand wrote: > Large boolean queries can cause a lot of random access as each sub clause > is advanced one after the other. Even in the case that everything fits in > the filesystem cache, the fact that the heap needs to be rebalanced

Is there some sensible way to do giant BooleanQuery or similar lazily?

2017-04-02 Thread Trejkaz
Hi all. We have this one kind of query where you essentially specify a text file which contains the actual query to search for. The catch is that the text file can be large. Our custom query currently computes the set of matching docs up-front, and then when queries come in for one LeafReader,

Re: Index error

2017-03-30 Thread Trejkaz
What if totalHits > 1? TX - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Range queries get misinterpreted when parsed twice via the "Standard" parsers

2017-03-09 Thread Trejkaz
On Fri, 10 Mar 2017 at 01:19, Erick Erickson wrote: > There has never been a guarantee that going back and forth between a > parsed query and its string representation is idempotent. so this > isn't supported. Maybe delete the toQueryString method... There is a

Re: Grouping in Lucene queries giving unexpected results

2017-02-16 Thread Trejkaz
On Fri, Feb 17, 2017 at 11:14 AM, Erick Erickson wrote: > Lucene query logic is not strict Boolean logic, the article above explains > why. tl;dr it mostly comes down to scoring and syntax. The scoring argument will depend on how much you care. (My care for scoring is

Re: Grouping in Lucene queries giving unexpected results

2017-02-16 Thread Trejkaz
On Fri, Feb 17, 2017 at 5:42 AM, Michael Peterson wrote: > I have a question about the meaning and behavior of grouping behavior with > Lucene queries. For this query: host:host_1 AND (NOT location:location_5) The right hand side is: NOT location:location_5 Which

Re: How do I write in 3.x format to an upgradeded index using Lucene 4.10

2017-01-31 Thread Trejkaz
> If we take our old 3.x index and apply IndexUpgrader to it, we end up with a > 4.10 index. > There are several lucene 4.x files created in the index directory and no > errors are thrown. > However, it appears that the index data is still in the 3.x format, namely it > remains: > "thanks",

Re: [Deep Esoterica] How do point codecs work?

2017-01-24 Thread Trejkaz
On Tue, Jan 24, 2017 at 10:21 PM, Michael McCandless <luc...@mikemccandless.com> wrote: > Hi Trejkaz, > > A normal codec would call visitor.compare on smaller and smaller cells > (1D ranges for the 1D case) of the byte[] space and depending on that > result would call one

[Deep Esoterica] How do point codecs work?

2017-01-23 Thread Trejkaz
Hi all. I'm, considering writing a migration to copy existing doc values into points (after which I will discard their postings). So essentially I have to implement three things: public void intersect(String fieldName, IntersectVisitor visitor) throws IOException public byte[]

Re: Where did earthDiameter go?

2017-01-20 Thread Trejkaz
On Wed, Jan 18, 2017 at 5:43 AM, Adrien Grand wrote: > I think the reason why there was no deprecation notice is that this code > was considered as internal code rather than something that we explicitly > expose to users as an API. Hmm...

Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-17 Thread Trejkaz
On Wed, Jan 18, 2017 at 6:07 AM, Adrien Grand wrote: > > We are open to feedback, what issues are you having with > ConstantScoreWeight? It is true that it does not bring much compared to > Weight anymore now that we removed query normalization. The only useful > thing it has

Re: Weird corruption symptom, not making sense

2017-01-16 Thread Trejkaz
On Tue, Jan 17, 2017 at 9:31 AM, Uwe Schindler wrote: > ...or a JVM bug. We have seen those around PagedBytes in the past. What Java > version? Actually I just did a bit more digging and found this: https://issues.apache.org/jira/browse/LUCENE-6948 And of course, even though

Weird corruption symptom, not making sense

2017-01-16 Thread Trejkaz
I have this thing where our UninvertingReader is getting an ArrayIndexOutOfBoundsException in production. I'm sure the index is corrupt, but I tried investigating the code and it still seems a bit odd. Caused by: java.lang.ArrayIndexOutOfBoundsException: -48116 at

Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-11 Thread Trejkaz
On Thu, Jan 12, 2017 at 1:02 PM, Kumaran Ramasubramanian wrote: > I always use filter when i need to add more than 1024 ( for no scoring > cases ). If filter is removed in lucene 6, what will happen to > maxbooleanclauses limit? Am i missing anything? That sounds like a

Where did earthDiameter go?

2017-01-11 Thread Trejkaz
Hi. I don't know why, but we have some kind of esoteric logic in our own code to simplify a circle on the Earth to a bounding box, clearly something to do with computing geo queries. double lonMin = -180.0, lonMax = 180.0; if (!closeToPole(latMin, latMax)) { double D =

Re: Replacement for Filter-as-abstract-class in Lucene 5.4?

2017-01-11 Thread Trejkaz
On Thu, Jan 21, 2016 at 4:25 AM, Adrien Grand wrote: > Uwe, maybe we could promote ConstantScoreWeight to an experimental API and > document how to build simple queries based on it? In the future now, looking at Lucene 6.3 Javadocs, where Filter is now gone, and it seems that

Re: CPU usage 100% during search

2017-01-02 Thread Trejkaz
On Tue, Jan 3, 2017 at 5:26 AM, Rajnish kamboj wrote: > > Hi > > The CPU usage goes upto 100% during search. Isn't that ideal? Or would you prefer your searches to be slow, blocked by I/O? TX - To

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-21 Thread Trejkaz
On Wed, Dec 21, 2016 at 11:23 PM, suriya prakash wrote: > Hi, > > Thanks for your reply. > > I might have one or more emailds in a single record. Just so you know, you can add the same field more than once with the field analysed by KeywordAnalyzer, and it will still become

Re: Email id tokenizer (actual email id & multiple terms)

2016-12-20 Thread Trejkaz
On Wed, Dec 21, 2016 at 1:21 AM, Ahmet Arslan wrote: > Hi, > > You can index whole address in a separate field. > Otherwise, how would you handle positions of the split tokens? > > By the way, speed of phrase search may be just fine, so consider trying first. Speed

Re: Opposite of SpanFirstQuery - Searching for documents by last term in a field

2016-12-13 Thread Trejkaz
On Wed, Dec 12, 2012 at 3:04 AM, Ian Lea wrote: > The javadoc for SpanFirstQuery says it is a special case of > SpanPositionRangeQuery so maybe you can use the latter directly, > although you might need to know the position of the last term which > might be a problem. > >

ReaderManager, more drama with things not being closed before closing the Directory

2016-10-19 Thread Trejkaz
Hi all. I seem to have a situation where ReaderManager is reducing a refCount to 0 before it actually releases all its references. It's difficult because it's all mixed up in our framework for multiple ReaderManagers, which I'm still not convinced works because the concurrency is impossible to

Re: What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-17 Thread Trejkaz
Continuation, found a bug but I'm not sure whether it's in Lucene or Lucene's Javadoc. In MultiFields: @SuppressWarnings({"unchecked","rawtypes"}) @Override public Iterator iterator() { Iterator subIterators[] = new Iterator[subs.length]; for(int

Re: What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-17 Thread Trejkaz
Additional investigation: The index has two segments. Both segments have this "path-position" in the FieldInfo only once. The settings look the same: FieldInfo in first sub-reader: name = "path-position" number = 6 docValuesType = NONE storeTermVector = false

What does "found existing value for PerFieldPostingsFormat.format" mean?

2016-10-17 Thread Trejkaz
Hi all. Does anyone know what this error message means? found existing value for PerFieldPostingsFormat.format, field=path-position, old=Lucene50, new=Lucene50 java.lang.IllegalStateException: found existing value for PerFieldPostingsFormat.format, field=path-position, old=Lucene50,

Re: Performance of Prefix, Wildcard and Regex queries?

2016-10-16 Thread Trejkaz
On Sat, Oct 15, 2016 at 1:21 AM, Rajnish Kamboj wrote: > Hi > > Performance of Prefix, Wildcard and Regex queries? > Does Lucene internally optimizes this (using rewrite or something else) or > I have to manually create specific queries depending on input pattern. > > Example

Re: Lucene Query Parser Special Characters

2016-10-13 Thread Trejkaz
On Fri, Oct 14, 2016 at 2:47 AM, Ashley Ryan wrote: > Obviously, our work around of escaping the angle brackets works as we need > it to, but it seems to me that your documentation is incorrect. Am I > misunderstanding the documentation or conflating the issue I'm seeing

Re: complex disjoint search query

2016-10-12 Thread Trejkaz
On Thu, Oct 13, 2016 at 5:04 AM, Mikhail Khludnev wrote: > Hello, > Why not "To:local.one -(To:[* TO local.one} To:{local.one TO *)" ? That would not match example 2: > 2. To:other.one, third.one, This alone would match 1 and 2, but not 3: To:[* TO local.one} OR

Re: Default LRUQueryCache causing OOO exception

2016-10-12 Thread Trejkaz
On Thu, Oct 13, 2016 at 6:32 AM, Michael McCandless wrote: > You must be calling SearcherManager.maybeRefresh periodically, which > does open new NRT readers. > > Can you please triple check that you do in fact always release() after > an acquire(), in a finally clause?

Re: Crazy increase of MultiPhraseQuery memory usage in Lucene 5 (compared with 3)

2016-10-06 Thread Trejkaz
Thought I would try some thread necromancy here, because nobody replied about this a year ago. Now we're on 5.4.1 and the numbers changed a bit again. Recording best times for each operation. Indexing: 5.723 s SpanQuery: 25.13 s MultiPhraseQuery: (waited 10 minutes and it hasn't

Re: What version is this index?

2016-09-19 Thread Trejkaz
On Mon, Sep 19, 2016 at 3:41 PM, Trejkaz <trej...@trypticon.org> wrote: > The version checking code then says that because format < 9 and format >= 11, > the index must be Lucene 3.0. Obviously I meant format < -9 and format >= -11. Just in case this confuses anyon

What version is this index?

2016-09-18 Thread Trejkaz
Hi all. I have an index in my hands where we have: 1197474657 _0.fdt 270297 _0.fdx 7737 _0.fnm 520 _0.si 377812472 _0.tvd 216765 _0.tvx 182245906 _0_Lucene50_0.doc 4121910583 _0_Lucene50_0.pos 197539330

Re: MultiFields#getTerms docs clarification

2016-08-30 Thread Trejkaz
On Mon, Aug 29, 2016 at 8:23 PM, Michael McCandless wrote: > Seems like you need to scrutinize exactly what documents were indexed in step > 3? > > How exactly did you copy documents out of the old index? Note that > when Lucene's IndexReader returns a Document, it's

Re: MultiFields#getTerms docs clarification

2016-08-28 Thread Trejkaz
Updating this with newly-obtained info. 1. The original index was created in Lucene 3.x. In 3.x, if I call getMin(), it returns non-empty values. So far so good. 2. The index then gets migrated to 5.x using multiple IndexUpgrader steps. Now, when I call getMin(), it still returns a non-empty

Unknown type flag: 6 at CompressingStoredFieldsReader.readField

2016-08-23 Thread Trejkaz
Hi all. Someone apparently got this assertion failure on one of their indexes which they had been storing on a network drive for some stupid reason: AssertionError: Unknown type flag: 6 at

Re: docid is just a signed int32

2016-08-18 Thread Trejkaz
On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand wrote: > No, IndexWriter enforces that the number of documents cannot go over > IndexWriter.MAX_DOCS (which is a bit less than 2^31) and > BaseCompositeReader computes the number of documents in a long variable and > ensures it is

Re: MultiFields#getTerms docs clarification

2016-08-13 Thread Trejkaz
On Fri, Aug 12, 2016 at 11:51 PM, Michael McCandless wrote: > Getting an empty BytesRef back from Terms.getMin() means Lucene thinks you > indexed an empty (zero length) token. Lucene (unfortunately) allows this. > Is it possible you did that? > > If not, can you make

MultiFields#getTerms docs clarification

2016-08-12 Thread Trejkaz
Hi all. The docs on MultiFields#getTerms state: > This method may return null if the field does not exist. Does this mean: (a) The method *will* return null if the field does not exist. (b) The method will *not necessarily* return null if the field does not exist. I think we've seen a

Re: Any compatiblity issue in the upgrade from Lucene Core 3.2.0 to Core 6.1.0?

2016-08-08 Thread Trejkaz
On Tue, Aug 9, 2016 at 12:36 PM, 郑文兴 wrote: > Then it sounds like that "re-index all the sources in 6.x" is the most > feasible way, :(. If you can, that's what I would do. There are newer features you'll want to use anyway and migrating in doc values and the like is not the

Re: Any compatiblity issue in the upgrade from Lucene Core 3.2.0 to Core 6.1.0?

2016-08-08 Thread Trejkaz
On Mon, Aug 8, 2016 at 1:37 PM, Erick Erickson wrote: > Yes. Lucene only guarantees back-compatibility with > indexes for one major version. That is, a 4.x release can > read a 3.x Lucene index. But a 5.x will not read a 3.x. > > So you have some options here: > 1>

Re: Dubious error message?

2016-08-05 Thread Trejkaz
On Fri, Aug 5, 2016 at 2:51 PM, Erick Erickson wrote: > Question 2: Not that I know of > > Question 2.1. It's actually pretty difficult to understand why a single _term_ > can be over 32K and still make sense. This is not to say that a > single _text_ field can't be over

Dubious error message?

2016-08-04 Thread Trejkaz
Trying to add a document, someone saw: java.lang.IllegalArgumentException: Document contains at least one immense term in field="bcc-address" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The

Exception in the logs from IndexUpgrader (ArrayIndexOutOfBoundsException from FixedBitSet.set)

2016-08-02 Thread Trejkaz
Hi all. Someone saw IndexUpgrader from 4.10.4 throw this when upgrading their index: Caused by: java.lang.ArrayIndexOutOfBoundsException: 191 at org.apache.lucene.util.FixedBitSet.set(FixedBitSet.java:252) at

Re: How to get the index for a document after a search over multiple indexes

2016-06-14 Thread Trejkaz
On Wed, Jun 15, 2016 at 6:08 AM, Mark Shapiro wrote: > private static IndexSearcher getSearcher( String[] indexDirs ) throws > Exception { > IndexReader[] readers = new IndexReader[indexDirs.length]; > FSDirectory[] directorys = new FSDirectory[indexDirs.length]; > >

Re: How to get the index for a document after a search over multiple indexes

2016-06-13 Thread Trejkaz
On Tue, Jun 14, 2016 at 9:01 AM, Mark Shapiro wrote: > How can I find the single index associated with each Document returned by a > search over > multiple indexes? The document number is not enough, I want to save the > index also so > that later I can retrieve the file

LRUQueryCache appears to be messing with me

2016-03-02 Thread Trejkaz
Hi all. I spent a while trying to track down some weird behaviour where our custom queries that work off information outside Lucene were returning out of date information. It looks like what's happening is that LRUQueryCache compares the queries, decides that they're the same (equals() does

Re: Is there a way to share IndexReader data sensibly across independent callers?

2016-02-25 Thread Trejkaz
So it turns out I still have problems. I wanted to return a proxy reader that the caller could close like normal. I wanted to do this for two reasons: 1. This: try (IndexReader reader = sharer.acquireReader(...)) { ... } Looks much nicer than this: IndexReader reader =

Re: Are "position" and "position increment" actually the exact same concept?

2016-02-14 Thread Trejkaz
On Tue, Feb 9, 2016 at 2:39 AM, András Péteri wrote: > It's only the naming of FieldQueryNode's property that seems ambiguous to > me. The caller of setPositionIncrement(int), AnalyzerQueryNodeProcessor > [1], computes absolute term positions and stores that value in

Re: Is there a way to share IndexReader data sensibly across independent callers?

2016-02-09 Thread Trejkaz
On Wed, Feb 10, 2016 at 3:17 AM, Michael McCandless wrote: > Why do you need to close the Directory? It should be light weight. > But if you really do need it, can't you subclass ReaderManager and > override afterClose to close the directory? I guess that's the next

Re: Is there a way to share IndexReader data sensibly across independent callers?

2016-02-08 Thread Trejkaz
On Tue, Feb 9, 2016 at 2:10 AM, Sanne Grinovero wrote: > Hi, > you should really try to reuse the same opened Directory, like you > suggest without closing it until your application is "done" with it in > all its threads (normally on application shutdown). > Keeping a

Is there a way to share IndexReader data sensibly across independent callers?

2016-02-04 Thread Trejkaz
Hi all. Suppose 100 independent callers are opening the same index like this: try (Directory directory = FSDirectory.open(path1); IndexReader reader = DirectoryReader.open(directory)) { // keeps reader open for a long time } Someone complains that we're using a lot

Are "position" and "position increment" actually the exact same concept?

2016-02-01 Thread Trejkaz
I found the following code in PhraseQueryNodeBuilder: PhraseQuery.Builder builder = new PhraseQuery.Builder(); List children = phraseNode.getChildren(); if (children != null) { for (QueryNode child : children) { TermQuery termQuery = (TermQuery) child

Re: Determine if Merge is triggered in SOLR

2016-01-31 Thread Trejkaz
On Mon, Feb 1, 2016 at 5:59 AM, abhi Abhishek wrote: > Hi All, > any suggestions/ ideas? Start by not cross-posting to irrelevant mailing lists. TX - To unsubscribe, e-mail:

SlopQueryNodeBuilder is wrecking the query the child node generated now?

2016-01-17 Thread Trejkaz
Hi all. We have a custom QueryNode in our parser which creates a subclass of PhraseQuery. I find since updating to 5.3.1, SlopQueryNodeBuilder is replacing it with a fresh PhraseQuery. Previously, it used to just set the slop on the existing one, which allowed our custom subclasses straight

Fwd: Replacement for Filter-as-abstract-class in Lucene 5.4?

2016-01-14 Thread Trejkaz
Hi all. Filter is now deprecated, which I already knew was in the pipeline. The docs say: "Use Query objects instead: when queries are wrapped in a ConstantScoreQuery or in a BooleanClause.Occur.FILTER clause, they automatically disable the score computation so the Filter class

Re: Lucene not closing an IndexOutput when writing?

2015-12-09 Thread Trejkaz
On Wed, Dec 9, 2015 at 10:26 PM, Michael McCandless wrote: > That said, Lucene tries hard to close this file handle, e.g. if an > in-memory segment is aborted because of e.g. an interrupt exception at > a bad time. > > So, yes, please try to make a test showing that we

Lucene not closing an IndexOutput when writing?

2015-12-08 Thread Trejkaz
We have a Directory implementation that keeps track of who doesn't close their IndexInput and IndexOutput. In some test which is attempting to index documents and ultimately timed out for other reasons (presumably triggering an interrupt, admittedly not the sort of thing libraries are usually

Re: Determine whether a MatchAllQuery or a Query with atleast one Term

2015-11-29 Thread Trejkaz
On Mon, Nov 30, 2015 at 4:15 PM, Sandeep Khanzode wrote: > I want to check whether the net effect of this query (bool or otherwise) is a > MatchAllQuery (i.e. without > any terms) or a query with at least one term, or numeric range. Or both. *:* OR

Re: Dubious stuff spotted in LowerCaseFilter

2015-10-22 Thread Trejkaz
On Thu, Oct 22, 2015 at 7:05 PM, Uwe Schindler wrote: > Hi, > >> Setting aside the fact that Character.toLowerCase is already dubious in some >> locales (e.g. Turkish), > > This is not true. Character.toLowerCase() works locale-independent. > It is only String.toLowerCase that

Dubious stuff spotted in LowerCaseFilter

2015-10-21 Thread Trejkaz
Hi all. LowerCaseFilter uses CharacterUtils.toLowerCase to perform its work. The latter method looks like this: public final void toLowerCase(final char[] buffer, final int offset, final int limit) { assert buffer.length >= limit; assert offset <=0 && offset <= buffer.length; for (int i =

Re: Recommendation for doing a search plus collecting extra information?

2015-10-11 Thread Trejkaz
On Mon, Oct 12, 2015 at 6:32 AM, Alan Woodward <a...@flax.co.uk> wrote: > Hi Trejkaz, > > You can still use a standard collector if you don’t need to worry about > multi-threaded search. It sounds > as though what you want to do is implement your own Collector that wi

Re: Recommendation for doing a search plus collecting extra information?

2015-10-07 Thread Trejkaz
On Thu, Oct 8, 2015 at 1:16 PM, Erick Erickson wrote: > This may be an "XY" problem, you're asking how to do X thinking > it will solve Y without telling us what Y is. > > What do you want to _do_ with the DV values you look up for each hit? Keep them around as the ID to

Re: Recommendation for doing a search plus collecting extra information?

2015-10-07 Thread Trejkaz
On Thu, Oct 8, 2015 at 1:48 PM, Erick Erickson wrote: > First off, the internal Lucene doc ID has never been stable as long as any > segment merging of whatever style was going on, don't quite know > where you're getting that idea. > > It sounds like what you're really

  1   2   3   >